Computer vision: LeNet-5, AlexNet, VGG-19, GoogLeNet

Import various modules that we need for this notebook.

In [3]:
%pylab inline

import copy

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from keras.datasets import mnist, cifar10
from keras.models import Sequential, Graph
from keras.layers.core import Dense, Dropout, Activation, Flatten, Reshape
from keras.optimizers import SGD, RMSprop
from keras.utils import np_utils
from keras.regularizers import l2
from keras.layers.convolutional import Convolution2D, MaxPooling2D, ZeroPadding2D, AveragePooling2D
from keras.callbacks import EarlyStopping
from keras.preprocessing.image import ImageDataGenerator
from keras.layers.normalization import BatchNormalization

from PIL import Image
Using Theano backend.
/Users/taylor/anaconda3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:5: UserWarning: downsample module has been moved to the pool module.
  warnings.warn("downsample module has been moved to the pool module.")
Populating the interactive namespace from numpy and matplotlib

Load the MNIST dataset, flatten the images, convert the class labels, and scale the data.

In [4]:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], 1, 28, 28).astype('float32') / 255
X_test = X_test.reshape(X_test.shape[0], 1, 28, 28).astype('float32') / 255
Y_train = np_utils.to_categorical(y_train, 10)
Y_test = np_utils.to_categorical(y_test, 10)

I. LeNet-5 for MNIST10

Here is my attempt to replicate the LeNet-5 model as closely as possibly the original paper: LeCun, Yann, Léon Bottou, Yoshua Bengio, and Patrick Haffner. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86, no. 11 (1998): 2278-2324.

As few modern neural network libraries allow for partially connected convolution layers, I've substituted this with a dropout layer. I've also replaced momentum with the Hessian approximation, and rescaled the learning rate schedule, though the proportional decay remains the same.

In [4]:
model = Sequential()

model.add(Convolution2D(6, 5, 5, border_mode='valid', input_shape = (1, 28, 28)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))

model.add(Convolution2D(16, 5, 5, border_mode='valid'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))
model.add(Dropout(0.5))

model.add(Convolution2D(120, 1, 1, border_mode='valid'))

model.add(Flatten())
model.add(Dense(84))
model.add(Activation("sigmoid"))
model.add(Dense(10))
model.add(Activation('softmax'))
In [12]:
l_rate = 1
sgd = SGD(lr=l_rate, mu=0.8)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=2,
          verbose=1, show_accuracy=True, validation_data=(X_test, Y_test))

sgd = SGD(lr=0.8 * l_rate, mu=0.8)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3,
          verbose=1, show_accuracy=True, validation_data=(X_test, Y_test))

sgd = SGD(lr=0.4 * l_rate, mu=0.8)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=3,
          verbose=1, show_accuracy=True, validation_data=(X_test, Y_test))

sgd = SGD(lr=0.2 * l_rate, mu=0.8)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=4,
          verbose=1, show_accuracy=True, validation_data=(X_test, Y_test))

sgd = SGD(lr=0.08 * l_rate, mu=0.8)
model.compile(loss='categorical_crossentropy', optimizer=sgd)
model.fit(X_train, Y_train, batch_size=32, nb_epoch=8,
          verbose=1, show_accuracy=True, validation_data=(X_test, Y_test))
Train on 60000 samples, validate on 10000 samples
Epoch 1/2
60000/60000 [==============================] - 72s - loss: 0.1487 - acc: 0.9540 - val_loss: 0.0655 - val_acc: 0.9784
Epoch 2/2
60000/60000 [==============================] - 73s - loss: 0.1331 - acc: 0.9574 - val_loss: 0.0586 - val_acc: 0.9817
Train on 60000 samples, validate on 10000 samples
Epoch 1/3
60000/60000 [==============================] - 70s - loss: 0.1217 - acc: 0.9615 - val_loss: 0.0508 - val_acc: 0.9835
Epoch 2/3
60000/60000 [==============================] - 68s - loss: 0.1143 - acc: 0.9646 - val_loss: 0.0486 - val_acc: 0.9837
Epoch 3/3
60000/60000 [==============================] - 68s - loss: 0.1073 - acc: 0.9670 - val_loss: 0.0470 - val_acc: 0.9845
Train on 60000 samples, validate on 10000 samples
Epoch 1/3
60000/60000 [==============================] - 67s - loss: 0.0911 - acc: 0.9719 - val_loss: 0.0394 - val_acc: 0.9875
Epoch 2/3
60000/60000 [==============================] - 70s - loss: 0.0854 - acc: 0.9731 - val_loss: 0.0355 - val_acc: 0.9879
Epoch 3/3
60000/60000 [==============================] - 70s - loss: 0.0833 - acc: 0.9737 - val_loss: 0.0375 - val_acc: 0.9880
Train on 60000 samples, validate on 10000 samples
Epoch 1/4
60000/60000 [==============================] - 71s - loss: 0.0765 - acc: 0.9757 - val_loss: 0.0388 - val_acc: 0.9867
Epoch 2/4
60000/60000 [==============================] - 73s - loss: 0.0740 - acc: 0.9763 - val_loss: 0.0341 - val_acc: 0.9893
Epoch 3/4
60000/60000 [==============================] - 68s - loss: 0.0725 - acc: 0.9771 - val_loss: 0.0337 - val_acc: 0.9888
Epoch 4/4
60000/60000 [==============================] - 67s - loss: 0.0717 - acc: 0.9773 - val_loss: 0.0327 - val_acc: 0.9887
Train on 60000 samples, validate on 10000 samples
Epoch 1/8
60000/60000 [==============================] - 69s - loss: 0.0680 - acc: 0.9785 - val_loss: 0.0316 - val_acc: 0.9890
Epoch 2/8
60000/60000 [==============================] - 66s - loss: 0.0651 - acc: 0.9794 - val_loss: 0.0312 - val_acc: 0.9893
Epoch 3/8
60000/60000 [==============================] - 61s - loss: 0.0662 - acc: 0.9790 - val_loss: 0.0316 - val_acc: 0.9896
Epoch 4/8
60000/60000 [==============================] - 62s - loss: 0.0638 - acc: 0.9802 - val_loss: 0.0308 - val_acc: 0.9896
Epoch 5/8
60000/60000 [==============================] - 68s - loss: 0.0649 - acc: 0.9795 - val_loss: 0.0314 - val_acc: 0.9895
Epoch 6/8
60000/60000 [==============================] - 69s - loss: 0.0625 - acc: 0.9802 - val_loss: 0.0302 - val_acc: 0.9896
Epoch 7/8
60000/60000 [==============================] - 70s - loss: 0.0642 - acc: 0.9804 - val_loss: 0.0295 - val_acc: 0.9899
Epoch 8/8
60000/60000 [==============================] - 70s - loss: 0.0630 - acc: 0.9799 - val_loss: 0.0308 - val_acc: 0.9900
Out[12]:
<keras.callbacks.History at 0x125caffd0>
In [13]:
print("Test classification rate %0.05f" % model.evaluate(X_test, Y_test, show_accuracy=True)[1])
10000/10000 [==============================] - 2s     
Test classification rate 0.99000

And once again, let's look at the misclassified examples.

In [15]:
y_hat = model.predict_classes(X_test)
test_wrong = [im for im in zip(X_test,y_hat,y_test) if im[1] != im[2]]

plt.figure(figsize=(10, 10))
for ind, val in enumerate(test_wrong[:100]):
    plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
    plt.subplot(10, 10, ind + 1)
    im = 1 - val[0].reshape((28,28))
    plt.axis("off")
    plt.text(0, 0, val[2], fontsize=14, color='blue')
    plt.text(8, 0, val[1], fontsize=14, color='red')
    plt.imshow(im, cmap='gray')
10000/10000 [==============================] - 2s     

II. LeNet-5 with "Distortions" (i.e., Data augmentation)

The LeNet paper also introduced the idea of adding tweaks to the input data set in order to artificially increase the trainin set size. They suggested slightly distorting the image by shifting or stretching the pixels. The idea is that these distortions should not change the output image classification. Keras has a pre-built library for doing this; let us try to use it here to improve the classification rate. Note that we do not want to flip the image, as this would change the meaning of some digits (6 & 9, for example). Minor rotations are okay, however.

In [40]:
# this will do preprocessing and realtime data augmentation
datagen = ImageDataGenerator(
    featurewise_center=False,  # set input mean to 0 over the dataset
    samplewise_center=False,  # set each sample mean to 0
    featurewise_std_normalization=False,  # divide inputs by std of the dataset
    samplewise_std_normalization=False,  # divide each input by its std
    zca_whitening=False,  # apply ZCA whitening
    rotation_range=25,  # randomly rotate images in the range (degrees, 0 to 180)
    width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
    height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
    horizontal_flip=False,  # randomly flip images
    vertical_flip=False)  # randomly flip images

datagen.fit(X_train)

We'll use the same adaptation of LeNet-5 architecture.

In [41]:
model = Sequential()

model.add(Convolution2D(6, 5, 5, border_mode='valid', input_shape = (1, 28, 28)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))

model.add(Convolution2D(16, 5, 5, border_mode='valid'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Activation("sigmoid"))
model.add(Dropout(0.5))

model.add(Convolution2D(120, 1, 1, border_mode='valid'))

model.add(Flatten())
model.add(Dense(84))
model.add(Activation("sigmoid"))
model.add(Dense(10))
model.add(Activation('softmax'))

Now we'll fit the model. Notice that the format for this is slightly different as the data is coming from datagen.flow rather than a single numpy array. We set the number of sample per epoch to be the same as before (60k). I am also using the non-augmented version with RMS prop for the first 2 epochs, as the details are not specified in the paper and this seems to greatly improve the convergence.

In [42]:
model.compile(loss='categorical_crossentropy', optimizer=RMSprop())
model.fit(X_train, Y_train, batch_size=32, nb_epoch=25,
          verbose=1, show_accuracy=True, validation_data=(X_test, Y_test))
Train on 60000 samples, validate on 10000 samples
Epoch 1/25
60000/60000 [==============================] - 71s - loss: 0.6566 - acc: 0.7880 - val_loss: 0.1605 - val_acc: 0.9489
Epoch 2/25
60000/60000 [==============================] - 71s - loss: 0.2248 - acc: 0.9302 - val_loss: 0.1098 - val_acc: 0.9662
Epoch 3/25
60000/60000 [==============================] - 76s - loss: 0.1778 - acc: 0.9443 - val_loss: 0.0785 - val_acc: 0.9747
Epoch 4/25
60000/60000 [==============================] - 73s - loss: 0.1497 - acc: 0.9514 - val_loss: 0.0692 - val_acc: 0.9769
Epoch 5/25
60000/60000 [==============================] - 68s - loss: 0.1387 - acc: 0.9556 - val_loss: 0.0608 - val_acc: 0.9794
Epoch 6/25
60000/60000 [==============================] - 70s - loss: 0.1265 - acc: 0.9597 - val_loss: 0.0604 - val_acc: 0.9791
Epoch 7/25
60000/60000 [==============================] - 68s - loss: 0.1219 - acc: 0.9623 - val_loss: 0.0540 - val_acc: 0.9819
Epoch 8/25
60000/60000 [==============================] - 73s - loss: 0.1161 - acc: 0.9635 - val_loss: 0.0506 - val_acc: 0.9836
Epoch 9/25
60000/60000 [==============================] - 68s - loss: 0.1130 - acc: 0.9642 - val_loss: 0.0537 - val_acc: 0.9818
Epoch 10/25
60000/60000 [==============================] - 70s - loss: 0.1060 - acc: 0.9660 - val_loss: 0.0486 - val_acc: 0.9838
Epoch 11/25
60000/60000 [==============================] - 71s - loss: 0.1028 - acc: 0.9678 - val_loss: 0.0460 - val_acc: 0.9854
Epoch 12/25
60000/60000 [==============================] - 69s - loss: 0.1007 - acc: 0.9682 - val_loss: 0.0464 - val_acc: 0.9839
Epoch 13/25
60000/60000 [==============================] - 72s - loss: 0.0973 - acc: 0.9691 - val_loss: 0.0454 - val_acc: 0.9848
Epoch 14/25
60000/60000 [==============================] - 73s - loss: 0.0964 - acc: 0.9689 - val_loss: 0.0449 - val_acc: 0.9853
Epoch 15/25
60000/60000 [==============================] - 73s - loss: 0.0947 - acc: 0.9703 - val_loss: 0.0421 - val_acc: 0.9861
Epoch 16/25
60000/60000 [==============================] - 71s - loss: 0.0922 - acc: 0.9706 - val_loss: 0.0441 - val_acc: 0.9868
Epoch 17/25
60000/60000 [==============================] - 71s - loss: 0.0911 - acc: 0.9709 - val_loss: 0.0513 - val_acc: 0.9829
Epoch 18/25
60000/60000 [==============================] - 75s - loss: 0.0910 - acc: 0.9719 - val_loss: 0.0398 - val_acc: 0.9859
Epoch 19/25
60000/60000 [==============================] - 77s - loss: 0.0894 - acc: 0.9716 - val_loss: 0.0406 - val_acc: 0.9873
Epoch 20/25
60000/60000 [==============================] - 68s - loss: 0.0883 - acc: 0.9714 - val_loss: 0.0477 - val_acc: 0.9854
Epoch 21/25
60000/60000 [==============================] - 75s - loss: 0.0866 - acc: 0.9731 - val_loss: 0.0400 - val_acc: 0.9876
Epoch 22/25
60000/60000 [==============================] - 71s - loss: 0.0832 - acc: 0.9735 - val_loss: 0.0412 - val_acc: 0.9866
Epoch 23/25
60000/60000 [==============================] - 69s - loss: 0.0823 - acc: 0.9744 - val_loss: 0.0429 - val_acc: 0.9870
Epoch 24/25
60000/60000 [==============================] - 74s - loss: 0.0855 - acc: 0.9732 - val_loss: 0.0391 - val_acc: 0.9870
Epoch 25/25
60000/60000 [==============================] - 64s - loss: 0.0833 - acc: 0.9738 - val_loss: 0.0394 - val_acc: 0.9876
Out[42]:
<keras.callbacks.History at 0x12c8d5438>

How does the performance stack up? Not quite as good as the non-distorted version, though notice how the classifier does not overfit the same was as it would without the data augmentation. I have a hunch that there is something non-optimal about the RMSprop implementation when using data augmentation.

At any rate, the true advantage of data augmentation comes when we have large models (regularization) or more complex learning tasks (generalization).

In [43]:
print("Test classification rate %0.05f" % model.evaluate(X_test, Y_test, show_accuracy=True)[1])
10000/10000 [==============================] - 2s     
Test classification rate 0.98760

III. OverFeat adaptation of AlexNet (2012)

An adaptation of the 'fast' model from AlexNet applied to MNIST-10.

In [44]:
model = Sequential()

# Layer 1
model.add(Convolution2D(96, 11, 11, input_shape = (1,28,28), border_mode='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Layer 2
model.add(Convolution2D(256, 5, 5, border_mode='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Layer 3
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, border_mode='same'))
model.add(Activation('relu'))

# Layer 4
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Activation('relu'))

# Layer 5
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(1024, 3, 3, border_mode='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Layer 6
model.add(Flatten())
model.add(Dense(3072, init='glorot_normal'))
model.add(Activation('relu'))
model.add(Dropout(0.5))

# Layer 7
model.add(Dense(4096, init='glorot_normal'))
model.add(Activation('relu'))
model.add(Dropout(0.5))

# Layer 8
model.add(Dense(10, init='glorot_normal'))
model.add(Activation('softmax'))

As you can imagine, training this model (even on MNIST-10) is quite time consuming. I'll run just one Epoch with 10 samples to show how it works.

In [45]:
model.compile(loss='categorical_crossentropy', optimizer=RMSprop())
model.fit(X_train[:10], Y_train[:10], batch_size=1, nb_epoch=1,
          verbose=1, show_accuracy=True)
Epoch 1/1
10/10 [==============================] - 95s - loss: nan - acc: 0.1000    
Out[45]:
<keras.callbacks.History at 0x119c08f28>

The true power of this model really comes out when it is used on a larger corpus of images, such as ILSVRC and MS COCO, with images having a larger spatial size.

IV. VGG-19 Model

Now, let's load the VGG-19 model using pre-trained weights. First, we'll create a keras model as normal:

In [13]:
model = Sequential()
model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(64, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(128, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(256, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(ZeroPadding2D((1,1)))
model.add(Convolution2D(512, 3, 3, activation='relu'))
model.add(MaxPooling2D((2,2), strides=(2,2)))

model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='softmax'))

We then load the weights of the model from a file (you can download this from the course website; it is not small, coming in at about half a gigabyte). We then have to compile the model, even though we have no intention of actually training it. This is because the compilation in part sets the forward propigation code, which we will need to do predictions.

In [14]:
model.load_weights("../../../class_data/keras/vgg19_weights.h5")

sgd = SGD(lr=0.1, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(optimizer=sgd, loss='categorical_crossentropy')

We will also load some metadata, that gives class labels to the output:

In [15]:
synsets = []
with open("../../../class_data/keras/synset_words.txt", "r") as f:
    synsets += f.readlines()
synsets = [x.replace("\n","") for x in synsets]

Now lets read in an image of a lion:

In [27]:
im = Image.open('img/lion.jpg').resize((224, 224), Image.ANTIALIAS)
plt.figure(figsize=(4, 4))
plt.axis("off")
plt.imshow(im)
im = np.array(im).astype(np.float32)

# scale the image, according to the format used in training
im[:,:,0] -= 103.939
im[:,:,1] -= 116.779
im[:,:,2] -= 123.68
im = im.transpose((2,0,1))
im = np.expand_dims(im, axis=0)

And now predict the class label from the VGG-19 model:

In [28]:
out = model.predict(im)
for index in np.argsort(out)[0][::-1][:10]:
    print("%01.4f - %s" % (out[0][index], synsets[index].replace("\n","")))
0.3274 - n02129165 lion, king of beasts, Panthera leo
0.2489 - n02125311 cougar, puma, catamount, mountain lion, painter, panther, Felis concolor
0.2208 - n02128757 snow leopard, ounce, Panthera uncia
0.0753 - n02128385 leopard, Panthera pardus
0.0631 - n02128925 jaguar, panther, Panthera onca, Felis onca
0.0360 - n02117135 hyena, hyaena
0.0091 - n02127052 lynx, catamount
0.0063 - n01882714 koala, koala bear, kangaroo bear, native bear, Phascolarctos cinereus
0.0024 - n02129604 tiger, Panthera tigris
0.0020 - n01883070 wombat

A relatively impressive result for an out of sample image!

V. GoogLeNet - Inception Module

An implementation of the Inception module, the basic building block of GoogLeNet (2014). As with OverFeat, I don't have enough compute power here to actually traing the model, but this does serve as a nice example of how to use the graph interface in keras.

In [46]:
model = Graph()
model.add_input(name='n00', input_shape=(1,28,28))

# layer 1
model.add_node(Convolution2D(64,1,1, activation='relu'), name='n11', input='n00')
model.add_node(Flatten(), name='n11_f', input='n11')

model.add_node(Convolution2D(96,1,1, activation='relu'), name='n12', input='n00')

model.add_node(Convolution2D(16,1,1, activation='relu'), name='n13', input='n00')

model.add_node(MaxPooling2D((3,3),strides=(2,2)), name='n14', input='n00')

# layer 2
model.add_node(Convolution2D(128,3,3, activation='relu'), name='n22', input='n12')
model.add_node(Flatten(), name='n22_f', input='n22')

model.add_node(Convolution2D(32,5,5, activation='relu'), name='n23', input='n13')
model.add_node(Flatten(), name='n23_f', input='n23')

model.add_node(Convolution2D(32,1,1, activation='relu'), name='n24', input='n14')
model.add_node(Flatten(), name='n24_f', input='n24')

# output layer
model.add_node(Dense(1024, activation='relu'), name='layer4',
               inputs=['n11_f', 'n22_f', 'n23_f', 'n24_f'], merge_mode='concat')
model.add_node(Dense(10, activation='softmax'), name='layer5', input='layer4')
model.add_output(name='output1',input='layer5')
In [48]:
model.compile(loss={'output1':'categorical_crossentropy'}, optimizer=RMSprop())
model.fit({'n00':X_train[:100], 'output1':Y_train[:100]}, nb_epoch=1, verbose=1)
Epoch 1/1
100/100 [==============================] - 24s - loss: 7.0162
Out[48]:
<keras.callbacks.History at 0x156ea1b38>

VI. Batch Normalization

Use the Batch Normalization of: Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015). We'll re-train LeNet-5, but use relu units.

In [38]:
model = Sequential()

model.add(Convolution2D(6, 5, 5, border_mode='valid', input_shape = (1, 28, 28)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Activation("relu"))

model.add(Convolution2D(16, 5, 5, border_mode='valid'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(BatchNormalization())
model.add(Activation("relu"))

model.add(Convolution2D(120, 1, 1, border_mode='valid'))

model.add(Flatten())
model.add(Dense(84))
model.add(Activation("relu"))
model.add(Dense(10))
model.add(Activation('softmax'))
In [39]:
model.compile(loss='categorical_crossentropy', optimizer=RMSprop())
model.fit(X_train, Y_train, batch_size=32, nb_epoch=8,
          verbose=1, show_accuracy=True, validation_data=(X_test, Y_test))
Train on 60000 samples, validate on 10000 samples
Epoch 1/8
60000/60000 [==============================] - 73s - loss: 0.2294 - acc: 0.9293 - val_loss: 0.0937 - val_acc: 0.9697
Epoch 2/8
60000/60000 [==============================] - 72s - loss: 0.0786 - acc: 0.9756 - val_loss: 0.0662 - val_acc: 0.9784
Epoch 3/8
60000/60000 [==============================] - 72s - loss: 0.0569 - acc: 0.9820 - val_loss: 0.0561 - val_acc: 0.9814
Epoch 4/8
60000/60000 [==============================] - 72s - loss: 0.0439 - acc: 0.9866 - val_loss: 0.0566 - val_acc: 0.9829
Epoch 5/8
60000/60000 [==============================] - 73s - loss: 0.0364 - acc: 0.9886 - val_loss: 0.0547 - val_acc: 0.9838
Epoch 6/8
60000/60000 [==============================] - 75s - loss: 0.0304 - acc: 0.9903 - val_loss: 0.0477 - val_acc: 0.9853
Epoch 7/8
60000/60000 [==============================] - 81s - loss: 0.0258 - acc: 0.9917 - val_loss: 0.0555 - val_acc: 0.9832
Epoch 8/8
60000/60000 [==============================] - 89s - loss: 0.0221 - acc: 0.9927 - val_loss: 0.0520 - val_acc: 0.9857
Out[39]:
<keras.callbacks.History at 0x12a0ee8d0>

VII. Residual block - as in ResNet (2015)

An example of the residual block used in the pre-print: "Deep Residual Learning for Image Recognition." (2015).

In [36]:
model = Graph()
model.add_input(name='input0', input_shape=(1,28,28))
model.add_node(Flatten(), name='input1', input='input0')
model.add_node(Dense(50),   name='input2', input='input1')

model.add_node(Dense(50, activation='relu'), name='middle1', input='input2')
model.add_node(Dense(50, activation='relu'), name='middle2', input='middle1')

model.add_node(Dense(512, activation='relu'), name='top1',
               inputs=['input2', 'middle2'], merge_mode='sum')
model.add_node(Dense(10, activation='softmax'), name='top2', input='top1')
model.add_output(name='top3',input='top2')
In [37]:
model.compile(loss={'top3':'categorical_crossentropy'}, optimizer=RMSprop())
model.fit({'input0':X_train, 'top3':Y_train}, nb_epoch=25, verbose=1,
          validation_data={'input0':X_test, 'top3':Y_test})
Train on 60000 samples, validate on 10000 samples
Epoch 1/25
60000/60000 [==============================] - 3s - loss: 0.3205 - val_loss: 0.1624
Epoch 2/25
60000/60000 [==============================] - 2s - loss: 0.1416 - val_loss: 0.1197
Epoch 3/25
60000/60000 [==============================] - 2s - loss: 0.1025 - val_loss: 0.1044
Epoch 4/25
60000/60000 [==============================] - 2s - loss: 0.0812 - val_loss: 0.0978
Epoch 5/25
60000/60000 [==============================] - 2s - loss: 0.0679 - val_loss: 0.0857
Epoch 6/25
60000/60000 [==============================] - 2s - loss: 0.0574 - val_loss: 0.0819
Epoch 7/25
60000/60000 [==============================] - 2s - loss: 0.0493 - val_loss: 0.1023
Epoch 8/25
60000/60000 [==============================] - 2s - loss: 0.0428 - val_loss: 0.0861
Epoch 9/25
60000/60000 [==============================] - 3s - loss: 0.0373 - val_loss: 0.0948
Epoch 10/25
60000/60000 [==============================] - 2s - loss: 0.0316 - val_loss: 0.0789
Epoch 11/25
60000/60000 [==============================] - 3s - loss: 0.0277 - val_loss: 0.0882
Epoch 12/25
60000/60000 [==============================] - 3s - loss: 0.0241 - val_loss: 0.0995
Epoch 13/25
60000/60000 [==============================] - 3s - loss: 0.0230 - val_loss: 0.0865
Epoch 14/25
60000/60000 [==============================] - 2s - loss: 0.0203 - val_loss: 0.0958
Epoch 15/25
60000/60000 [==============================] - 2s - loss: 0.0180 - val_loss: 0.1060
Epoch 16/25
60000/60000 [==============================] - 2s - loss: 0.0158 - val_loss: 0.0942
Epoch 17/25
60000/60000 [==============================] - 2s - loss: 0.0152 - val_loss: 0.0940
Epoch 18/25
60000/60000 [==============================] - 3s - loss: 0.0138 - val_loss: 0.0969
Epoch 19/25
60000/60000 [==============================] - 2s - loss: 0.0128 - val_loss: 0.1041
Epoch 20/25
60000/60000 [==============================] - 2s - loss: 0.0106 - val_loss: 0.0998
Epoch 21/25
60000/60000 [==============================] - 2s - loss: 0.0109 - val_loss: 0.1075
Epoch 22/25
60000/60000 [==============================] - 2s - loss: 0.0103 - val_loss: 0.1018
Epoch 23/25
60000/60000 [==============================] - 2s - loss: 0.0088 - val_loss: 0.1103
Epoch 24/25
60000/60000 [==============================] - 2s - loss: 0.0079 - val_loss: 0.1218
Epoch 25/25
60000/60000 [==============================] - 2s - loss: 0.0081 - val_loss: 0.1210
Out[37]:
<keras.callbacks.History at 0x115ced2e8>

VIII. Pure convolution

For reference, here is the architecture of a Pure Convolution network: Springenberg, J. T., Dosovitskiy, A., Brox, T., & Riedmiller, M. (2014). Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806.

In [6]:
model = Sequential()

model.add(Convolution2D(96, 5, 5, border_mode='valid', input_shape = (1, 28, 28)))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2,2)))
model.add(Activation("relu"))

model.add(Convolution2D(192, 5, 5, border_mode='valid'))
model.add(MaxPooling2D(pool_size=(3, 3), strides=(2,2)))
model.add(Activation("relu"))

model.add(Convolution2D(192, 3, 3, border_mode='valid'))
model.add(Activation("relu"))
model.add(Convolution2D(192, 1, 1, border_mode='valid'))
model.add(Activation("relu"))
model.add(Convolution2D(10, 1, 1, border_mode='valid'))
model.add(Activation("relu"))

model.add(Flatten())
model.add(Activation('softmax'))
          
rms = RMSprop()
model.compile(loss='categorical_crossentropy', optimizer=rms)
In [ ]: