Basic neural networks in R

Today, we are going to look at my custom implementation of the back-propagation algorithm in R. In total, it is under 60 lines of code, but does require a bit of thought to understand a few complex steps.

Implementation

Some helper functions to simplify the code:

randMat <- function(n,m) matrix(rnorm(n*m),ncol=m)
zeroMat <- function(d) matrix(0,nrow=d[1],ncol=d[2])
zeroVec <- function(d) rep(0,d)

I start by constructing a neural network object, that is built by passing a vector giving the sizes of each layer of the neural network. Notice that this vector needs to include the input and output layer (so the minimum length for a non-trival network is three). This object contains everything that describes the architecture of the network: weights, biases, the cost function, the activation function, and the derivatives of both. There are not weights or biases associated with the output layer, but to keep the code clean, I add an empty element to the end of the weight and bias lists.

nnetObj <- function(sizes) {
  structure(list(sizes = sizes,
                 num_layers = length(sizes),
                 sigmoid = function(v) 1 / (1 + exp(-v)),
                 sigmoid_prime = function(v) (1 / (1 + exp(-v))) * (1 - 1 / (1 + exp(-v))),
                 cost = function(a, y) 0.5 * apply((a - y)^2,2,sum),
                 cost_derivative = function(activation, y) (activation - y),
                 biases = lapply(sizes[-1], rnorm),
                 weights = mapply(randMat, sizes[-1], sizes[-length(sizes)]), SIMPLIFY=FALSE),
      class="net")
}

We now need a function that takes a neural network and a matrix of input data and returns a list of the weighted inputs and activation functions for the current set of weights and biases in the network. This is done by iteratively applying the weights and biases on the input in one layer to get the activations for the next layer. Notice that there is an activation for every layer, but no weighted input for the input layer (as before, we add an empty element to z to keep the notation consistent).

feedforward <- function(nn, x) {
  activation <- x
  activations <- list(activation)
  zs <- list()

  for (k in 1:(nn$num_layers-1)) {
    b <- nn$biases[[k]]
    w <- nn$weights[[k]]
    z <- w %*% activation + b
    zs <- append(zs, list(z))
    activation <- nn$sigmoid(z)
    activations <- append(activations, list(activation))
  }

  structure(list(zs=zs, activations=activations),class="ff")
}

Now, we write a function that takes a neural network and the results of the feedforward step to learn the errors (delta in my notes) for every layer of the network. These are calculated by looping backwards through the layers and applying our formula relating the errors in one layer to the next layer.

backprop <- function(nn, ff, y) {
  delta_nabla_b <- lapply(sapply(nn$biases,length),zeroVec)
  delta_nabla_w <- lapply(lapply(nn$weights,dim),zeroMat)

  delta <- nn$cost_derivative(ff$activations[[nn$num_layers]], y) *
                nn$sigmoid_prime(ff$zs[[nn$num_layers-1]])
  delta_nabla_b[[nn$num_layers-1]] <- delta
  delta_nabla_w[[nn$num_layers-1]] <- delta %*% t(ff$activations[[nn$num_layers-1]])

  for (k in 2:(nn$num_layers-1)) {
    z <- ff$zs[[nn$num_layers - k]]
    sp <- nn$sigmoid_prime(z)
    delta <- t(nn$weights[[nn$num_layers - k + 1]]) %*% delta * sp
    delta_nabla_b[[nn$num_layers - k]] <- delta
    delta_nabla_w[[nn$num_layers - k]] <- delta %*% t(ff$activations[[nn$num_layers - k]])
  }

  structure(list(delta_nabla_b=delta_nabla_b, delta_nabla_w=delta_nabla_w))
}

Finally, I wrap all of this into a function that encodes the stochastic gradient descent algorithm. It optionally accepts a test set, for which predictions are computed and error rates displayed, at the end of the each epoch. The tuning parameters are the size of the mini-batch, the number of epochs, and the learning rate. We have to construct a neural network architecture to input to this function, and we get a copy of the network back, with the updated weights and biases.

sgd <- function(nn,X,Y,Xtest=NULL,Ytest=NULL,
                epochs=10L,m=10,eta=1,verbose=TRUE) {

  n <- nrow(X)

  for (j in 1:epochs) { # start of epoch

    # construct a list of mini-batches
    id <- sample(rep(1:(n/m),length.out=n))
    mini_batches <- split(1:n, id)

    for (mini_batch in mini_batches) { # run one mini-batch
      tm <- length(mini_batch)
      nabla_b <- lapply(sapply(nn$biases,length),zeroVec)
      nabla_w <- lapply(lapply(nn$weights,dim),zeroMat)

      for (i in mini_batch) {
        # feed forward step
        ff <- feedforward(nn, X[i,])

        # backward pass
        bp <- backprop(nn, ff, Y[i,])

        nabla_b <- mapply(`+`, nabla_b, bp$delta_nabla_b, SIMPLIFY=FALSE)
        nabla_w <- mapply(`+`, nabla_w, bp$delta_nabla_w, SIMPLIFY=FALSE)

      }

      # multiply gradient by learning rate
      nabla_b <- lapply(nabla_b, function(v) -1 * (eta / tm) *  v)
      nabla_w <- lapply(nabla_w, function(v) -1 * (eta / tm) *  v)

      # update weights and biases
      nn$biases <- mapply(`+`, nn$biases, nabla_b)
      nn$weights <- mapply(`+`, nn$weights, nabla_w)

    }

    # feed forward on test set
    if (!is.null(Ytest) & verbose) {
      Yhat <- Ytest
      for (i in 1:nrow(Xtest)) {
        x <- Xtest[i,]
        ff <- feedforward(nn, x)
        Yhat[i,] <- ff$activations[[nn$num_layers]]
      }
      YhatClass <- apply(Yhat, 1, which.max) - 1L
      YtestClass <- apply(Ytest, 1, which.max) - 1L
      cat(sprintf("Epoch %d - %05d / %05d\n",  j,
        sum(YtestClass == YhatClass),length(YtestClass)))
    }
  }
  return(nn)
}

Personally, this algorithm is fairly straightforward once you get the data structure for the network settled. The hardest part is making sure that the algorithm works correctly in vectorized form. The formula for nablaW, while simple in its final form, took a bit of head scratching to work out (it turns out the dot product conveniently does the right thing with multiple inputs by summing over them).

Application to MNIST

Now, I want to read in the full complete version of the MNIST dataset. Notice that the images are larger and the number of samples are significantly increased.

train <- read.csv("../../../class_data/mnist/mnist_train.csv", as.is=TRUE, header=FALSE)
test <- read.csv("../../../class_data/mnist/mnist_test.csv", as.is=TRUE, header=FALSE)
dim(train)

## [1] 60000   785

dim(test)

## [1] 10000   785

I now need to scale the X inputs:

X <- as.matrix(train[,-1]) / 256
Xtest <- as.matrix(test[,-1]) / 256

And convert the categorical response into 10 indicator functions:

Y <- matrix(0, nrow=nrow(X), ncol=10)
Ytest <- matrix(0, nrow=nrow(Xtest), ncol=10)
for (i in 0:9) {
  Y[train[,1] == i,i+1] <- 1
  Ytest[test[,1] == i,i+1] <- 1
}
head(Y)

##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,]    0    0    0    0    0    1    0    0    0     0
## [2,]    1    0    0    0    0    0    0    0    0     0
## [3,]    0    0    0    0    1    0    0    0    0     0
## [4,]    0    1    0    0    0    0    0    0    0     0
## [5,]    0    0    0    0    0    0    0    0    0     1
## [6,]    0    0    1    0    0    0    0    0    0     0

We can now run a simple neural network with one hidden layer with 30 neurons. I’ll start by just building the architecture of the neural network:

sizes <- c(ncol(X), 30, ncol(Y))
nn <- nnetObj(sizes)

Do the dimensions of the output seem to make sense?

lapply(nn$w, dim)

## [[1]]
## [1]  30 784
## 
## [[2]]
## [1] 10 30

lapply(nn$b, length)

## [[1]]
## [1] 30
## 
## [[2]]
## [1] 10

Now, let’s train this over 5 epochs with a learning rate of 1 and mini-batch size of 1.

nn2 <- sgd(nn, X, Y, Xtest, Ytest, m=1, epochs=5, eta=1)

## Epoch 1 - 09058 / 10000
## Epoch 2 - 09243 / 10000
## Epoch 3 - 09212 / 10000
## Epoch 4 - 09263 / 10000
## Epoch 5 - 09222 / 10000

Can we improve things by changing the learning rate?

nn2 <- sgd(nn, X, Y, Xtest, Ytest, m=1, epochs=5, eta=2)

## Epoch 1 - 08697 / 10000
## Epoch 2 - 08880 / 10000
## Epoch 3 - 08846 / 10000
## Epoch 4 - 08894 / 10000
## Epoch 5 - 09063 / 10000

What about increasing the number of hidden nodes? What if we increase them to a total of 60 hidden nodes? For this, let’s increase the number of epochs as well given the increase in the number of parameter we need to learn.

set.seed(1)
nn <- nnetObj(sizes <- c(784,60,10))
nn <- nnetObj(sizes <- c(784,60,10))
nn2 <- sgd(nn, X, Y, Xtest, Ytest, m=1, epochs=5, eta=0.5)

## Epoch 1 - 09180 / 10000
## Epoch 2 - 09388 / 10000
## Epoch 3 - 09441 / 10000
## Epoch 4 - 09486 / 10000
## Epoch 5 - 09541 / 10000

This improves on the classification with fewer nodes. We will save the predictions on the test set for later:

# feed forward on test set
Yhat <- Ytest
for (i in 1:nrow(Xtest)) {
  x <- Xtest[i,]
  ff <- feedforward(nn, x)
  Yhat[i,] <- ff$activations[[nn2$num_layers]]
}
YhatClass <- apply(Yhat, 1, which.max) - 1L
YtestClass <- apply(Ytest, 1, which.max) - 1L

Visualizing the output of a neural network

While often opaque, we can sometimes benefit from actually trying to understand the learned weights in a fitted neural network. Running a small model with a single layer of just 9 hidden nodes, we can plot the associated weights of each of the nodes:

nn <- nnetObj(sizes <- c(784,9,10))
nn2 <- sgd(nn, X, Y, Xtest, Ytest, m=1, epochs=15, eta=0.2, FALSE)

par(mfrow=c(3,3))
par(mar=c(0,0,0,0))
for (j in 1:9) {
  z <- nn2$w[[1]][j,]
  z <- (z - min(z)) / (max(z) - min(z))
  z <- matrix(as.matrix(z),28,28,byrow=TRUE)
  plot(0,0,axes=FALSE,xlab="",ylab="",main="")
  rasterImage((1 - z),-1,-1,1,1)
  box()
}

These can be interpreted as a filter on the image, and we can somewhat understand the features that each is capturing.

Misclassified points

As with our SVM implementation, we can attempt to understand where the neural network is having the most difficulty by pulling out mis-classified examples from the test set.

YhatClass <- apply(Yhat, 1, which.max)
YtestClass <- apply(Ytest, 1, which.max)
iset <- sample(which(YtestClass != YhatClass),7*7)
par(mar=c(0,0,0,0))
par(mfrow=c(7,7))
for (j in iset) {
  y <- matrix(as.matrix(Xtest[j,]),28,28,byrow=TRUE)
  y <- (1 - y)

  plot(0,0,xlab="",ylab="",axes=FALSE)
  rasterImage(y,-1,-1,1,1)
  box()
  text(-0.8,-0.7, YhatClass[j]-1, cex=3, col="red")
}

Most of these seem classifiable by us, though some are understandably difficult given the squished versions of the digits. In order to improve on this model, we need to build bigger, and deeper networks. To do that, we need to make some small tweaks to our algorithm in order to fit the weights and biases in a more stable and accurate way.