Stochastic gradient descent

To illustrate the use of stochastic gradient descent, let us try to use it to solve for the ordinary least squares solution in linear regression. I will first generate a small test set of data:

set.seed(42)
n <- 1000
p <- 10
X <- matrix(rnorm(n*p),ncol=p)
beta <- 1 / 1:p
y <- X %*% beta + rnorm(n, sd=0.5)

The true solution, which in this case we can calculate analytically, is given by directly solving the normal equations. We can get these in R by the following:

betaOls <- coef(lm(y ~ X - 1))

Standard gradient descent calculates the entire gradient function and moves some amount in the opposite direction of the gradient. Here we run this algorithm for 2000 iterations and plot the error on the log scale:

eta <- 0.01
b <- rep(0, p)
E <- 2000
err <- rep(NA, E)
for (k in 1:E) {
  gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y)
  b <- b - eta * gradFull
  err[k] <- max(abs(b - betaOls))
}

plot(err, log="y", type="l", xlab="iteration")
grid()

Notice that the log error rate is decreasing almost exactly as a linear function of the iteration.

I now construct a matrix of mini-batch samples. For example, with a size of 5 I get the following:

m <- 5
miniBatchIndex <- matrix(sample(1:n),ncol=m)
head(miniBatchIndex)
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  385  961  208  758  589
## [2,]  381  248  101  611  264
## [3,]  678   69  807  681  411
## [4,]  754  642  757  214  725
## [5,]   62  974  293  177  586
## [6,]  893  865  392  738  474

Using these minibatch estimates, I want to calculate an estimate of the gradient function and again move a small direction downhill from the current value of beta.

eta <- 0.01
b <- rep(0, p)
err <- rep(NA, nrow(miniBatchIndex))
for (j in 1:nrow(miniBatchIndex)) {
  ind <- miniBatchIndex[j,]
  gradMini <- (2/m) * ((t(X[ind,]) %*% X[ind,]) %*% b - t(X[ind,]) %*% y[ind])
  b <- b - eta * gradMini
  err[j] <- max(abs(b - betaOls))
}

plot(err, log="y", type="l", xlab="iteration")
grid()

Notice that this curve is much more noisy, but gets under an error rate of 0.05 in only one pass through the data. Be careful to see that the 200 iterations here is the equivalent of one iteration in the gradient descent approach.

Tweaking this code slightly, let’s iterate over 10 epochs and see how stochastic gradient descent converges over a longer number of cycles.

eta <- 0.01
b <- rep(0, p)
E <- 10
err <- rep(NA, nrow(miniBatchIndex)*E)
for (k in 1:E) {
  miniBatchIndex <- matrix(sample(1:n),ncol=m)
  for (j in 1:nrow(miniBatchIndex)) {
    ind <- miniBatchIndex[j,]
    gradMini <- (1/m) * (2 * (t(X[ind,]) %*% X[ind,]) %*% b - 2 * t(X[ind,]) %*% y[ind])
    b <- b - eta * gradMini
    err[j+(k-1)*nrow(miniBatchIndex)] <- max(abs(b - betaOls))
  }
}

plot(err, log="y", type="l", axes=FALSE, xlab="epoch")
box()
axis(2)
axis(1, at=(0:E)*nrow(miniBatchIndex), 0:E)
abline(v=(1:E)*nrow(miniBatchIndex))

Notice that this does not actually appear to converge, but rather is essentially a random walk after the first epoch. The issue is that we need an adaptive learning rate. Here we divide the original learning rate by the epoch number, this causes the movements to slow down over time.

eta <- 0.01
b <- rep(0, p)
m <- 5
E <- 10
err <- rep(NA, nrow(miniBatchIndex)*E)
for (k in 1:E) {
  miniBatchIndex <- matrix(sample(1:n),ncol=m)
  for (j in 1:nrow(miniBatchIndex)) {
    ind <- miniBatchIndex[j,]
    gradMini <- (1/m) * (2 * (t(X[ind,]) %*% X[ind,]) %*% b - 2 * t(X[ind,]) %*% y[ind])
    b <- b - (eta) * (1 / k) * gradMini
    err[j+(k-1)*nrow(miniBatchIndex)] <- max(abs(b - betaOls))
  }
}

plot(err, log="y", type="l", axes=FALSE, xlab="epoch")
box()
axis(2)
axis(1, at=(0:E)*nrow(miniBatchIndex), 0:E)
abline(v=(1:E)*nrow(miniBatchIndex))

There is still a good amount of noise in the middle of the epoch (you can think of this as a type of bias given that some data points have been used more than others), but at the end of each epoch the value has improved compared to the value at the end of the prior epoch. We are able to get to within 0.001 in just 10 cycles of the data, something which would takes nearly 500 iterations in the non-stochastic variant.

Simple neural network

Support for running modern, deep, neural networks in R is quite weak. Here I will show how to use the two most popular packages that do exist, but after today we will use my own code or call routines in other languages. To start, let’s load in the cars dataset:

set.seed(1)
x <- read.csv("../../data/cars.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x$expensive <- as.numeric(x$retail > median(x$retail))
x$testFlag <- as.numeric(runif(nrow(x)) > 0.8)
head(x)
##                         sports suv wagon minivan pickup awd rwd retail
## Acura 3.5 RL                 0   0     0       0      0   0   0  43755
## Acura 3.5 RL Navigation      0   0     0       0      0   0   0  46100
## Acura MDX                    0   1     0       0      0   1   0  36945
## Acura NSX S                  1   0     0       0      0   0   1  89765
## Acura RSX                    0   0     0       0      0   0   0  23820
## Acura TL                     0   0     0       0      0   0   0  33195
##                         dealer engine cylinders horsepower citympg
## Acura 3.5 RL             39014    3.5         6        225      18
## Acura 3.5 RL Navigation  41100    3.5         6        225      18
## Acura MDX                33337    3.5         6        265      17
## Acura NSX S              79978    3.2         6        290      17
## Acura RSX                21761    2.0         4        200      24
## Acura TL                 30299    3.2         6        270      20
##                         highwaympg weight wheelbase length width expensive
## Acura 3.5 RL                    24   3880       115    197    72         1
## Acura 3.5 RL Navigation         24   3893       115    197    72         1
## Acura MDX                       23   4451       106    189    77         1
## Acura NSX S                     24   3153       100    174    71         1
## Acura RSX                       31   2778       101    172    68         0
## Acura TL                        28   3575       108    186    72         1
##                         testFlag
## Acura 3.5 RL                   0
## Acura 3.5 RL Navigation        0
## Acura MDX                      0
## Acura NSX S                    1
## Acura RSX                      0
## Acura TL                       1

For a point of reference, I’ll fit a logistic regression on the response of whether this care is in the upper half of the dataset in terms of retail price:

out <- glm(expensive ~ weight + citympg + highwaympg + sports + suv,
                  data=x,
                  subset=(testFlag == 0),
                  family=binomial)
predGlm <- predict(out, x, type='response')

Using the nnet package, I can fit a neural network with one hidden layers with 10 nodes:

library(nnet)
out <- nnet(expensive ~ weight + citympg + highwaympg + sports + suv,
                  data=x,
                  subset=(testFlag == 0),
                  size=10)
## # weights:  71
## initial  value 96.797873 
## final  value 79.471698 
## converged
predNnet10 <- predict(out, x)
table(predNnet10)
## predNnet10
## 0.490566038097916 
##               387

Notice that the prediction results are all the same. In truth, we do not have enough nodes here to learn anything useful. If I increase this number to 500, the results seem to mimic those of the glm model fairly well.

out <- nnet(expensive ~ weight + citympg + highwaympg + sports + suv,
                  data=x,
                  subset=(testFlag == 0),
                  size=500,
                  MaxNWts=7500L)
## # weights:  3501
## initial  value 137.069159 
## iter  10 value 69.829041
## iter  20 value 50.012859
## iter  30 value 44.158543
## iter  40 value 37.637928
## iter  50 value 36.391943
## iter  60 value 36.387061
## iter  70 value 36.369509
## iter  80 value 36.354144
## iter  90 value 36.346069
## iter 100 value 36.342067
## final  value 36.342067 
## stopped after 100 iterations
predNnet500 <- predict(out, x)
table(round(predNnet500,2))
## 
##    0 0.01 0.02 0.03 0.04 0.05 0.09  0.1 0.11 0.13 0.14 0.16 0.19  0.2 0.21 
##  128    3    1    1    1    1    1    1    2    1    1    1    2    2    2 
## 0.22 0.23 0.24 0.25 0.26  0.3 0.31 0.32 0.35 0.36 0.37  0.4 0.43 0.44 0.45 
##    1    1    1    1    1    1    1    1    1    2    1    1    1    1    1 
## 0.49  0.5 0.53 0.56 0.57 0.59 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.69 0.71 
##    1    1    1    3    1    2    3    1    1    1    3    2    1    1    3 
## 0.73 0.74 0.75 0.76 0.77 0.78 0.79  0.8 0.81 0.82 0.83 0.84 
##    2    1    1    1    8    3    3    4    7    9   26  133
cor(predNnet500, predGlm)
##              [,1]
## [1,] 0.9152229668

Increasing the number of nodes to 900, we see that the correlation actually drops; at this point, the capacity of the model is too large given the input data size.

out <- nnet(expensive ~ weight + citympg + highwaympg + sports + suv,
                  data=x,
                  subset=(testFlag == 0),
                  size=900,
                  MaxNWts=7500L)
## # weights:  6301
## initial  value 155.998595 
## iter  10 value 69.214097
## iter  20 value 48.827687
## iter  30 value 38.998937
## iter  40 value 37.654080
## iter  50 value 36.391560
## iter  60 value 35.767659
## iter  70 value 34.716419
## iter  80 value 34.633026
## iter  90 value 34.510777
## iter 100 value 34.330067
## final  value 34.330067 
## stopped after 100 iterations
predNnet900 <- predict(out, x)
table(round(predNnet900,2))
## 
##    0 0.03 0.04 0.32 0.33 0.35 0.37 0.38  0.4 0.44 0.46  0.5 0.57 0.64 0.66 
##  113    1    1    3   56    2    1    3    1    2    2    3    1    2    1 
## 0.72 0.73 0.77 0.78 0.79 0.82 0.83 0.84 0.85 
##    1    3    1    3    1    1    1    5  179
cor(predNnet900, predGlm)
##              [,1]
## [1,] 0.9003548153

Testing these results, notice that the neural network with 500 nodes outperforms the glm, but the other models under-perform.

tapply(as.numeric(predGlm > 0.5) == x$expensive, x$testFlag, mean)
##            0            1 
## 0.8238993711 0.8840579710
tapply(as.numeric(predNnet10 > 0.5) == x$expensive, x$testFlag, mean)
##            0            1 
## 0.5094339623 0.4637681159
tapply(as.numeric(predNnet500 > 0.5) == x$expensive, x$testFlag, mean)
##            0            1 
## 0.8553459119 0.8695652174
tapply(as.numeric(predNnet900 > 0.5) == x$expensive, x$testFlag, mean)
##            0            1 
## 0.8616352201 0.7971014493

This is a general property of neural networks that will soon become painfully obvious: there is some way to construct a very predictive neural network, but unless it is very well tuned it will often get outperformed by much simpler models.

The nnet package is actually quite well written, the multinom function is particularly useful, but only handles a single hidden layer. The neuralnet allows for larger models, but I have never had much luck with using it. I think the only real use of it is to quickly construct a visualization of a simple neural network model:

library(neuralnet)
out <- neuralnet(expensive ~ weight + citympg + highwaympg + sports + suv,
                  data=x,
                  err.fct="sse",
                  hidden=c(3,2),
                  linear.output=FALSE,
                  likelihood=TRUE)
plot(out)

twoHiddenLayers