To illustrate the use of stochastic gradient descent, let us try to use it to solve for the ordinary least squares solution in linear regression. I will first generate a small test set of data:
set.seed(42)
n <- 1000
p <- 10
X <- matrix(rnorm(n*p),ncol=p)
beta <- 1 / 1:p
y <- X %*% beta + rnorm(n, sd=0.5)
The true solution, which in this case we can calculate analytically, is given by directly solving the normal equations. We can get these in R by the following:
betaOls <- coef(lm(y ~ X - 1))
Standard gradient descent calculates the entire gradient function and moves some amount in the opposite direction of the gradient. Here we run this algorithm for 2000 iterations and plot the error on the log scale:
eta <- 0.01
b <- rep(0, p)
E <- 2000
err <- rep(NA, E)
for (k in 1:E) {
gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y)
b <- b - eta * gradFull
err[k] <- max(abs(b - betaOls))
}
plot(err, log="y", type="l", xlab="iteration")
grid()
Notice that the log error rate is decreasing almost exactly as a linear function of the iteration.
I now construct a matrix of mini-batch samples. For example, with a size of 5 I get the following:
m <- 5
miniBatchIndex <- matrix(sample(1:n),ncol=m)
head(miniBatchIndex)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 385 961 208 758 589
## [2,] 381 248 101 611 264
## [3,] 678 69 807 681 411
## [4,] 754 642 757 214 725
## [5,] 62 974 293 177 586
## [6,] 893 865 392 738 474
Using these minibatch estimates, I want to calculate an estimate of the gradient function and again move a small direction downhill from the current value of beta.
eta <- 0.01
b <- rep(0, p)
err <- rep(NA, nrow(miniBatchIndex))
for (j in 1:nrow(miniBatchIndex)) {
ind <- miniBatchIndex[j,]
gradMini <- (2/m) * ((t(X[ind,]) %*% X[ind,]) %*% b - t(X[ind,]) %*% y[ind])
b <- b - eta * gradMini
err[j] <- max(abs(b - betaOls))
}
plot(err, log="y", type="l", xlab="iteration")
grid()
Notice that this curve is much more noisy, but gets under an error rate of 0.05 in only one pass through the data. Be careful to see that the 200 iterations here is the equivalent of one iteration in the gradient descent approach.
Tweaking this code slightly, let’s iterate over 10 epochs and see how stochastic gradient descent converges over a longer number of cycles.
eta <- 0.01
b <- rep(0, p)
E <- 10
err <- rep(NA, nrow(miniBatchIndex)*E)
for (k in 1:E) {
miniBatchIndex <- matrix(sample(1:n),ncol=m)
for (j in 1:nrow(miniBatchIndex)) {
ind <- miniBatchIndex[j,]
gradMini <- (1/m) * (2 * (t(X[ind,]) %*% X[ind,]) %*% b - 2 * t(X[ind,]) %*% y[ind])
b <- b - eta * gradMini
err[j+(k-1)*nrow(miniBatchIndex)] <- max(abs(b - betaOls))
}
}
plot(err, log="y", type="l", axes=FALSE, xlab="epoch")
box()
axis(2)
axis(1, at=(0:E)*nrow(miniBatchIndex), 0:E)
abline(v=(1:E)*nrow(miniBatchIndex))
Notice that this does not actually appear to converge, but rather is essentially a random walk after the first epoch. The issue is that we need an adaptive learning rate. Here we divide the original learning rate by the epoch number, this causes the movements to slow down over time.
eta <- 0.01
b <- rep(0, p)
m <- 5
E <- 10
err <- rep(NA, nrow(miniBatchIndex)*E)
for (k in 1:E) {
miniBatchIndex <- matrix(sample(1:n),ncol=m)
for (j in 1:nrow(miniBatchIndex)) {
ind <- miniBatchIndex[j,]
gradMini <- (1/m) * (2 * (t(X[ind,]) %*% X[ind,]) %*% b - 2 * t(X[ind,]) %*% y[ind])
b <- b - (eta) * (1 / k) * gradMini
err[j+(k-1)*nrow(miniBatchIndex)] <- max(abs(b - betaOls))
}
}
plot(err, log="y", type="l", axes=FALSE, xlab="epoch")
box()
axis(2)
axis(1, at=(0:E)*nrow(miniBatchIndex), 0:E)
abline(v=(1:E)*nrow(miniBatchIndex))
There is still a good amount of noise in the middle of the epoch (you can think of this as a type of bias given that some data points have been used more than others), but at the end of each epoch the value has improved compared to the value at the end of the prior epoch. We are able to get to within 0.001 in just 10 cycles of the data, something which would takes nearly 500 iterations in the non-stochastic variant.
Support for running modern, deep, neural networks in R is quite weak. Here I will show how to use the two most popular packages that do exist, but after today we will use my own code or call routines in other languages. To start, let’s load in the cars dataset:
set.seed(1)
x <- read.csv("../../data/cars.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x$expensive <- as.numeric(x$retail > median(x$retail))
x$testFlag <- as.numeric(runif(nrow(x)) > 0.8)
head(x)
## sports suv wagon minivan pickup awd rwd retail
## Acura 3.5 RL 0 0 0 0 0 0 0 43755
## Acura 3.5 RL Navigation 0 0 0 0 0 0 0 46100
## Acura MDX 0 1 0 0 0 1 0 36945
## Acura NSX S 1 0 0 0 0 0 1 89765
## Acura RSX 0 0 0 0 0 0 0 23820
## Acura TL 0 0 0 0 0 0 0 33195
## dealer engine cylinders horsepower citympg
## Acura 3.5 RL 39014 3.5 6 225 18
## Acura 3.5 RL Navigation 41100 3.5 6 225 18
## Acura MDX 33337 3.5 6 265 17
## Acura NSX S 79978 3.2 6 290 17
## Acura RSX 21761 2.0 4 200 24
## Acura TL 30299 3.2 6 270 20
## highwaympg weight wheelbase length width expensive
## Acura 3.5 RL 24 3880 115 197 72 1
## Acura 3.5 RL Navigation 24 3893 115 197 72 1
## Acura MDX 23 4451 106 189 77 1
## Acura NSX S 24 3153 100 174 71 1
## Acura RSX 31 2778 101 172 68 0
## Acura TL 28 3575 108 186 72 1
## testFlag
## Acura 3.5 RL 0
## Acura 3.5 RL Navigation 0
## Acura MDX 0
## Acura NSX S 1
## Acura RSX 0
## Acura TL 1
For a point of reference, I’ll fit a logistic regression on the response of whether this care is in the upper half of the dataset in terms of retail price:
out <- glm(expensive ~ weight + citympg + highwaympg + sports + suv,
data=x,
subset=(testFlag == 0),
family=binomial)
predGlm <- predict(out, x, type='response')
Using the nnet package, I can fit a neural network with one hidden layers with 10 nodes:
library(nnet)
out <- nnet(expensive ~ weight + citympg + highwaympg + sports + suv,
data=x,
subset=(testFlag == 0),
size=10)
## # weights: 71
## initial value 96.797873
## final value 79.471698
## converged
predNnet10 <- predict(out, x)
table(predNnet10)
## predNnet10
## 0.490566038097916
## 387
Notice that the prediction results are all the same. In truth, we do not have enough nodes here to learn anything useful. If I increase this number to 500, the results seem to mimic those of the glm model fairly well.
out <- nnet(expensive ~ weight + citympg + highwaympg + sports + suv,
data=x,
subset=(testFlag == 0),
size=500,
MaxNWts=7500L)
## # weights: 3501
## initial value 137.069159
## iter 10 value 69.829041
## iter 20 value 50.012859
## iter 30 value 44.158543
## iter 40 value 37.637928
## iter 50 value 36.391943
## iter 60 value 36.387061
## iter 70 value 36.369509
## iter 80 value 36.354144
## iter 90 value 36.346069
## iter 100 value 36.342067
## final value 36.342067
## stopped after 100 iterations
predNnet500 <- predict(out, x)
table(round(predNnet500,2))
##
## 0 0.01 0.02 0.03 0.04 0.05 0.09 0.1 0.11 0.13 0.14 0.16 0.19 0.2 0.21
## 128 3 1 1 1 1 1 1 2 1 1 1 2 2 2
## 0.22 0.23 0.24 0.25 0.26 0.3 0.31 0.32 0.35 0.36 0.37 0.4 0.43 0.44 0.45
## 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1
## 0.49 0.5 0.53 0.56 0.57 0.59 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.69 0.71
## 1 1 1 3 1 2 3 1 1 1 3 2 1 1 3
## 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84
## 2 1 1 1 8 3 3 4 7 9 26 133
cor(predNnet500, predGlm)
## [,1]
## [1,] 0.9152229668
Increasing the number of nodes to 900, we see that the correlation actually drops; at this point, the capacity of the model is too large given the input data size.
out <- nnet(expensive ~ weight + citympg + highwaympg + sports + suv,
data=x,
subset=(testFlag == 0),
size=900,
MaxNWts=7500L)
## # weights: 6301
## initial value 155.998595
## iter 10 value 69.214097
## iter 20 value 48.827687
## iter 30 value 38.998937
## iter 40 value 37.654080
## iter 50 value 36.391560
## iter 60 value 35.767659
## iter 70 value 34.716419
## iter 80 value 34.633026
## iter 90 value 34.510777
## iter 100 value 34.330067
## final value 34.330067
## stopped after 100 iterations
predNnet900 <- predict(out, x)
table(round(predNnet900,2))
##
## 0 0.03 0.04 0.32 0.33 0.35 0.37 0.38 0.4 0.44 0.46 0.5 0.57 0.64 0.66
## 113 1 1 3 56 2 1 3 1 2 2 3 1 2 1
## 0.72 0.73 0.77 0.78 0.79 0.82 0.83 0.84 0.85
## 1 3 1 3 1 1 1 5 179
cor(predNnet900, predGlm)
## [,1]
## [1,] 0.9003548153
Testing these results, notice that the neural network with 500 nodes outperforms the glm, but the other models under-perform.
tapply(as.numeric(predGlm > 0.5) == x$expensive, x$testFlag, mean)
## 0 1
## 0.8238993711 0.8840579710
tapply(as.numeric(predNnet10 > 0.5) == x$expensive, x$testFlag, mean)
## 0 1
## 0.5094339623 0.4637681159
tapply(as.numeric(predNnet500 > 0.5) == x$expensive, x$testFlag, mean)
## 0 1
## 0.8553459119 0.8695652174
tapply(as.numeric(predNnet900 > 0.5) == x$expensive, x$testFlag, mean)
## 0 1
## 0.8616352201 0.7971014493
This is a general property of neural networks that will soon become painfully obvious: there is some way to construct a very predictive neural network, but unless it is very well tuned it will often get outperformed by much simpler models.
The nnet package is actually quite well written, the multinom function is particularly useful, but only handles a single hidden layer. The neuralnet allows for larger models, but I have never had much luck with using it. I think the only real use of it is to quickly construct a visualization of a simple neural network model:
library(neuralnet)
out <- neuralnet(expensive ~ weight + citympg + highwaympg + sports + suv,
data=x,
err.fct="sse",
hidden=c(3,2),
linear.output=FALSE,
likelihood=TRUE)
plot(out)