# Momentum ## Generate data To illustrate the use of momentum in optimization algorithms, let us try to use it to solve for the ordinary least squares solution in linear regression. I will first generate a small test set of data: ```{r} set.seed(42) n <- 1000 p <- 10 X <- matrix(rnorm(n*p),ncol=p) beta <- 1 / 1:p y <- X %*% beta + rnorm(n, sd=0.5) ``` The true solution, which in this case we can calculate analytically, is given by directly solving the normal equations. We can get these in R by the following: ```{r} betaOls <- coef(lm(y ~ X - 1)) ``` ## Standard gradient descent Standard gradient descent calculates the entire gradient function and moves some amount in the opposite direction of the gradient. Here we run this algorithm for 20 iterations and plot the error on the log scale: ```{r} eta <- 0.01 b <- rep(0, p) E <- 100 err <- rep(NA, E) for (k in 1:E) { gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y) v <- -1 * eta * gradFull b <- b + v err[k] <- max(abs(b - betaOls)) } plot(err, log="y", type="l", xlab="iteration") grid() ``` Notice that the log error rate is decreasing almost exactly as a linear function of the iteration. We see that the error is still relatively high after 100 iterations. ## Newton's method Newton's method uses a quadratic approximation to the function to estimate not just which direction to move in but how far to move in that direction. It is often augmented with a learning rate, though this is much different than the rate in the gradient descent algorithm because in theory it should converge under some weak assumptions even when the learning rate is 1. ```{r} eta <- 0.01 v <- 0 gamma <- 0.5 b <- rep(0, p) E <- 100 err <- rep(NA, E) for (k in 1:E) { gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y) hessFull <- -(2/n) * t(X) %*% X v <- gamma * solve(hessFull) %*% gradFull b <- b + v err[k] <- max(abs(b - betaOls)) } plot(err, log="y", type="l", xlab="iteration") grid() ``` Because the linear regression problem is itself a quadratic form, using a learning rate of 1 would solve the problem exactly. We see that even setting it to 0.5 converges to double precision within 50 steps. ## Basic momentum What if we approximate the hessian information by using basic momentum, with a mu of 0.8? ```{r} eta <- 0.01 v <- 0 mu <- 0.8 b <- rep(0, p) E <- 100 err <- rep(NA, E) for (k in 1:E) { gradFull <- (2/n) * ((t(X) %*% X) %*% b - t(X) %*% y) v <- v * mu - eta * gradFull b <- b + v err[k] <- max(abs(b - betaOls)) } plot(err, log="y", type="l", xlab="iteration") grid() ``` Though slightly noisier at some points along the curve, this gives a drastically better convergence rate! ## Nesterov's accelerated gradient (i.e., momentum) Finally, we'll use Nesterov's trick to approximate the moment instead: ```{r} eta <- 0.01 v <- 0 mu <- 0.8 b <- rep(0, p) E <- 100 err <- rep(NA, E) for (k in 1:E) { bhalf <- b + v * mu gradFull <- (2/n) * ((t(X) %*% X) %*% bhalf - t(X) %*% y) v <- v * mu - eta * gradFull b <- b + v err[k] <- max(abs(b - betaOls)) } plot(err, log="y", type="l", xlab="iteration") grid() ``` The benefit here is slight, but still noticeable. For noisier functions, where the gradient is changing rapidly, the differences will become more pronounced. ## Example of SGD Finally, let's put this all together to use momentum within SGD: ```{r} eta <- 0.01 b <- rep(0, p) m <- 5 E <- 10 mu <- 0.8 v <- 0 err <- rep(NA, nrow(miniBatchIndex)*E) for (k in 1:E) { miniBatchIndex <- matrix(sample(1:n),ncol=m) for (j in 1:nrow(miniBatchIndex)) { ind <- miniBatchIndex[j,] bhalf <- b + v * mu gradMini <- (1/m) * (2 * (t(X[ind,]) %*% X[ind,]) %*% bhalf - 2 * t(X[ind,]) %*% y[ind]) v <- v * mu - (eta) * (1 / k) * gradMini b <- b + v err[j+(k-1)*nrow(miniBatchIndex)] <- max(abs(b - betaOls)) } } plot(err, log="y", type="l", axes=FALSE, xlab="epoch") box() axis(2) axis(1, at=(0:E)*nrow(miniBatchIndex), 0:E) abline(v=(1:E)*nrow(miniBatchIndex)) ``` Clearly, if we need to get close the true solution in a short amount of time, SGD with momentum is a great technique. The benefits are even more pronounced when dealing with the non-convex loss functions of neural networks.