Support Vector Machines

Let’s now use a support vector machine to build a classification algorithm. Here, I am using the e1071 package in R, which is essentially a wrapper to the LIBSVM library. It is based on the algorithm we described today (i.e., directly solving the dual problem).

Housing data

We will start by looking the housing data.

set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)

And construct the same training and testing sets as from the previous class:

X <- x[, c(6,11:15,33:34)]
trainFlag <- (runif(nrow(x)) < 0.66)
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]

Fit a linear SVM; the function will try to determine whether you want to do classification or regression (we have not discussed the extension to SVM for regression). I prefer to manually specify it because it can be subtle to miss the difference:

library(e1071)

out <- svm(Xtrain, ytrain, type="C-classification",
            kernel="linear")
out

## 
## Call:
## svm.default(x = Xtrain, y = ytrain, type = "C-classification", 
##     kernel = "linear")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.125 
## 
## Number of Support Vectors:  1435

And then predict the results using the predict function, noting that you need to subtract one from the results to match up to the input.

svmYhat <- as.numeric(predict(out, Xtest)) - 1L
mean(ytest != svmYhat)

## [1] 0.07451196

Specifying a linear kernel, which attempts to find a hyperplane in space to separate the classes, notice that this performs roughly as well as the tree based classification algorithms.

If we set the kernel parameter to the ‘radial’ kernel,

out <- svm(Xtrain, ytrain, type="C-classification",
            kernel="radial")
out

## 
## Call:
## svm.default(x = Xtrain, y = ytrain, type = "C-classification", 
##     kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.125 
## 
## Number of Support Vectors:  1389

The support vector machine performs significantly better than the base model:

svmYhat <- as.numeric(predict(out, Xtest)) - 1L
mean(ytest != svmYhat)

## [1] 0.0577399

We can change the cost function, and in this case observe an improvement over the default cost:

out <- svm(Xtrain, ytrain, type="C-classification",
            kernel="radial", cost=5)
svmYhat <- as.numeric(predict(out, Xtest)) - 1L
mean(ytest != svmYhat)

## [1] 0.05361562

As before, I want to fit an additive model to the training set and then use this as a meta-predictor in the random forest and gradient boosted tree models. However, instead of directly predicting the classification response (which we could do), I will instead use the additive model to predictive the most important variable in the tree models (median number of rooms):

library(mgcv)

## Loading required package: nlme

## This is mgcv 1.8-7. For overview type 'help("mgcv-package")'.

x.gam2 <- gam(median_rooms
  ~ s(median_house_value) + s(mean_household_income)
  + s(population) + s(total_units) + s(vacant_units)
  + s(median_household_income) + s(mean_household_size_owners)
  + s(mean_household_size_renters),
    data=x, subset=trainFlag)

We then reconstruct the test and training matrices:

x$gamPred <- predict(x.gam2, x)
X <- x[, c(6,11:15,33:35)]
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]

And re-run the random forest estimator:

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

rfObj <- randomForest(Xtrain, factor(ytrain),
                      Xtest, factor(ytest),
                      do.trace=FALSE, keep.forest=TRUE,
                      ntree=500L)
rfYhat <- predict(rfObj, Xtest)

As well as the support vector machines:

out <- svm(Xtrain, ytrain, type="C-classification",
            kernel="linear", cost=100)
svmLinearYhat <- as.numeric(predict(out, Xtest)) - 1L
out <- svm(Xtrain, ytrain, type="C-classification",
            kernel="radial")
svmRadialYhat <- as.numeric(predict(out, Xtest)) - 1L

And compare the results:

mean(ytest != rfYhat)

## [1] 0.04756668

mean(ytest != svmLinearYhat)

## [1] 0.04839153

mean(ytest != svmRadialYhat)

## [1] 0.03574374

And so, the support vector machine performs significantly better than the random forest in this example, when we use a non-linear kernel. Note that we could further try to optimize the cost and gamma parameters to get a slightly better overall fit.

Stacking

How different are the predicted classes from the support vector machine compared to the random forest?

table(svmRadialYhat, rfYhat)

##              rfYhat
## svmRadialYhat    0    1
##             0  967   66
##             1   25 2579

In order to do some form of stacking, we need to compute both estimators on a probability scale. This can be done as follows:

rfYhatProb <- predict(rfObj, Xtest, type="prob")[,2]
out <- svm(Xtrain, ytrain, type="C-classification",
            kernel="radial", probability=TRUE)
svmRadialYhatProb <- attributes(predict(out, Xtest, probability=TRUE))$probabilities[,1]
cor(rfYhatProb, svmRadialYhatProb)

## [1] 0.9686548

Notice that again, the two estimators are quite similar but not exactly the same. If we blend these together, we get a slight improvement on the SVM:

newYhat <- as.numeric(rfYhatProb*0.4 + svmRadialYhatProb*0.6 > 0.5)
mean(newYhat != ytest)

## [1] 0.03464394

Confusion matrix

To gain an understanding and appreciation for where we are still making errors, we can construct the confusion matrix of the predictions from the random forest estimator.

table(svmRadialYhat, ytest)

##              ytest
## svmRadialYhat    0    1
##             0  978   55
##             1   75 2529

It appears that we are not overly biased towards predicting points in either CA or PA. Where do these errant points occur?

these <- which(svmRadialYhat != ytest)
par(mar=c(0,0,0,0))
plot(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
     axes=FALSE, xlab="", ylab="", col="white")
snippets::osmap(tiles.url="http://c.tile.stamen.com/toner/",alpha=0.5)
points(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
       pch=19, cex=0.4, col="orange")
box()

Hierarchical model

Switching gears for a moment, let’s consider the problem again of predicting the median house price for just the California data.

Say that we wanted to utilize the county codes in a predictive model (this is quite relevant to problem set 3); how might we do this, particularly in support vector machines, which do not natively handle categorical variables? One way is to build a hierarchical model, which first ranks the counties in some fashion and then uses that ranking as a predictor variable.

set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
ca <- x[x$statefp==6,] # just take CA data
trainFlag <- (runif(nrow(ca)) < 0.66)

This hierarchical model does not need to be overly complex. For example, just take the median house value for each tract:

tab <- tapply(ca$median_house_value[trainFlag], ca$countyfp[trainFlag], mean)
tab

##        1        3        5        7        9       11       13       15 
## 515177.3 395600.0 323742.9 256211.4 366466.7 213350.0 454139.4 207266.7 
##       17       19       21       23       25       27       29       31 
## 405636.0 234236.5 258660.0 303747.4 160538.1 289350.0 188555.9 252017.6 
##       33       35       37       39       41       43       45       47 
## 216500.0 211700.0 452806.0 237864.7 757447.6 264250.0 401793.3 203094.1 
##       49       51       53       55       57       59       61       63 
## 151866.7 419533.3 414477.6 510805.3 410490.9 522405.8 403832.7 217875.0 
##       65       67       69       71       73       75       77       79 
## 273487.7 282013.2 451857.1 264718.3 434141.0 701880.2 227706.0 527571.0 
##       81       83       85       87       89       91       93       95 
## 663035.7 503654.5 589356.1 591994.3 257251.9 304700.0 197236.4 333773.3 
##       97       99      101      103      105      107      109      111 
## 499596.6 219882.0 247230.8 194733.3 270175.0 184054.5 302162.5 496460.7 
##      113      115 
## 433689.3 181300.0

And attach this to the data set:

index <- match(ca$countyfp, names(tab))
ca$countyMHP <- tab[index]
X <- ca[, c(6,11:15,33:35)]
y <- log(ca$median_house_value)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]

Now, we can use this a random forest with this meta-predictor variable

rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
                      do.trace=FALSE, keep.forest=TRUE,
                      ntree=500)
rfObj

## 
## Call:
##  randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest,      ntree = 500, do.trace = FALSE, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.07314528
##                     % Var explained: 73.11
##                        Test set MSE: 0.08
##                     % Var explained: 72.72

importance(rfObj)

##                             IncNodePurity
## population                       43.21423
## total_units                      38.21183
## vacant_units                     43.61819
## median_rooms                     87.65921
## mean_household_size_owners       59.91909
## mean_household_size_renters      83.30742
## median_household_income         191.94969
## mean_household_income           354.59927
## countyMHP                       391.37913

Which is significantly better than the same model without the county summary statistic, which is incidentally now the most important variable in the random forest.