Let’s now use a support vector machine to build a classification algorithm. Here, I am using the e1071 package in R, which is essentially a wrapper to the LIBSVM library. It is based on the algorithm we described today (i.e., directly solving the dual problem).
We will start by looking the housing data.
set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
And construct the same training and testing sets as from the previous class:
X <- x[, c(6,11:15,33:34)]
trainFlag <- (runif(nrow(x)) < 0.66)
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
Fit a linear SVM; the function will try to determine whether you want to do classification or regression (we have not discussed the extension to SVM for regression). I prefer to manually specify it because it can be subtle to miss the difference:
library(e1071)
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="linear")
out
##
## Call:
## svm.default(x = Xtrain, y = ytrain, type = "C-classification",
## kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.125
##
## Number of Support Vectors: 1435
And then predict the results using the predict function, noting that you need to subtract one from the results to match up to the input.
svmYhat <- as.numeric(predict(out, Xtest)) - 1L
mean(ytest != svmYhat)
## [1] 0.07451196
Specifying a linear kernel, which attempts to find a hyperplane in space to separate the classes, notice that this performs roughly as well as the tree based classification algorithms.
If we set the kernel parameter to the ‘radial’ kernel,
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="radial")
out
##
## Call:
## svm.default(x = Xtrain, y = ytrain, type = "C-classification",
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.125
##
## Number of Support Vectors: 1389
The support vector machine performs significantly better than the base model:
svmYhat <- as.numeric(predict(out, Xtest)) - 1L
mean(ytest != svmYhat)
## [1] 0.0577399
We can change the cost function, and in this case observe an improvement over the default cost:
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="radial", cost=5)
svmYhat <- as.numeric(predict(out, Xtest)) - 1L
mean(ytest != svmYhat)
## [1] 0.05361562
As before, I want to fit an additive model to the training set and then use this as a meta-predictor in the random forest and gradient boosted tree models. However, instead of directly predicting the classification response (which we could do), I will instead use the additive model to predictive the most important variable in the tree models (median number of rooms):
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-7. For overview type 'help("mgcv-package")'.
x.gam2 <- gam(median_rooms
~ s(median_house_value) + s(mean_household_income)
+ s(population) + s(total_units) + s(vacant_units)
+ s(median_household_income) + s(mean_household_size_owners)
+ s(mean_household_size_renters),
data=x, subset=trainFlag)
We then reconstruct the test and training matrices:
x$gamPred <- predict(x.gam2, x)
X <- x[, c(6,11:15,33:35)]
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
And re-run the random forest estimator:
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
rfObj <- randomForest(Xtrain, factor(ytrain),
Xtest, factor(ytest),
do.trace=FALSE, keep.forest=TRUE,
ntree=500L)
rfYhat <- predict(rfObj, Xtest)
As well as the support vector machines:
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="linear", cost=100)
svmLinearYhat <- as.numeric(predict(out, Xtest)) - 1L
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="radial")
svmRadialYhat <- as.numeric(predict(out, Xtest)) - 1L
And compare the results:
mean(ytest != rfYhat)
## [1] 0.04756668
mean(ytest != svmLinearYhat)
## [1] 0.04839153
mean(ytest != svmRadialYhat)
## [1] 0.03574374
And so, the support vector machine performs significantly better than the random forest in this example, when we use a non-linear kernel. Note that we could further try to optimize the cost and gamma parameters to get a slightly better overall fit.
How different are the predicted classes from the support vector machine compared to the random forest?
table(svmRadialYhat, rfYhat)
## rfYhat
## svmRadialYhat 0 1
## 0 967 66
## 1 25 2579
In order to do some form of stacking, we need to compute both estimators on a probability scale. This can be done as follows:
rfYhatProb <- predict(rfObj, Xtest, type="prob")[,2]
out <- svm(Xtrain, ytrain, type="C-classification",
kernel="radial", probability=TRUE)
svmRadialYhatProb <- attributes(predict(out, Xtest, probability=TRUE))$probabilities[,1]
cor(rfYhatProb, svmRadialYhatProb)
## [1] 0.9686548
Notice that again, the two estimators are quite similar but not exactly the same. If we blend these together, we get a slight improvement on the SVM:
newYhat <- as.numeric(rfYhatProb*0.4 + svmRadialYhatProb*0.6 > 0.5)
mean(newYhat != ytest)
## [1] 0.03464394
To gain an understanding and appreciation for where we are still making errors, we can construct the confusion matrix of the predictions from the random forest estimator.
table(svmRadialYhat, ytest)
## ytest
## svmRadialYhat 0 1
## 0 978 55
## 1 75 2529
It appears that we are not overly biased towards predicting points in either CA or PA. Where do these errant points occur?
these <- which(svmRadialYhat != ytest)
par(mar=c(0,0,0,0))
plot(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
axes=FALSE, xlab="", ylab="", col="white")
snippets::osmap(tiles.url="http://c.tile.stamen.com/toner/",alpha=0.5)
points(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
pch=19, cex=0.4, col="orange")
box()
Switching gears for a moment, let’s consider the problem again of predicting the median house price for just the California data.
Say that we wanted to utilize the county codes in a predictive model (this is quite relevant to problem set 3); how might we do this, particularly in support vector machines, which do not natively handle categorical variables? One way is to build a hierarchical model, which first ranks the counties in some fashion and then uses that ranking as a predictor variable.
set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
ca <- x[x$statefp==6,] # just take CA data
trainFlag <- (runif(nrow(ca)) < 0.66)
This hierarchical model does not need to be overly complex. For example, just take the median house value for each tract:
tab <- tapply(ca$median_house_value[trainFlag], ca$countyfp[trainFlag], mean)
tab
## 1 3 5 7 9 11 13 15
## 515177.3 395600.0 323742.9 256211.4 366466.7 213350.0 454139.4 207266.7
## 17 19 21 23 25 27 29 31
## 405636.0 234236.5 258660.0 303747.4 160538.1 289350.0 188555.9 252017.6
## 33 35 37 39 41 43 45 47
## 216500.0 211700.0 452806.0 237864.7 757447.6 264250.0 401793.3 203094.1
## 49 51 53 55 57 59 61 63
## 151866.7 419533.3 414477.6 510805.3 410490.9 522405.8 403832.7 217875.0
## 65 67 69 71 73 75 77 79
## 273487.7 282013.2 451857.1 264718.3 434141.0 701880.2 227706.0 527571.0
## 81 83 85 87 89 91 93 95
## 663035.7 503654.5 589356.1 591994.3 257251.9 304700.0 197236.4 333773.3
## 97 99 101 103 105 107 109 111
## 499596.6 219882.0 247230.8 194733.3 270175.0 184054.5 302162.5 496460.7
## 113 115
## 433689.3 181300.0
And attach this to the data set:
index <- match(ca$countyfp, names(tab))
ca$countyMHP <- tab[index]
X <- ca[, c(6,11:15,33:35)]
y <- log(ca$median_house_value)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
Now, we can use this a random forest with this meta-predictor variable
rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
do.trace=FALSE, keep.forest=TRUE,
ntree=500)
rfObj
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, ntree = 500, do.trace = FALSE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.07314528
## % Var explained: 73.11
## Test set MSE: 0.08
## % Var explained: 72.72
importance(rfObj)
## IncNodePurity
## population 43.21423
## total_units 38.21183
## vacant_units 43.61819
## median_rooms 87.65921
## mean_household_size_owners 59.91909
## mean_household_size_renters 83.30742
## median_household_income 191.94969
## mean_household_income 354.59927
## countyMHP 391.37913
Which is significantly better than the same model without the county summary statistic, which is incidentally now the most important variable in the random forest.