Last week, we talked about using decision trees for regression and classification problems. Here, we are going to actual apply these techniques to the California and Pennsylvania housing datasets.
As a first task, let us again take just the California data and try to predict the logarithm of the median house value in each census tract.
set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
ca <- x[x$statefp==6,] # just take CA data
trainFlag <- (runif(nrow(ca)) < 0.66)
In order to fit a single regression tree, we use the R package tree. As a starting point, I will just use longitude and latitude.
library(tree)
tf <- tree(log(median_house_value) ~ longitude + latitude, data = ca)
plot(tf)
text(tf, cex=0.75)
It is easy to incorporate more variables in the model. How much do these change the tree that is learned from the data?
library(tree)
tf <- tree(log(median_house_value) ~ longitude + latitude +
median_household_income + median_rooms,
data = ca)
plot(tf)
text(tf, cex=0.75)
Which variables might we want to incorporate if we want a highly predictive model? I am going to start by using these:
names(ca)[c(6,11:15,33:34)]
## [1] "population" "total_units"
## [3] "vacant_units" "median_rooms"
## [5] "mean_household_size_owners" "mean_household_size_renters"
## [7] "median_household_income" "mean_household_income"
I find it easier to use matrices directly rather than the formula interface in the randomForest package (it is much slower and makes subsequent predictions on new data more difficult). Here I define a matrix of predictors for both the training and testing sets, as well as the response for both training and testing.
X <- ca[, c(6,11:15,33:34)]
y <- log(ca$median_house_value)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
I will fit a small random forest estimator to this data with only 15 trees. Notice that the function allows me to directly give the test dataset to the function; if I also turn on the do.trace option, this shows how well we are predicting on the test set as the model learns more trees. Also notice that if I define a test set beforehand, we need to tell R to keep the forest in the output object. Otherwise the forest gets thrown away, making predictions on new datasets or calculating variable importance impossible (I have no idea why this they thought this would be a good default).
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
do.trace=TRUE, keep.forest=TRUE,
ntree=15L)
## | Out-of-bag | Test set |
## Tree | MSE %Var(y) | MSE %Var(y) |
## 1 | 0.1904 70.00 | 0.2044 70.87 |
## 2 | 0.1806 66.40 | 0.1571 54.47 |
## 3 | 0.1762 64.79 | 0.1445 50.11 |
## 4 | 0.1694 62.29 | 0.1366 47.39 |
## 5 | 0.1663 61.14 | 0.1285 44.55 |
## 6 | 0.1557 57.26 | 0.1241 43.05 |
## 7 | 0.1485 54.58 | 0.1214 42.08 |
## 8 | 0.1447 53.21 | 0.1194 41.41 |
## 9 | 0.1426 52.42 | 0.1177 40.80 |
## 10 | 0.139 51.09 | 0.1164 40.37 |
## 11 | 0.1335 49.08 | 0.1154 40.02 |
## 12 | 0.1324 48.67 | 0.1156 40.09 |
## 13 | 0.1298 47.74 | 0.1158 40.14 |
## 14 | 0.1274 46.82 | 0.1151 39.92 |
## 15 | 0.125 45.95 | 0.1147 39.77 |
rfObj
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, ntree = 15L, do.trace = TRUE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 15
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1249941
## % Var explained: 54.05
## Test set MSE: 0.11
## % Var explained: 60.24
Notice that the randomForest estimator automatically calculates the training and testing MSE. Note: I would argue that we are treating Xtest and ytest as a validation set, but named them as such to following terminology in the randomForest function. Now, let’s turn off the do.trace option and run the estimator with 500 trees.
rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
do.trace=FALSE, keep.forest=TRUE,
ntree=500L)
rfObj
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, ntree = 500L, do.trace = FALSE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1040941
## % Var explained: 61.73
## Test set MSE: 0.11
## % Var explained: 61.73
rfYhat <- predict(rfObj, Xtest)
mean((rfYhat - ytest)^2)
## [1] 0.1103721
Here we have explicitly calculated the MSE as well. How much better does the addition of 485 more trees do compared to the simpler model? As mentioned last time, we can use a random forest to calculate the importance of each of the variables in the model, as measured by how often each variable is used for splitting on and how much it decreases the MSE. This is done with the importance function (remember, this only works if you set keep.forest to TRUE).
importance(rfObj)
## IncNodePurity
## population 70.42507
## total_units 65.93293
## vacant_units 93.05627
## median_rooms 157.68088
## mean_household_size_owners 90.45918
## mean_household_size_renters 119.44500
## median_household_income 270.05590
## mean_household_income 410.49938
The scale of the output is not terribly important, and we are only going to focus on the relative sizes of these importance scores. Household income seems to be the most important, followed by the median number of rooms. Do these seem reasonable to you?
While the exact structure of one tree in a random forest is difficult to make too much sense of in general, I think it is useful to understand how these trees are being stored as a data structure in R. Let’s look at the first few rows of the first tree object of the random forest:
head(getTree(rfObj,k=1),15)
## left daughter right daughter split var split point status prediction
## 1 2 3 4 5.650 -3 12.81532
## 2 4 5 5 3.145 -3 12.73616
## 3 6 7 3 76.500 -3 13.02461
## 4 8 9 2 1954.000 -3 12.82533
## 5 10 11 7 46570.000 -3 12.60980
## 6 12 13 8 95428.500 -3 13.13880
## 7 14 15 7 77486.000 -3 12.89324
## 8 16 17 1 1988.500 -3 12.76969
## 9 18 19 8 69669.000 -3 12.90005
## 10 20 21 7 35629.000 -3 12.47093
## 11 22 23 8 76361.500 -3 12.81060
## 12 24 25 5 3.625 -3 12.81310
## 13 26 27 7 106924.000 -3 13.32329
## 14 28 29 5 2.865 -3 12.58295
## 15 30 31 5 2.875 -3 13.14507
Why do you think the R implementation is using a matrix to store this information rather than a hierarchical list that better represents the actual structure of the tree?
Finally, note that we can easily increase the number of trees in the random forest even after it has already been constructed. Here we add 10 more trees with the grow function:
rfObj <- grow(rfObj, how.many=10)
rfObj
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, ntree = 500L, do.trace = FALSE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 510
## No. of variables tried at each split: 2
Let’s not return to the additive models we studied two weeks ago. I will fit an additive model using the same exact variables we had used in the random forest on the training set:
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-7. For overview type 'help("mgcv-package")'.
ca.gam2 <- gam(log(median_house_value)
~ s(median_household_income) + s(mean_household_income)
+ s(population) + s(total_units) + s(vacant_units)
+ s(median_rooms) + s(mean_household_size_owners)
+ s(mean_household_size_renters)
, data=ca, subset=trainFlag)
Now, we can find predictions from the generalized additive model for both the training set and testing set. I will then reconstruct the X matrices to use this prediction itself as a predictor variable in the model.
ca$gamPred <- predict(ca.gam2, ca)
X <- ca[, c(6,11:15,33:35)]
y <- log(ca$median_house_value)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
Now we can run another random forest on this new dataset. Notice that the MSE decreases slightly using this new variable.
rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
do.trace=FALSE, keep.forest=TRUE)
rfObj
##
## Call:
## randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest, do.trace = FALSE, keep.forest = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.09915329
## % Var explained: 63.55
## Test set MSE: 0.11
## % Var explained: 62.7
rfYhat <- predict(rfObj, Xtest)
mean((rfYhat - ytest)^2)
## [1] 0.1075728
How important is this new variable in the random forest model?
importance(rfObj)
## IncNodePurity
## population 50.97247
## total_units 47.14099
## vacant_units 53.38192
## median_rooms 76.87409
## mean_household_size_owners 60.17520
## mean_household_size_renters 74.61663
## median_household_income 153.18022
## mean_household_income 247.40573
## gamPred 524.55112
Now, I want to try gradient boosting on this same dataset. For this I will use the excellent gbm package in R. There are two functions for fitting gradient boosted trees (or machines, as the package calls them): gbm and gbm.fit. The first takes a formula object and the second take raw matrices. As said before, I prefer to deal with the raw matrix version as it is faster, but one needs to be careful that the training and testing matrices are constructed exactly the same. We will start with using 100 trees and a relatively high shrinkage factor of 0.1:
library(gbm)
## Loading required package: survival
## Loading required package: lattice
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=100L, shrinkage=0.1)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.2497 nan 0.1000 0.0220
## 2 0.2315 nan 0.1000 0.0177
## 3 0.2157 nan 0.1000 0.0157
## 4 0.2024 nan 0.1000 0.0128
## 5 0.1910 nan 0.1000 0.0113
## 6 0.1809 nan 0.1000 0.0100
## 7 0.1723 nan 0.1000 0.0087
## 8 0.1647 nan 0.1000 0.0073
## 9 0.1581 nan 0.1000 0.0059
## 10 0.1520 nan 0.1000 0.0060
## 20 0.1175 nan 0.1000 0.0016
## 40 0.1007 nan 0.1000 0.0003
## 60 0.0976 nan 0.1000 0.0000
## 80 0.0968 nan 0.1000 -0.0001
## 100 0.0962 nan 0.1000 -0.0001
Notice that the variable importance scores look a bit more lopsided than in the random forest case. Why might this be the case?
summary(gbmObj,plotit=FALSE)
## var rel.inf
## gamPred gamPred 95.8896590
## mean_household_income mean_household_income 3.0322903
## mean_household_size_renters mean_household_size_renters 0.3405273
## median_household_income median_household_income 0.2254245
## median_rooms median_rooms 0.2145368
## mean_household_size_owners mean_household_size_owners 0.1490498
## population population 0.1485122
## total_units total_units 0.0000000
## vacant_units vacant_units 0.0000000
If we turn down shrinkage notice that the training deviance (a fancy way of saying the MSE) decreases much slower, and after 100 trees the variable importance scores are even more lopsided.
library(gbm)
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=100L, shrinkage=0.01)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 0.2697 nan 0.0100 0.0023
## 2 0.2674 nan 0.0100 0.0022
## 3 0.2653 nan 0.0100 0.0022
## 4 0.2631 nan 0.0100 0.0022
## 5 0.2610 nan 0.0100 0.0021
## 6 0.2589 nan 0.0100 0.0021
## 7 0.2568 nan 0.0100 0.0020
## 8 0.2548 nan 0.0100 0.0020
## 9 0.2528 nan 0.0100 0.0020
## 10 0.2508 nan 0.0100 0.0019
## 20 0.2331 nan 0.0100 0.0016
## 40 0.2049 nan 0.0100 0.0012
## 60 0.1837 nan 0.0100 0.0009
## 80 0.1672 nan 0.0100 0.0007
## 100 0.1543 nan 0.0100 0.0006
summary(gbmObj,plotit=FALSE)
## var rel.inf
## gamPred gamPred 100
## population population 0
## total_units total_units 0
## vacant_units vacant_units 0
## median_rooms median_rooms 0
## mean_household_size_owners mean_household_size_owners 0
## mean_household_size_renters mean_household_size_renters 0
## median_household_income median_household_income 0
## mean_household_income mean_household_income 0
Now I will let it run for a while with a modest shrinkage penalty. We will then look at a plot of the MSE on the test set as a function of the number of trees.
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=1e4, shrinkage=0.03, verbose=FALSE)
ntree <- seq(100,gbmObj$n.trees,length.out=100)
gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree)
r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE)
mse <- apply(r^2,2,mean)
plot(ntree, mse, type="l")
abline(h=min(mse),lty="dashed")
text(ntree[100],min(mse)+0.0004,signif(min(mse),3))
If we decrease the shrinkage penalty from 0.03 to 0.005, notice what happens to the MSE on the test set:
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=1e4, shrinkage=0.005, verbose=FALSE)
ntree <- seq(500,gbmObj$n.trees,length.out=100)
gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree)
r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE)
mse <- apply(r^2,2,mean)
plot(ntree, mse, type="l")
abline(h=min(mse),lty="dashed")
text(ntree[100],min(mse)+0.0004,signif(min(mse),3))
Notice that the model does slightly worst after 2500 trees, but takes slightly longer to reach the optimal value. It also does not suffer as much when the number of trees is increased past this point. What if we try a still lower amount of shrinkage?
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
n.trees=1e4, shrinkage=0.001, verbose=FALSE)
ntree <- seq(500,gbmObj$n.trees,length.out=100)
gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree)
r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE)
mse <- apply(r^2,2,mean)
plot(ntree, mse, type="l")
abline(h=min(mse),lty="dashed")
text(ntree[100],min(mse)+0.0004,signif(min(mse),3))
As a general rule of thumb, you want to slowly decrease the shrinkage and produce curves like this, generally until the point where the shrinkage penalty is so small that it never really decays on the backend. This is in bold because it is one of those important secrets of applied machine learning that is known well by a set of practitioners but not actually written down anywhere (prominent, at least).
For an ultimately model, we might choose to use this final model this MSE and pick only 8500 trees. Do you see now why we are treating the test set more like a validation set?
gbmYhat <- predict(gbmObj, Xtest, n.trees=8500)
mean((gbmYhat - ytest)^2)
## [1] 0.1111789
Now, notice that we can improve both the random forest and gradient boosted models by averaging there predictions. As the random forest is slightly more predictive, I will weight it a bit higher:
mean((rfYhat - ytest)^2)
## [1] 0.1075728
mean((gbmYhat - ytest)^2)
## [1] 0.1111789
mean((rfYhat*0.8 + gbmYhat*0.2 - ytest)^2)
## [1] 0.107298
Taking this a step farther, we can even weight together the random forest, gradient boosted tree, and additive model. This produces a slightly better model still.
gamYhat <- as.numeric(ca$gamPred[!trainFlag])
mean((rfYhat*0.8 + gbmYhat*0.1 + gamYhat*0.1 - ytest)^2)
## [1] 0.1070845
The process of blending together models is called stacking (amongst many other names), and is an important technique in producing state of the art predictive models. It is also possible to `learn’ the weights on a separate part of the validation set. In a way, this is what we did in using the additive model as part of the random forest and gradient boosted tree models.
I would like to note that while for this small example it seems that gradient boosting is a lot more work with little reward, this is not in general the case. Typically it will be able to significantly improve on the results of the random forest.
It may have seemed like I should use a separate chunk of data to fit the additive model as opposed to the random forest and gradient boosted trees. It is true that given a large enough dataset, this would be the ideal. However, I am not particularly worried about this effecting the output. Why? Additive models have a relatively low complexity when we only have 8 variables, and will rarely overfit to the training data.
In general, there is no big mysterious difference between classification and regression with decision trees. Some of the details are important to understand in terms of how these methods are employed in practice though, so I think it is useful to actually look at an example. I will take the same dataset as in the previous section:
set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
But now use the entire dataset, setting the class that we want to predict equal to which state the census tract is located in:
X <- x[, c(6,11:15,33:34)]
trainFlag <- (runif(nrow(x)) < 0.66)
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
We can run the random forest model, explicitly coding the y variables as factors to tell the function to run classification:
rfObj <- randomForest(Xtrain, factor(ytrain),
Xtest, factor(ytest),
do.trace=FALSE, keep.forest=TRUE,
ntree=500L)
rfYhat <- predict(rfObj, Xtest)
To evaluate the model, I will use the misclassification rate on the test set:
mean(rfYhat != ytest)
## [1] 0.06598845
Now, as before, let’s look at the variable importance scores for this model. Do these surprise you at all?
importance(rfObj)
## MeanDecreaseGini
## population 161.6781
## total_units 133.4970
## vacant_units 149.5611
## median_rooms 847.5819
## mean_household_size_owners 379.0625
## mean_household_size_renters 549.5154
## median_household_income 314.5436
## mean_household_income 373.0121
We can also fit a gradient boosted tree to the classification problem. In order to tell the gbm.fit function that we want classification, I will set the distribution parameter to the ‘bernoulli’ loss. How do the important variable differ from the random forest? How does the mis-classification rate compare to the random forest estimator?
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="bernoulli",
n.trees=1e4, shrinkage=0.005, verbose=FALSE)
gbmYhat <- as.numeric(predict(gbmObj, Xtest, n.trees=gbmObj$n.trees) > 0)
summary(gbmObj, plotit=FALSE)
## var rel.inf
## median_rooms median_rooms 37.7247915
## mean_household_size_renters mean_household_size_renters 27.0391940
## mean_household_income mean_household_income 17.6024644
## mean_household_size_owners mean_household_size_owners 10.0502211
## median_household_income median_household_income 5.7804208
## population population 0.8391667
## vacant_units vacant_units 0.5404817
## total_units total_units 0.4232599
mean(gbmYhat != ytest)
## [1] 0.06406379
As before, I want to fit an additive model to the training set and then use this as a meta-predictor in the random forest and gradient boosted tree models. However, instead of directly predicting the classification response (which we could do), I will instead use the additive model to predictive the most important variable in the tree models (median number of rooms):
library(mgcv)
x.gam2 <- gam(median_rooms
~ s(median_house_value) + s(mean_household_income)
+ s(population) + s(total_units) + s(vacant_units)
+ s(median_household_income) + s(mean_household_size_owners)
+ s(mean_household_size_renters),
data=x, subset=trainFlag)
I will now find the predictions of these variables and put these predictions in the predictor matrices Xtrain and Xtest.
x$gamPred <- predict(x.gam2, x)
X <- x[, c(6,11:15,33:35)]
trainFlag <- (runif(nrow(x)) < 0.66)
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]
This gives an impressive improvement on the misclassification rate for the random forest model:
rfObj <- randomForest(Xtrain, factor(ytrain),
Xtest, factor(ytest),
do.trace=FALSE, keep.forest=TRUE,
ntree=500L)
rfYhat <- predict(rfObj, Xtest)
mean(rfYhat != ytest)
## [1] 0.04337085
To gain an understanding and appreciation for where we are still making errors, we can construct the confusion matrix of the predictions from the random forest estimator.
table(rfYhat, ytest)
## ytest
## rfYhat 0 1
## 0 980 43
## 1 115 2505
It appears that we are not overly biased towards predicting points in either CA or PA. Where do these errant points occur?
these <- which(rfYhat != ytest)
par(mar=c(0,0,0,0))
plot(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
axes=FALSE, xlab="", ylab="", col="white")
snippets::osmap(tiles.url="http://c.tile.stamen.com/toner/",alpha=0.5)
points(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
pch=19, cex=0.4, col="orange")
box()
As we may have expected, areas in the most urban parts of PA and the rural parts of CA are the most likely to be classified incorrectly.
Finally, we can see that the additive model meta-variable also has a similar improvement on the gradient boosted trees:
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="bernoulli",
n.trees=2500, shrinkage=0.05, verbose=FALSE)
gbmYhat <- as.numeric(predict(gbmObj, Xtest, n.trees=gbmObj$n.trees) > 0)
summary(gbmObj, plotit=FALSE)
## var rel.inf
## median_rooms median_rooms 26.6462615
## mean_household_size_renters mean_household_size_renters 25.5736402
## mean_household_income mean_household_income 16.1747340
## gamPred gamPred 12.7770390
## median_household_income median_household_income 8.1728206
## mean_household_size_owners mean_household_size_owners 7.2388484
## vacant_units vacant_units 1.7353594
## population population 1.0531226
## total_units total_units 0.6281745
Though it still fails to best the random forest estimator in this particular case:
mean(rfYhat != ytest)
## [1] 0.04337085
mean(gbmYhat != ytest)
## [1] 0.04858633
Though, as stated before, this is often not the case in more complex models.
We have looked at three methods for combining simple models into powerful prediction engines:
We have seen that these ideas can be used in conjunction with one another to produce fairly powerful predictive models.