Tree based models

Last week, we talked about using decision trees for regression and classification problems. Here, we are going to actual apply these techniques to the California and Pennsylvania housing datasets.

Regression

As a first task, let us again take just the California data and try to predict the logarithm of the median house value in each census tract.

set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)
ca <- x[x$statefp==6,] # just take CA data
trainFlag <- (runif(nrow(ca)) < 0.66)

In order to fit a single regression tree, we use the R package tree. As a starting point, I will just use longitude and latitude.

library(tree)
tf <- tree(log(median_house_value) ~ longitude + latitude, data = ca)
plot(tf)
text(tf, cex=0.75)

It is easy to incorporate more variables in the model. How much do these change the tree that is learned from the data?

library(tree)
tf <- tree(log(median_house_value) ~ longitude + latitude +
           median_household_income + median_rooms,
           data = ca)
plot(tf)
text(tf, cex=0.75)

Which variables might we want to incorporate if we want a highly predictive model? I am going to start by using these:

names(ca)[c(6,11:15,33:34)]

## [1] "population"                  "total_units"                
## [3] "vacant_units"                "median_rooms"               
## [5] "mean_household_size_owners"  "mean_household_size_renters"
## [7] "median_household_income"     "mean_household_income"

I find it easier to use matrices directly rather than the formula interface in the randomForest package (it is much slower and makes subsequent predictions on new data more difficult). Here I define a matrix of predictors for both the training and testing sets, as well as the response for both training and testing.

X <- ca[, c(6,11:15,33:34)]
y <- log(ca$median_house_value)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]

I will fit a small random forest estimator to this data with only 15 trees. Notice that the function allows me to directly give the test dataset to the function; if I also turn on the do.trace option, this shows how well we are predicting on the test set as the model learns more trees. Also notice that if I define a test set beforehand, we need to tell R to keep the forest in the output object. Otherwise the forest gets thrown away, making predictions on new datasets or calculating variable importance impossible (I have no idea why this they thought this would be a good default).

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
                      do.trace=TRUE, keep.forest=TRUE,
                      ntree=15L)

##      |      Out-of-bag   |       Test set    |
## Tree |      MSE  %Var(y) |      MSE  %Var(y) |
##    1 |   0.1904    70.00 |   0.2044    70.87 |
##    2 |   0.1806    66.40 |   0.1571    54.47 |
##    3 |   0.1762    64.79 |   0.1445    50.11 |
##    4 |   0.1694    62.29 |   0.1366    47.39 |
##    5 |   0.1663    61.14 |   0.1285    44.55 |
##    6 |   0.1557    57.26 |   0.1241    43.05 |
##    7 |   0.1485    54.58 |   0.1214    42.08 |
##    8 |   0.1447    53.21 |   0.1194    41.41 |
##    9 |   0.1426    52.42 |   0.1177    40.80 |
##   10 |    0.139    51.09 |   0.1164    40.37 |
##   11 |   0.1335    49.08 |   0.1154    40.02 |
##   12 |   0.1324    48.67 |   0.1156    40.09 |
##   13 |   0.1298    47.74 |   0.1158    40.14 |
##   14 |   0.1274    46.82 |   0.1151    39.92 |
##   15 |    0.125    45.95 |   0.1147    39.77 |

rfObj

## 
## Call:
##  randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest,      ntree = 15L, do.trace = TRUE, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 15
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.1249941
##                     % Var explained: 54.05
##                        Test set MSE: 0.11
##                     % Var explained: 60.24

Notice that the randomForest estimator automatically calculates the training and testing MSE. Note: I would argue that we are treating Xtest and ytest as a validation set, but named them as such to following terminology in the randomForest function. Now, let’s turn off the do.trace option and run the estimator with 500 trees.

rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
                      do.trace=FALSE, keep.forest=TRUE,
                      ntree=500L)
rfObj

## 
## Call:
##  randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest,      ntree = 500L, do.trace = FALSE, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.1040941
##                     % Var explained: 61.73
##                        Test set MSE: 0.11
##                     % Var explained: 61.73

rfYhat <- predict(rfObj, Xtest)
mean((rfYhat - ytest)^2)

## [1] 0.1103721

Here we have explicitly calculated the MSE as well. How much better does the addition of 485 more trees do compared to the simpler model? As mentioned last time, we can use a random forest to calculate the importance of each of the variables in the model, as measured by how often each variable is used for splitting on and how much it decreases the MSE. This is done with the importance function (remember, this only works if you set keep.forest to TRUE).

importance(rfObj)

##                             IncNodePurity
## population                       70.42507
## total_units                      65.93293
## vacant_units                     93.05627
## median_rooms                    157.68088
## mean_household_size_owners       90.45918
## mean_household_size_renters     119.44500
## median_household_income         270.05590
## mean_household_income           410.49938

The scale of the output is not terribly important, and we are only going to focus on the relative sizes of these importance scores. Household income seems to be the most important, followed by the median number of rooms. Do these seem reasonable to you?

While the exact structure of one tree in a random forest is difficult to make too much sense of in general, I think it is useful to understand how these trees are being stored as a data structure in R. Let’s look at the first few rows of the first tree object of the random forest:

head(getTree(rfObj,k=1),15)

##    left daughter right daughter split var split point status prediction
## 1              2              3         4       5.650     -3   12.81532
## 2              4              5         5       3.145     -3   12.73616
## 3              6              7         3      76.500     -3   13.02461
## 4              8              9         2    1954.000     -3   12.82533
## 5             10             11         7   46570.000     -3   12.60980
## 6             12             13         8   95428.500     -3   13.13880
## 7             14             15         7   77486.000     -3   12.89324
## 8             16             17         1    1988.500     -3   12.76969
## 9             18             19         8   69669.000     -3   12.90005
## 10            20             21         7   35629.000     -3   12.47093
## 11            22             23         8   76361.500     -3   12.81060
## 12            24             25         5       3.625     -3   12.81310
## 13            26             27         7  106924.000     -3   13.32329
## 14            28             29         5       2.865     -3   12.58295
## 15            30             31         5       2.875     -3   13.14507

Why do you think the R implementation is using a matrix to store this information rather than a hierarchical list that better represents the actual structure of the tree?

Finally, note that we can easily increase the number of trees in the random forest even after it has already been constructed. Here we add 10 more trees with the grow function:

rfObj <- grow(rfObj, how.many=10)
rfObj

## 
## Call:
##  randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest,      ntree = 500L, do.trace = FALSE, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 510
## No. of variables tried at each split: 2

Let’s not return to the additive models we studied two weeks ago. I will fit an additive model using the same exact variables we had used in the random forest on the training set:

library(mgcv)

## Loading required package: nlme

## This is mgcv 1.8-7. For overview type 'help("mgcv-package")'.

ca.gam2 <- gam(log(median_house_value)
  ~ s(median_household_income) + s(mean_household_income)
  + s(population) + s(total_units) + s(vacant_units)
  + s(median_rooms) + s(mean_household_size_owners)
  + s(mean_household_size_renters)
  , data=ca, subset=trainFlag)

Now, we can find predictions from the generalized additive model for both the training set and testing set. I will then reconstruct the X matrices to use this prediction itself as a predictor variable in the model.

ca$gamPred <- predict(ca.gam2, ca)
X <- ca[, c(6,11:15,33:35)]
y <- log(ca$median_house_value)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]

Now we can run another random forest on this new dataset. Notice that the MSE decreases slightly using this new variable.

rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest,
                      do.trace=FALSE, keep.forest=TRUE)
rfObj

## 
## Call:
##  randomForest(x = Xtrain, y = ytrain, xtest = Xtest, ytest = ytest,      do.trace = FALSE, keep.forest = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.09915329
##                     % Var explained: 63.55
##                        Test set MSE: 0.11
##                     % Var explained: 62.7

rfYhat <- predict(rfObj, Xtest)
mean((rfYhat - ytest)^2)

## [1] 0.1075728

How important is this new variable in the random forest model?

importance(rfObj)

##                             IncNodePurity
## population                       50.97247
## total_units                      47.14099
## vacant_units                     53.38192
## median_rooms                     76.87409
## mean_household_size_owners       60.17520
## mean_household_size_renters      74.61663
## median_household_income         153.18022
## mean_household_income           247.40573
## gamPred                         524.55112

Now, I want to try gradient boosting on this same dataset. For this I will use the excellent gbm package in R. There are two functions for fitting gradient boosted trees (or machines, as the package calls them): gbm and gbm.fit. The first takes a formula object and the second take raw matrices. As said before, I prefer to deal with the raw matrix version as it is faster, but one needs to be careful that the training and testing matrices are constructed exactly the same. We will start with using 100 trees and a relatively high shrinkage factor of 0.1:

library(gbm)

## Loading required package: survival

## Loading required package: lattice

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.1

gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
                  n.trees=100L, shrinkage=0.1)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.2497             nan     0.1000    0.0220
##      2        0.2315             nan     0.1000    0.0177
##      3        0.2157             nan     0.1000    0.0157
##      4        0.2024             nan     0.1000    0.0128
##      5        0.1910             nan     0.1000    0.0113
##      6        0.1809             nan     0.1000    0.0100
##      7        0.1723             nan     0.1000    0.0087
##      8        0.1647             nan     0.1000    0.0073
##      9        0.1581             nan     0.1000    0.0059
##     10        0.1520             nan     0.1000    0.0060
##     20        0.1175             nan     0.1000    0.0016
##     40        0.1007             nan     0.1000    0.0003
##     60        0.0976             nan     0.1000    0.0000
##     80        0.0968             nan     0.1000   -0.0001
##    100        0.0962             nan     0.1000   -0.0001

Notice that the variable importance scores look a bit more lopsided than in the random forest case. Why might this be the case?

summary(gbmObj,plotit=FALSE)

##                                                     var    rel.inf
## gamPred                                         gamPred 95.8896590
## mean_household_income             mean_household_income  3.0322903
## mean_household_size_renters mean_household_size_renters  0.3405273
## median_household_income         median_household_income  0.2254245
## median_rooms                               median_rooms  0.2145368
## mean_household_size_owners   mean_household_size_owners  0.1490498
## population                                   population  0.1485122
## total_units                                 total_units  0.0000000
## vacant_units                               vacant_units  0.0000000

If we turn down shrinkage notice that the training deviance (a fancy way of saying the MSE) decreases much slower, and after 100 trees the variable importance scores are even more lopsided.

library(gbm)
gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
                  n.trees=100L, shrinkage=0.01)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        0.2697             nan     0.0100    0.0023
##      2        0.2674             nan     0.0100    0.0022
##      3        0.2653             nan     0.0100    0.0022
##      4        0.2631             nan     0.0100    0.0022
##      5        0.2610             nan     0.0100    0.0021
##      6        0.2589             nan     0.0100    0.0021
##      7        0.2568             nan     0.0100    0.0020
##      8        0.2548             nan     0.0100    0.0020
##      9        0.2528             nan     0.0100    0.0020
##     10        0.2508             nan     0.0100    0.0019
##     20        0.2331             nan     0.0100    0.0016
##     40        0.2049             nan     0.0100    0.0012
##     60        0.1837             nan     0.0100    0.0009
##     80        0.1672             nan     0.0100    0.0007
##    100        0.1543             nan     0.0100    0.0006

summary(gbmObj,plotit=FALSE)

##                                                     var rel.inf
## gamPred                                         gamPred     100
## population                                   population       0
## total_units                                 total_units       0
## vacant_units                               vacant_units       0
## median_rooms                               median_rooms       0
## mean_household_size_owners   mean_household_size_owners       0
## mean_household_size_renters mean_household_size_renters       0
## median_household_income         median_household_income       0
## mean_household_income             mean_household_income       0

Now I will let it run for a while with a modest shrinkage penalty. We will then look at a plot of the MSE on the test set as a function of the number of trees.

gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
                  n.trees=1e4, shrinkage=0.03, verbose=FALSE)
ntree <- seq(100,gbmObj$n.trees,length.out=100)
gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree)
r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE)
mse <- apply(r^2,2,mean)
plot(ntree, mse, type="l")
abline(h=min(mse),lty="dashed")
text(ntree[100],min(mse)+0.0004,signif(min(mse),3))

If we decrease the shrinkage penalty from 0.03 to 0.005, notice what happens to the MSE on the test set:

gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
                  n.trees=1e4, shrinkage=0.005, verbose=FALSE)
ntree <- seq(500,gbmObj$n.trees,length.out=100)
gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree)
r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE)
mse <- apply(r^2,2,mean)
plot(ntree, mse, type="l")
abline(h=min(mse),lty="dashed")
text(ntree[100],min(mse)+0.0004,signif(min(mse),3))

Notice that the model does slightly worst after 2500 trees, but takes slightly longer to reach the optimal value. It also does not suffer as much when the number of trees is increased past this point. What if we try a still lower amount of shrinkage?

gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian",
                  n.trees=1e4, shrinkage=0.001, verbose=FALSE)
ntree <- seq(500,gbmObj$n.trees,length.out=100)
gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree)
r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE)
mse <- apply(r^2,2,mean)
plot(ntree, mse, type="l")
abline(h=min(mse),lty="dashed")
text(ntree[100],min(mse)+0.0004,signif(min(mse),3))

As a general rule of thumb, you want to slowly decrease the shrinkage and produce curves like this, generally until the point where the shrinkage penalty is so small that it never really decays on the backend. This is in bold because it is one of those important secrets of applied machine learning that is known well by a set of practitioners but not actually written down anywhere (prominent, at least).

For an ultimately model, we might choose to use this final model this MSE and pick only 8500 trees. Do you see now why we are treating the test set more like a validation set?

gbmYhat <- predict(gbmObj, Xtest, n.trees=8500)
mean((gbmYhat - ytest)^2)

## [1] 0.1111789

Now, notice that we can improve both the random forest and gradient boosted models by averaging there predictions. As the random forest is slightly more predictive, I will weight it a bit higher:

mean((rfYhat - ytest)^2)

## [1] 0.1075728

mean((gbmYhat - ytest)^2)

## [1] 0.1111789

mean((rfYhat*0.8 + gbmYhat*0.2 - ytest)^2)

## [1] 0.107298

Taking this a step farther, we can even weight together the random forest, gradient boosted tree, and additive model. This produces a slightly better model still.

gamYhat <- as.numeric(ca$gamPred[!trainFlag])
mean((rfYhat*0.8 + gbmYhat*0.1 + gamYhat*0.1 - ytest)^2)

## [1] 0.1070845

The process of blending together models is called stacking (amongst many other names), and is an important technique in producing state of the art predictive models. It is also possible to `learn’ the weights on a separate part of the validation set. In a way, this is what we did in using the additive model as part of the random forest and gradient boosted tree models.

I would like to note that while for this small example it seems that gradient boosting is a lot more work with little reward, this is not in general the case. Typically it will be able to significantly improve on the results of the random forest.

It may have seemed like I should use a separate chunk of data to fit the additive model as opposed to the random forest and gradient boosted trees. It is true that given a large enough dataset, this would be the ideal. However, I am not particularly worried about this effecting the output. Why? Additive models have a relatively low complexity when we only have 8 variables, and will rarely overfit to the training data.

Classification

In general, there is no big mysterious difference between classification and regression with decision trees. Some of the details are important to understand in terms of how these methods are employed in practice though, so I think it is useful to actually look at an example. I will take the same dataset as in the previous section:

set.seed(1)
x <- read.csv("../../data/CAPA.csv", as.is=TRUE)
names(x) <- tolower(names(x))
x <- na.omit(x)

But now use the entire dataset, setting the class that we want to predict equal to which state the census tract is located in:

X <- x[, c(6,11:15,33:34)]
trainFlag <- (runif(nrow(x)) < 0.66)
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]

We can run the random forest model, explicitly coding the y variables as factors to tell the function to run classification:

rfObj <- randomForest(Xtrain, factor(ytrain),
                      Xtest, factor(ytest),
                      do.trace=FALSE, keep.forest=TRUE,
                      ntree=500L)
rfYhat <- predict(rfObj, Xtest)

To evaluate the model, I will use the misclassification rate on the test set:

mean(rfYhat != ytest)

## [1] 0.06598845

Now, as before, let’s look at the variable importance scores for this model. Do these surprise you at all?

importance(rfObj)

##                             MeanDecreaseGini
## population                          161.6781
## total_units                         133.4970
## vacant_units                        149.5611
## median_rooms                        847.5819
## mean_household_size_owners          379.0625
## mean_household_size_renters         549.5154
## median_household_income             314.5436
## mean_household_income               373.0121

We can also fit a gradient boosted tree to the classification problem. In order to tell the gbm.fit function that we want classification, I will set the distribution parameter to the ‘bernoulli’ loss. How do the important variable differ from the random forest? How does the mis-classification rate compare to the random forest estimator?

gbmObj <- gbm.fit(Xtrain, ytrain, distribution="bernoulli",
                  n.trees=1e4, shrinkage=0.005, verbose=FALSE)
gbmYhat <- as.numeric(predict(gbmObj, Xtest, n.trees=gbmObj$n.trees) > 0)
summary(gbmObj, plotit=FALSE)

##                                                     var    rel.inf
## median_rooms                               median_rooms 37.7247915
## mean_household_size_renters mean_household_size_renters 27.0391940
## mean_household_income             mean_household_income 17.6024644
## mean_household_size_owners   mean_household_size_owners 10.0502211
## median_household_income         median_household_income  5.7804208
## population                                   population  0.8391667
## vacant_units                               vacant_units  0.5404817
## total_units                                 total_units  0.4232599

mean(gbmYhat != ytest)

## [1] 0.06406379

As before, I want to fit an additive model to the training set and then use this as a meta-predictor in the random forest and gradient boosted tree models. However, instead of directly predicting the classification response (which we could do), I will instead use the additive model to predictive the most important variable in the tree models (median number of rooms):

library(mgcv)
x.gam2 <- gam(median_rooms
  ~ s(median_house_value) + s(mean_household_income)
  + s(population) + s(total_units) + s(vacant_units)
  + s(median_household_income) + s(mean_household_size_owners)
  + s(mean_household_size_renters),
    data=x, subset=trainFlag)

I will now find the predictions of these variables and put these predictions in the predictor matrices Xtrain and Xtest.

x$gamPred <- predict(x.gam2, x)
X <- x[, c(6,11:15,33:35)]
trainFlag <- (runif(nrow(x)) < 0.66)
y <- as.numeric(x$statefp == 6)
Xtrain <- X[trainFlag,]
ytrain <- y[trainFlag]
Xtest <- X[!trainFlag,]
ytest <- y[!trainFlag]

This gives an impressive improvement on the misclassification rate for the random forest model:

rfObj <- randomForest(Xtrain, factor(ytrain),
                      Xtest, factor(ytest),
                      do.trace=FALSE, keep.forest=TRUE,
                      ntree=500L)
rfYhat <- predict(rfObj, Xtest)
mean(rfYhat != ytest)

## [1] 0.04337085

To gain an understanding and appreciation for where we are still making errors, we can construct the confusion matrix of the predictions from the random forest estimator.

table(rfYhat, ytest)

##       ytest
## rfYhat    0    1
##      0  980   43
##      1  115 2505

It appears that we are not overly biased towards predicting points in either CA or PA. Where do these errant points occur?

these <- which(rfYhat != ytest)
par(mar=c(0,0,0,0))
plot(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
     axes=FALSE, xlab="", ylab="", col="white")
snippets::osmap(tiles.url="http://c.tile.stamen.com/toner/",alpha=0.5)
points(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these],
       pch=19, cex=0.4, col="orange")
box()

As we may have expected, areas in the most urban parts of PA and the rural parts of CA are the most likely to be classified incorrectly.

Finally, we can see that the additive model meta-variable also has a similar improvement on the gradient boosted trees:

gbmObj <- gbm.fit(Xtrain, ytrain, distribution="bernoulli",
                  n.trees=2500, shrinkage=0.05, verbose=FALSE)
gbmYhat <- as.numeric(predict(gbmObj, Xtest, n.trees=gbmObj$n.trees) > 0)
summary(gbmObj, plotit=FALSE)

##                                                     var    rel.inf
## median_rooms                               median_rooms 26.6462615
## mean_household_size_renters mean_household_size_renters 25.5736402
## mean_household_income             mean_household_income 16.1747340
## gamPred                                         gamPred 12.7770390
## median_household_income         median_household_income  8.1728206
## mean_household_size_owners   mean_household_size_owners  7.2388484
## vacant_units                               vacant_units  1.7353594
## population                                   population  1.0531226
## total_units                                 total_units  0.6281745

Though it still fails to best the random forest estimator in this particular case:

mean(rfYhat != ytest)

## [1] 0.04337085

mean(gbmYhat != ytest)

## [1] 0.04858633

Though, as stated before, this is often not the case in more complex models.

Overview

We have looked at three methods for combining simple models into powerful prediction engines:

Bagging (random forests)
Boosting (gradient boosted trees)
Stacking

We have seen that these ideas can be used in conjunction with one another to produce fairly powerful predictive models.