# Tree based models Last week, we talked about using decision trees for regression and classification problems. Here, we are going to actual apply these techniques to the California and Pennsylvania housing datasets. ## Regression As a first task, let us again take just the California data and try to predict the logarithm of the median house value in each census tract. ```{r} set.seed(1) x <- read.csv("../../data/CAPA.csv", as.is=TRUE) names(x) <- tolower(names(x)) x <- na.omit(x) ca <- x[x$statefp==6,] # just take CA data trainFlag <- (runif(nrow(ca)) < 0.66) ``` In order to fit a single regression tree, we use the R package **tree**. As a starting point, I will just use longitude and latitude. ```{r} library(tree) tf <- tree(log(median_house_value) ~ longitude + latitude, data = ca) plot(tf) text(tf, cex=0.75) ``` It is easy to incorporate more variables in the model. How much do these change the tree that is learned from the data? ```{r} library(tree) tf <- tree(log(median_house_value) ~ longitude + latitude + median_household_income + median_rooms, data = ca) plot(tf) text(tf, cex=0.75) ``` Which variables might we want to incorporate if we want a highly predictive model? I am going to start by using these: ```{r} names(ca)[c(6,11:15,33:34)] ``` I find it easier to use matrices directly rather than the formula interface in the **randomForest** package (it is much slower and makes subsequent predictions on new data more difficult). Here I define a matrix of predictors for both the training and testing sets, as well as the response for both training and testing. ```{r} X <- ca[, c(6,11:15,33:34)] y <- log(ca$median_house_value) Xtrain <- X[trainFlag,] ytrain <- y[trainFlag] Xtest <- X[!trainFlag,] ytest <- y[!trainFlag] ``` I will fit a small random forest estimator to this data with only 15 trees. Notice that the function allows me to directly give the test dataset to the function; if I also turn on the *do.trace* option, this shows how well we are predicting on the test set as the model learns more trees. Also notice that if I define a test set beforehand, we need to tell R to keep the forest in the output object. Otherwise the forest gets thrown away, making predictions on new datasets or calculating variable importance impossible (I have no idea why this they thought this would be a good default). ```{r} library(randomForest) rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest, do.trace=TRUE, keep.forest=TRUE, ntree=15L) rfObj ``` Notice that the randomForest estimator automatically calculates the training and testing MSE. Note: I would argue that we are treating **Xtest** and **ytest** as a validation set, but named them as such to following terminology in the *randomForest* function. Now, let's turn off the *do.trace* option and run the estimator with 500 trees. ```{r} rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest, do.trace=FALSE, keep.forest=TRUE, ntree=500L) rfObj rfYhat <- predict(rfObj, Xtest) mean((rfYhat - ytest)^2) ``` Here we have explicitly calculated the MSE as well. How much better does the addition of 485 more trees do compared to the simpler model? As mentioned last time, we can use a random forest to calculate the importance of each of the variables in the model, as measured by how often each variable is used for splitting on and how much it decreases the MSE. This is done with the *importance* function (remember, this only works if you set *keep.forest* to TRUE). ```{r} importance(rfObj) ``` The scale of the output is not terribly important, and we are only going to focus on the relative sizes of these importance scores. Household income seems to be the most important, followed by the median number of rooms. Do these seem reasonable to you? While the exact structure of one tree in a random forest is difficult to make too much sense of in general, I think it is useful to understand how these trees are being stored as a data structure in R. Let's look at the first few rows of the first tree object of the random forest: ```{r} head(getTree(rfObj,k=1),15) ``` Why do you think the R implementation is using a matrix to store this information rather than a hierarchical list that better represents the actual structure of the tree? Finally, note that we can easily increase the number of trees in the random forest even after it has already been constructed. Here we add 10 more trees with the *grow* function: ```{r} rfObj <- grow(rfObj, how.many=10) rfObj ``` Let's not return to the additive models we studied two weeks ago. I will fit an additive model using the same exact variables we had used in the random forest on the training set: ```{r} library(mgcv) ca.gam2 <- gam(log(median_house_value) ~ s(median_household_income) + s(mean_household_income) + s(population) + s(total_units) + s(vacant_units) + s(median_rooms) + s(mean_household_size_owners) + s(mean_household_size_renters) , data=ca, subset=trainFlag) ``` Now, we can find predictions from the generalized additive model for both the training set and testing set. I will then reconstruct the **X** matrices to use this prediction itself as a predictor variable in the model. ```{r} ca$gamPred <- predict(ca.gam2, ca) X <- ca[, c(6,11:15,33:35)] y <- log(ca$median_house_value) Xtrain <- X[trainFlag,] ytrain <- y[trainFlag] Xtest <- X[!trainFlag,] ytest <- y[!trainFlag] ``` Now we can run another random forest on this new dataset. Notice that the MSE decreases slightly using this new variable. ```{r} rfObj <- randomForest(Xtrain, ytrain, Xtest, ytest, do.trace=FALSE, keep.forest=TRUE) rfObj ``` ```{r} rfYhat <- predict(rfObj, Xtest) mean((rfYhat - ytest)^2) ``` How important is this new variable in the random forest model? ```{r} importance(rfObj) ``` Now, I want to try gradient boosting on this same dataset. For this I will use the excellent **gbm** package in R. There are two functions for fitting gradient boosted trees (or machines, as the package calls them): *gbm* and *gbm.fit*. The first takes a formula object and the second take raw matrices. As said before, I prefer to deal with the raw matrix version as it is faster, but one needs to be careful that the training and testing matrices are constructed exactly the same. We will start with using 100 trees and a relatively high shrinkage factor of 0.1: ```{r} library(gbm) gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian", n.trees=100L, shrinkage=0.1) ``` Notice that the variable importance scores look a bit more lopsided than in the random forest case. Why might this be the case? ```{r} summary(gbmObj,plotit=FALSE) ``` If we turn down shrinkage notice that the training deviance (a fancy way of saying the MSE) decreases much slower, and after 100 trees the variable importance scores are even more lopsided. ```{r} library(gbm) gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian", n.trees=100L, shrinkage=0.01) summary(gbmObj,plotit=FALSE) ``` Now I will let it run for a while with a modest shrinkage penalty. We will then look at a plot of the MSE on the test set as a function of the number of trees. ```{r} gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian", n.trees=1e4, shrinkage=0.03, verbose=FALSE) ntree <- seq(100,gbmObj$n.trees,length.out=100) gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree) r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE) mse <- apply(r^2,2,mean) plot(ntree, mse, type="l") abline(h=min(mse),lty="dashed") text(ntree[100],min(mse)+0.0004,signif(min(mse),3)) ``` If we decrease the shrinkage penalty from 0.03 to 0.005, notice what happens to the MSE on the test set: ```{r} gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian", n.trees=1e4, shrinkage=0.005, verbose=FALSE) ntree <- seq(500,gbmObj$n.trees,length.out=100) gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree) r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE) mse <- apply(r^2,2,mean) plot(ntree, mse, type="l") abline(h=min(mse),lty="dashed") text(ntree[100],min(mse)+0.0004,signif(min(mse),3)) ``` Notice that the model does slightly worst after 2500 trees, but takes slightly longer to reach the optimal value. It also does not suffer as much when the number of trees is increased past this point. What if we try a still lower amount of shrinkage? ```{r} gbmObj <- gbm.fit(Xtrain, ytrain, distribution="gaussian", n.trees=1e4, shrinkage=0.001, verbose=FALSE) ntree <- seq(500,gbmObj$n.trees,length.out=100) gbmYhat <- predict(gbmObj, Xtest, n.trees=ntree) r <- gbmYhat - matrix(ytest,ncol=100,nrow=length(ytest),byrow=FALSE) mse <- apply(r^2,2,mean) plot(ntree, mse, type="l") abline(h=min(mse),lty="dashed") text(ntree[100],min(mse)+0.0004,signif(min(mse),3)) ``` **As a general rule of thumb, you want to slowly decrease the shrinkage and produce curves like this, generally until the point where the shrinkage penalty is so small that it never really decays on the backend.** This is in bold because it is one of those important secrets of applied machine learning that is known well by a set of practitioners but not actually written down anywhere (prominent, at least). For an ultimately model, we might choose to use this final model this MSE and pick only 8500 trees. Do you see now why we are treating the test set more like a validation set? ```{r} gbmYhat <- predict(gbmObj, Xtest, n.trees=8500) mean((gbmYhat - ytest)^2) ``` Now, notice that we can improve both the random forest and gradient boosted models by averaging there predictions. As the random forest is slightly more predictive, I will weight it a bit higher: ```{r} mean((rfYhat - ytest)^2) mean((gbmYhat - ytest)^2) mean((rfYhat*0.8 + gbmYhat*0.2 - ytest)^2) ``` Taking this a step farther, we can even weight together the random forest, gradient boosted tree, and additive model. This produces a slightly better model still. ```{r} gamYhat <- as.numeric(ca$gamPred[!trainFlag]) mean((rfYhat*0.8 + gbmYhat*0.1 + gamYhat*0.1 - ytest)^2) ``` The process of blending together models is called *stacking* (amongst many other names), and is an important technique in producing state of the art predictive models. It is also possible to `learn' the weights on a separate part of the validation set. In a way, this is what we did in using the additive model as part of the random forest and gradient boosted tree models. I would like to note that while for this small example it seems that gradient boosting is a lot more work with little reward, this is not in general the case. Typically it will be able to significantly improve on the results of the random forest. It may have seemed like I should use a separate chunk of data to fit the additive model as opposed to the random forest and gradient boosted trees. It is true that given a large enough dataset, this would be the ideal. However, I am not particularly worried about this effecting the output. Why? Additive models have a relatively low complexity when we only have 8 variables, and will rarely overfit to the training data. ## Classification In general, there is no big mysterious difference between classification and regression with decision trees. Some of the details are important to understand in terms of how these methods are employed in practice though, so I think it is useful to actually look at an example. I will take the same dataset as in the previous section: ```{r} set.seed(1) x <- read.csv("../../data/CAPA.csv", as.is=TRUE) names(x) <- tolower(names(x)) x <- na.omit(x) ``` But now use the entire dataset, setting the class that we want to predict equal to which state the census tract is located in: ```{r} X <- x[, c(6,11:15,33:34)] trainFlag <- (runif(nrow(x)) < 0.66) y <- as.numeric(x$statefp == 6) Xtrain <- X[trainFlag,] ytrain <- y[trainFlag] Xtest <- X[!trainFlag,] ytest <- y[!trainFlag] ``` We can run the random forest model, explicitly coding the y variables as factors to tell the function to run classification: ```{r} rfObj <- randomForest(Xtrain, factor(ytrain), Xtest, factor(ytest), do.trace=FALSE, keep.forest=TRUE, ntree=500L) rfYhat <- predict(rfObj, Xtest) ``` To evaluate the model, I will use the misclassification rate on the test set: ```{r} mean(rfYhat != ytest) ``` Now, as before, let's look at the variable importance scores for this model. Do these surprise you at all? ```{r} importance(rfObj) ``` We can also fit a gradient boosted tree to the classification problem. In order to tell the *gbm.fit* function that we want classification, I will set the *distribution* parameter to the 'bernoulli' loss. How do the important variable differ from the random forest? How does the mis-classification rate compare to the random forest estimator? ```{r} gbmObj <- gbm.fit(Xtrain, ytrain, distribution="bernoulli", n.trees=1e4, shrinkage=0.005, verbose=FALSE) gbmYhat <- as.numeric(predict(gbmObj, Xtest, n.trees=gbmObj$n.trees) > 0) summary(gbmObj, plotit=FALSE) mean(gbmYhat != ytest) ``` As before, I want to fit an additive model to the training set and then use this as a meta-predictor in the random forest and gradient boosted tree models. However, instead of directly predicting the classification response (which we could do), I will instead use the additive model to predictive the most important variable in the tree models (median number of rooms): ```{r} library(mgcv) x.gam2 <- gam(median_rooms ~ s(median_house_value) + s(mean_household_income) + s(population) + s(total_units) + s(vacant_units) + s(median_household_income) + s(mean_household_size_owners) + s(mean_household_size_renters), data=x, subset=trainFlag) ``` I will now find the predictions of these variables and put these predictions in the predictor matrices **Xtrain** and **Xtest**. ```{r} x$gamPred <- predict(x.gam2, x) X <- x[, c(6,11:15,33:35)] trainFlag <- (runif(nrow(x)) < 0.66) y <- as.numeric(x$statefp == 6) Xtrain <- X[trainFlag,] ytrain <- y[trainFlag] Xtest <- X[!trainFlag,] ytest <- y[!trainFlag] ``` This gives an impressive improvement on the misclassification rate for the random forest model: ```{r} rfObj <- randomForest(Xtrain, factor(ytrain), Xtest, factor(ytest), do.trace=FALSE, keep.forest=TRUE, ntree=500L) rfYhat <- predict(rfObj, Xtest) mean(rfYhat != ytest) ``` To gain an understanding and appreciation for where we are still making errors, we can construct the *confusion matrix* of the predictions from the random forest estimator. ```{r} table(rfYhat, ytest) ``` It appears that we are not overly biased towards predicting points in either CA or PA. Where do these errant points occur? ```{r, fig.width=9, fig.height=3} these <- which(rfYhat != ytest) par(mar=c(0,0,0,0)) plot(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these], axes=FALSE, xlab="", ylab="", col="white") snippets::osmap(tiles.url="http://c.tile.stamen.com/toner/",alpha=0.5) points(x$longitude[!trainFlag][these], x$latitude[!trainFlag][these], pch=19, cex=0.4, col="orange") box() ``` As we may have expected, areas in the most urban parts of PA and the rural parts of CA are the most likely to be classified incorrectly. Finally, we can see that the additive model meta-variable also has a similar improvement on the gradient boosted trees: ```{r} gbmObj <- gbm.fit(Xtrain, ytrain, distribution="bernoulli", n.trees=2500, shrinkage=0.05, verbose=FALSE) gbmYhat <- as.numeric(predict(gbmObj, Xtest, n.trees=gbmObj$n.trees) > 0) summary(gbmObj, plotit=FALSE) ``` Though it still fails to best the random forest estimator in this particular case: ```{r} mean(rfYhat != ytest) mean(gbmYhat != ytest) ``` Though, as stated before, this is often not the case in more complex models. ## Overview We have looked at three methods for combining simple models into powerful prediction engines: * Bagging (random forests) * Boosting (gradient boosted trees) * Stacking We have seen that these ideas can be used in conjunction with one another to produce fairly powerful predictive models.