We’ve used decision trees before to describe the decision-making process as a sequence of actions and conditions. In this section, we’ll use decision trees to make predictions. You can think of a prediction as a decision task, where you need to decide which value of \(y\) to use for a given \(x\). Similar to decision tree predictive tree model is a nested sequence of if-else statements that map any input data point \(x\) to a predicted output \(y\). Each if-else statement checks a feature of \(x\) and sends the data left or right along the tree branch. At the end of the branch, a single value of \(y\) is predicted.
Figure 18.1 shows a decision tree for predicting a chess piece given a four-dimension input vector that describes the types of moves available to the piece. The tree is a sequence of nested if-else statements that check the values of the input vector. The tree has six leaves, one for each of the chess pieces and has a depth of four. The tree is a predictive model that maps a four-dimensional input vector to a single output categorical value with six possible values.
Figure 18.1: Elementary tree scheme; visualization of the splitting process
The prediction algorithm is simple. Start at the root node and move down the tree until you reach a leaf node. The process of building a tree, given a set of training data, is more complicated and has three main components:
Splitting. The process of dividing the training data into subsets based on the value of a single feature. The goal is to create subsets that are as homogeneous as possible. The subsets are then used to create the nodes of the tree.
Stopping. The process of deciding when to stop splitting. The goal is to create a tree that is as accurate as possible without overfitting the training data.
Pruning. The process of removing nodes from the tree that do not improve the accuracy of the tree. The goal is to create a tree that is as accurate as possible without overfitting the training data.
The splitting process is the most important part of the tree-building process. At each step the splitting process need to decide on the feature index \(j\) to be used for splitting and the location of the split. For binary variable there is only one possible split location, but for continuous variables there are many possible split locations. The goal is to find the split that creates the most homogeneous subsets. In the case of regression trees, the best split is the one that minimizes the sum of squared errors. In the case of classification trees, the best split is the one that minimizes the Gini impurity. The Gini impurity is a measure of how homogeneous the subsets are.
The widely used CART algorithm uses the table \(X \in \mathbb{R}^{n\times p}\) during the splitting process. It loops over every element \(x_{ij}\) and finds the best split \(x_{ij}\) that creates the most homogeneous subsets. A split creates two subsets the left subset \(X_L = \{x \mid x_j < x_{ij}\}\) and the right subset \(X_R = \{x \mid x_j \ge x_{ij}\}\).
To measure how homogeneous the sets are, tn case of regression, we calculate \(\hat y_L = \mathrm{Average}(y_i), i \in I_L\) and \(\hat y_R = \mathrm{Average}(y_i), i \in I_R\), where \(I_L\) and \(I_R\) are indenes of observed inputs \(x_i\) in the left and right subsets. The best split is the one that minimizes the sum of squared errors. \[
\text{SSE} = \sum_{i\in I_L}^n (y_i - \hat{y}_L)^2 + \sum_{i\in I_R}^n (y_i - \hat{y}_R)^2
\]
Let’s start with a quick demo and look at the data. We’ll use the Hitters data set from the ISLR package. The data set contains information about Major League Baseball players. The goal is to predict the salary of a player based on his performance and experience. The data set contains 322 observations and 20 variables. The three variable we are to analyze are:
Hits: Number of hits in 1986
Years: Number of years in the major leagues
Salary: 1987 annual salary on opening day in thousands of dollars
The question we are interested in is, does experience and performance effect the salary of a baseball player?
Seems like there is a relation between number of years, number of hits and the salary. Left bottom corner of the input space is mostly occupied by the low salary players. We will use the CART algorithm to find the optimal splits
We use tree function that has a similar syntax to the lm. Then we cal the prune.tree to find the best tree with 3 leaves (terminal nodes). Each terminal node corresponds to a region
The prediction is rather straightforward. The tree divides the predictor space-that is, the set of possible values for \(x_1, x_2, \ldots, x_p\) - into \(J\) distinct and non-overlapping boxes, \(R_1,R_2,...,R_J\). For every observation that falls into the region \(R_j\), we make the same prediction, which is simply the mean of the response values for the training observations in \(R_j\).
Tge overall goal of building a tree is to find find regions that lead to minima of the total Residual Sum of Squares (RSS) \[
\mathrm{RSS} = \sum_{j=1}^J\sum_{i \in R_j}(y_i - \bar{y}_j)^2 \rightarrow \mathrm{minimize}
\]
Unfortunately, it is computationally infeasible (NP-hard problem) to consider every possible partition of the feature space into \(J\) boxes. We can find a good approximate solution, using top-down approach (the CART algorithm). As mentioned earlier at each iteatoin we decide on: which variable \(j\) to split and split point \(s\). \[
R_1(j, s) = \{x\mid x_j < s\} \mbox{ and } R_2(j, s) = \{x\mid x_j \ge s\},
\] thus, we seek to minimize (in case of regression tree) \[
\min_{j,s}\left[ \sum_{i:x_i\in R_1}(y_i - \bar{y}_1)^2 + \sum_{i:x_i \in R_2}(y_i - \bar{y}_2)^2\right]
\] As a result, every observed input point belongs to a single region.
Now let’s discuss how many regions we should have. At one extreme end, we can have \(n\) regions, one for each observation. Then the tree model will work similar to the one-nearest neighbor model. At the other end, we can have one big region for the entire input space and then every prediction will be the same (average across observed \(y\)’s). Both models can be used but usually the best one is in the middle. The number of regions (branches) controls the complexity of the model. We need to find a good size on the variance-bias scale. A smaller tree with fewer splits (that is, fewer regions \(R_1,...,R_J\)) might lead to lower variance and better interpretation at the cost of a little bias.
How do we construct a tree with a “manageable” number of branches? This is accomplished through the steps of forward tree construction and backward pruning. The forward step is a greedy algorithm that begins with a single region and divides it into two. This procedure is repeated until a certain stopping criterion is met. A practical method is to continue building the tree until the Residual Sum of Squares (RSS) plateaus. However, this method can be myopic as an initially unproductive split might be followed by a highly beneficial one, leading to a significant decrease in RSS in subsequent iterations. A more effective strategy is to grow an extensive tree \(T_0\), and then trim it down to obtain a subtree. The size of the subtree can be determined using cross-validation. However, be aware that the number of subtrees can be exponential!
Instead of considering all possible sub-trees, we will do cost complexity pruning - also known as weakest link pruning. We consider a sequence of trees indexed by a nonnegative tuning parameter \(\alpha\). For each value of \(\alpha\) there corresponds a subtree \(T \subset T_0\) such that minimizes \[
\sum_{m=1}^{|T|}\sum_{i:x_i\in R_m}(y_i - \bar{y}_m)^2 + \alpha |T|
\] The parameter \(\alpha\) balances the complexity of the subtree and its adherence to the training data. When we increment \(\alpha\) starting from zero, branches are predictably and sequentially pruned from the tree, making it straightforward to acquire the entire series of subtrees as a function of \(\alpha\). We determine the optimal value \(\hat \alpha\) through cross-validation. Afterward, we refer back to the complete data set and extract the subtree that corresponds to \(\hat \alpha\).
A classification tree operates much like a regression tree. The prediction is made based on the “majority vote”, which means selecting the class that appears most frequently within the region. The process of developing a classification tree is largely the same as that of a regression tree, involving recursive binary splitting. However, instead of using the Residual Sum of Squares (RSS), we use the classification error rate, which is the proportion of observations in that region that do not belong to the most prevalent class.
We start by introducing s0me notations \[
p_{mk} = \dfrac{1}{N_m}\sum_{x_i \in R_m} I(y_i=k),
\] which is proportion of observations of class \(k\) in region \(m\).
The classification then done as follows \[
p_m = \max_k p_{mk},~~~ E_m = 1-p_m
\] i.e the most frequent observation in region \(m\)
Then classification is done as follows \[
P(y=k) = \sum_{j=1}^J p_j I(x \in R_j)
\]
An alternative method to evaluate the quality of a split in a classification tree is through the use of the Gini Index or Cross-Entropy. Let’s consider a scenario where we have an equal number of observations in each class, say 400 in each.
Now, suppose we create a tree that results in two regions: one with a distribution of (300,100) and the other with (100,300). This means that in the first region, 300 observations belong to one class and 100 to the other, and vice versa in the second region.
Consider another scenario where we have a different tree that results in two regions with distributions of (200,400) and (200,0).
In both cases, the misclassification rate is 0.25, meaning that 25% of the observations are incorrectly classified. However, the second tree is more desirable. Why is that? The second tree has a region with no misclassifications at all (200,0), which means it’s perfectly classifying all observations in that region. This is an ideal situation in classification problems. On the other hand, the first tree, despite having the same overall misclassification rate, doesn’t have any region where all observations are correctly classified.
This illustrates that while the misclassification rate is a useful metric, it doesn’t always capture the full picture. Other metrics like the Gini Index or Cross-Entropy can provide a more nuanced view of the quality of a split, taking into account not just the overall error rate, but also the distribution of errors across different regions.
Another way to measure the quality of the split is to use the Gini Index and Cross-Entropy Say, I have 400 observations in each class (400,400). I create a tree with two region: (300,100) and (100,300). Say I have another tree: (200,400) and (200,0). In both cases misclassification rate is 0.25. The later tree is preferable. We prefer to have more “pure nodes” and Gini index does a better job.
The Gini index: \[
G_m = \sum_{k=1}^K p_{mk}(1-p_{mk})
\] It measures a variance across the \(K\) classes. It takes on a small value if all of the \(p_{mk}\)’s are close to zero or one
An alternative to the Gini index is cross-entropy (a.k.a deviance), given by \[
D_m = -\sum_{k=1}^Kp_{mk}\log p_{mk}
\] It is near zero if the \(p_mk\)’s are all near zero or near one. Gini index and the cross-entropy led to similar results.
Now we apply the tree model to the Boston housing dataset.
plot(boston.tree,type="u")# plot tree and partition in x.text(boston.tree,col="blue",label=c("yval"),cex=.8)partition.tree(boston.tree)
Get predictions on 2d grid
pv=seq(from=.01,to=.99,by=.05)x1q =quantile(df2$lstat,probs=pv)x2q =quantile(df2$dis,probs=pv)xx =expand.grid(x1q,x2q) #matrix with two columns using all combinations of x1q and x2qdfpred =data.frame(dis=xx[,2],lstat=xx[,1])lmedpred =predict(boston.tree,dfpred)
Decision trees are incredibly intuitive and simple to explain. They can be even more straightforward to understand than linear regression models. Some theorists argue that decision trees mimic human decision-making processes more accurately than other regression and classification methods we’ve discussed in previous chapters. Decision trees can be visually represented, making them easily interpretable, even for those without a deep understanding of the underlying mechanics, particularly when the trees are not overly complex. Decision trees can effortlessly manage qualitative predictors, eliminating the need to create dummy variables.
Disadvantages of Decision Trees:
Large trees can exhibit high variance. This means that a minor change in the data can lead to a significant change in the final estimated tree, making the model unstable. Conversely, small trees, while more stable, may not be powerful predictors as they might oversimplify the problem. It can be challenging to find a balance between bias and variance when using decision trees. A model with too much bias oversimplifies the problem and performs poorly, while a model with too much variance overfits the data and may not generalize well to unseen data.
There are several techniques used to address the issue of overfitting in decision trees. We considered the pruning technique which reduces the complexity of the final classifier, and hence improve predictive accuracy by reducing overfitting. Two other methods are random forests and boosting. Random Forests is an ensemble method where multiple decision trees are created and their predictions are averaged (for regression) or majority voting is done (for classification). Boosting is another ensemble technique where trees are built sequentially such that each subsequent tree aims to reduce the bias of the combined classifier.
In the bagging approach, we treat the sample as if it were the population and then take iid draws. That is, you sample with replacement so that you can get the same original sample value more than once in a bootstrap sample.
To Bootsrap Aggregate (Bag) we:
Take \(B\) bootstrap samples from the training data, each of the same size as the training data.
Fit a large tree to each bootstrap sample (we know how to do this fast!). This will give us \(B\) trees.
Combine the results from each of the B trees to get an overall prediction.
When the target variable \(y\) is numeric, the bagging process is straightforward. The final prediction is simply the average of the predictions from each of the \(B\) trees. However, when \(y\) is categorical, the process of combining results from different trees is less straightforward. One common approach is to use a voting system. In this system, each tree in the ensemble makes a prediction for a given input \(x\). The predicted category that receives the most votes (out of \(B\) total votes) is chosen as the final prediction. Another approach is to average the predicted probabilities \(\hat p\) from each tree. This method can provide a more nuanced prediction, especially in cases where the voting results are close.
Despite the potential benefits of averaging predicted probabilities, most software implementations of bagging for decision trees use the voting method. This is likely due to its simplicity and intuitive appeal. However, the best method to use can depend on the specific characteristics of the problem at hand.
The simple idea behind every ensemble modes is that variance of the average is lowe than variance of individual. Say we have \(B\) models \(f_1(x),\ldots,f_B(x)\) then we combine those \[
f_{avg}(x) = \dfrac{1}{B}\sum_{b=1}^Bf_b(x)
\] Combining models helps fighting overfilling. On the negative side, it is harder to interpret those ensembles
Let’s experiment with the number of trees in the model
With 1,000 and 5,000 trees the fit is not bad and very similar.
Note that although our method is based multiple trees (average over) so we no longer have a simple step function!!
Random Forest
In the bagging technique, models can become correlated, which prevents the achievement of a \(1/n\) reduction in variance. This happens because most, if not all, of the trees will use the most influential predictor in the top split. As a result, bagged trees tend to look very similar to each other.
Random Forests, on the other hand, introduce an element of randomness that helps to decorrelate the trees, making the ensemble more robust and improving prediction accuracy. This randomness comes into play when considering a split in a tree. Instead of considering all \(p\) predictors for a split, a random sample of \(m\) predictors is chosen as split candidates. This subset of predictors is different for each split, which means that different trees are likely to use different predictors in the top split, leading to a more diverse set of trees.
The number of predictors considered at each split, \(m\), is typically chosen to be the square root of the total number of predictors, \(p\). This choice is a rule of thumb that often works well in practice, but it can be tuned based on the specific characteristics of the dataset.
By decorrelating the trees, Random Forests can often achieve better performance than bagging, especially when there’s a small number of very strong predictors in the dataset. In such cases, bagging can end up with an ensemble of very similar trees that all rely heavily on these strong predictors, while Random Forests can leverage the other, weaker predictors more effectively.
Ove of the “interpretation” tools that comes with ensemble models is importance rank: total amount that the deviance (loss) is decreased due to splits over a given predictor, averaged over all tree
rf.boston =randomForest(medv~.,data=Boston,mtry=6,ntree=50, maxnodes=50)yhat.rf =predict(rf.boston,newdata=Boston)oo=order(Boston$lstat)plot(Boston$lstat[oo],Boston$medv[oo],pch=21,bg="grey", xlab="lstat", ylab="medv") #plot datalines(Boston$lstat[oo],yhat.rf[oo],col='red',lwd=3) #step function fit
Boosting
Boosting, like Random Forests, is a method that combines multiple trees to create a more powerful predictive model. However, the approach it takes is quite distinct.
Here’s how Boosting works:
Initially, a single decision tree is fitted to the data.
This initial tree is intentionally made weak, meaning it doesn’t perfectly fit the data.
We then examine the residuals, which represent the portion of the target variable \(y\) not explained by the weak tree.
A new tree is then fitted to these residuals, essentially trying to predict the error of the first tree.
This new tree is also “weakened” or “shrunk”. The prediction from this tree is then added to the prediction of the first tree.
This process is repeated iteratively. In each iteration, a new tree is fitted to the residuals of the current ensemble of trees, shrunk, and then added to the ensemble.
The final model is the sum of all these “shrunk” trees. The key idea behind Boosting is to iteratively improve the model by focusing on the parts of the data that the current model is not explaining well (the residuals). Each new tree is trying to correct the mistakes of the ensemble of previous trees. By adding together many weak models (the shrunk trees), Boosting can often achieve a strong overall model.
Pick a loss function \(L\) that reflects setting; e.g., for continuous \(y\), could take \(L(y_i , \theta_i ) = (y_i - \theta_i )^2\) Want to solve \[\mathrm{minimize}_{\beta \in R^M} \sum_{i=1}^n L \left(y_i, \sum_{j=1}^M \beta_j \cdot T_j(x_i)\right)\]
Indexes all trees of a fixed size (e.g., depth = 5), so \(M\) is huge
Space is simply too big to optimize
Gradient boosting: basically a version of gradient descent that is forced to work with trees
First think of optimization as \(\min_\theta f (\theta)\), over predicted values \(\theta\) (subject to \(\theta\) coming from trees)
Set \(f_1(x)=0\) (constant predictor) and \(r_i=y_i\)
For \(b=1,2,\ldots,B\)
Fit a tree \(f_b\) with \(d\) splits to the training set \((X,r)\)
Update the model \[f(x) = f(x) +\lambda f_b(x)\]
Update the residuals \[r_i=r_i - \lambda f_b(x)\]
Here are some boosting fits where we vary the number of trees, but fix the depth at 2 (suitable with 1 x) and shrinkage = \(\lambda\) at .2.
Performance: Boosting, in many cases, provides better predictive accuracy than Random Forests. By focusing on the residuals or mistakes, Boosting can incrementally improve model performance.
Model Interpretability: While both methods are not as interpretable as a single decision tree, Boosting models can sometimes be more interpretable than Random Forests, especially when the number of weak learners (trees) is small.
Disadvantages of Boosting compared to Random Forests:
Computation Time and Complexity: Boosting can be more computationally intensive than Random Forests. This is because trees are built sequentially in Boosting, while in Random Forests, they are built independently and can be parallelized.
Overfitting: Boosting can overfit the training data if the number of trees is too large, or if the trees are too complex. This is less of a problem with Random Forests, which are less prone to overfitting due to the randomness injected into the tree building process.
Outliers: Boosting can be sensitive to outliers since it tries to correct the mistakes of the predecessors. On the other hand, Random Forests are more robust to outliers.
Noise: Boosting can overemphasize instances that are hard to classify and can overfit to noise, whereas Random Forests are more robust to noise.
Remember, the choice between Boosting and Random Forests (or any other model) should be guided by the specific requirements of your task, including the nature of your data, the computational resources available, and the trade-off between interpretability and predictive accuracy.