17 Tree Models

We’ve used decision trees before to describe the decision-making process as a sequence of actions and conditions. In this section, we’ll use decision trees to make predictions. You can think of a prediction as a decision task, where you need to decide which value of \(y\) to use for a given \(x\). Similar to decision tree predictive tree model is a nested sequence of if-else statements that map any input data point \(x\) to a predicted output \(y\). Each if-else statement checks a feature of \(x\) and sends the data left or right along the tree branch. At the end of the branch, a single value of \(y\) is predicted.

Figure 17.1 shows a decision tree for predicting a chess piece given a four-dimension input vector that describes the types of moves available to the piece. The tree is a sequence of nested if-else statements that check the values of the input vector. The tree has six leaves, one for each of the chess pieces and has a depth of four. The tree is a predictive model that maps a four-dimensional input vector to a single output categorical value with six possible values.

Figure 17.1: Elementary tree scheme; visualization of the splitting process

The prediction algorithm is simple. Start at the root node and move down the tree until you reach a leaf node. The process of building a tree, given a set of training data, is more complicated and has three main components:

Splitting. The process of dividing the training data into subsets based on the value of a single feature. The goal is to create subsets that are as homogeneous as possible. The subsets are then used to create the nodes of the tree.
Stopping. The process of deciding when to stop splitting. The goal is to create a tree that is as accurate as possible without overfitting the training data.
Pruning. The process of removing nodes from the tree that do not improve the accuracy of the tree. The goal is to create a tree that is as accurate as possible without overfitting the training data.

The splitting process is the most important part of the tree-building process. At each step the splitting process need to decide on the feature index \(j\) to be used for splitting and the location of the split. For binary variable there is only one possible split location, but for continuous variables there are many possible split locations. The goal is to find the split that creates the most homogeneous subsets. In the case of regression trees, the best split is the one that minimizes the sum of squared errors. In the case of classification trees, the best split is the one that minimizes the Gini impurity. The Gini impurity is a measure of how homogeneous the subsets are.

The widely used CART algorithm uses the table \(X \in \mathbb{R}^{n\times p}\) during the splitting process. It loops over every element \(x_{ij}\) and finds the best split \(x_{ij}\) that creates the most homogeneous subsets. A split creates two subsets the left subset \(X_L = \{x \mid x_j < x_{ij}\}\) and the right subset \(X_R = \{x \mid x_j \ge x_{ij}\}\).

To measure how homogeneous the sets are, tn case of regression, we calculate \(\hat y_L = \mathrm{Average}(y_i), i \in I_L\) and \(\hat y_R = \mathrm{Average}(y_i), i \in I_R\), where \(I_L\) and \(I_R\) are indenes of observed inputs \(x_i\) in the left and right subsets. The best split is the one that minimizes the sum of squared errors. \[ \text{SSE} = \sum_{i\in I_L}^n (y_i - \hat{y}_L)^2 + \sum_{i\in I_R}^n (y_i - \hat{y}_R)^2 \]

Let’s start with a quick demo and look at the data. We’ll use the Hitters data set from the ISLR package. The data set contains information about Major League Baseball players. The goal is to predict the salary of a player based on his performance and experience. The data set contains 322 observations and 20 variables. The three variable we are to analyze are:

Hits: Number of hits in 1986
Years: Number of years in the major leagues
Salary: 1987 annual salary on opening day in thousands of dollars

The question we are interested in is, does experience and performance effect the salary of a baseball player?

Hitters = read.csv("../data/Hitters.csv")
Hitters[1:7,1:7]

X	AtBat	Hits	HmRun	Runs	RBI	Walks
-Andy Allanson	293	66	1	30	29	14
-Alan Ashby	315	81	7	24	38	39
-Alvin Davis	479	130	18	66	72	76
-Andre Dawson	496	141	20	65	78	37
-Andres Galarraga	321	87	10	39	42	30
-Alfredo Griffin	594	169	4	74	51	35
-Al Newman	185	37	1	23	8	21

ggplot(Hitters,aes(x=Years,y=Hits, colour = Salary)) + geom_point() + scale_color_gradient(low="blue", high="red")

Seems like there is a relation between number of years, number of hits and the salary. Left bottom corner of the input space is mostly occupied by the low salary players. We will use the CART algorithm to find the optimal splits

library(tree)
tree.hit = tree(log(Hitters$Salary)~Years+Hits,Hitters)
prune.hit=prune.tree(tree.hit,best=3)
par(mar=c(7,1,2,2))
plot(prune.hit)
text(prune.hit)

We use tree function that has a similar syntax to the lm. Then we cal the prune.tree to find the best tree with 3 leaves (terminal nodes). Each terminal node corresponds to a region

partition.tree(prune.hit, label = "yval")
lines(Hitters$Years, Hitters$Hits, type='p', pch=16, cex=0.5,col="lightblue")

plot(prune.hit)
text(prune.hit)
partition.tree(prune.hit, label = "yval")
lines(Hitters$Years, Hitters$Hits, type='p', pch=16, cex=0.5,col="lightblue")

The prediction is rather straightforward. The tree divides the predictor space-that is, the set of possible values for \(x_1, x_2, \ldots, x_p\) - into \(J\) distinct and non-overlapping boxes, \(R_1,R_2,...,R_J\). For every observation that falls into the region \(R_j\), we make the same prediction, which is simply the mean of the response values for the training observations in \(R_j\).

\[ f(x) = \sum_{j=1}^J\bar{y}_jI(x \in R_j) \]

\[ \bar{y}_j = \text{Average}(y_i \mid x_i \in R_j) \]

Tge overall goal of building a tree is to find find regions that lead to minima of the total Residual Sum of Squares (RSS) \[ \mathrm{RSS} = \sum_{j=1}^J\sum_{i \in R_j}(y_i - \bar{y}_j)^2 \rightarrow \mathrm{minimize} \]

Unfortunately, it is computationally infeasible (NP-hard problem) to consider every possible partition of the feature space into \(J\) boxes. We can find a good approximate solution, using top-down approach (the CART algorithm). As mentioned earlier at each iteatoin we decide on: which variable \(j\) to split and split point \(s\). \[ R_1(j, s) = \{x\mid x_j < s\} \mbox{ and } R_2(j, s) = \{x\mid x_j \ge s\}, \] thus, we seek to minimize (in case of regression tree) \[ \min_{j,s}\left[ \sum_{i:x_i\in R_1}(y_i - \bar{y}_1)^2 + \sum_{i:x_i \in R_2}(y_i - \bar{y}_2)^2\right] \] As a result, every observed input point belongs to a single region.

Now let’s discuss how many regions we should have. At one extreme end, we can have \(n\) regions, one for each observation. Then the tree model will work similar to the one-nearest neighbor model. At the other end, we can have one big region for the entire input space and then every prediction will be the same (average across observed \(y\)’s). Both models can be used but usually the best one is in the middle. The number of regions (branches) controls the complexity of the model. We need to find a good size on the variance-bias scale. A smaller tree with fewer splits (that is, fewer regions \(R_1,...,R_J\)) might lead to lower variance and better interpretation at the cost of a little bias.

How do we construct a tree with a “manageable” number of branches? This is accomplished through the steps of forward tree construction and backward pruning. The forward step is a greedy algorithm that begins with a single region and divides it into two. This procedure is repeated until a certain stopping criterion is met. A practical method is to continue building the tree until the Residual Sum of Squares (RSS) plateaus. However, this method can be myopic as an initially unproductive split might be followed by a highly beneficial one, leading to a significant decrease in RSS in subsequent iterations. A more effective strategy is to grow an extensive tree \(T_0\), and then trim it down to obtain a subtree. The size of the subtree can be determined using cross-validation. However, be aware that the number of subtrees can be exponential!

Instead of considering all possible sub-trees, we will do cost complexity pruning - also known as weakest link pruning. We consider a sequence of trees indexed by a nonnegative tuning parameter \(\alpha\). For each value of \(\alpha\) there corresponds a subtree \(T \subset T_0\) such that minimizes \[ \sum_{m=1}^{|T|}\sum_{i:x_i\in R_m}(y_i - \bar{y}_m)^2 + \alpha |T| \] The parameter \(\alpha\) balances the complexity of the subtree and its adherence to the training data. When we increment \(\alpha\) starting from zero, branches are predictably and sequentially pruned from the tree, making it straightforward to acquire the entire series of subtrees as a function of \(\alpha\). We determine the optimal value \(\hat \alpha\) through cross-validation. Afterward, we refer back to the complete data set and extract the subtree that corresponds to \(\hat \alpha\).

Let’s return back to the Baseball example

plot(tree.hit); text(tree.hit, cex=0.6)

Let’s find the best tree

cv.hitters=cv.tree(tree.hit)
plot(cv.hitters$size,cv.hitters$dev,type="b")

Size of 3 seems good!

prune.hit=prune.tree(tree.hit,best=3)
plot(prune.hit)
text(prune.hit)

17.1 Finding Good Bayes Predictors

Bayesian methods tackle the problem of good predictive performance in a number of ways. The goal is to find a good predictive MSE \(E_{Y,\hat{Y}}(\Vert\hat{Y} - Y \Vert^2)\). First, Stein shrinkage (a.k.a regularization with an \(\ell_2\) norm) has long been known to provide good mean squared error properties in estimation, namely \(E(||\hat{\theta} - \theta||^2)\) as well. These gains translate into predictive performance (in an iid setting) for \(E(||\hat{Y}-Y||^2)\). One of the main issues is how to tune the amount of regularisation (a.k.a prior hyper-parameters). Stein’s unbiased estimator of risk provides a simple empirical rule to address this problem, as does cross-validation. From a Bayes perspective, the marginal likelihood (and full marginal posterior) provides a natural method for hyper-parameter tuning. The issue is computational tractability and scalability. The posterior for \((W,b)\) is extremely high dimensional and multimodal and posterior MAP provides good predictors \(\hat{Y}(X)\).

Bayes conditional averaging can also perform well in high dimensional regression and classification problems. High dimensionality brings with it the curse of dimensionality and it is instructive to understand why certain kernel can perform badly.

Adaptive Kernel predictors (a.k.a. smart conditional averager) are of the form

\[ \hat{Y}(X) = \sum_{r=1}^R K_r ( X_i , X ) \hat{Y}_r (X) \]

Here \(\hat{Y}_r(X)\) is a deep predictor with its own trained parameters. For tree models, the kernel \(K_r( X_i , X)\) is a cylindrical region \(R_r\) (open box set). Figure Figure 17.2 illustrates the implied kernels for trees (cylindrical sets) and random forests. Not too many points will be neighbors in high dimensional input space.

Constructing the regions is fundamental to reduce the curse of dimensionality. It is useful to imagine a very large dataset, e.g. 100k images and think about how a new image’s input coordinates, \(X\), are “neighbors” to data point in the training set. Our predictor will then be a smart conditional average of the observed outputs, \(Y\), for our neighbors. When \(p\) is large, spheres (\(L^2\) balls or Gaussian kernels) are terrible: either no points or all points are “neighbors” of the new input variable. Trees are good as not too many “neighbors”.

To illustrate the problem further, Figure Figure 17.3 below shows the 2D image of 1000 uniform samples from a 50-dimensional ball \(B_{50}\). The image is calculated as \(w^T Y\), where \(w = (1,1,0,\ldots,0)\) and \(Y \sim U(B_{50})\). Samples are centered around the equators and none of the samples fall close to the boundary of the set.

As dimensionality of the space grows, the variance of the marginal distribution goes to zero. We can empirically see it from Figure Figure 17.4, which shows histogram of 1D image of uniform sample from balls of different dimensionality, i.e. \(e_1^T Y\), where \(e_1 = (1,0,\ldots,0)\).

Similar central limit results were known to Maxwell who showed that random variable \(w^TY\) is close to standard normal, when \(Y \sim U(B_p)\), \(p\) is large, and \(w\) is a unit vector (lies on the boundary of the ball). For the history of this fact, see Diaconis and Freedman (1987). More general results in this direction were obtained in Klartag (2007). Further, Milman and Schechtman (2009) presents many analytical and geometrical results for finite dimensional normed spaces, as the dimension grows to infinity.

Deep learning improves on this by performing a sequence of GLM-like transformations, effectively DL learns a distributed partition of the input space. Specifically, suppose that we have \(K\) partitions. Then the DL predictor takes the form of a weighted average or soft-max of the weighted average in case of classification of observations in this partition. Given a new high dimensional input \(X_{\mathrm{new}}\), many deep learners are an average of learners obtained by our hyper-plane decomposition. Generically, we have

\[ \hat{Y}(X) = \sum_{k \in K} w_k(X)\hat{Y}_k(X) \]

where \(w_k\) are the weights learned in region \(K\), and \(w_k(X)\) is an indicator of the region with appropriate weighting given the training data. Where \(w_k\) is a weight which also indicates which partition the new \(X_{new}\) lies in.

The partitioning of the input space by a deep learner is similar to the one performed by decision trees and partition-based models such as CART, MARS, RandomForests, BART. However, trees are more local in the regions that they use to construct their estimators within a region. Each neuron in deep learning model corresponds to a manifold that divides the input space. In case of ReLU activation function \(f(x) = \max(0,x)\) the manifold is simply a hyperplane and neuron gets activated when the new observation is on the “right” side of this hyperplane, the activation amount is equal to how far from the boundary the given point is. For example in two dimensions, three neurons with ReLU activation functions will divide the space into seven regions, as shown on Figure Figure 17.5.

The key difference then between tree-based architecture and neural network based models is the way hyper-planes are combined. Figure Figure 17.6 shows the comparison of space decomposition by hyperplanes as performed by a tree-based and neural network architectures. We compare a neural network with two layers (bottom row) with tree model trained with CART algorithm (top row). The network architecture used is:

\[ Y = \mathrm{softmax}(w^0Z^2 + b^0)\\ Z^2 = \tanh(w^2Z^1 + b^2)\\ Z^1 = \tanh(w^1X + b^1) \]

The weight matrices for simple data \(W^1, W^2 \in \mathbb{R}^{2 \times 2}\), for circle data \(W^1 \in \mathbb{R}^{2 \times 2}\) and \(W^2 \in \mathbb{R}^{3 \times 2}\), for spiral data we have \(W^1 \in \mathbb{R}^{2 \times 2}\) and \(W^2 \in \mathbb{R}^{4 \times 2}\). In our notations, we assume that the activation function is applied pointwise at each layer. An advantage of deep architectures is that the number of hyper-planes grow exponentially with the number of layers. The key property of an activation function (link) is \(f(0) = 0\) and it has zero value in certain regions. For example, hinge or rectified learner \(\max(x,0)\), box car (differences in Heaviside) functions are very common. As compared to a logistic regression, rather than using \(\mathrm{softmax}(1/(1+e^{-x}))\) in deep learning \(\tanh\) is typically used for training.

Formally, a Bayesian probabilistic approach (if computationally feasible) knows how to optimally weight predictors via a model averaging approach:

\[ \hat{Y}(X) = \sum_{r=1}^R w_k \hat{Y}_k(X) \]

where \(\hat{Y}_k(x) = E(Y \mid X_k)\). Such rules can achieve great out-of-sample performance. Amit, Blanchard, and Wilder (2000) discuss the striking success of multiple randomized classifiers. Using a simple set of binary local features, one classification tree can achieve 5% error on the NIST data base with 100,000 training data points. On the other hand 100 trees, trained under one hour, when aggregated yield an error rate under 7%. We believe that this stems from the fact that a sample from a very rich and diverse set of classifiers produces on average weakly dependent classifiers conditional on class. A Bayesian model of weak dependence is exchangeability.

The use of pooling (a.k.a. averaging) of multiple predictors is commonplace in machine learning. Ensemble methods (a.k.a. some form clever conditional averaging) is prevalent in high dimensions. One reason for these procedures is that it is relatively easy to find unbiased predictors, the caveat that they have large variances due to the dimensionality. The following result on exchangeability (Kingman, 1975) shows that we can simply use the \(1/N\)-rule and average to reduce risk. Specifically, suppose that we have \(K\) exchangeable, \(\mathbb{E} ( \hat{Y}_i ) = \mathbb{E} ( \hat{Y}_{\pi(i)} )\), predictors

\[ \hat{Y} = ( \hat{Y}_1 , \ldots , \hat{Y}_K ) \]

Find \(w\) to attain \(\operatorname{argmin}_W E l( Y , w^T \hat{Y} )\) where \(l\) convex in the second argument;

\[ E l( Y , w^T \hat{Y} ) = \frac{1}{K!} \sum_\pi E l( Y , w^T \hat{Y} ) \geq E l \left ( Y , \frac{1}{K!} \sum_\pi w_\pi^T \hat{Y} )\right ) = E l \left ( Y , (1/K) \iota^T \hat{Y} \right ) \]

where \(\iota = ( 1 , \ldots ,1 )\). Hence, the randomized multiple predictor with weights \(w = (1/K)\iota\) provides close to optimal Bayes predictive performance. We now turn to algorithmic issues.

An alternative approach is to perform Bayesian model selection. Here we calculate the optimal Bayes weight for each predictor in accordance with Bayes Rule.

We formalize the gains in Classification Risk with the following discussion.

17.2 Ensemble Averaging and \(1/N\) Rule

In high dimensions, when I have large number of predictive models that generate uncorrelated predictions, the optimal approach to generate a prediction is to average out predictions from those individual models/ weak predictors. This is called the \(1/N\) rule. The variance in the prediction is reduced by a factor of \(N\) when we average out \(N\) uncorrelated predictions. \[ \mbox{Var} \left ( \frac{1}{N} \sum_{i=1}^N \hat y_i \right ) = \frac{1}{N^2} \mbox{Var} \left ( \hat y_i \right ) + \frac{2}{N^2} \sum_{i \neq j} \mbox{Cov} \left ( \hat y_i, \hat y_j \right ) \] In high dimensions it relatively easy to find uncorrelated predictors and those techniques prove to lead to a winning solution in many machine learning competitions. The \(1/N\) rule is optimal due to exchangeability of the weak predictors, see Polson, Sokolov, et al. (2017)

17.2.1 Classification variance decomposition

The famous result is due to Cover who proved that k-nearest neighbors are at most twice the bayes risk.

Amit, Blanchard, and Wilder (2000) use the population conditional probability distribution of a point \(X\) given \(Y=c\), denoted by \(P_{c}\), and the associated conditional expectation and variance operators will be denoted \(E_{c}\) and \(V a r_{c}\). Define the vectors of average aggregates conditional on class \(c\) as \[ \begin{equation*} M_{c}(d)=E_{c}\left[H_{\mathbf{Q}}(X, d)\right]=E\left[H_{\mathbf{Q}}(X, d) \mid Y=c\right] \tag{8} \end{equation*} \]

for \(d=1, \ldots, K\). The average conditional margin (ACM) for class \(c\) is defined as

\[ \begin{equation*} \theta_{c}=\min _{d \neq c}\left(M_{c}(c)-M_{c}(d)\right) \tag{9} \end{equation*} \]

We assume that \(\theta_{c}>0\). This assumption is very weak since it involves only the average over the population of class \(c\). It is quite natural since one would not expect good classification results when it is violated. Indeed as shown below it is satisfied in all cases.

Given that \(\theta_{c}>0\), the error rate for class \(c\) depends on the extent to which the aggregate classifier \(H_{\mathbf{Q}}(X, d)\) is concentrated around \(M_{c}(d)\) for each \(d=1, \ldots, K\). The simplest measure of concentration is the variance of \(H_{\mathbf{Q}}(X, d)\) with respect to the distribution \(P_{c}\). Using Chebyshev’s inequality we write a coarse bound on the misclassification probability with respect to \(P_{c}\) as follows.

\[ \begin{align*} P_{c}\left(C_{\mathbf{Q}}(X) \neq c\right) \leq & P_{c}\left(H_{\mathbf{Q}}(X, c)<M_{c}(c)-\theta_{c} / 2\right) \\ & +\sum_{d \neq c} P_{c}\left(H_{\mathbf{Q}}(X, d)>M_{c}(d)+\theta_{c} / 2\right) \\ \leq & \sum_{d=1}^{K} P_{c}\left(\left|H_{\mathbf{Q}}(X, d)-M_{c}(d)\right|>\theta_{c} / 2\right) \\ \leq & \frac{4}{\theta_{c}^{2}} \sum_{d=1}^{K} \operatorname{Var}_{c}\left[H_{\mathbf{Q}}(X, d)\right] . \tag{10} \end{align*} \]

Of course Chebyshev’s inequality is coarse and will not give very sharp results in itself, be we state it here as a landmark pointing to the relative importance of margin and variance, and to the tradeoff between the two quantities.

We rewrite each of the variance terms of the last equation as

\[ \begin{align*} \operatorname{Var}_{c}\left[E_{\mathbf{Q}} h(X, d)\right] & =E_{c}\left[E_{\mathbf{Q}} h(X, d)\right]^{2}-\left[E_{c} E_{\mathbf{Q}} h(X, d)\right]^{2} \\ & =E_{\mathbf{Q} \otimes \mathbf{Q}} E_{c}\left[h_{1}(X, d) h_{2}(X, d)\right]-E_{\mathbf{Q} \otimes \mathbf{Q}}\left[E_{c}\left[h_{1}(X, d)\right] E_{c}\left[h_{2}(X, d)\right]\right] \\ & =E_{\mathbf{Q} \otimes \mathbf{Q}} \operatorname{Cov}_{c}\left[h_{1}(X, d), h_{2}(X, d)\right] \doteq \gamma_{c, d} \tag{11} \end{align*} \]

where the notation \(E_{\mathbf{Q} \otimes \mathbf{Q}}\) means that \(h_{1}, h_{2}\) are two classifiers sampled independently from the distribution \(\mathbf{Q}\). We can therefore interpret this variance term as the conditional covariance of two classifiers independently sampled from \(\mathbf{Q}\). We call this quantity the average conditional covariance (ACC). Even if \(\mathbf{Q}\) is a discrete distribution, such as that provided by a particular run of \(N\) classifiers, when it is supported on a moderate number of classifiers, it is dominated by the conditional covariances of which there are order \(N^{2}\), and not the conditional variances of which there are order \(N\).

17.2.2 Conditional and unconditional dependence

It should be emphasized that two classifiers, provided that they achieve reasonable classification rate (that is, better than just picking a class at random) will not be unconditionally independent. If we do not know the class label of a point, and vector \(\left(h_{1}(X, i)\right)_{i}\) is large at class \(c\), then we actually change our expectations regarding \(\left(h_{2}(X, i)\right)_{i}\). On the other hand if we were given in advance the class label \(Y\), then knowing \(h_{1}(X)\) would hardly affect our guess about \(h_{2}(X)\). This is the motivation behind the notion of weak conditional dependence.

This is in contrast to the measure of dependence introduced in Dietterich (1998), which involves the unconditional covariance. The \(\kappa\) statistic used there is

\[ \kappa\left(h_{1}, h_{2}\right)=\frac{\sum_{d} \operatorname{Cov}\left[h_{1}(X, d), h_{2}(X, d)\right]}{1-\sum_{d} E h_{1}(X, d) E h_{2}(X, d)}, \]

A simple decomposition of the numerator yields: \(\operatorname{Cov}\left[h_{1}(X, d), h_{2}(X, d)\right]=E \operatorname{Cov}\left[h_{1}(X, d), h_{2}(X, d) \mid Y\right]+\operatorname{Cov}\left[E\left[h_{1}(X, d) \mid Y\right], E\left[h_{2}(X, d) \mid Y\right]\right]\).

17.3 Classification Trees

A classification tree operates much like a regression tree. The prediction is made based on the “majority vote”, which means selecting the class that appears most frequently within the region. The process of developing a classification tree is largely the same as that of a regression tree, involving recursive binary splitting. However, instead of using the Residual Sum of Squares (RSS), we use the classification error rate, which is the proportion of observations in that region that do not belong to the most prevalent class.

We start by introducing s0me notations \[ p_{mk} = \dfrac{1}{N_m}\sum_{x_i \in R_m} I(y_i=k), \] which is proportion of observations of class \(k\) in region \(m\).

The classification then done as follows \[ p_m = \max_k p_{mk},~~~ E_m = 1-p_m \] i.e the most frequent observation in region \(m\)

Then classification is done as follows \[ P(y=k) = \sum_{j=1}^J p_j I(x \in R_j) \]

An alternative method to evaluate the quality of a split in a classification tree is through the use of the Gini Index or Cross-Entropy. Let’s consider a scenario where we have an equal number of observations in each class, say 400 in each.

Now, suppose we create a tree that results in two regions: one with a distribution of (300,100) and the other with (100,300). This means that in the first region, 300 observations belong to one class and 100 to the other, and vice versa in the second region.

Consider another scenario where we have a different tree that results in two regions with distributions of (200,400) and (200,0).

In both cases, the misclassification rate is 0.25, meaning that 25% of the observations are incorrectly classified. However, the second tree is more desirable. Why is that? The second tree has a region with no misclassifications at all (200,0), which means it’s perfectly classifying all observations in that region. This is an ideal situation in classification problems. On the other hand, the first tree, despite having the same overall misclassification rate, doesn’t have any region where all observations are correctly classified.

This illustrates that while the misclassification rate is a useful metric, it doesn’t always capture the full picture. Other metrics like the Gini Index or Cross-Entropy can provide a more nuanced view of the quality of a split, taking into account not just the overall error rate, but also the distribution of errors across different regions.

Another way to measure the quality of the split is to use the Gini Index and Cross-Entropy Say, I have 400 observations in each class (400,400). I create a tree with two region: (300,100) and (100,300). Say I have another tree: (200,400) and (200,0). In both cases misclassification rate is 0.25. The later tree is preferable. We prefer to have more “pure nodes” and Gini index does a better job.

The Gini index: \[ G_m = \sum_{k=1}^K p_{mk}(1-p_{mk}) \] It measures a variance across the \(K\) classes. It takes on a small value if all of the \(p_{mk}\)’s are close to zero or one

An alternative to the Gini index is cross-entropy (a.k.a deviance), given by \[ D_m = -\sum_{k=1}^Kp_{mk}\log p_{mk} \] It is near zero if the \(p_mk\)’s are all near zero or near one. Gini index and the cross-entropy led to similar results.

Now we apply the tree model to the Boston housing dataset.

library(MASS); data(Boston); attach(Boston)
head(Boston)

crim	zn	indus	nox	rm	age	dis	rad	tax	ptratio	black	lstat	medv
0.01	18	2.3	0.54	6.6	65	4.1	1	296	15	397	5.0	24
0.03	0	7.1	0.47	6.4	79	5.0	2	242	18	397	9.1	22
0.03	0	7.1	0.47	7.2	61	5.0	2	242	18	393	4.0	35
0.03	0	2.2	0.46	7.0	46	6.1	3	222	19	395	2.9	33
0.07	0	2.2	0.46	7.2	54	6.1	3	222	19	397	5.3	36
0.03	0	2.2	0.46	6.4	59	6.1	3	222	19	394	5.2	29

First we build a big tree

temp = tree(medv~lstat,data=Boston,mindev=.0001)
length(unique(temp$where)) # first big tree size

Then prune it down to one with 7 leaves

boston.tree=prune.tree(temp,best=7)
length(unique(boston.tree$where)) # pruned tree size

text(boston.tree,col="blue",label=c("yval"),cex=.8)
boston.fit = predict(boston.tree) #get training fitted values
plot(Boston$lstat,Boston$medv,cex=.5,pch=16) #plot data
oo=order(Boston$lstat)
lines(Boston$lstat[oo],boston.fit[oo],col='red',lwd=3) #step function fit
cvals=c(9.725,4.65,3.325,5.495,16.085,19.9) #cutpoints from tree
for(i in 1:length(cvals)) abline(v=cvals[i],col='magenta',lty=2) #cutpoints

Pick off dis,lstat,medv

df2=Boston[,c(8,13,14)] 
print(names(df2))

 "dis"   "lstat" "medv"

Build the big tree

temp = tree(medv~.,df2,mindev=.0001)
length(unique(temp$where)) #

Then prune it down to one with 7 leaves

boston.tree=prune.tree(temp,best=7)

plot(boston.tree,type="u")# plot tree and partition in x.
text(boston.tree,col="blue",label=c("yval"),cex=.8)
partition.tree(boston.tree)

Get predictions on 2d grid

pv=seq(from=.01,to=.99,by=.05)
x1q = quantile(df2$lstat,probs=pv)
x2q = quantile(df2$dis,probs=pv)
xx = expand.grid(x1q,x2q) #matrix with two columns using all combinations of x1q and x2q
dfpred = data.frame(dis=xx[,2],lstat=xx[,1])
lmedpred = predict(boston.tree,dfpred)

Make perspective plot

persp(x1q,x2q,matrix(lmedpred,ncol=length(x2q),byrow=T),
      theta=150,xlab='dis',ylab='lstat',zlab='medv',
      zlim=c(min(df2$medv),1.1*max(df2$medv)))

Advantages of Decision Trees:

Decision trees are incredibly intuitive and simple to explain. They can be even more straightforward to understand than linear regression models. Some theorists argue that decision trees mimic human decision-making processes more accurately than other regression and classification methods we’ve discussed in previous chapters. Decision trees can be visually represented, making them easily interpretable, even for those without a deep understanding of the underlying mechanics, particularly when the trees are not overly complex. Decision trees can effortlessly manage qualitative predictors, eliminating the need to create dummy variables.

Disadvantages of Decision Trees:

Large trees can exhibit high variance. This means that a minor change in the data can lead to a significant change in the final estimated tree, making the model unstable. Conversely, small trees, while more stable, may not be powerful predictors as they might oversimplify the problem. It can be challenging to find a balance between bias and variance when using decision trees. A model with too much bias oversimplifies the problem and performs poorly, while a model with too much variance overfits the data and may not generalize well to unseen data.

There are several techniques used to address the issue of overfitting in decision trees. We considered the pruning technique which reduces the complexity of the final classifier, and hence improve predictive accuracy by reducing overfitting. Two other methods are random forests and boosting. Random Forests is an ensemble method where multiple decision trees are created and their predictions are averaged (for regression) or majority voting is done (for classification). Boosting is another ensemble technique where trees are built sequentially such that each subsequent tree aims to reduce the bias of the combined classifier.

In the bagging approach, we treat the sample as if it were the population and then take iid draws. That is, you sample with replacement so that you can get the same original sample value more than once in a bootstrap sample.

To Bootsrap Aggregate (Bag) we:

Take \(B\) bootstrap samples from the training data, each of the same size as the training data.
Fit a large tree to each bootstrap sample (we know how to do this fast!). This will give us \(B\) trees.
Combine the results from each of the B trees to get an overall prediction.

When the target variable \(y\) is numeric, the bagging process is straightforward. The final prediction is simply the average of the predictions from each of the \(B\) trees. However, when \(y\) is categorical, the process of combining results from different trees is less straightforward. One common approach is to use a voting system. In this system, each tree in the ensemble makes a prediction for a given input \(x\). The predicted category that receives the most votes (out of \(B\) total votes) is chosen as the final prediction. Another approach is to average the predicted probabilities \(\hat p\) from each tree. This method can provide a more nuanced prediction, especially in cases where the voting results are close.

Despite the potential benefits of averaging predicted probabilities, most software implementations of bagging for decision trees use the voting method. This is likely due to its simplicity and intuitive appeal. However, the best method to use can depend on the specific characteristics of the problem at hand.

The simple idea behind every ensemble modes is that variance of the average is lowe than variance of individual. Say we have \(B\) models \(f_1(x),\ldots,f_B(x)\) then we combine those \[ f_{avg}(x) = \dfrac{1}{B}\sum_{b=1}^Bf_b(x) \] Combining models helps fighting overfilling. On the negative side, it is harder to interpret those ensembles

Let’s experiment with the number of trees in the model

library(randomForest)
n = nrow(Boston)
ntreev = c(10,500,5000)
fmat = matrix(0,n,3)
for(i in 1:3) {
  rffit = randomForest(medv~lstat,data=Boston,ntree=ntreev[i],maxnodes=15)
  fmat[,i] = predict(rffit)
  print(mean((fmat[,i] - Boston$medv)^2, na.rm = TRUE))
}

 31
 29
 29

Let’s plot the results

oo = order(Boston$lstat)
for(i in 1:3) {
  plot(Boston$lstat,Boston$medv,xlab='lstat',ylab='medv',pch=16)
  lines(Boston$lstat[oo],fmat[oo,i],col=i+1,lwd=3)
  title(main=paste('bagging ntrees = ',ntreev[i]))
}

With 10 trees our fit is too jumbly.
With 1,000 and 5,000 trees the fit is not bad and very similar.
Note that although our method is based multiple trees (average over) so we no longer have a simple step function!!

Random Forest

In the bagging technique, models can become correlated, which prevents the achievement of a \(1/n\) reduction in variance. This happens because most, if not all, of the trees will use the most influential predictor in the top split. As a result, bagged trees tend to look very similar to each other.

Random Forests, on the other hand, introduce an element of randomness that helps to decorrelate the trees, making the ensemble more robust and improving prediction accuracy. This randomness comes into play when considering a split in a tree. Instead of considering all \(p\) predictors for a split, a random sample of \(m\) predictors is chosen as split candidates. This subset of predictors is different for each split, which means that different trees are likely to use different predictors in the top split, leading to a more diverse set of trees.

The number of predictors considered at each split, \(m\), is typically chosen to be the square root of the total number of predictors, \(p\). This choice is a rule of thumb that often works well in practice, but it can be tuned based on the specific characteristics of the dataset.

By decorrelating the trees, Random Forests can often achieve better performance than bagging, especially when there’s a small number of very strong predictors in the dataset. In such cases, bagging can end up with an ensemble of very similar trees that all rely heavily on these strong predictors, while Random Forests can leverage the other, weaker predictors more effectively.

Ove of the “interpretation” tools that comes with ensemble models is importance rank: total amount that the deviance (loss) is decreased due to splits over a given predictor, averaged over all tree

rf.boston = randomForest(medv~.,data=Boston,mtry=4,importance=TRUE,ntree=50)
varImpPlot(rf.boston,pch=21,bg="lightblue",main="")

rf.boston = randomForest(medv~.,data=Boston,mtry=6,ntree=50, maxnodes=50)
yhat.rf = predict(rf.boston,newdata=Boston)
oo=order(Boston$lstat)
plot(Boston$lstat[oo],Boston$medv[oo],pch=21,bg="grey", xlab="lstat", ylab="medv") #plot data
lines(Boston$lstat[oo],yhat.rf[oo],col='red',lwd=3) #step function fit

Boosting

Boosting, like Random Forests, is a method that combines multiple trees to create a more powerful predictive model. However, the approach it takes is quite distinct.

Here’s how Boosting works:

Initially, a single decision tree is fitted to the data.
This initial tree is intentionally made weak, meaning it doesn’t perfectly fit the data.
We then examine the residuals, which represent the portion of the target variable \(y\) not explained by the weak tree.
A new tree is then fitted to these residuals, essentially trying to predict the error of the first tree.
This new tree is also “weakened” or “shrunk”. The prediction from this tree is then added to the prediction of the first tree.
This process is repeated iteratively. In each iteration, a new tree is fitted to the residuals of the current ensemble of trees, shrunk, and then added to the ensemble.
The final model is the sum of all these “shrunk” trees. The key idea behind Boosting is to iteratively improve the model by focusing on the parts of the data that the current model is not explaining well (the residuals). Each new tree is trying to correct the mistakes of the ensemble of previous trees. By adding together many weak models (the shrunk trees), Boosting can often achieve a strong overall model.

Pick a loss function \(L\) that reflects setting; e.g., for continuous \(y\), could take \(L(y_i , \theta_i ) = (y_i - \theta_i )^2\) Want to solve \[\mathrm{minimize}_{\beta \in R^M} \sum_{i=1}^n L \left(y_i, \sum_{j=1}^M \beta_j \cdot T_j(x_i)\right)\]

Indexes all trees of a fixed size (e.g., depth = 5), so \(M\) is huge
Space is simply too big to optimize
Gradient boosting: basically a version of gradient descent that is forced to work with trees
First think of optimization as \(\min_\theta f (\theta)\), over predicted values \(\theta\) (subject to \(\theta\) coming from trees)

Set \(f_1(x)=0\) (constant predictor) and \(r_i=y_i\)

For \(b=1,2,\ldots,B\)

Fit a tree \(f_b\) with \(d\) splits to the training set \((X,r)\)
Update the model \[f(x) = f(x) +\lambda f_b(x)\]
Update the residuals \[r_i=r_i - \lambda f_b(x)\]

Here are some boosting fits where we vary the number of trees, but fix the depth at 2 (suitable with 1 x) and shrinkage = \(\lambda\) at .2.

library(gbm)
boost.boston=gbm(medv~.,data=Boston,distribution="gaussian",n.trees=5000,interaction.depth=4)
yhat.boost=predict(boost.boston,newdata=Boston,n.trees=5000)
mean((yhat.boost-Boston$medv)^2)

 4e-04

summary(boost.boston, plotit=FALSE)

	var	rel.inf
lstat	lstat	36.32
rm	rm	30.98
dis	dis	7.63
crim	crim	5.09
nox	nox	4.63
age	age	4.50
black	black	3.45
ptratio	ptratio	3.11
tax	tax	1.74
rad	rad	1.17
indus	indus	0.87
chas	chas	0.39
zn	zn	0.13

plot(boost.boston,i="rm")
plot(boost.boston,i="lstat")

Advantages of Boosting over Random Forests:

Performance: Boosting, in many cases, provides better predictive accuracy than Random Forests. By focusing on the residuals or mistakes, Boosting can incrementally improve model performance.
Model Interpretability: While both methods are not as interpretable as a single decision tree, Boosting models can sometimes be more interpretable than Random Forests, especially when the number of weak learners (trees) is small.

Disadvantages of Boosting compared to Random Forests:

Computation Time and Complexity: Boosting can be more computationally intensive than Random Forests. This is because trees are built sequentially in Boosting, while in Random Forests, they are built independently and can be parallelized.
Overfitting: Boosting can overfit the training data if the number of trees is too large, or if the trees are too complex. This is less of a problem with Random Forests, which are less prone to overfitting due to the randomness injected into the tree building process.
Outliers: Boosting can be sensitive to outliers since it tries to correct the mistakes of the predecessors. On the other hand, Random Forests are more robust to outliers.
Noise: Boosting can overemphasize instances that are hard to classify and can overfit to noise, whereas Random Forests are more robust to noise.

Remember, the choice between Boosting and Random Forests (or any other model) should be guided by the specific requirements of your task, including the nature of your data, the computational resources available, and the trade-off between interpretability and predictive accuracy.