Trees 🌲🌲🌲

1 min read

Regression tree

Recursive binary splitting

Consider all predictors and all possible values of cutpoint, choose the one resulting with the lowest RSS tree

Tree pruning

Cost complexity pruning

Classification trees

Make recursive splits by classification error rate, Gini index, or cross entropy

Categorical variables are ok, no need for dummy variables

trees are easy to explain, display, interpret

Tree prediction performance can be improved by bagging, boosting, aggregating many decision trees


Trees have high variance and bagging is way to reduce the variance

Bootstrap aggregation

create many sets from the same data using sampling with replacement

Create many trees and aggregate them

To estimate test error, no need for CV or validation set. instead use

  • out of the bag observations
  • out of the bag error

Bagging improves predicton accuracy but decreases interpretability

Important predictors can be seen by averaging RSS for regression trees or Gini index for classification trees


Just as we learn from previous mistakes, models can learn from previous models

  • Sequential training

  • Adaboost gives more weight to incorrectly classified samples next time

  • gradient boosting

  • xg boost , extreme gradient boosting
  • Since boosting is sequential, it’s slow. Xgboost parallelizes it

Random forests 🌳🌴🌲

Select a random sample of m predictors

Thus, decorrelate the trees

As a measure of node purity, Gini index and cross entropy could be used.

January 16, 2020