BDA Random Forests Feb. 2016 - BigData@UCSD · Understanding Random Forests Recall how CART is used...

Random ForestsFeb., 2016

Roger Bohn

Big Data Analytics

1

Harold Colson on good library data catalogs� Google Scholar http://scholar.google.com

� Web of Science http://uclibs.org/PID/12610

� Business Source Complete http://uclibs.org/PID/126938

� INSPEC http://uclibs.org/PID/22771

� ACM Digital Library http://www.acm.org/dl/

� IEEE Xplore http://www.ieee.org/ieeexplore/

� PubMed http://www.ncbi.nlm.nih.gov/sites/entrez?tool=cdl&otool=cdlotool

� See page http://libguides.ucsd.edu/data-statistics

2

Random Forests (DMRattle+R)

� Build many decision trees (e.g., 500).

� For each tree: � Select a random subset of the training set (N);� Choose different subsets of variables for each node of the

decision tree (m << M);� Build the tree without pruning (i.e., overfit)

� Classify a new entity using every decision tree: � Each tree “votes” for the entity.� The decision with the largest number of votes wins!� The proportion of votes is the resulting score.

� Outcome is a pseudo probability. 0 ≤ prob ≤ 1

3

RF on weather dataRandom Forests

Example: RF on Weather Data

set.seed(42)(m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)],

na.action=na.roughfix,importance=TRUE))

#### Call:## randomForest(formula=RainTomorrow ~ ., data=weath...## Type of random forest: classification## Number of trees: 500## No. of variables tried at each split: 4#### OOB estimate of error rate: 13.67%## Confusion matrix:## No Yes class.error## No 211 4 0.0186## Yes 31 10 0.7561

http: // togaware. com Copyright 2014, [email protected] 21/36

4

Mechanics of RFs

� Each model uses random bag of observations ~70/30

� Each time a split in a tree is considered, random selection of m predictors chosen as candidates from the full set of p predictors. The split chooses one of those m predictors, just like a single tree.

� A fresh selection of m predictors is taken at each split.

� Typically we choose m ≈ √p Number of predictors considered at each split is approximately the square root of total number of predictors. max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),

� If tree is deep, most of the p variables get considered at least once.

� We do not prune the trees. (This speeds up computation, among other effects.)

5

“Model” is 100s of small Trees

� Each tree is quick to solve, so computationally tractable

� Example model from RF

� ## Tree 1 Rule 1 Node 30 Decision No ## ## 1: Evaporation <= 9## 2: Humidity3pm <= 71

## 3: Cloud3pm <= 2.5 ## 4: WindDir9am IN ("NNE")## 5: Sunshine <= 10.25

## 6: Temp3pm <= 17.55

� Final decision (yes/no, or level) just like single tree

6

Error rates.Random Forests

Example: Error Rate

Error rate decreases quickly then flattens over the 500 trees.

plot(m)

0 100 200 300 400 500

0.0

0.2

0.4

0.6

0.8

m

trees

Error


7

Properties of RFs

� Often works better than other methods.

� Runs efficiently on large data sets.

� Can handle hundreds of input variables.

� Gives estimates of variable importance.

� Results easy to use, but too complex to summarize (“black box”)

� Cross-validation is built in: Use random set of observations for each tree. (With replacement.)� Omitted observations are the validation set for that tree.

8

Random Forests

Example: Variable Importance

Helps understand the knowledge captured.

varImpPlot(m, main="Variable Importance")

RainTodayRainfallWindDir3pmWindDir9amEvaporationWindGustDirHumidity9amWindSpeed9amWindSpeed3pmCloud9amMinTempHumidity3pmTemp9amPressure9amMaxTempWindGustSpeedTemp3pmPressure3pmCloud3pmSunshine

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 5 10 15MeanDecreaseAccuracy

RainTodayRainfallWindDir3pmWindGustDirWindDir9amEvaporationCloud9amWindSpeed9amWindSpeed3pmHumidity9amMaxTempTemp9amTemp3pmMinTempHumidity3pmWindGustSpeedPressure9amCloud3pmSunshinePressure3pm

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

0 2 4 6 8MeanDecreaseGini

Variable Importance


9

R code� randomForest is one RF program. There are others.

� ds <- weather[train, -c(1:2, 23)] � form <- RainTomorrow ~ . � m.rp <- rpart(form, data=ds) � m.rf <- randomForest(form, data=ds,

na.action=na.roughfix, importance=TRUE)

10

randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,mtry=if (!is.null(y) && !is.factor(y))max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),replace=TRUE, classwt=NULL, cutoff, strata,sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,maxnodes = NULL,importance=FALSE, localImp=FALSE, nPerm=1,proximity, oob.prox=proximity,norm.votes=TRUE, do.trace=FALSE,keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,keep.inbag=FALSE, ...)

Understanding Random Forests

Recall how CART is used in practice.

I Split to lower deviance until leaves hit minimum size.

I Create a set of candidate trees by pruning back from this.

I Choose the best among those trees by cross validation.

Random Forests avoid the need for CV.

Each tree ‘b’ is not overly complicated becauseyou only work with a limited set of variables.

Your predictions are not ‘optimized to noise’ becausethey are averages of trees fit to many di↵erent subsets.

RFs are a great go-to model for nonparametric prediction.

22

11

Mechanics: combining trees� Run RF 500 times, get 500 models.

� Check this! With many variables you may need more trees.

� Final prediction or classification is based on voting� Usually use unweighted voting: all trees equal� Can weight the votes e.g. most successful trees get highest

weights.

� For classification: majority of trees determines classification

� For prediction problems (continuous outcomes): Averageprediction of all the trees becomes the RF’s prediction.

12

Random Forests

CART is an e↵ective way to choose a single tree, but oftenthere are many possible trees that fit the data similarly well.

An alternative approach is to make use of random forests.

• Sample B subsets of the data + variables:e.g., observations 1, 5, 20, ... and inputs 2, 10, 17, ...

• Fit a tree to each subset, to get B fitted trees is Tb.

• Average prediction across trees:

- for regression average E[y|x] = 1B

PBb=1 Tb(x).

- for classification let {Tb(x)}Bb=1 vote on y.

You’re “shaking the data” and averaging across fits.

21

Case study: Comparing methods

A larger example: California Housing Data

Median home values in census tracts, along with

I Latitude and Longitude of tract centers.

I Population totals and median income.

I Average room/bedroom numbers, home age.

The goal is to predict log(MedVal) for census tracts.

Di�cult regression: Covariate e↵ects change with location.How they change is probably not linear.

28

13

From: Matt Taddy, Chicago Booth School faculty.chicagobooth.edu/matt.taddy/teaching

Single tree resultCART Dendrogram for CA housing

|medianIncome < 3.5471

medianIncome < 2.51025

latitude < 34.465

longitude < -117.775 longitude < -120.275

latitude < 37.905

AveRooms < 4.70574


AveOccupancy < 2.41199



11.93 11.53

11.76 11.36

11.08

12.13 11.78 12.53

12.08 12.30

12.64 12.98

Income is dominant, with location important for low income.Cross Validation favors the most complicated tree: 12 leaves.

29

14

CART fit for CA housing data

Under-estimating the coast, over-estimating the central valley?

31

15

randomForest fit for CA housing data

No big residuals! (although still missing the LA and SF e↵ects)Overfit? From out-of-sample prediction it appears not.

32

16

LASSO fit for CA housing data

Looks like over-estimates in the Bay, under-estimates in OC.

30

17

CA housing: out-of-sample prediction

LASSO CART RF

-0.5

0.0

0.5

1.0

model

PVE

Trees outperform LASSO: gain from nonlinear interaction.RF is better still than CART: benefits of model averaging.

33

18

Evaluating importance of onevariable

� Regression models: size of the coefficient tells importance (notthe t statistic, although usually they are correlated).

� Random forest: Two different estimation methods

� Any algorithm: Leave out one variable from final solution� How much does result change?� How much does performance decrease on the validation set? R2

or classification errors.

� General comment: when high multicollinearity among variables, no one variable is important.� Omit variable 17; variables 18 and 19 are similar� This is a philosophical problem, not a technique issue

� Can also leave out sets of related variables. (eg all demographic variables)

19

Other concepts using treesRoundup on Tree-based learning

We’ve seen two techniques for building tree models.

I CART: recursive partitions, pruned back by CV.

I randomForest: average many simple CART trees.

There are many other tree-based algorithms.

I Boosted Trees: repeatedly fit simple trees to residuals.Fast, but it is tough to avoid over-fit (requires full CV).

I Bayes Additive Regression Trees: mix many simple trees.Robust prediction, but su↵ers with non-constant variance.

I Dynamic Trees: grow sequential ‘particle’ treesGood online, but fit depends on data ordering

Trees are poor in high dimension, but fitting them to lowdimension factors (principle components) is a good option.

35

20

Generalize: Groups of different models!

� Many models are better than any 1 model

� Each model better at classifying some situations.

� “Boosting” algorithms

21

Model Averaging

This technique of ‘Model Averaging’ is central tomany advanced nonparametric learning algorithms.

ensemble learning, mixture of experts, Bayesian averages, ...

It works best with flexible but simple models

Recall lasso as a stabilized version of stepwise regression(if you jitter the data your estimates stay pretty constant).

Model averaging is a way to take arbitrary unstable methods,and make them stable. This makes training easier.

Probability of rain on a new day is the average P(rain)across some trees that split on forecast, others on sky.We don’t get tied to one way of deciding about umbrellas.

23

Model Averaging

This technique of ‘Model Averaging’ is central tomany advanced nonparametric learning algorithms.

ensemble learning, mixture of experts, Bayesian averages, ...

It works best with flexible but simple models

Recall lasso as a stabilized version of stepwise regression(if you jitter the data your estimates stay pretty constant).

Model averaging is a way to take arbitrary unstable methods,and make them stable. This makes training easier.

Probability of rain on a new day is the average P(rain)across some trees that split on forecast, others on sky.We don’t get tied to one way of deciding about umbrellas.

23

22

Comparing algorithmsProperty Single

treeRandom forest Logistic

/regressionLASSO

Nonlinear relationships?

Good Very good Must pre-guessinteractions

Same as regression

Explain to audience?

Good difficult Good (most audiences)

difficult

Large p Erratic Good Poor Good

Variable importance

No Yes although “odd”

Yes, very good Yes

Handle continuous outcomes (predict)

Yes Yes Yes Yes

Handle discrete outcomes (classify)

Directly Directly Transform eglogistic

Transform

OTSUs few Medium Interpretation normalize

23

Comparing algorithmsProperty Single

treeRandom forest Logistic

/regressionLASSO

Nonlinear relationships?

Good Good Must pre-guessinteractions

same

Explain to audience?

Very good

Poor Very good if trained

Medium

Selecting variables (large p)

Decent Good Poor Very good

Variable importance

Weak Relative importance

Absolute importance

Same

Handle continuous outcomes (predict)

Yes Yes Yes Yes

Handle discrete outcomes (classify)

Yes Yes Yes Yes

Number of OTSUs Who are we kidding? All have plenty of OTSUs. Hence importance of validation, then test

24

BDA Random Forests Feb. 2016 - BigData@UCSD · Understanding Random Forests Recall how CART is used...

Documents

Transcript of BDA Random Forests Feb. 2016 - BigData@UCSD · Understanding Random Forests Recall how CART is used...