BDA Random Forests Feb. 2016 - BigData@UCSD · Understanding Random Forests Recall how CART is used...
Transcript of BDA Random Forests Feb. 2016 - BigData@UCSD · Understanding Random Forests Recall how CART is used...
Random ForestsFeb., 2016
Roger Bohn
Big Data Analytics
1
Harold Colson on good library data catalogs� Google Scholar http://scholar.google.com
� Web of Science http://uclibs.org/PID/12610
� Business Source Complete http://uclibs.org/PID/126938
� INSPEC http://uclibs.org/PID/22771
� ACM Digital Library http://www.acm.org/dl/
� IEEE Xplore http://www.ieee.org/ieeexplore/
� PubMed http://www.ncbi.nlm.nih.gov/sites/entrez?tool=cdl&otool=cdlotool
� See page http://libguides.ucsd.edu/data-statistics
2
Random Forests (DMRattle+R)
� Build many decision trees (e.g., 500).
� For each tree: � Select a random subset of the training set (N);� Choose different subsets of variables for each node of the
decision tree (m << M);� Build the tree without pruning (i.e., overfit)
� Classify a new entity using every decision tree: � Each tree “votes” for the entity.� The decision with the largest number of votes wins!� The proportion of votes is the resulting score.
� Outcome is a pseudo probability. 0 ≤ prob ≤ 1
3
RF on weather dataRandom Forests
Example: RF on Weather Data
set.seed(42)(m <- randomForest(RainTomorrow ~ ., weather[train, -c(1:2, 23)],
na.action=na.roughfix,importance=TRUE))
#### Call:## randomForest(formula=RainTomorrow ~ ., data=weath...## Type of random forest: classification## Number of trees: 500## No. of variables tried at each split: 4#### OOB estimate of error rate: 13.67%## Confusion matrix:## No Yes class.error## No 211 4 0.0186## Yes 31 10 0.7561
http: // togaware. com Copyright 2014, [email protected] 21/36
4
Mechanics of RFs
� Each model uses random bag of observations ~70/30
� Each time a split in a tree is considered, random selection of m predictors chosen as candidates from the full set of p predictors. The split chooses one of those m predictors, just like a single tree.
� A fresh selection of m predictors is taken at each split.
� Typically we choose m ≈ √p Number of predictors considered at each split is approximately the square root of total number of predictors. max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),
� If tree is deep, most of the p variables get considered at least once.
� We do not prune the trees. (This speeds up computation, among other effects.)
5
“Model” is 100s of small Trees
� Each tree is quick to solve, so computationally tractable
� Example model from RF
� ## Tree 1 Rule 1 Node 30 Decision No ## ## 1: Evaporation <= 9## 2: Humidity3pm <= 71
## 3: Cloud3pm <= 2.5 ## 4: WindDir9am IN ("NNE")## 5: Sunshine <= 10.25
## 6: Temp3pm <= 17.55
� Final decision (yes/no, or level) just like single tree
6
Error rates.Random Forests
Example: Error Rate
Error rate decreases quickly then flattens over the 500 trees.
plot(m)
0 100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
m
trees
Error
http: // togaware. com Copyright 2014, [email protected] 22/36
7
Properties of RFs
� Often works better than other methods.
� Runs efficiently on large data sets.
� Can handle hundreds of input variables.
� Gives estimates of variable importance.
� Results easy to use, but too complex to summarize (“black box”)
� Cross-validation is built in: Use random set of observations for each tree. (With replacement.)� Omitted observations are the validation set for that tree.
8
Random Forests
Example: Variable Importance
Helps understand the knowledge captured.
varImpPlot(m, main="Variable Importance")
RainTodayRainfallWindDir3pmWindDir9amEvaporationWindGustDirHumidity9amWindSpeed9amWindSpeed3pmCloud9amMinTempHumidity3pmTemp9amPressure9amMaxTempWindGustSpeedTemp3pmPressure3pmCloud3pmSunshine
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 5 10 15MeanDecreaseAccuracy
RainTodayRainfallWindDir3pmWindGustDirWindDir9amEvaporationCloud9amWindSpeed9amWindSpeed3pmHumidity9amMaxTempTemp9amTemp3pmMinTempHumidity3pmWindGustSpeedPressure9amCloud3pmSunshinePressure3pm
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0 2 4 6 8MeanDecreaseGini
Variable Importance
http: // togaware. com Copyright 2014, [email protected] 23/36
9
R code� randomForest is one RF program. There are others.
� ds <- weather[train, -c(1:2, 23)] � form <- RainTomorrow ~ . � m.rp <- rpart(form, data=ds) � m.rf <- randomForest(form, data=ds,
na.action=na.roughfix, importance=TRUE)
10
randomForest(x, y=NULL, xtest=NULL, ytest=NULL, ntree=500,mtry=if (!is.null(y) && !is.factor(y))max(floor(ncol(x)/3), 1) else floor(sqrt(ncol(x))),replace=TRUE, classwt=NULL, cutoff, strata,sampsize = if (replace) nrow(x) else ceiling(.632*nrow(x)),nodesize = if (!is.null(y) && !is.factor(y)) 5 else 1,maxnodes = NULL,importance=FALSE, localImp=FALSE, nPerm=1,proximity, oob.prox=proximity,norm.votes=TRUE, do.trace=FALSE,keep.forest=!is.null(y) && is.null(xtest), corr.bias=FALSE,keep.inbag=FALSE, ...)
Understanding Random Forests
Recall how CART is used in practice.
I Split to lower deviance until leaves hit minimum size.
I Create a set of candidate trees by pruning back from this.
I Choose the best among those trees by cross validation.
Random Forests avoid the need for CV.
Each tree ‘b’ is not overly complicated becauseyou only work with a limited set of variables.
Your predictions are not ‘optimized to noise’ becausethey are averages of trees fit to many di↵erent subsets.
RFs are a great go-to model for nonparametric prediction.
22
11
Mechanics: combining trees� Run RF 500 times, get 500 models.
� Check this! With many variables you may need more trees.
� Final prediction or classification is based on voting� Usually use unweighted voting: all trees equal� Can weight the votes e.g. most successful trees get highest
weights.
� For classification: majority of trees determines classification
� For prediction problems (continuous outcomes): Averageprediction of all the trees becomes the RF’s prediction.
12
Random Forests
CART is an e↵ective way to choose a single tree, but oftenthere are many possible trees that fit the data similarly well.
An alternative approach is to make use of random forests.
• Sample B subsets of the data + variables:e.g., observations 1, 5, 20, ... and inputs 2, 10, 17, ...
• Fit a tree to each subset, to get B fitted trees is Tb.
• Average prediction across trees:
- for regression average E[y|x] = 1B
PBb=1 Tb(x).
- for classification let {Tb(x)}Bb=1 vote on y.
You’re “shaking the data” and averaging across fits.
21
Case study: Comparing methods
A larger example: California Housing Data
Median home values in census tracts, along with
I Latitude and Longitude of tract centers.
I Population totals and median income.
I Average room/bedroom numbers, home age.
The goal is to predict log(MedVal) for census tracts.
Di�cult regression: Covariate e↵ects change with location.How they change is probably not linear.
28
13
From: Matt Taddy, Chicago Booth School faculty.chicagobooth.edu/matt.taddy/teaching
Single tree resultCART Dendrogram for CA housing
|medianIncome < 3.5471
medianIncome < 2.51025
latitude < 34.465
longitude < -117.775 longitude < -120.275
latitude < 37.905
AveRooms < 4.70574
medianIncome < 5.5892
AveOccupancy < 2.41199
medianIncome < 4.5287
medianIncome < 7.393
11.93 11.53
11.76 11.36
11.08
12.13 11.78 12.53
12.08 12.30
12.64 12.98
Income is dominant, with location important for low income.Cross Validation favors the most complicated tree: 12 leaves.
29
14
CART fit for CA housing data
Under-estimating the coast, over-estimating the central valley?
31
15
randomForest fit for CA housing data
No big residuals! (although still missing the LA and SF e↵ects)Overfit? From out-of-sample prediction it appears not.
32
16
LASSO fit for CA housing data
Looks like over-estimates in the Bay, under-estimates in OC.
30
17
CA housing: out-of-sample prediction
LASSO CART RF
-0.5
0.0
0.5
1.0
model
PVE
Trees outperform LASSO: gain from nonlinear interaction.RF is better still than CART: benefits of model averaging.
33
18
Evaluating importance of onevariable
� Regression models: size of the coefficient tells importance (notthe t statistic, although usually they are correlated).
� Random forest: Two different estimation methods
� Any algorithm: Leave out one variable from final solution� How much does result change?� How much does performance decrease on the validation set? R2
or classification errors.
� General comment: when high multicollinearity among variables, no one variable is important.� Omit variable 17; variables 18 and 19 are similar� This is a philosophical problem, not a technique issue
� Can also leave out sets of related variables. (eg all demographic variables)
19
Other concepts using treesRoundup on Tree-based learning
We’ve seen two techniques for building tree models.
I CART: recursive partitions, pruned back by CV.
I randomForest: average many simple CART trees.
There are many other tree-based algorithms.
I Boosted Trees: repeatedly fit simple trees to residuals.Fast, but it is tough to avoid over-fit (requires full CV).
I Bayes Additive Regression Trees: mix many simple trees.Robust prediction, but su↵ers with non-constant variance.
I Dynamic Trees: grow sequential ‘particle’ treesGood online, but fit depends on data ordering
Trees are poor in high dimension, but fitting them to lowdimension factors (principle components) is a good option.
35
20
Generalize: Groups of different models!
� Many models are better than any 1 model
� Each model better at classifying some situations.
� “Boosting” algorithms
21
Model Averaging
This technique of ‘Model Averaging’ is central tomany advanced nonparametric learning algorithms.
ensemble learning, mixture of experts, Bayesian averages, ...
It works best with flexible but simple models
Recall lasso as a stabilized version of stepwise regression(if you jitter the data your estimates stay pretty constant).
Model averaging is a way to take arbitrary unstable methods,and make them stable. This makes training easier.
Probability of rain on a new day is the average P(rain)across some trees that split on forecast, others on sky.We don’t get tied to one way of deciding about umbrellas.
23
Model Averaging
This technique of ‘Model Averaging’ is central tomany advanced nonparametric learning algorithms.
ensemble learning, mixture of experts, Bayesian averages, ...
It works best with flexible but simple models
Recall lasso as a stabilized version of stepwise regression(if you jitter the data your estimates stay pretty constant).
Model averaging is a way to take arbitrary unstable methods,and make them stable. This makes training easier.
Probability of rain on a new day is the average P(rain)across some trees that split on forecast, others on sky.We don’t get tied to one way of deciding about umbrellas.
23
22
Comparing algorithmsProperty Single
treeRandom forest Logistic
/regressionLASSO
Nonlinear relationships?
Good Very good Must pre-guessinteractions
Same as regression
Explain to audience?
Good difficult Good (most audiences)
difficult
Large p Erratic Good Poor Good
Variable importance
No Yes although “odd”
Yes, very good Yes
Handle continuous outcomes (predict)
Yes Yes Yes Yes
Handle discrete outcomes (classify)
Directly Directly Transform eglogistic
Transform
OTSUs few Medium Interpretation normalize
23
Comparing algorithmsProperty Single
treeRandom forest Logistic
/regressionLASSO
Nonlinear relationships?
Good Good Must pre-guessinteractions
same
Explain to audience?
Very good
Poor Very good if trained
Medium
Selecting variables (large p)
Decent Good Poor Very good
Variable importance
Weak Relative importance
Absolute importance
Same
Handle continuous outcomes (predict)
Yes Yes Yes Yes
Handle discrete outcomes (classify)
Yes Yes Yes Yes
Number of OTSUs Who are we kidding? All have plenty of OTSUs. Hence importance of validation, then test
24