Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google


Random Forests:
Classification



The approach taken by http://en.wikipedia.org/wiki/Random_forestrandom forests is to build multiple decision trees (often many hundreds) from different subsets of entities from the dataset and from different subsets of the variables from the dataset, to obtain substantial performance gains over single tree classifiers. Each decision tree is built from a sample of the full dataset, and a random sample of the available variables is used for each node of each tree. Having built an ensemble of models the final decision is the majority vote of the models, or the average value for a regression problem. The generalisation error rate from random forests tends to compare favourably to boosting approaches, yet the approach is more robust to noise in the data.

Note and illustrate with the audit data how building a random forest and evaluate on training dataset gives "prefect" results (that might make you wonder about it overfitting) but on test data you get a realistic (and generally good still) performance.

In RF if there are many noise variables, increase the number of variables considered at each node.

Random Forrests are implemented:

> library(randomForest)
>

The classwt option in the current randomForrest package does not fully work and should be avoided. The sampsize and strata options can be used together. Note that if strata is not specified, the class labels will be used.

Here's an example using the iris data:



> iris.rf <- randomForest(Species ~ ., iris, sampsize=c(10, 20, 10))

This will randomly sample 10, 20 and 10 entities from the three classes of species (with replacement) to grow each tree.

You can also name the classes in the sampsize specification:

> samples <- c(setosa=10, versicolor=20, virginica=10)
> iris.rf <- randomForest(Species ~ ., iris, sampsize=samples)

You can do a stratified sampling using a different variable than the class labels so that you even up the distribution of the class. Andy Liaw gives an example of the multi-centered clinical trial data where you want to draw the same number of patients per center to grow each tree where you can do something like:

> randomForest(..., strata=center,
               sampsize=rep(min(table(center))), nlevels(center)))

This samples the same number of patients (minimum at any center) from each center to grow each tree.

The importance option allows us to review the importance of each variable in determining the outcome. The first importance is the scaled average of the prediction accuracy of each variable, and the second is the total decrease in node impurities splitting on the variable over all trees, using the Gini index.



Subsections
Copyright © 2004-2006 [email protected]
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.