DATA MINING
Desktop Survival Guide
by Graham Williams

Sampling Data

The Sample tab allows us to partition our dataset into a training dataset and a testing dataset, and to select different random samples if we wish to explore the sensitivity of our models to different data samples.

[width=, trim=0 250 0 0, clip=true]rattle-audit-sample

Here we specify how we might partition the dataset for exploratory and modelling purposes. The default for Rattle is to build two subsets of the dataset: one is a training dataset from which to build models, while the other is used for testing the performance of the model. The default for Rattle is to use a 70% training and a 30% testing split, but you are welcome to turn sampling off, or choose other samplings. A very small sampling may be required to perform some explorations of the smaller dataset, or to build models using the more computationally expensive algorithms (like support vector machines).

R uses random numbers to generate samples, which may present a problem with regard repeatable modelling. This presents itself through the fact that each time the sample function is called we will get a different random sample. However, R provides the set.seed function to set a seed for the next random numbers it generates. Thus, by setting the seed to the same number each time you can be assured of obtaining the same sample.

Rattle allows you to specify the random number generator seed. By default, this is the number 123, but you can change this to get a different random sample on the next Execute. By keep the random number generator seed constant you can guarantee to get the same model, and by changing it you can explore the sensitivity of the modelling to different samples of the dataset.

Often in modelling we build our model on a training dataset and then test its performance on a test dataset.

Subsections

Moving into R

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.