DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Evaluating the outcomes of data mining is important. We need to measure how any model we build will perform on previously unseen cases. A measure will also allow us to ascertain how well a model performs in comparison to other models we might choose to build, either using the same model builder, or a very different model builder. A common approach is to measure the error rate as the proportional number of cases that the model incorrectly (or equivalently, correctly) classifies. Common methods for presenting and estimating the empirical error rate include confusion matrices and cross-validation.
In this chapter we explore Rattle's various tools for reporting the performance of a model and it's various approaches to evaluating the output of data mining. We include the confusion matrix (using underneath the table function) for producing confusion matrices, Rattle's new Risk Chart for effectively displaying model performance including a measure of the success of each case, and we explore the use of the ROCR package for the graphical presentation of numerous evaluations, including those common approaches included in Rattle. Moving in to R illustrates how to fine the presentations for your own needs.
This chapter also touches on issues around Deployment of our models, and in particular Rattle's Scoring option, which allows us to load a new dataset and apply our model to that dataset, and to save the scores, together with the identity data, to a file for actioning.