DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Weka runs as a Java application. Thus, a user can simply obtain the
appropriate Java archive file weka.jar and then start up the
application with Java. On GNU/Linux and Unix this is usually:
java -jar weka.jar |
javaw -jar weka.jar |
On start-up you will see the Weka GUI Chooser (Figure ). From here you can either run the system from a simple command line interface (Simple CLI), or else start up an interactive data explorer and modeller with the Explorer button.
The Weka Explorer (Figure ) can be used to interactively load data, pre-process the data, and run the modelling tools. Figure shows a dataset having been loaded, with a list of the variables found in the CSV file in the left pane, with a plot of the distribution of the output variable (yexno) shown in the right pane.
To load a CSV file, for example, click on the Open file... button. This will bring up the Weka Open dialogue (Figure ). Click in the button labelled Arff data files to change this to CSV data files. Then browse to the CSV file you wish to load. In our example this is wine-nominal.csv. Double click the name, and then the Open button to import the data.
To start building models, go to the Classify tab (Figure ). The default model builder is ZeroR, a very basic model builder indeed! Click the Choose button to select from over 60 model builders. For example, under Trees you could choose J48, which is an implementation of C4.5. You will also find support vector machines (SMO under Functions) and random forests (under Trees) and AdaBoost (under Meta). Once you have chosen you model builder the corresponding command is shown in the text box to the right of the button. Click in here to change any of the parameters, or to read some documentation about the chosen method. From the drop-down menu above the Start button, choose the output variable (in this case we have chosen Class). When you are ready to build your model, click on the Start button.
A tree built this way will list, for each branch, the number of training instances and the number of these that are misclassified.
Copyright © 2004-2006 [email protected] Support further development through the purchase of the PDF version of the book.