Data Mining Survivor: Single_Variable

DATA MINING
Desktop Survival Guide
by Graham Williams

Textual Summaries

We saw in Chapter some of the R functions that help us get a basic picture of the scope and type of data in any dataset. These include the most basic of information including the number and names of columns and rows (for data frames) and a summary of the data values themselves. We illustrate this again with the wine dataset (see See Section ):

> load("wine.RData") > dim(wine) [1] 178 14 > nrow(wine) [1] 178 > ncol(wine) [1] 14 > colnames(wine) [1] "Type" "Alcohol" "Malic" "Ash" [5] "Alcalinity" "Magnesium" "Phenols" "Flavanoids" [9] "Nonflavanoids" "Proanthocyanins" "Color" "Hue" [13] "Dilution" "Proline" > rownames(wine) [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" [13] "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" [...] [157] "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167" "168" [169] "169" "170" "171" "172" "173" "174" "175" "176" "177" "178"

This gives us an idea of the shape of the data. We are dealing with a relatively small dataset of 178 entities and 14 variables.

Next, we'd like to see what the data itself looks like. We can list the first few rows of the data using head:

> head(wine) Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids 1 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 3 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 4 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 5 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 6 1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 Proanthocyanins Color Hue Dilution Proline 1 2.29 5.64 1.04 3.92 1065 2 1.28 4.38 1.05 3.40 1050 3 2.81 5.68 1.03 3.17 1185 4 2.18 7.80 0.86 3.45 1480 5 1.82 4.32 1.04 2.93 735 6 1.97 6.75 1.05 2.85 1450

Next we might look at the structure of the data using the str (structure) function. This provides a basic overview of both values and their data type:

> str(wine) `data.frame': 178 obs. of 14 variables: $ Type : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ... $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ... $ Malic : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ... $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ... $ Alcalinity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ... $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ... $ Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ... $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ... $ Nonflavanoids : num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ... $ Proanthocyanins: num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ... $ Color : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ... $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ... $ Dilution : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ... $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

We are now starting to get an idea of what the data itself looks like. The categorical variable Type would appear to be something that we might want to model--the output variable. The remaining variables are all numeric variables, a mixture of integers and real numbers.

The final step in the first look at the data is to get a summary of each variable using summary:

> summary(wine) Type Alcohol Malic Ash Alcalinity 1:59 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60 2:71 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20 3:48 Median :13.05 Median :1.865 Median :2.360 Median :19.50 Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50 Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00 [...]

Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.