Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Textual Summaries

We saw in Chapter [*] some of the R functions that help us get a basic picture of the scope and type of data in any dataset. These include the most basic of information including the number and names of columns and rows (for data frames) and a summary of the data values themselves. We illustrate this again with the wine dataset (see See Section [*]):

> load("wine.RData")
> dim(wine)
[1] 178  14
> nrow(wine)
[1] 178
> ncol(wine)
[1] 14
> colnames(wine)
 [1] "Type"            "Alcohol"         "Malic"           "Ash"            
 [5] "Alcalinity"      "Magnesium"       "Phenols"         "Flavanoids"     
 [9] "Nonflavanoids"   "Proanthocyanins" "Color"           "Hue"            
[13] "Dilution"        "Proline"        
> rownames(wine)
  [1] "1"   "2"   "3"   "4"   "5"   "6"   "7"   "8"   "9"   "10"  "11"  "12" 
 [13] "13"  "14"  "15"  "16"  "17"  "18"  "19"  "20"  "21"  "22"  "23"  "24" 
[...]
[157] "157" "158" "159" "160" "161" "162" "163" "164" "165" "166" "167" "168"
[169] "169" "170" "171" "172" "173" "174" "175" "176" "177" "178"

This gives us an idea of the shape of the data. We are dealing with a relatively small dataset of 178 entities and 14 variables.

Next, we'd like to see what the data itself looks like. We can list the first few rows of the data using head:

> head(wine)
  Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids
1    1   14.23  1.71 2.43       15.6       127    2.80       3.06          0.28
2    1   13.20  1.78 2.14       11.2       100    2.65       2.76          0.26
3    1   13.16  2.36 2.67       18.6       101    2.80       3.24          0.30
4    1   14.37  1.95 2.50       16.8       113    3.85       3.49          0.24
5    1   13.24  2.59 2.87       21.0       118    2.80       2.69          0.39
6    1   14.20  1.76 2.45       15.2       112    3.27       3.39          0.34
  Proanthocyanins Color  Hue Dilution Proline
1            2.29  5.64 1.04     3.92    1065
2            1.28  4.38 1.05     3.40    1050
3            2.81  5.68 1.03     3.17    1185
4            2.18  7.80 0.86     3.45    1480
5            1.82  4.32 1.04     2.93     735
6            1.97  6.75 1.05     2.85    1450

Next we might look at the structure of the data using the str (structure) function. This provides a basic overview of both values and their data type:

> str(wine)
`data.frame':	178 obs. of  14 variables:
 $ Type           : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
 $ Alcohol        : num  14.2 13.2 13.2 14.4 13.2 ...
 $ Malic          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
 $ Ash            : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
 $ Alcalinity     : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
 $ Magnesium      : int  127 100 101 113 118 112 96 121 97 98 ...
 $ Phenols        : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
 $ Flavanoids     : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
 $ Nonflavanoids  : num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
 $ Proanthocyanins: num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
 $ Color          : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
 $ Hue            : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
 $ Dilution       : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
 $ Proline        : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

We are now starting to get an idea of what the data itself looks like. The categorical variable Type would appear to be something that we might want to model--the output variable. The remaining variables are all numeric variables, a mixture of integers and real numbers.

The final step in the first look at the data is to get a summary of each variable using summary:

> summary(wine)
 Type      Alcohol          Malic            Ash          Alcalinity   
 1:59   Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
 2:71   1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
 3:48   Median :13.05   Median :1.865   Median :2.360   Median :19.50  
        Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
        3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
        Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00 
[...]

Copyright © 2004-2006 [email protected]
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.