DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
The Arrhythmia dataset will be used to illustrate issues with data cleaning.
The dataset is of moderate size (392Kb), with 452 entities. This dataset has 280 variables, one being an output variable with 16 values. Of the input variables some 40 of them are categorical. Although a meta-data file on the repository lists the variables, we may not want to give them all names just now (too many to do by hand). We select a few to give other than the default R names to them. As with other data from the UCI repository ? is used for missing values and we deal with that when we read the downloaded data into R.
> UCI <- "ftp://ftp.ics.uci.edu/pub" > REPOS <- "ml-repos/machine-learning-databases" > cardiac.url <- sprintf("%s/%s/arrhythmia/arrhythmia.data", UCI, REPOS) > download.file(cardiac.url, "cardiac.data") > cardiac <- read.csv("cardiac.data", header=F, na.strings="?") > summary(cardiac) V1 V2 V3 V4 Min. : 0.00 Min. :0.0000 Min. :105.0 Min. : 6.00 1st Qu.:36.00 1st Qu.:0.0000 1st Qu.:160.0 1st Qu.: 59.00 Median :47.00 Median :1.0000 Median :164.0 Median : 68.00 Mean :46.47 Mean :0.5509 Mean :166.2 Mean : 68.17 3rd Qu.:58.00 3rd Qu.:1.0000 3rd Qu.:170.0 3rd Qu.: 79.00 Max. :83.00 Max. :1.0000 Max. :780.0 Max. :176.00 [...] > str(cardiac) `data.frame': 452 obs. of 280 variables: $ V1 : int 75 56 54 55 75 13 40 49 44 50 ... $ V2 : int 0 1 0 0 0 0 1 1 0 1 ... $ V3 : int 190 165 172 175 190 169 160 162 168 167 ... $ V4 : int 80 64 95 94 80 51 52 54 56 67 ... $ V5 : int 91 81 138 100 88 100 77 78 84 89 ... $ V6 : int 193 174 163 202 181 167 129 0 118 130 ... $ V7 : int 371 401 386 380 360 321 377 376 354 383 ... $ V8 : int 174 149 185 179 177 174 133 157 160 156 ... $ V9 : int 121 39 102 143 103 91 77 70 63 73 ... $ V10 : int -16 25 96 28 -16 107 77 67 61 85 ... [...] $ V278: num 23.3 20.4 12.3 34.6 25.4 13.5 14.3 15.8 12.5 20.1 ... $ V279: num 49.4 38.8 49 61.6 62.8 31.1 20.5 19.8 30.9 25.1 ... $ V280: int 8 6 10 1 7 14 1 1 1 10 ... |
We will now give a names to a few columns, then save it to a cleaner CSV file and a binary RData file where ? will be NA, and all columns will have names, some that we have given, and the rest as given by R.
> colnames(cardiac)[1:4] <- c("Age", "Gender", "Height", "Weight") > write.table(cardiac, "cardiac.csv", sep=",", row.names=F) > save(cardiac, file="cardiac.RData", compress=TRUE) > dim(cardiac) [1] 452 280 > str(cardiac) `data.frame': 452 obs. of 280 variables: $ Age : int 75 56 54 55 75 13 40 49 44 50 ... $ Gender: int 0 1 0 0 0 0 1 1 0 1 ... \$ Height: int 190 165 172 175 190 169 160 162 168 167 ... $ Weight: int 80 64 95 94 80 51 52 54 56 67 ... $ V5 : int 91 81 138 100 88 100 77 78 84 89 ... [...] |
Copyright © 2004-2006 [email protected] Support further development through the purchase of the PDF version of the book.