Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Remove Non-Numeric Columns

We might only be interested in the numeric data, so we remove all columns that are not numeric from a dataset. We can use the survey dataset to illustrate this. First load the dataset and have a look at the column names and their types. We use the lapply function to apply the class function to each column of the data frame.



> load("survey.RData")
> colnames(survey)
 [1] "Age"            "Workclass"      "fnlwgt"         "Education"     
 [5] "Education.Num"  "Marital.Status" "Occupation"     "Relationship"  
 [9] "Race"           "Sex"            "Capital.Gain"   "Capital.Loss"  
[13] "Hours.Per.Week" "Native.Country" "Salary.Group"  
> lapply(survey, class)
$Age
[1] "integer"

$Workclass
[1] "factor"

$fnlwgt
[1] "integer"

$Education
[1] "factor"

$Education.Num
[1] "integer"

$Marital.Status
[1] "factor"

$Occupation
[1] "factor"

\$Relationship
[1] "factor"

$Race
[1] "factor"

$Sex
[1] "factor"

$Capital.Gain
[1] "integer"

$Capital.Loss
[1] "integer"

$Hours.Per.Week
[1] "integer"

$Native.Country
[1] "factor"

$Salary.Group
[1] "factor"

We can now simply use is.numeric to select the numeric columns and store the result in a new dataset, using sapply to extract the list of numeric columns:

> survey.numeric <- survey[,sapply(survey, is.numeric)]

You could instead build a list of the columns to remove and then explicitly remove them from the dataset in place, so that you don't create a need for extra data storage.

First build a numeric list of columns to remove, and reverse it since after we remove a column, all the remaining columns are shifted left and their index is then one less! We use sapply to extract the list of numeric columns (those for which is.numeric is true).



> rmcols <- rev(seq(1,ncol(survey))[!as.logical(sapply(survey, is.numeric))])
> rmcols
[1] 15 14 10  9  8  7  6  4  2

Now remove the columns from the dataset simply by setting the column to NULL.



> for (i in rmcols) survey[[i]] <- NULL
> colnames(survey)
[1] "Age"            "fnlwgt"         "Education.Num"  "Capital.Gain"  
[5] "Capital.Loss"   "Hours.Per.Week"

This same process can be used to remove or retain columns of any type, simply by using the appropriate R function: e.g., is.factor, is.logical, is.integer, or is.numeric.

Copyright © 2004-2006 [email protected]
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.