DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
Large datasets often present challenges for R on memory limited machines. While you may be able to load a large dataset, processing it and modelling may lead to an error indicating the memory could not be allocated.
To maximise 's capabilities on large datasets, be sure to run a 64bit operating system on a 64 bit platform (e.g., http://www.togaware.com/linux/survivorDebian GNU/Linux) on 64 bit hardware (e.g., http://en.wikipedia.org/wiki/AMD64AMD64) with plenty of RAM (e.g., 16GB). Such capable machines are quite affordable.
Selecting and subsetting required datasets off a database (e.g., through the RODBC package) or through other means (e.g., using Python) will generally be faster.
On MS/Windows you may need to set the memory size using the command-line flag -max-mem-size. The amount of memory currently in use and allocated to the R process, is given by the memory.size function.
The example below indicates that some 470MB is in use, altogether
about 1GB has been allocated.
> memory.size() # Current memory in use: 470MB [1] 477706008 > memory.size(TRUE) # Current memory allocated: 1GB [1] 1050681344 |
The memory limit currently in force in R is reported by the
memory.limit function which can also be used to set the
limit.
> memory.limit() # Current memory limit: 1GB [1] 1073741824 > memory.limit(2073741824) # New memory limit: 2GB NULL > memory.limit() [1] 2684354560 |
A suggested process is to work with a subset of all the data loaded in memory, using a dataset small enough to make this viable. Explore the data, explore for the choice of models, and prototype the final analysis using this smaller dataset. For the final full analyses one may need to allow R to run overnight with enough RAM.
A data frame of 150,000 rows and some 55 columns will be about 500MB of RAM.
Also, note the difference between data frames and arrays/matrices. For example, rbind'ing data frames is much more expensive than rbind'ing arrays/matrices. However, an array/matrix must have all data of the same data type in each column while data frames can have different data types in different columns. A number of functions are written to handle either data frames or matrices (e.g., rpart) and it is best, if possible, to use a matrix in these cases. The coercion back to a data frame can always be done afterwards.
Note that to convert a data frame to a matrix you can use
as.matrix:
> m <- as.matrix(dframe) |
To obtain an estimate of the amount of memory being used by an object
in R use the object.size function:
> object.size(ds) # Object ds is using 181MB [1] 181694428 |
The following function can be used to explore memory requirements:
sizes <- function(rows, cols=1) { testListLength <- 1000 cellSize <- object.size(seq(0.5, testListLength/2, 0.5))/testListLength cells <- rows * cols required <- cells * cellSize if (required > 1e12) result <- sprintf("%dTB", required %/% 1e12) else if (required > 1e9) result <- sprintf("%dGB", required %/% 1e9) else if (required > 1e6) result <- sprintf("%dMB", required %/% 1e6) else if (required > 1e3) result <- sprintf("%dKB", required %/% 1e3) else result <- sprintf("%dBytes", required) return(result) } |
For example, on a 32bit machine, a 1 million row dataset with 400 columns might require about 3GB of memory!
Copyright © 2004-2006 [email protected] Support further development through the purchase of the PDF version of the book.