Togaware DATA MINING
Desktop Survival Guide
by Graham Williams
Google

Normalising Data

R's scale is used to re-center and re-scale data in a numeric matrix. The re-centering involves subtracting a column's mean from each value in the column. The re-scaling then divides each value by the root-mean-square.



> ds <- wine[1:20,c(2,9,14)]
> summary(ds)
    Alcohol      Nonflavanoids       Proline    
 Min.   :13.16   Min.   :0.1700   Min.   : 735  
 1st Qu.:13.72   1st Qu.:0.2600   1st Qu.:1061  
 Median :14.11   Median :0.2950   Median :1280  
 Mean   :14.01   Mean   :0.2970   Mean   :1235  
 3rd Qu.:14.32   3rd Qu.:0.3225   3rd Qu.:1352  
 Max.   :14.83   Max.   :0.4300   Max.   :1680  
> ds
   Alcohol Nonflavanoids Proline
1    14.23          0.28    1065
2    13.20          0.26    1050
3    13.16          0.30    1185
4    14.37          0.24    1480
5    13.24          0.39     735
6    14.20          0.34    1450
7    14.39          0.30    1290
8    14.06          0.31    1295
9    14.83          0.29    1045
10   13.86          0.22    1045
11   14.10          0.22    1510
12   14.12          0.26    1280
13   13.75          0.29    1320
14   14.75          0.43    1150
15   14.38          0.29    1547
16   13.63          0.30    1310
17   14.30          0.33    1280
18   13.83          0.40    1130
19   14.19          0.32    1680
20   13.64          0.17     845
> scale(ds)
      Alcohol Nonflavanoids    Proline
1   0.4630901   -0.27054355 -0.7184008
2  -1.7198976   -0.58883009 -0.7819386
3  -1.8046738    0.04774298 -0.2100983
4   0.7598069   -0.90711662  1.0394785
5  -1.6351214    1.48003239 -2.1162325
6   0.3995079    0.68431605  0.9124029
7   0.8021950    0.04774298  0.2346663
8   0.1027912    0.20688625  0.2558456
9   1.7347334   -0.11140029 -0.8031179
10 -0.3210899   -1.22540316 -0.8031179
11  0.1875674   -1.22540316  1.1665541
12  0.2299555   -0.58883009  0.1923078
13 -0.5542245   -0.11140029  0.3617419
14  1.5651810    2.11660546 -0.3583532
15  0.7810009   -0.11140029  1.3232807
16 -0.8085532    0.04774298  0.3193834
17  0.6114485    0.52517278  0.1923078
18 -0.3846721    1.63917565 -0.4430703
19  0.3783139    0.36602952  1.8866493
20 -0.7873591   -2.02111950 -1.6502886
attr(,"scaled:center")
      Alcohol Nonflavanoids       Proline 
      14.0115        0.2970     1234.6000 
attr(,"scaled:scale")
      Alcohol Nonflavanoids       Proline 
   0.47183042    0.06283646  236.07991510 
> ds
   Alcohol Nonflavanoids Proline
1    14.23          0.28    1065
2    13.20          0.26    1050
3    13.16          0.30    1185
4    14.37          0.24    1480
5    13.24          0.39     735
6    14.20          0.34    1450
7    14.39          0.30    1290
8    14.06          0.31    1295
9    14.83          0.29    1045
10   13.86          0.22    1045
11   14.10          0.22    1510
12   14.12          0.26    1280
13   13.75          0.29    1320
14   14.75          0.43    1150
15   14.38          0.29    1547
16   13.63          0.30    1310
17   14.30          0.33    1280
18   13.83          0.40    1130
19   14.19          0.32    1680
20   13.64          0.17     845
> summary(scale(ds))
    Alcohol           Nonflavanoids           Proline          
 Min.   :-1.805e+00   Min.   :-2.021e+00   Min.   :-2.116e+00  
 1st Qu.:-6.125e-01   1st Qu.:-5.888e-01   1st Qu.:-7.343e-01  
 Median : 2.088e-01   Median :-3.183e-02   Median : 1.923e-01  
 Mean   :-3.381e-15   Mean   :-6.217e-16   Mean   : 3.886e-16  
 3rd Qu.: 6.485e-01   3rd Qu.: 4.058e-01   3rd Qu.: 4.994e-01  
 Max.   : 1.735e+00   Max.   : 2.117e+00   Max.   : 1.887e+00

The function rescaler from Hadley Wickham's reshape package supports five methods for rescaling/standardising data: rescale to $[0,1]$; subtract mean and divide by the standard deviation; subtract median and divide by median absolute deviation; convert values to a rank; and do nothing.



Copyright © 2004-2006 [email protected]
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.