 DATA MINING
Desktop Survival Guide
by Graham Williams ## Graphing Means and Error Bars

The simplest plot of means is achieved using the plotmeans function of the gplots package. The example uses the wine dataset, aggregating the data into the three classes defined by Type and plotting the mean of the value for Phenols and Magnesium for each class.

[width=0.8]rplot-line-means

 ```library("gplots") load("wine.Rdata") attach(wine) pdf("graphics/rplot-line-means.pdf") par(mfrow=c(1,2)) plotmeans(Magnesium ~ Type) plotmeans(Phenols ~ Type) dev.off() ```

http://rattle.togaware.com/code/rplot-line-means.R

Both plots are placed onto the one plotting canvas (using par(mfrow=c(1,2))). They are placed side-by-side which exagerates the bars around the means. A visual inspection indicates that the three groups have quite different means for Magnesium and for Phenols, but it is more significant for Phenols.

We can evaluate this statistically using R. Comparing the means between different subsets of a dataset is called http://en.wikipedia.org/wiki/analysis_of_varianceanalysis of variance or http://en.wikipedia.org/wiki/ANOVAANOVA. Here we compare the means of Magnesium, and, separately, the means of Phenols across the Types.

 ```> anova(lm(Phenols ~ Type)) Analysis of Variance Table Response: Phenols Df Sum Sq Mean Sq F value Pr(>F) Type 2 35.857 17.928 93.733 < 2.2e-16 *** Residuals 175 33.472 0.191 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > anova(lm(Magnesium ~ Type)) Analysis of Variance Table Response: Magnesium Df Sum Sq Mean Sq F value Pr(>F) Type 2 4491.0 2245.5 12.430 8.963e-06 *** Residuals 175 31615.1 180.7 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ```

The Pr(>F) value is clearly smaller than 0.05, thus with 95% confidence we see that the means are different.

If however we look at just Types 2 and 3, and compare the means of the two groups:

 ```> wine23 <- wine[Type!=1,] > attach(wine23) > anova(lm(Magnesium ~ Type)) Analysis of Variance Table Response: Magnesium Df Sum Sq Mean Sq F value Pr(>F) Type 1 649.8 649.8 3.0141 0.08518 . Residuals 117 25221.9 215.6 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ```

With a Pr(>F) of 0.08518, which is larger than 0.05, the means for Magnesium across these two groups is not significantly different (at the 95% level). However, it is significant at the 90% level of confidence (indicated by the period following the number in the output, and the legend below associating this with 0.1 - 10%).