Cluster analysis can be used to find clusters that are most
interesting according to some criteria. For example, we might
cluster the spam7 data of the DAAG package (without using yesno in the
clustering) and then score the clusters depending on the proportion of
yes cases within the cluster. The following R code will build K
clusters (user specified) and return a score for each cluster.
# Some ideas here from Felix Andrews
kmeans.scores <- function(x, centers, cases)
clust <- kmeans(x, centers)
# Iterate over each cluster to generate the scores
scores <- c()
for (i in 1:centers)
# Count number of TRUE cases in the cluster
# as the proportion of the cluster size
scores[i] <- sum( cases[clust$cluster == i] == TRUE ) / clust$size[i]
# Add the scores as another element to the kmeans list
clust$scores <- scores
We can now run this on our data with:
> clust <- kmeans.scores(spam7[,1:6], centers=10, spam7["yesno"]=="y")
 0.7037037 0.1970109 0.5995763 0.7656250 0.8043478 1.0000000 0.4911628
 0.7446809 0.6086957 0.6043956
 162 2208 472 128 46 5 1075 47 276 182
Thus, cluster 5 with 46 members has a high proportion of positive
cases and may be a cluster we are interested in exploring further.
Clusters 4, 8, and 1 are also probably worth exploring.
Now that we have built some clusters we can generate some rules that
describe the clusters:
hotspots <- function(x, cluster, cases)
overall = sum(cases) / nrow(cases)
x.clusters <- cbind(x, cluster)
tree = rpart(cluster ~ ., data = x.clusters, method = "class")
# tree = prune(tree, cp = 0.06)
nodes <- rownames(tree$frame)
paths = path.rpart(tree, nodes = nodes)
TO BE CONTINUED
And to use it:
> h <- hotspots(spam7[,1:6], clust$cluster, spam7["yesno"]=="y")
Copyright © 2004-2006 [email protected]
Support further development through the purchase of the PDF version of the book.
Brought to you by Togaware.