DATA MINING
Desktop Survival Guide by Graham Williams |
|||||
A simple example from e-commerce is that of an on-line retailer of DVDs, maintaining a database of all purchases made by each customer. (They will also, of course, have web log data about what the customers browsed.) The retailer might be interested to know what DVDs appear regularly together and to then use this information to make recommendations to other customers.
The input data consists of ``transactions'' like the following, which record on each line the purchase history of a customer, with each purchase separated by a comma (i.e., CSV format as discussed in See Section ):
Sixth Sense,LOTR1,Harry Potter1,Green Mile,LOTR2 Gladiator,Patriot,Braveheart LOTR1,LOTR2 Gladiator,Patriot,Sixth Sense Gladiator,Patriot,Sixth Sense Gladiator,Patriot,Sixth Sense Harry Potter1,Harry Potter2 Gladiator,Patriot Gladiator,Patriot,Sixth Sense Sixth Sense,LOTR,Galdiator,Green Mile |
This data might be stored in the file DVD.csv which can be directly loaded into R using the read.transactions function of the arules package:
> library(arules) > dvd.transactions <- read.transactions("DVD.csv", sep=",") > dvd.transactions transactions in sparse format with 10 transactions (rows) and 11 items (columns) |
This tells us that there are, in total, 11 items that appear in the basket. The read.transactions function can also read data from a file with transaction ID and a single item per line (using the format="single" option).
For example, if the data consists of:
1,Sixth Sense 1,LOTR1 1,Harry Potter1 1,Green Mile 1,LOTR2 2,Gladiator 2,Patriot 2,Braveheart 3,LOTR1 3,LOTR2 4,Gladiator 4,Patriot 4,Sixth Sense 5,Gladiator 5,Patriot 5,Sixth Sense 6,Gladiator 6,Patriot 6,Sixth Sense 7,Harry Potter1 7,Harry Potter2 8,Gladiator 8,Patriot 9,Gladiator 9,Patriot 9,Sixth Sense 10,Sixth Sense 10,LOTR 10,Galdiator 10,Green Mile |
> dvd.transactions <- read.transactions("DVD.csv", format="single", sep=",", cols=c(1,2)) > dvd.transactions transactions in sparse format with 10 transactions (rows) and 11 items (columns) |
A summary of the dataset is obtained in the usual way:
> summary(dvd.transactions) transactions as itemMatrix in sparse format with 10 rows (elements/itemsets/transactions) and 11 columns (items) most frequent items: Gladiator Patriot Sixth Sense Green Mile 6 6 6 2 Harry Potter1 (Other) 2 8 element (itemset/transaction) length distribution: 2 3 4 5 3 5 1 1 Min. 1st Qu. Median Mean 3rd Qu. Max. 2.00 2.25 3.00 3.00 3.00 5.00 includes extended transaction information - examples: transactionIDs 1 1 2 2 3 3 |
The dataset is identified as a sparse matrix consisting of 10 rows (transactions in this case) and 11 columns or items. In fact, this corresponds to the total number of distinct items in the dataset, which internally are represented as a binary matrix, one column for each item. A distribution across the most frequent items (Gladiator appears in 6 ``baskets'') is followed by a distribution over the length of each transaction (one transaction has 5 items in the ``basket''). The final extended transaction information can be ignored in this simple example, but is explained for the more complex example that follows.
Association rules can now be built from the dataset:
> dvd.apriori <- apriori(dvd.transactions) parameter specification: confidence minval smax arem aval originalSupport support minlen 0.8 0.1 1 none FALSE TRUE 0.1 1 maxlen target ext 5 rules FALSE algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE apriori - find association rules with the apriori algorithm version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[11 item(s), 10 transaction(s)] done [0.00s]. sorting and recoding items ... [7 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 done [0.00s]. writing ... [7 rule(s)] done [0.00s]. creating S4 object ... done [0.01s]. |
The output here begins with a summary of the parameters chosen for the algorithm. The default values of confidence (0.8) and support (0.1) are noted, in addition to the minimum and maximum number of items in an itemset (minlen=1 and maxlen=5). The default target is rules, but you could instead target itemsets or hyperedges. These can be set in the call to apriori with the parameter argument which takes a list of keyword arguments.
We view the actual results of the modelling with the inspect function:
> inspect(dvd.apriori) lhs rhs support confidence lift 1 {LOTR1} => {LOTR2} 0.2 1 5.000000 2 {LOTR2} => {LOTR1} 0.2 1 5.000000 3 {Green Mile} => {Sixth Sense} 0.2 1 1.666667 4 {Gladiator} => {Patriot} 0.6 1 1.666667 5 {Patriot} => {Gladiator} 0.6 1 1.666667 6 {Sixth Sense, Gladiator} => {Patriot} 0.4 1 1.666667 7 {Sixth Sense, Patriot} => {Gladiator} 0.4 1 1.666667 |
The rules are listed in order of decreasing lift.
We can change the parameters to get other association rules. For example
we might reduce the support and deliver many more rules (81 rules):
> dvd.apriori <- apriori(dvd.transactions, par=list(supp=0.01)) |
> dvd.apriori <- apriori(dvd.transactions, par=list(conf=0.1)) |
Copyright © 2004-2006 [email protected] Support further development through the purchase of the PDF version of the book.