![]() |
DATA MINING
Desktop Survival Guide by Graham Williams |
![]() |
|||
The Apriori algorithm is the original association rule algorithm. Each
transaction is thought of as a basket of items (which we might
represent as
). The algorithm searches for
collections of items that appear together in multiple baskets (e.g.,
). From these so called itemsets it identifies
rules like
which we read as indicating that
and
appearing in a transaction typically entails that
will also
appear in the transaction.
The basis of an association analysis algorithm is the generation of
frequent itemsets. However, naïve approaches will be quite expensive
in computational time with even moderately sized databases. The
Apriori algorithm takes advantage of the simple apriori
observation that all subsets of a frequent itemset must also be
frequent. That is, if
is a frequent itemset
then so must each of the smaller itemsets
,
,
,
,
, and
. This observation allows the algorithm to consider a
significantly reduced search space by starting with frequent
individual items (eliminating rare items). We can then combine these
into itemsets containing just two items and retain only those that are
frequent enough. Similarly for itemsets containing three items, and so
on.
Suppose we have a rule of the form
. We
call
the antecedent and
the
consequent, and both are non-empty sets of items.
The concept of frequent enough is a parameter of the
algorithm, used to control the number of association rules discovered
This support specifies how frequently the
items must appear in the whole dataset before the items can be
considered as a candidate association rule. For example, the user may
choose to consider only sets of items that occur in at least 5% of
all transactions. Formally we define support for a
collection of items as the proportion of all baskets in
which all items in
appear. Then we can define the support
for an association rule as:
A second parameter, the confidence,
calculates the proportion of transactions containing
that also contain
. The confidence specifies a minimal
probability for the association rule. For example, the user may
choose to only generate rules which are true at least 90% of the time
(that is, when
appears in the basket,
also
appears in the same basket at least 90% of the time). Formally:
Copyright © 2004-2006 [email protected] Support further development through the purchase of the PDF version of the book.