Agrawal, R., Imielinski, T. and Swami, A. (1993). Mining Association Rules Between Sets of Items in Large Databases. ACM SIGMOD International Conference on Management of Data, Washington DC, USA, 22. ACM Press. 207-216.
Although limited in its capabilities, it has since become the basis of many mining routines.
The problem can be summarised as follows:
Given a large set of transactions (t1, .. tn) which consist of one or more of a large number of items (i1, .. im), report those items that occur together frequently.
Apriori works by creating frequent itemsets of increasing length. It is assisted in this by the observation that if an itemset I of length k is not frequent, then nor will any longer itemsets that contain I.
Determining if a rule is “interesting” is a tough question. The two metrics used for association rule mining are:
Determining if a rule is “interesting” is a tough question. The two metrics used for association rule mining are:
The first step is to make a list of all separate items that occur in the dataset together with the number of times they occur.
The next stage is to create 2-item itemsets by creating every possible combination of the 1-item itemsets.
The next stage is to create 2-item itemsets by creating every possible combination of the 1-item itemsets.
This is done by first:
Creating all possible 2-item itemsets from the 1-item itemsets. These are termed the C2 sets.
Then the C2 sets are checked against the dataset itself to see which actually occur. These are then known as the L2 itemsets.
The C2 itemsets includes every 2-item combination of Biscuits, Bread, Chocolate, Cream and Newspaper.
The C2 itemsets includes every 2-item combination of Biscuits, Bread, Chocolate, Cream and Newspaper.
After this, 3-item itemsets are created (first by creating the C3 itemset and then the L3 itemsets) and so on.
After this, 3-item itemsets are created (first by creating the C3 itemset and then the L3 itemsets) and so on.
Note that for a C3 itemset ABC to be created we need ALL of AB, BC and AC as L2 itemsets. Similarly, to form a C4 itemset ABCD to be created we need ALL of ABC, ABD, ACD and BCD as L3 itemsets.
This goes on until no more itemsets can be created.
The full list of k-itemsets (k≥2) is thus:
We then form rules from these itemsets.
Rules are constructed by taking each combination of each frequent itemset (on each side of the rule) and calculating confidence values (ie. P(B | A) = tc(A->B)/tc(A) where tc(x) is the number of transactions containing all items in x.
Rules are constructed by taking each combination of each frequent itemset (on each side of the rule) and calculating confidence values (ie. P(B | A) = tc(A->B)/tc(A) where tc(x) is the number of transactions containing all items in x.
A specified confidence level is used to cull those rules not reaching some specified level of accuracy.