Agrawal, R., Imielinski, T. and Swami, A. (1993). Mining Association Rules Between Sets of Items in Large Databases. ACM SIGMOD International Conference on Management of Data, Washington DC, USA, 22. ACM Press. 207-216.
Although limited in its capabilities, it has since become the basis of many mining routines.
The problem can be summarised as follows:
Given a large set of transactions (t1, .. tn) which consist of one or more of a large number of items (i1, .. im), report those items that occur together frequently.
Apriori works by creating frequent itemsets of increasing length. It is assisted in this by the observation that if an itemset I of length k is not frequent, then nor will any longer itemsets that contain I.
Determining if a rule is “interesting” is a tough question. The two metrics used for association rule mining are:
Determining if a rule is “interesting” is a tough question. The two metrics used for association rule mining are:
Support (σ) - P(A∪B) - Applied to Itemsets
ie. How often do A and B appear in the dataset together.
Confidence (γ) - P(B|A) - Applied to Rules
ie. Of the times A appears in the dataset, how often does it appear with B. Note this is different from P(B|A)
Consider a transaction set as follows:
Consider a transaction set as follows:
Trans 1 Butter, Biscuits, Cream, Newspaper, Bread, Chocolate
Trans 2 Cream, Newspaper, Tea, Oil, Chocolate
Trans 3 Chocolate, Cereal, Bread
Trans 4 Chocolate, Flour, Biscuits, Newspaper
Trans 5 Chocolate, Biscuits, Newspaper
The first step is to make a list of all separate items that occur in the dataset together with the number of times they occur.
The next stage is to create 2-item itemsets by creating every possible combination of the 1-item itemsets.
The next stage is to create 2-item itemsets by creating every possible combination of the 1-item itemsets.
This is done by first:
Creating all possible 2-item itemsets from the 1-item itemsets. These are termed the C2 sets.
Then the C2 sets are checked against the dataset itself to see which actually occur. These are then known as the L2 itemsets.
The C2 itemsets includes every 2-item combination of Biscuits, Bread, Chocolate, Cream and Newspaper.
The C2 itemsets includes every 2-item combination of Biscuits, Bread, Chocolate, Cream and Newspaper.
After this, 3-item itemsets are created (first by creating the C3 itemset and then the L3 itemsets) and so on.
After this, 3-item itemsets are created (first by creating the C3 itemset and then the L3 itemsets) and so on.
Note that for a C3 itemset ABC to be created we need ALL of AB, BC and AC as L2 itemsets. Similarly, to form a C4 itemset ABCD to be created we need ALL of ABC, ABD, ACD and BCD as L3 itemsets.
This goes on until no more itemsets can be created.
The full list of k-itemsets (k≥2) is thus:
We then form rules from these itemsets.
Rules are constructed by taking each combination of each frequent itemset (on each side of the rule) and calculating confidence values (ie. P(B | A) = tc(A->B)/tc(A) where tc(x) is the number of transactions containing all items in x.
Rules are constructed by taking each combination of each frequent itemset (on each side of the rule) and calculating confidence values (ie. P(B | A) = tc(A->B)/tc(A) where tc(x) is the number of transactions containing all items in x.
A specified confidence level is used to cull those rules not reaching some specified level of accuracy.
Biscuits -> Chocolate 100%
Biscuits -> Chocolate 100%
Chocolate -> Biscuits 60% Not reported
Biscuits ∧ Chocolate -> Newspaper 100%
Chocolate ∧ Newspaper -> Biscuits 75% Not reported
Newspaper ∧ Biscuits -> Chocolate 100%
Biscuits -> Chocolate ∧ Newspaper 100%
Chocolate -> Newspaper ∧ Biscuits 60% Not reported
Newspaper -> Biscuits ∧ Chocolate 75% Not reported