Prediction uses that model together with new data to predict new values.
Predictive Accuracy - how well does the model predict new values?
Predictive Accuracy - how well does the model predict new values?
Capability - how well does the process handle different forms of data?
Speed - how fast is the model building and/or the prediction process?
Robustness - how well does the model deal with spurious data?
Scalability - how well does the process handle increases in the volume of data?
Interpretability - how understandable are the results?
Incrementation - how well is the model able to change its model when given new data?
Class Attribute - the attribute that holds the different values for the groups we want to classify. Eg. If we are trying to classify the n different types of disease then diseasetype (which would have n distinct values) would be the class attribute.
Class Attribute - the attribute that holds the different values for the groups we want to classify. Eg. If we are trying to classify the n different types of disease then diseasetype (which would have n distinct values) would be the class attribute.
Sample - part of a dataset. In classification terms almost always a partitioning based on the class attribute.
Test Attribute - another attribute we are using to split the dataset into sample such that the diversity of values of the class attribute is lower.
Entropy - amount of disorder wrt the class attribute.
Information (Gain) - a theoretical measure of (the increase in) knowledge held by a given sample.
Select the class attribute.
Select the class attribute.
Repeat until all attributes used up or tree complete
For each branch (each sample at that point in the tree)
If not yet sufficiently accurate
Select the attribute (and the values of that attribute) that best distinguishes the different values in the target attribute
Add these branches to the tree
Else terminate branch
Create Rules from Tree
Based (loosely) on the physics concept of Entropy (the idea of disorder)
Based (loosely) on the physics concept of Entropy (the idea of disorder)
We choose that attribute that best decreases entropy (has the greatest information gain). Ie. we choose the attribute that most results in the target attribute being distinguished. May be a different attribute for each branch.
Steps are as follows (full equations given in Han and Kamber, p286):
Determine the information needed to classify a given dataset on a specified target attribute by working out the sum of the information needed to classify each sample (each distinct value) of the target attribute.
Choosing each attribute in turn, work out the entropy of choosing that attribute by calculating the sum of the (weighted) information for each subset based on a value of the attribute.
The Gain for that attribute is thus the original information less the entropy (the disorder) remaining.
Choose the attribute with the highest Gain. (ie. the attribute that results in the least disorder wrt the target attribute).
Run down the tree in a depth wise manner creating the rules from the attribute-value pairs on each limb.
Run down the tree in a depth wise manner creating the rules from the attribute-value pairs on each limb.
Amalgamate (and maybe simplify) the rules that end up predicting the same value.
Pruning Techniques:
Prepruning - stop creating the tree when the branch is sufficiently accurate,
Postpruning - create the tree to completion and then remove either:
Branches that are unnecessarily complex, or
Rule terms that do not affect (or have little affect on) the accuracy of the result.
Age Dept Married? Children? Car? LtL?
Age Dept Married? Children? Car? LtL?
20 Sales Yes Yes Yes Unlikely
18 IT Yes Yes No Likely
45 IT Yes No No Maybe
63 Sales No No Yes Unlikely
23 Mktg No Yes Yes Maybe
17 Admin No No No Likely
19 Admin Yes No No Unlikely
51 Mgmt Yes No Yes Unlikely
31 Mgmt No No Yes Unlikely
34 IT Yes No No Maybe
41 Sales Yes Yes No Unlikely
34 IT No Yes No Likely
25 Mktg Yes Yes No Maybe
61 Sales Yes No No Unlikely
34 IT No No Yes Maybe
56 Mktg Yes Yes Yes Maybe
23 Admin Yes Yes Yes Unlikely
This is sometimes confused …
This is sometimes confused …
Sequential Pattern Mining will be taken to mean the mining of long sequences of tokens for frequently occurring subsequences.
Time-series mining is the analysis of large amounts of data time-stamped in some way.