A merger of (at least) four disciplines. A merger of (at least) four disciplines


Classification builds a model of the data



Yüklə 500 b.
səhifə11/14
tarix25.07.2018
ölçüsü500 b.
#58059
1   ...   6   7   8   9   10   11   12   13   14

Classification builds a model of the data

  • Classification builds a model of the data

  • Prediction uses that model together with new data to predict new values.



Predictive Accuracy - how well does the model predict new values?

  • Predictive Accuracy - how well does the model predict new values?

  • Capability - how well does the process handle different forms of data?

  • Speed - how fast is the model building and/or the prediction process?

  • Robustness - how well does the model deal with spurious data?

  • Scalability - how well does the process handle increases in the volume of data?

  • Interpretability - how understandable are the results?

  • Incrementation - how well is the model able to change its model when given new data?



Class Attribute - the attribute that holds the different values for the groups we want to classify. Eg. If we are trying to classify the n different types of disease then diseasetype (which would have n distinct values) would be the class attribute.

  • Class Attribute - the attribute that holds the different values for the groups we want to classify. Eg. If we are trying to classify the n different types of disease then diseasetype (which would have n distinct values) would be the class attribute.

  • Sample - part of a dataset. In classification terms almost always a partitioning based on the class attribute.

  • Test Attribute - another attribute we are using to split the dataset into sample such that the diversity of values of the class attribute is lower.

  • Entropy - amount of disorder wrt the class attribute.

  • Information (Gain) - a theoretical measure of (the increase in) knowledge held by a given sample.



Select the class attribute.

  • Select the class attribute.

  • Repeat until all attributes used up or tree complete

    • For each branch (each sample at that point in the tree)
      • If not yet sufficiently accurate
        • Select the attribute (and the values of that attribute) that best distinguishes the different values in the target attribute
        • Add these branches to the tree
      • Else terminate branch
  • Create Rules from Tree



Based (loosely) on the physics concept of Entropy (the idea of disorder)

  • Based (loosely) on the physics concept of Entropy (the idea of disorder)

  • We choose that attribute that best decreases entropy (has the greatest information gain). Ie. we choose the attribute that most results in the target attribute being distinguished. May be a different attribute for each branch.

  • Steps are as follows (full equations given in Han and Kamber, p286):

    • Determine the information needed to classify a given dataset on a specified target attribute by working out the sum of the information needed to classify each sample (each distinct value) of the target attribute.
    • Choosing each attribute in turn, work out the entropy of choosing that attribute by calculating the sum of the (weighted) information for each subset based on a value of the attribute.
    • The Gain for that attribute is thus the original information less the entropy (the disorder) remaining.
    • Choose the attribute with the highest Gain. (ie. the attribute that results in the least disorder wrt the target attribute).


Run down the tree in a depth wise manner creating the rules from the attribute-value pairs on each limb.

  • Run down the tree in a depth wise manner creating the rules from the attribute-value pairs on each limb.

  • Amalgamate (and maybe simplify) the rules that end up predicting the same value.

  • Pruning Techniques:

    • Prepruning - stop creating the tree when the branch is sufficiently accurate,
    • Postpruning - create the tree to completion and then remove either:
      • Branches that are unnecessarily complex, or
      • Rule terms that do not affect (or have little affect on) the accuracy of the result.


Age Dept Married? Children? Car? LtL?







This is sometimes confused …

  • This is sometimes confused …

    • Sequential Pattern Mining will be taken to mean the mining of long sequences of tokens for frequently occurring subsequences.
    • Time-series mining is the analysis of large amounts of data time-stamped in some way.



Yüklə 500 b.

Dostları ilə paylaş:
1   ...   6   7   8   9   10   11   12   13   14




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin