A merger of (at least) four disciplines. A merger of (at least) four disciplines


Needs to traverse the database for each level for if there are n transactions and k levels of item then the routine is O(nk)



Yüklə 500 b.
səhifə7/14
tarix25.07.2018
ölçüsü500 b.
#58059
1   2   3   4   5   6   7   8   9   10   ...   14

Needs to traverse the database for each level for if there are n transactions and k levels of item then the routine is O(nk).

  • Needs to traverse the database for each level for if there are n transactions and k levels of item then the routine is O(nk).

  • Cannot handle

    • Hierarchies
    • Categorical Values
    • Continuous Values
  • Has been improved on considerably by a number of new algorithms

    • FP Growth, for example, finds frequent itemsets without generating candidates first.
    • There are multi-level association rule generation routines that accommodate concept hierarchies.
    • There are routines that handle time and space,
    • There are routines that work on parallel machines,
    • There are routines that spot competitor items, and so on.


While Apriori uses transaction databases, conventional relational databases can also be used. For example, the following relation:

  • While Apriori uses transaction databases, conventional relational databases can also be used. For example, the following relation:

    • Id Sex Department Age License? Location
    • 21788A M Sales 31 Y Adelaide
    • 21771H F Mgmt 42 N Sydney
    • 12299I M Payroll …
    • :
  • Can be translated to transaction dataset:

    • Id.21788A Sex.M, Department.Sales, Age.31, License.Y, Location.Adelaide
    • Id.21771H Sex.F, Department.Mgmt, Age.42, License.N, Location.Sydney
    • Id.12299I Sex.M, Department.Payroll …
    • :
  • This can then mined to create rules such as:

    • Sex.M, Department.Sales -> License.Y σ(15%), γ(72%)


Note that three attributes need special attention:

  • Note that three attributes need special attention:

    • Id never will never appear in a rule as it will never have the required support.
      • This is generally desirable as it effectively confidentialises the rules.
    • Age, if left as it is in the database is unlikely to be part of a rule as the values are too diverse. Thus we need to segment (or discretize) the range, eg.
      • Age.31-40; Age.41-50, etc.
      • The problem is often deciding on the ranges to use
        • Some are predetermined by the interests of the user,
        • Others may be automatically derived - there are routines to do this, such as binning and analyzing association rules after generation.
        • Finally, Age can be put in a hierarchy as for Location below


Location may be interpreted better if it was accommodated as part of a hierarchy. For example:

    • Location may be interpreted better if it was accommodated as part of a hierarchy. For example:
    • Thus while a rule
    • Sex.M, Location.Adelaide -> License.Y
    • may not have the required support
    • Sex.M, Location.SA -> License.Y
    • might have.


There are a few problems with hierarchies:

  • There are a few problems with hierarchies:

    • As the hierarchy is ascended, the support level needed for something to make sense increases. For example:
    • Peas -> Coffee
    • might be considered interesting, but:
    • Grocery -> Beverage
    • would have to reach a very high level of support before it would be consider useful.
    • There may be more than one hierarchy - which one do we use?
      • If they are (supposed to be) independent then we can use them all as an association between them might be interesting.
      • If they are known dependencies (eg. between, public holidays and the day of the week), then we must stop such associations dominating our rules.


Linking in external files as additional attributes can be extremely useful. For example, on the admission date field on a hospital record we might link:

  • Linking in external files as additional attributes can be extremely useful. For example, on the admission date field on a hospital record we might link:

    • Meteorological data
    • Public holiday data
    • Day of the week
    • Pollen Count
    • Lunar Cycles
  • Similarly, on home postcode we might link:

    • SLA data
    • Census data
    • Geographic data, and so on.
  • As for hierarchies, we have to ensure that known associations between data linked in externally does not dominate the mining routine. When mining hospital data, for example, we do not want to discover that:

    • Temperature.Low -> PollenCount.Low


Consider the following three rules

  • Consider the following three rules

    • A -> C σ(15%), γ(72%)
    • B -> C σ(16%), γ(67%)
    • AB -> C σ(3%), γ(12%)
  • It is clear that the last is interesting as it implies that while A and B are independently associated with C, A and B together very rarely occur and when they do then are only loosely correlated with C.

  • This is the common form of association rules for competitors. Ie. the existence of one suppresses the other and vice versa.

  • The common way to calculate competitors is to work out the expected support and confidence and to compare it with what was observed.



Consider the following

  • Consider the following

    • Transaction - 100,000
    • Purchases of A - 50,000
    • Purchases of B - 40,000
    • Purchases of A and B - 15,000

  • Yüklə 500 b.

    Dostları ilə paylaş:
1   2   3   4   5   6   7   8   9   10   ...   14




Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©muhaz.org 2024
rəhbərliyinə müraciət

gir | qeydiyyatdan keç
    Ana səhifə


yükləyin