The term Higher Semantics is used here to cover extensions to data mining to accommodate specialist semantic domains, specifically:
The term Higher Semantics is used here to cover extensions to data mining to accommodate specialist semantic domains, specifically:
Space
Time
The Web
Note that all of these (and others) can appear in all of the technologies so far discussed. Ie., you can have temporal association rule mining, clustering of web data, spatial classification, and so on.
They are dealt with separately as they are commonly occurring but non-trivial types of data.
Medical - eg., someone’s brain, viral configurations, …
Chemical - eg., Atomic Structure
Biological - environmental, molecular
Electronic - eg. Reported faults in a VLSI circuit, …
All can be represented by large collections of geometric objects. Indeed the geometry selected has a dramatic effect on the data mining processes able to be used.
Geometries include:
Geometries include:
Points (in n dimensions)
Lines (finite and infinite)
Planes / Areas
Volumes
Space itself
A definition of a spatial DBMS (from Ralf Güting):
A spatial database system is a database system,
It offers spatial data types in its data model and query language,
It supports spatial data types in its implementation, providing at least spatial indexing and efficient algorithms for spatial join.
A
A
Data mining of spatial data needs to take into account the geometries and topologies of spatial objects. Thus has to include relationships such as:
Data mining of spatial data needs to take into account the geometries and topologies of spatial objects. Thus has to include relationships such as:
Inside, overlaps
Inside MBP, overlaps MBP
“Close to”
“East of”
For example, in classification will also drive when cells are coalesced.
Will also make a dramatic difference to the query language extensions that may be available.
Can be reduced to longitudinal (ordered dataset) data mining by restricting set to before / after / equals.
Can be combined with spatial semantics to give 3D or 4D spatio-temporal rules.
13 interval to interval relationships.
13 interval to interval relationships.
Can be put in a transitivity table to imply further relationships between linked intervals.
Eg. If
A before B and
B meets C then
A before C.
Similar to spatial, needs to take into account temporal semantics.
Similar to spatial, needs to take into account temporal semantics.
As an example of this consider the following scenario.
While a non-temporal association rule might suggest that the presence of stands of River Red Gum Eucalypts is associated with the presence of the endangered Red-Tailed Black Cockatoo, a temporal association rule may indicate that the presence of Cockatoo usually occurs some time after the Eucalypt stand has reached maturity. This may indicate that a recovery plan for the endangered Cockatoo would involve maintaining what might otherwise be considered ageing stands of River Red Gums.
A separate but related area is longitudinal analysis.
A separate but related area is longitudinal analysis.
Temporal analysis deals with timestamped data.
Longitudinal analysis deals with data that is (merely) ordered in time.
Architecturally, we can mine multiple datasets to discover trends in rules over time.
The result is a set of rules that:
The result is a set of rules that:
Are implicitly longitudinal.
Exist at sufficient strength in at least a certain number of datasets. Helps to rule out spurious, one off but dominant anomalies. In the past to capture these low support but persistent rules requires a lowering of the support threshold which increases the volume of results.
Associates sets of items in a different manner. Ie, it associates itemsets. “When A is associated with B then C is associated with D”. Quite a common observation.
Can be regenerated without reference to the original data -> faster.
Can be extended to patterns of support values.
Interestingness remains one of the toughest problems, particularly when dealing with temporal data mining and an explosion of possible rules.
Interestingness remains one of the toughest problems, particularly when dealing with temporal data mining and an explosion of possible rules.
As well as Support and Confidence there are other metrics for Interestingness caused by different requirements:
May be useful to cluster webpages according to various criteria to find, for example, all other sites “similar” to another site.
Structure Mining
Mining the structure of web pages to determine
cliques,
home pages,
authoritative pages, etc.
Usage Mining
Commonly through web logs and this also known as Web Log Mining.
Can be done client side or server side.
Can track general access characteristics of individual user accesses.
Analyses either
Analyses either
the html (normally) returned when a web site is returned, or
the content of a specific search request.
In some ways an extension of search engines and is a type of text mining.
Can focus on
Keywords and terms (including the use of synonyms, hierarchies, etc.)
Similarities between pages (for example, to find similar pages or to find copyright infringements).
Can report:
A category (and therefore perhaps an importance) of a given page,
The frequency and types of change.
Analyses the manner in which the web is put together.
Analyses the manner in which the web is put together.
Can be used to discover:
Cliques - set of pages that are linked to each other. For example, may find a clique of pages dealing with Nanotechnology as they all link to each other.
Home pages and Hubs - pages that are an the centre of a network of pages or at the top of a hierarchy of pages.
Authoritative pages - pages pointed to be other pages. Used, for example, by Google.
Analyses the web traffic on either a general or an individual basis. Ie. will categorise:
Analyses the web traffic on either a general or an individual basis. Ie. will categorise:
Extremely useful for simplifying a population into manageable groups.
Becomes a problem when a stereotyped group is adversely discriminated against on the basis of a category listed in the Anti-Discrimination Legislation.
Also becomes a problem when systems are not constructed to cater for member of a category that do not conform to the category average.
Data perturbation.
Data perturbation.
Change the data by a small amount such that the value itself is incorrect but its effect on statistical calculations is negligible,
Data swapping
Swap values in attributes so that the value is unreliable but statistical calculations are correct.
Statistical Query Control
Monitor which queries have been executed and disallow any that might result in compromise.
Incorporation of Ethical Filters / Alerters
Allows sensitive rules to be either masked or augmented so that miners are alerted to the sensitivity of the rule.
Data mining is an inductive process.
Data mining is an inductive process.
Therefore the rule can be wrong, either because the dataset is unrepresentative or because an object does not conform.
Data mining provides evidence for further investigation - not proof.
Use a paradigm that puts DM into proper context.
Ie. use an accepted Knowledge Discovery process such as CRISP-DM.
A Legal Issue
A Legal Issue
Anyone who generates discovered rules is considered the ‘author’ of those rules - at least under (Australian) law,
Anyone who then promulgates the rules is the ‘distributor’ of the rule,
If a rule can be considered defamatory then, generally speaking, the author and the distributor can be sued unless the rule can be proven true (or the promulgation was done in an appropriate manner).
But wait … Data mining works on induction ...
Some attributes have a higher sensitivity than others.
Some attributes have a higher sensitivity than others.
Rules including these attributes should be flagged or in some cases (depending on the authority of the user) perhaps even suppressed.
Research has suggested the modification of the APRIORI algorithm to cater for sensitivity.
To ensure that the mining process achieves what is intended.
To ensure that the mining process achieves what is intended.
Applicability of rules produced
Semantics of rules what is required
To minimise effort and maximise results.
To adhere to acceptable (and defensible) ethical and process standards.
To enable teams to work on the mining process.
To ensure that the process works effectively. Ie.
What is produced matches what was expected/requested,
Nothing forgotten by mistake,
Each step verified, etc.
Knowledge Discovery is to Data Mining what Software Engineering is to Programming
Quite a few frameworks proposed.
Quite a few frameworks proposed.
One used for scientific discovery was discussed earlier.
The most common framework is arguable the CRISP-DM (CRoss-Industry Standard Process for Data Mining) framework developed by an industry-oriented team who wanted to set out what they felt was an appropriate process.