The term Higher Semantics is used here to cover extensions to data mining to accommodate specialist semantic domains, specifically:
The term Higher Semantics is used here to cover extensions to data mining to accommodate specialist semantic domains, specifically:
Space
Time
The Web
Note that all of these (and others) can appear in all of the technologies so far discussed. Ie., you can have temporal association rule mining, clustering of web data, spatial classification, and so on.
They are dealt with separately as they are commonly occurring but non-trivial types of data.
Relate data to some space.
Relate data to some space.
Geographical - GIS, LIS, Planning, …
Conceptual - dimensions relate to some abstract dimensions, …
Astronomical
Medical - eg., someone’s brain, viral configurations, …
Chemical - eg., Atomic Structure
Biological - environmental, molecular
Electronic - eg. Reported faults in a VLSI circuit, …
All can be represented by large collections of geometric objects. Indeed the geometry selected has a dramatic effect on the data mining processes able to be used.
Geometries include:
Geometries include:
Points (in n dimensions)
Lines (finite and infinite)
Planes / Areas
Volumes
Space itself
A definition of a spatial DBMS (from Ralf Güting):
A spatial database system is a database system,
It offers spatial data types in its data model and query language,
It supports spatial data types in its implementation, providing at least spatial indexing and efficient algorithms for spatial join.
A
A
Data mining of spatial data needs to take into account the geometries and topologies of spatial objects. Thus has to include relationships such as:
Data mining of spatial data needs to take into account the geometries and topologies of spatial objects. Thus has to include relationships such as:
Inside, overlaps
Inside MBP, overlaps MBP
“Close to”
“East of”
For example, in classification will also drive when cells are coalesced.
Will also make a dramatic difference to the query language extensions that may be available.
SELECT *
FROM ROAD R, GEOBOUNDARIES G
WHERE G.STATENAME = “SA”
AND R.NAME = “A12”
AND R INTERSECTS G
Association
Association
Closeto(Beach), IceCream SunTanLotion (),()
¬Closeto(Beach), IceCream Strawberries (),()
Clustering
Mining can be done two ways:
Mining can be done two ways:
Cluster on the non-spatial attributes (aspatial dominance) first, then reason over spatial attributes.
“Areas close to deserts have hot-dry summers”.
Cluster on the spatial attributes (spatial dominance) then reason over non-spatial attributes.
“South Australia has mild winters”.
Spatial mining can be combined with temporal data mining to form spatio-temporal data mining. In this case there are six options:
Space first, time second, non-S/T last,
Time first, non-S/T second, space last, etc.
Absolute
Absolute
Point in time - 12:05am, Monday, September 15, 2003
Interval - Week beginning Monday, September 15, 2003
Relative
Point in time - This time next week
Interval - During the flight to London
Absolute is relatively easy to deal with:
Must accommodate temporal hierarchies, such as weeks, weekends, calendars (lunar, Gregorian, Julian, Chinese, university, …)
Relative more complex
Must be accommodated through reasoning
Can be a mixture of the two.
Uni-directional and uniform.
Uni-directional and uniform.
Has the ability to suggest cause and effect.
Primitives can be events or intervals.
qv. Allen 83, Freksa 92
Can be reduced to longitudinal (ordered dataset) data mining by restricting set to before / after / equals.
Can be combined with spatial semantics to give 3D or 4D spatio-temporal rules.
13 interval to interval relationships.
13 interval to interval relationships.
Can be put in a transitivity table to imply further relationships between linked intervals.
Eg. If
A before B and
B meets C then
A before C.
Similar to spatial, needs to take into account temporal semantics.
Similar to spatial, needs to take into account temporal semantics.
As an example of this consider the following scenario.
While a non-temporal association rule might suggest that the presence of stands of River Red Gum Eucalypts is associated with the presence of the endangered Red-Tailed Black Cockatoo, a temporal association rule may indicate that the presence of Cockatoo usually occurs some time after the Eucalypt stand has reached maturity. This may indicate that a recovery plan for the endangered Cockatoo would involve maintaining what might otherwise be considered ageing stands of River Red Gums.
A separate but related area is longitudinal analysis.
A separate but related area is longitudinal analysis.
Temporal analysis deals with timestamped data.
Longitudinal analysis deals with data that is (merely) ordered in time.
Architecturally, we can mine multiple datasets to discover trends in rules over time.
The result is a set of rules that:
The result is a set of rules that:
Are implicitly longitudinal.
Exist at sufficient strength in at least a certain number of datasets. Helps to rule out spurious, one off but dominant anomalies. In the past to capture these low support but persistent rules requires a lowering of the support threshold which increases the volume of results.
Associates sets of items in a different manner. Ie, it associates itemsets. “When A is associated with B then C is associated with D”. Quite a common observation.
Can be regenerated without reference to the original data -> faster.
Can be extended to patterns of support values.
Interestingness remains one of the toughest problems, particularly when dealing with temporal data mining and an explosion of possible rules.
Interestingness remains one of the toughest problems, particularly when dealing with temporal data mining and an explosion of possible rules.
As well as Support and Confidence there are other metrics for Interestingness caused by different requirements:
Rare occurrences
Perhaps in consultation with a knowledge base
Changes in associations, classifications...
Anomalies,
Trends, etc.
Counting measures
Counting measures
Support, etc.
Statistical measures or Validity
Confidence, 2, etc. Eg. Two standard deviations away from expected values.
Threshold measures
More interested when close to user-defined thresholds, etc.
Attribute-value interest
Associating relative interest at the level of attribute-value. Eg. More interested in rules that pertain to Eastern Suburbs locations.
Actionability or Applicability
Rules that result in actionable results.
Novelty
Rules that either contradict existing data or represent new knowledge. Requires a knowledge base.
Behavioural Deviation
Behavioural Deviation
Rules that show a deviation in behaviour over time, ordering or space.
Representativeness
Whether a rule can be said to hold more generally.
Provability
Linked to applicability but may involve attribute-level knowledge of whether something can be tested.
Understandability
Short enough to comprehend?
Content Mining
Content Mining
Mining either
The content of web pages, or
The content of search results.
May be useful to cluster webpages according to various criteria to find, for example, all other sites “similar” to another site.
Structure Mining
Mining the structure of web pages to determine
cliques,
home pages,
authoritative pages, etc.
Usage Mining
Commonly through web logs and this also known as Web Log Mining.
Can be done client side or server side.
Can track general access characteristics of individual user accesses.
Analyses either
Analyses either
the html (normally) returned when a web site is returned, or
the content of a specific search request.
In some ways an extension of search engines and is a type of text mining.
Can focus on
Keywords and terms (including the use of synonyms, hierarchies, etc.)
Similarities between pages (for example, to find similar pages or to find copyright infringements).
Can report:
A category (and therefore perhaps an importance) of a given page,
The frequency and types of change.
Analyses the manner in which the web is put together.
Analyses the manner in which the web is put together.
Can be used to discover:
Cliques - set of pages that are linked to each other. For example, may find a clique of pages dealing with Nanotechnology as they all link to each other.
Home pages and Hubs - pages that are an the centre of a network of pages or at the top of a hierarchy of pages.
Authoritative pages - pages pointed to be other pages. Used, for example, by Google.
Analyses the web traffic on either a general or an individual basis. Ie. will categorise:
Analyses the web traffic on either a general or an individual basis. Ie. will categorise:
The way a website is traversed, or
The way in which a user is traversing a website.
General WUM:
The way in which (groups of) users navigate around a website. Might suggest improvements to a site.
May categorise types of visitor into browsers, serious buyers, browsers who turn into buyers and those who don’t etc.
Individual WUM:
Monitors that way in which an individual uses a site to:
Categorise the user,
Personalise the website (for example through popup adverts)
Again, may look at mining a general population of users or track individuals:
Again, may look at mining a general population of users or track individuals:
In general mode, may look at:
the types of site visited by members of an organisation,
the way in which users visit sites (length of time on a page, etc.),
Suggest possible ways to reduce network traffic.
In individual mode, might:
Characterise users according to the sites they visit.
DMKD often uses sensitive/secure data
DMKD often uses sensitive/secure data
Need identifiers to link datasets
For example - often needs primary key to link relations in a corporate database for association rule mining.
This means that data cannot be completely confidentialised.
The ability to find information about individuals from aggregated information.
The ability to find information about individuals from aggregated information.
Consider the following three queries:
Find the average salary of the most highly paid person at Flinders University
Find the average salary of the most highly paid 20 people at Flinders University
Find the average salary of the most highly paid 21 people at Flinders University
Query 1 would be rejected (by a general purpose statistical DB Engine) as it relates to too few entities in the database.
Queries 2 and 3 would be accepted.
Salary of 21st most highly paid person = (Answer20 - Answer21)*21 <- Compromise
Assuming that all members of a group conform to the average of the group.
Assuming that all members of a group conform to the average of the group.
Done all the time by many agencies:
Insurance companies,
Government departments,
… etc.
Extremely useful for simplifying a population into manageable groups.
Becomes a problem when a stereotyped group is adversely discriminated against on the basis of a category listed in the Anti-Discrimination Legislation.
Also becomes a problem when systems are not constructed to cater for member of a category that do not conform to the category average.
Data perturbation.
Data perturbation.
Change the data by a small amount such that the value itself is incorrect but its effect on statistical calculations is negligible,
Data swapping
Swap values in attributes so that the value is unreliable but statistical calculations are correct.
Statistical Query Control
Monitor which queries have been executed and disallow any that might result in compromise.
Incorporation of Ethical Filters / Alerters
Allows sensitive rules to be either masked or augmented so that miners are alerted to the sensitivity of the rule.
Data mining is an inductive process.
Data mining is an inductive process.
Therefore the rule can be wrong, either because the dataset is unrepresentative or because an object does not conform.
Data mining provides evidence for further investigation - not proof.
Use a paradigm that puts DM into proper context.
Ie. use an accepted Knowledge Discovery process such as CRISP-DM.
A Legal Issue
A Legal Issue
Anyone who generates discovered rules is considered the ‘author’ of those rules - at least under (Australian) law,
Anyone who then promulgates the rules is the ‘distributor’ of the rule,
If a rule can be considered defamatory then, generally speaking, the author and the distributor can be sued unless the rule can be proven true (or the promulgation was done in an appropriate manner).
But wait … Data mining works on induction ...
Some attributes have a higher sensitivity than others.
Some attributes have a higher sensitivity than others.
Gender,
Native language
Religion
Aboriginality, etc
Particularly when a minority.
Rules including these attributes should be flagged or in some cases (depending on the authority of the user) perhaps even suppressed.
Research has suggested the modification of the APRIORI algorithm to cater for sensitivity.
To ensure that the mining process achieves what is intended.
To ensure that the mining process achieves what is intended.
Applicability of rules produced
Semantics of rules what is required
To minimise effort and maximise results.
To adhere to acceptable (and defensible) ethical and process standards.
To enable teams to work on the mining process.
To ensure that the process works effectively. Ie.
What is produced matches what was expected/requested,
Nothing forgotten by mistake,
Each step verified, etc.
Knowledge Discovery is to Data Mining what Software Engineering is to Programming
Quite a few frameworks proposed.
Quite a few frameworks proposed.
One used for scientific discovery was discussed earlier.
The most common framework is arguable the CRISP-DM (CRoss-Industry Standard Process for Data Mining) framework developed by an industry-oriented team who wanted to set out what they felt was an appropriate process.