A merger of (at least) four disciplines. A merger of (at least) four disciplines

Yüklə 500 b.

səhifə	14/14
tarix	25.07.2018
ölçüsü	500 b.
	#58059

1 ... 6 7 8 9 10 11 12 13 14

Selection

Navigation

Coordinated Displays

Distortion

Distortion

Fisheye View
Perspective Wall

Object Characteristics

Shape
Size
Colour

Animation

The term Higher Semantics is used here to cover extensions to data mining to accommodate specialist semantic domains, specifically:

The term Higher Semantics is used here to cover extensions to data mining to accommodate specialist semantic domains, specifically:

Space
Time
The Web

Note that all of these (and others) can appear in all of the technologies so far discussed. Ie., you can have temporal association rule mining, clustering of web data, spatial classification, and so on.
They are dealt with separately as they are commonly occurring but non-trivial types of data.

Relate data to some space.

Relate data to some space.

Geographical - GIS, LIS, Planning, …
Conceptual - dimensions relate to some abstract dimensions, …
Astronomical
Medical - eg., someone’s brain, viral configurations, …
Chemical - eg., Atomic Structure
Biological - environmental, molecular
Electronic - eg. Reported faults in a VLSI circuit, …

All can be represented by large collections of geometric objects. Indeed the geometry selected has a dramatic effect on the data mining processes able to be used.

Geometries include:

Geometries include:

Points (in n dimensions)
Lines (finite and infinite)
Planes / Areas
Volumes
Space itself

A definition of a spatial DBMS (from Ralf Güting):

A spatial database system is a database system,
It offers spatial data types in its data model and query language,
It supports spatial data types in its implementation, providing at least spatial indexing and efficient algorithms for spatial join.

A

Data mining of spatial data needs to take into account the geometries and topologies of spatial objects. Thus has to include relationships such as:

Data mining of spatial data needs to take into account the geometries and topologies of spatial objects. Thus has to include relationships such as:

Inside, overlaps
Inside MBP, overlaps MBP
“Close to”
“East of”

For example, in classification will also drive when cells are coalesced.
Will also make a dramatic difference to the query language extensions that may be available.

SELECT *
FROM ROAD R, GEOBOUNDARIES G
WHERE G.STATENAME = “SA”
AND R.NAME = “A12”
AND R INTERSECTS G

Association

Association

Closeto(Beach), IceCream  SunTanLotion (),()
¬Closeto(Beach), IceCream  Strawberries (),()

Clustering

Mining can be done two ways:

Mining can be done two ways:

Cluster on the non-spatial attributes (aspatial dominance) first, then reason over spatial attributes.

“Areas close to deserts have hot-dry summers”.

Cluster on the spatial attributes (spatial dominance) then reason over non-spatial attributes.

“South Australia has mild winters”.

Spatial mining can be combined with temporal data mining to form spatio-temporal data mining. In this case there are six options:

Space first, time second, non-S/T last,
Time first, non-S/T second, space last, etc.

Absolute

Absolute

Point in time - 12:05am, Monday, September 15, 2003
Interval - Week beginning Monday, September 15, 2003

Relative

Point in time - This time next week
Interval - During the flight to London

Absolute is relatively easy to deal with:

Must accommodate temporal hierarchies, such as weeks, weekends, calendars (lunar, Gregorian, Julian, Chinese, university, …)

Relative more complex

Must be accommodated through reasoning

Can be a mixture of the two.

Uni-directional and uniform.

Uni-directional and uniform.
Has the ability to suggest cause and effect.
Primitives can be events or intervals.

qv. Allen 83, Freksa 92

Can be reduced to longitudinal (ordered dataset) data mining by restricting set to before / after / equals.
Can be combined with spatial semantics to give 3D or 4D spatio-temporal rules.

13 interval to interval relationships.

13 interval to interval relationships.
Can be put in a transitivity table to imply further relationships between linked intervals.
Eg. If

A before B and
B meets C then
A before C.

Similar to spatial, needs to take into account temporal semantics.

Similar to spatial, needs to take into account temporal semantics.

As an example of this consider the following scenario.

While a non-temporal association rule might suggest that the presence of stands of River Red Gum Eucalypts is associated with the presence of the endangered Red-Tailed Black Cockatoo, a temporal association rule may indicate that the presence of Cockatoo usually occurs some time after the Eucalypt stand has reached maturity. This may indicate that a recovery plan for the endangered Cockatoo would involve maintaining what might otherwise be considered ageing stands of River Red Gums.

A separate but related area is longitudinal analysis.

A separate but related area is longitudinal analysis.

Temporal analysis deals with timestamped data.
Longitudinal analysis deals with data that is (merely) ordered in time.
Architecturally, we can mine multiple datasets to discover trends in rules over time.

The result is a set of rules that:

The result is a set of rules that:

Are implicitly longitudinal.
Exist at sufficient strength in at least a certain number of datasets. Helps to rule out spurious, one off but dominant anomalies. In the past to capture these low support but persistent rules requires a lowering of the support threshold which increases the volume of results.
Associates sets of items in a different manner. Ie, it associates itemsets. “When A is associated with B then C is associated with D”. Quite a common observation.
Can be regenerated without reference to the original data -> faster.
Can be extended to patterns of support values.

Interestingness remains one of the toughest problems, particularly when dealing with temporal data mining and an explosion of possible rules.

Interestingness remains one of the toughest problems, particularly when dealing with temporal data mining and an explosion of possible rules.
As well as Support and Confidence there are other metrics for Interestingness caused by different requirements:

Rare occurrences

Perhaps in consultation with a knowledge base

Changes in associations, classifications...
Anomalies,
Trends, etc.

Counting measures

Counting measures

Support, etc.

Statistical measures or Validity

Confidence, 2, etc. Eg. Two standard deviations away from expected values.

Threshold measures

More interested when close to user-defined thresholds, etc.

Attribute-value interest

Associating relative interest at the level of attribute-value. Eg. More interested in rules that pertain to Eastern Suburbs locations.

Actionability or Applicability

Rules that result in actionable results.

Novelty

Rules that either contradict existing data or represent new knowledge. Requires a knowledge base.

Behavioural Deviation

Behavioural Deviation

Rules that show a deviation in behaviour over time, ordering or space.

Representativeness

Whether a rule can be said to hold more generally.

Provability

Linked to applicability but may involve attribute-level knowledge of whether something can be tested.

Understandability

Short enough to comprehend?

Content Mining

Content Mining

Mining either

The content of web pages, or
The content of search results.

May be useful to cluster webpages according to various criteria to find, for example, all other sites “similar” to another site.

Structure Mining

Mining the structure of web pages to determine

cliques,
home pages,
authoritative pages, etc.

Usage Mining

Commonly through web logs and this also known as Web Log Mining.

Can be done client side or server side.
Can track general access characteristics of individual user accesses.

Analyses either

Analyses either

the html (normally) returned when a web site is returned, or
the content of a specific search request.

In some ways an extension of search engines and is a type of text mining.
Can focus on

Keywords and terms (including the use of synonyms, hierarchies, etc.)
Similarities between pages (for example, to find similar pages or to find copyright infringements).

Can report:

A category (and therefore perhaps an importance) of a given page,
The frequency and types of change.

Analyses the manner in which the web is put together.

Analyses the manner in which the web is put together.
Can be used to discover:

Cliques - set of pages that are linked to each other. For example, may find a clique of pages dealing with Nanotechnology as they all link to each other.
Home pages and Hubs - pages that are an the centre of a network of pages or at the top of a hierarchy of pages.
Authoritative pages - pages pointed to be other pages. Used, for example, by Google.

Analyses the web traffic on either a general or an individual basis. Ie. will categorise:

Analyses the web traffic on either a general or an individual basis. Ie. will categorise:

The way a website is traversed, or
The way in which a user is traversing a website.

General WUM:

The way in which (groups of) users navigate around a website. Might suggest improvements to a site.
May categorise types of visitor into browsers, serious buyers, browsers who turn into buyers and those who don’t etc.

Individual WUM:

Monitors that way in which an individual uses a site to:

Categorise the user,
Personalise the website (for example through popup adverts)

Again, may look at mining a general population of users or track individuals:

Again, may look at mining a general population of users or track individuals:
In general mode, may look at:

the types of site visited by members of an organisation,
the way in which users visit sites (length of time on a page, etc.),
Suggest possible ways to reduce network traffic.

In individual mode, might:

Characterise users according to the sites they visit.

DMKD often uses sensitive/secure data

DMKD often uses sensitive/secure data
Need identifiers to link datasets

For example - often needs primary key to link relations in a corporate database for association rule mining.
This means that data cannot be completely confidentialised.

The ability to find information about individuals from aggregated information.

The ability to find information about individuals from aggregated information.

Consider the following three queries:

Find the average salary of the most highly paid person at Flinders University
Find the average salary of the most highly paid 20 people at Flinders University
Find the average salary of the most highly paid 21 people at Flinders University

Query 1 would be rejected (by a general purpose statistical DB Engine) as it relates to too few entities in the database.
Queries 2 and 3 would be accepted.
Salary of 21st most highly paid person = (Answer20 - Answer21)*21 <- Compromise

Assuming that all members of a group conform to the average of the group.

Assuming that all members of a group conform to the average of the group.
Done all the time by many agencies:

Insurance companies,
Government departments,
… etc.

Extremely useful for simplifying a population into manageable groups.
Becomes a problem when a stereotyped group is adversely discriminated against on the basis of a category listed in the Anti-Discrimination Legislation.
Also becomes a problem when systems are not constructed to cater for member of a category that do not conform to the category average.

Data perturbation.

Data perturbation.

Change the data by a small amount such that the value itself is incorrect but its effect on statistical calculations is negligible,

Data swapping

Swap values in attributes so that the value is unreliable but statistical calculations are correct.

Statistical Query Control

Monitor which queries have been executed and disallow any that might result in compromise.

Incorporation of Ethical Filters / Alerters

Allows sensitive rules to be either masked or augmented so that miners are alerted to the sensitivity of the rule.

Data mining is an inductive process.

Data mining is an inductive process.

Therefore the rule can be wrong, either because the dataset is unrepresentative or because an object does not conform.

Data mining provides evidence for further investigation - not proof.

Use a paradigm that puts DM into proper context.
Ie. use an accepted Knowledge Discovery process such as CRISP-DM.

A Legal Issue

A Legal Issue

Anyone who generates discovered rules is considered the ‘author’ of those rules - at least under (Australian) law,
Anyone who then promulgates the rules is the ‘distributor’ of the rule,
If a rule can be considered defamatory then, generally speaking, the author and the distributor can be sued unless the rule can be proven true (or the promulgation was done in an appropriate manner).
But wait … Data mining works on induction ...

Some attributes have a higher sensitivity than others.

Some attributes have a higher sensitivity than others.

Gender,
Native language
Religion
Aboriginality, etc

Particularly when a minority.
Rules including these attributes should be flagged or in some cases (depending on the authority of the user) perhaps even suppressed.
Research has suggested the modification of the APRIORI algorithm to cater for sensitivity.

To ensure that the mining process achieves what is intended.

To ensure that the mining process achieves what is intended.

Applicability of rules produced
Semantics of rules what is required

To minimise effort and maximise results.
To adhere to acceptable (and defensible) ethical and process standards.
To enable teams to work on the mining process.
To ensure that the process works effectively. Ie.

What is produced matches what was expected/requested,
Nothing forgotten by mistake,
Each step verified, etc.

Knowledge Discovery is to Data Mining what Software Engineering is to Programming

Quite a few frameworks proposed.

Quite a few frameworks proposed.
One used for scientific discovery was discussed earlier.
The most common framework is arguable the CRISP-DM (CRoss-Industry Standard Process for Data Mining) framework developed by an industry-oriented team who wanted to set out what they felt was an appropriate process.
Consists of 6 stages.

Business Understanding

Business Understanding

Determine Business Objectives
Assess Situation
Determine Data Mining Goals
Produce Project Plan

Data Understanding

Collect (Initial) Data
Describe Data
Explore data
Verify data quality

Data Preparation

Select Data
Clean Data
Construct Data
Integrate Data
Format Data

Yüklə 500 b.

Dostları ilə paylaş:

1 ... 6 7 8 9 10 11 12 13 14

A merger of (at least) four disciplines. A merger of (at least) four disciplines

Selection

Navigation

Coordinated Displays

Distortion

Distortion

Object Characteristics

Animation

The term Higher Semantics is used here to cover extensions to data mining to accommodate specialist semantic domains, specifically:

The term Higher Semantics is used here to cover extensions to data mining to accommodate specialist semantic domains, specifically:

Note that all of these (and others) can appear in all of the technologies so far discussed. Ie., you can have temporal association rule mining, clustering of web data, spatial classification, and so on.

They are dealt with separately as they are commonly occurring but non-trivial types of data.

Relate data to some space.

Relate data to some space.

All can be represented by large collections of geometric objects. Indeed the geometry selected has a dramatic effect on the data mining processes able to be used.

Geometries include:

Geometries include:

A definition of a spatial DBMS (from Ralf Güting):

A

A

Data mining of spatial data needs to take into account the geometries and topologies of spatial objects. Thus has to include relationships such as:

Data mining of spatial data needs to take into account the geometries and topologies of spatial objects. Thus has to include relationships such as:

For example, in classification will also drive when cells are coalesced.

Will also make a dramatic difference to the query language extensions that may be available.

Association

Association

Clustering

Mining can be done two ways:

Mining can be done two ways:

Spatial mining can be combined with temporal data mining to form spatio-temporal data mining. In this case there are six options:

Absolute

Absolute

Relative

Absolute is relatively easy to deal with:

Relative more complex

Can be a mixture of the two.

Uni-directional and uniform.

Uni-directional and uniform.

Has the ability to suggest cause and effect.

Primitives can be events or intervals.

Can be reduced to longitudinal (ordered dataset) data mining by restricting set to before / after / equals.

Can be combined with spatial semantics to give 3D or 4D spatio-temporal rules.

13 interval to interval relationships.

13 interval to interval relationships.

Can be put in a transitivity table to imply further relationships between linked intervals.

Eg. If

Similar to spatial, needs to take into account temporal semantics.

Similar to spatial, needs to take into account temporal semantics.

A separate but related area is longitudinal analysis.

A separate but related area is longitudinal analysis.

The result is a set of rules that:

The result is a set of rules that:

Interestingness remains one of the toughest problems, particularly when dealing with temporal data mining and an explosion of possible rules.

Interestingness remains one of the toughest problems, particularly when dealing with temporal data mining and an explosion of possible rules.

As well as Support and Confidence there are other metrics for Interestingness caused by different requirements:

Counting measures

Counting measures

Statistical measures or Validity

Threshold measures

Attribute-value interest

Actionability or Applicability

Novelty

Behavioural Deviation

Behavioural Deviation

Representativeness

Provability

Understandability

Content Mining

Content Mining

Structure Mining

Usage Mining

Analyses either

Analyses either

In some ways an extension of search engines and is a type of text mining.

Can focus on

Can report:

Analyses the manner in which the web is put together.

Analyses the manner in which the web is put together.

Can be used to discover:

Analyses the web traffic on either a general or an individual basis. Ie. will categorise: