Most SPM algorithms talk about discovering episodes. Most SPM algorithms talk about discovering episodes. - A serial episode is where the tokens in the episode are ordered.
- A parallel episode is where the tokens are only partially ordered (such as might be the case is the data are coming from a number of sensors which are being polled).
Application Areas for this include: - Genome sequencing,
- Sensor Processing,
- Text Compression,
- Any area where there are a lot of discrete tokens in a stream.
The simplest method is to use the Apriori principle - find all the strings of length 1
- find all the strings of length k from the strings of length k-1
Alternatively, some of the more advanced association rule mining algorithms can be adapted for the purpose such as FP-Growth. One Issue is Support - Clearly different levels of support need to be set as shorter length strings are always more likely to occur than longer ones, however the longer ones are likely to be more interesting.
- Support must be more than background noise. For k different tokens in a stream of n tokens, a given p-length string will occur randomly n/kp. Thus for k=20, n=100M, on average you will get:
- P = 3 - 12,500 random occurrences
- P = 4 - 625 random occurrences
- P = 5 - 31 random occurrences ,,, and so on.
4 extensions for sequence mining can be envisaged - Interacting Episodes
- Discovering whether two frequent episodes have some temporal relationship (eg. DEF tends to occur just before XYZ).
- Periodic Patterns
- Do some patterns appear only at certain times? Or in groups?
- Competitor Patterns
- Are there patterns that never occur together? (This is similar to the disassociation rules discussed earlier).
- User Interfaces
The work at Flinders is looking at interacting subsequences. Ie. when one sequence only occurs when another happens. The work at Flinders is looking at interacting subsequences. Ie. when one sequence only occurs when another happens. Two interacting sequences can have one or more distinct relationships. If using Allen’s relationships then this is one or more of 13. If you insist that you also know their midpoints, then this rises to 49 relationship but reduces to 11 relationships if the intervals are the same length.
A common form of sequence are pattern that recur. Moreover, some patterns may only occur during specified intervals. This might be according to: A common form of sequence are pattern that recur. Moreover, some patterns may only occur during specified intervals. This might be according to: - Some calendar
- Financial, Gregorian, lunar, …
- Some other pattern
- Ie. During one pattern another recurs but only then.
- Some internal absolute time stamping.
- For example, the token stream may have time stamps embedded. Eg. EAECBDAtEACCBAtCADEDEBACt has the pattern ABA between each time stamp.
Text (data) mining is mining from natural language text. As such algorithms have to be adapted (although sometimes not greatly) to the language in question. Text (data) mining is mining from natural language text. As such algorithms have to be adapted (although sometimes not greatly) to the language in question. Can use extensions to existing algorithms to apply the same ideas to text: - Association rule mining can produce groups of words occurring in documents together which might be used to form keywords for a wider set of documents,
- Clustering can be used to group similar documents,
- Characterisation can be used to determine why a group of documents has been grouped, etc.
Early text mining attempted to infer the major parts of a piece of text (and their importance) by where it occurred. Early text mining attempted to infer the major parts of a piece of text (and their importance) by where it occurred. - Titles indicate more important information than in-text words. Similarly, indexes, Section headings, etc.
- Having determined the sections, text within sections can be associated with more confidence that text between sections.
- Further importance can be gained by their proximity regarding sentence structure.
- Markup languages can help further by providing indicators as to the start of sections.
Any dataset which can be viewed as a corpus of related documents: Any dataset which can be viewed as a corpus of related documents: - Reports
- Academic Papers
- Websites
- News articles
- Transcripts
- Hansard
- The Australian constitution …
The word can be viewed as the major token. However, in natural language, the same term can be a: The word can be viewed as the major token. However, in natural language, the same term can be a: - Synonym: a word or phrase that means exactly or nearly the same as another word or phrase in the same language, for example shut is a synonym of close.
- Homonym: two or more words having the same spelling but different meanings and origins (e.g., pole and pole)
- Meronym: a term that denotes part of something but which is used to refer to the whole of it, e.g., faces when used to mean people in I see several familiar faces present.
- Metonym: a word, name, or expression used as a substitute for something else with which it is closely associated. For example, Canberra is a metonym for the federal government.
- Hypernym: a word with a broad meaning that more specific words fall under; a superordinate. For example, colour is a hypernym of red. (opposite is hyponym).
Dostları ilə paylaş: |