The process of understanding the world around us is based on a form of induction. The process of understanding the world around us is based on a form of induction. Scientific induction is a process by which scientists arrive at improvements in the way we view the physical laws of the universe. To avoid the problem of experimenter’s bias, we adopt “null hypotheses” and try to prove these. Other Frameworks discussed later.
Reiter’s “Closed World Assumption”, adopted implicitly by most databases states (in general terms) that what is not stored in the database is false. Reiter’s “Closed World Assumption”, adopted implicitly by most databases states (in general terms) that what is not stored in the database is false. This allows conventional DB queries to be: - Sound - nothing in the answer that shouldn’t be there, and
- Complete - nothing left out of the answer that should be there.
Data mining is not that clean. - It either does not assume the “Closed World Assumption” or uses algorithms that generalise / summarise (in sociology, if applied to people, this would also be called stereotyping) the data.
- Ie. data mining often assumes that database data is evidential or indicative of future events.
- It may produce rules that may be contradicted by some of the data (ie., it may produce unsound answers).
- In summary - it works by induction.
This means that data mining cannot be relied upon to give (deductively) correct answers. This means that data mining cannot be relied upon to give (deductively) correct answers. Typical output provided by data mining … - Customers who live in postcode 5098 and who have less than $1,000 in their account are likely to default on their mortgage payments.
- Partners of male patients who smoke are at a higher risk of breast cancer.
- There is a lower incidence of hip fractures amongst women on HRT.
- There is a correlation between schizophrenia and people born in March in the Northern Hemisphere.
- People who live close to the coast tend to pay their rates late.
- People who use BPay tend to pay rates in four installments rather than in one payment.
- Businesses located close to environmentally sensitive areas are generally SMEs.
- Wet or hot weather tends to cause an increase in sick days.
Discrete Values - All Values of the attribute are regarded as equally distant from each other. For example, values in attributes such as surname, Tax File Number, ISBNs, etc.
Continuously Varying. - The distances between attributes values are regarded as proportional to their numerical difference. Note this change may be linear (eg. Temperature) or non-linear, (perhaps even non-Euclidean), (eg. Epochs, global distances).
Stepwise Varying. - The difference between values are related to the category that an attribute value is assigned to and the distance between attribute values is determined by these categories, (eg. Age Ranges, Income Brackets).
Categorical. - Objects are assigned to categories that themselves have little or no direct relationship with those in other categories, however distance (even if zero) can be assigned within the categories, (eg. viruses (human or computer), software upgrades).
Hierarchical/Graph-based. - The differences between values are related to their position in a hierarchy or graph of concepts, (eg. plant taxonomy systems, suburb-city-state-country)
Imperfection can be viewed in two ways: Imperfection can be viewed in two ways: Information Level imperfection can occur because of: - Human inaccuracies in collection (wrong cohort, timing, location, preprocessing),
- Computer errors (wrong dataset being used, errors in algorithm, etc.)
- Deliberate misinformation.
Information level imperfection can only be handled to a limited extent.
Data-level imperfection can be divided into five categories (ala Parsons 1996) Data-level imperfection can be divided into five categories (ala Parsons 1996) - Uncertainty.
- Imprecision.
- Lack of Granularity. The temperature in SA on Monday was 15º
- Vagueness
- the fuzziness implied by terms such as `young' or `short',
- Incompleteness
- Some information held but there is a lack of all relevant information.
- Inconsistency
- contradictory information.
Note that the last case (only) requires reference between data items. Data level imperfection can be (but often is not) manageable.
When writing algorithms, you can (sometimes inadvertently) include constraints (aka features!). When writing algorithms, you can (sometimes inadvertently) include constraints (aka features!). - Non-optimal Solutions
- Some routines will find a solution but it might be non-optimal (a local minimum). Eg. k-means.
- Order Dependence
- Apriori, for example, is independent on the order in which items are presented to the algorithm.
- Instability
- A very small change in the data or parameters given to some algorithms will change markedly the solution found. Especially the case with DTI.
- Failure to be discriminating
- Some routines can turn large amounts of data into large amounts of useless rules. Support and Confidence are often not the best metrics when evaluating association rules.
- Time and Space Complexity
- As the volume of data increases, algorithms that are polynomial or worse will suffer. Parallel and/or distributed algorithms may help here.
Dostları ilə paylaş: |