Sei sulla pagina 1di 9

[UNIT - 2 (except Apriori algorithm)]

1. Data Generalization and Summarization based characterization:

Data Generalization is a process which abstracts a large set of


task-relevant data in a database from low conceptual levels to higher
ones.

In general, data generalization summarizes data by replacing relatively


low-level values (e.g. numeric values for an attribute age) with
higher-level concepts (e.g. young, middle-aged, and senior), or by
reducing the number of dimensions to summarize data in concept space
involving fewer dimensions (e.g. removing birth_date and telephone
number when summarizing the behaviour of a group of students).

Concept description​: A ​ form of data generalization, concept refers to a


data collection such as frequent_buyers, graduate_students and so on.
It’s not a simple enumeration of data. Instead, concept description
generates descriptions for data characterization and comparison.
- Characterization​: ​provides a concise and succint summarization of
the given collection of data.
- Comparison​: provides description comparing two or more
collections of data.

Approaches:
1. Data Cube Approach (OLAP approach)
2. Attribute-oriented induction approach
Data Cube Approach:
Performs computation and stores results in data cubes. It is based on
materialized views ​of the data, which typically have been precomputed in
data warehouse. It performs offline aggregation before an OLAP or data
mining query is submitted for processing.

Advantages​:
a). An efficient implementation of data generalization
b). Computation of various kinds of measures, e.g. count( ), sum( ),
average( ), max( )
c). Generalization and specialization can be performed on a data cube
by roll-up and drill-down

Disadvantages​:
a). Handles only dimensions of simple non-numeric data and measures
of simple aggregated numeric values.
b). Lack of intelligence analysis, can’t tell which dimensions should be
used and what levels should the generalization reach.

Attribute-oriented induction approach:


This is basically a query-oriented, generalization-based, online data
analysis technique.

General idea:
1. Collect the task-relevant data using a database query.
2. Perform generalization based on the examination of the number of
each attribute’s distinct values in the relevant data set. The
generalization is performed by either attribute removal or attribute
generalization.
3. Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts. This reduces the size of the
generalized data set.
4. The resulting generalized relation can be mapped into different
forms (e.g. charts or rules) for presentation to the user.
Basic principles of attribute-oriented induction​:

Data focusing:​ task relevant data, including dimensions, and the result is
the initial relation.
Attribute removal:​ ​remove attribute A if there is a large set of distinct
values for A but (1) there is no generalization operator on A, or (2) A’s
higher level concepts are expressed in terms of other attributes.

Attribute-generalization​: If there is a large set of distinct values of A, and


there exists a set of generalization operators on A, then select an
operator and generalize A.

Attribute-threshold control:​ Typically ranges from 2 to 8 and should allow


experts and users to modify the threshold values.

Generalized relation threshold control:​ control the final relation/rule size.

2. Analytical Characterization and Analysis of attribute relevance:

Methods should be introduced in order to perform attribute (or


dimension) relevance analysis in order to:
- filter out statistically irrelevant or weakly relevant attributes
- retain or rank the most relevant attributes for the descriptive mining
task

Class characterization which includes the analysis of attribute/dimension


relevance is called ​analytical characterization​.

Why is it required?
- The first limitation of class characterization for multidimensional
data analysis in data warehouses and OLAP tools is the handling
of complex objects.
- The second limitation is the lack of an automated generalization
process.
- A user may include too few attributes in the analysis, causing the
resulting mined descriptions to be incomplete and
incomprehensive.
- A user may include too many attributes for analysis, increasing the
complexity of descriptions.
- It is non-trivial for users to determine which dimensions should be
included in the analysis of class characteristics. Data relations
often contain 50 to 100 attributes, and a user may have little
knowledge regarding which attributes to be selected for effective
data mining.

Attribute Relevance Criteria:


- An attribute is considered highly relevant with respect to a given
class if it is likely that the values of the attribute may be used to
distinguish the class from others.
- Even with the same dimension, different levels of concepts may
have dramatically different powers for distinguishing a class from
others.

This implies that the analysis of dimension relevance should be


performed at ​multi-levels of abstraction,​ and only the most relevant
levels of a dimension should be included in the analysis.

Steps for Attribute Relevance Analysis:

1. Data collection:​ Collect data for both classes i.e. target and
contrasting.
2. Preliminary Relevance Analysis using conservative Attribute
Oriented Induction:​ ​Attribute Oriented Induction should employ
Attribute Thresholds that are reasonably large.
3. Evaluate Relevance Analysis using Selected measure:​ Attributes
are sorted based upon their ranks according to their relevance.
4. Remove irrelevant and weakly relevant attribute:​ This step results
in an initial target class working relation and initial contrasting class
working relation.
5. Generate Concept Description using Attribute Oriented Induction

Relevance Measures:
- The general idea behind attribute relevance analysis is to compute
some measure which is used to quantify the relevance of an
attribute with respect to a given class or concept.

- Quantitative relevance measures determines the classifying power


of an attribute within a set of data.

Methods​:

Information-Theoretic Approach:
- Decision tree:
● each internal node tests an attribute
● each branch corresponds to an attribute value
● each leaf node assigns a classification

- ID3 algorithm:
● build decision tree based on training objects with known
class labels to classify testing objects
● rank attributes with information gain measure
● minimal height: the least number of tests to classify an object

Entropy and Information Gain:


3. Association Rule Mining:

It is the process of finding frequent patterns, associations, correlations or


causal structures among sets of items in transaction databases and
understanding customer buying habits by finding associations and
correlations between the different data items that customers place in
their “shopping basket”.

Applications: B​ asket data analysis, cross-marketing, catalog design, web


log analysis, fraud detection

Given:
- database of transactions
- each transaction is a list of items purchased by a customer in a
visit

Find:
- all rules that correlate the presence of one set of items with that of
another set of items
- e.g. 98% of the people who purchased tires and auto accessories
also get automotive services done

Rule basic measures: Support and Confidence

A => B [s, c]

Support​: denotes the frequency of the rule within transactions. A high


value means that the rule involves a great part of database.
support (A => B [s, c]) = p(A U B)

Confidence​: denotes the percentage of transactions containing A which


also contains B.
confidence (A => B [s, c]) = p(B | A)
4. Constraint based association mining:

- A data mining process may uncover thousands of rules from a given


dataset, most of which end up being unrelated or uninteresting to users.

- Often, users have a good sense of which “direction” of mining may lead
to interesting patterns and the form of “patterns” or rules they want to
find.

- A good heuristic is to have the users specify such intuition or


expectations as ​constraints t​ o confine the search space. This strategy is
known as constraint based mining.

- The constraints can include the following:


● Knowledge type constraints​: These specify the type of knowledge
to be mined, such as association, correlation, classification or
clustering.
● Data constraints​: These specify the set of task-relevant data.
● Dimension/level constraints​: These specify the desired dimensions
(or attributes) of data, the abstraction levels, or the level of concept
hierarchies to be used in mining.
● Interestingness constraints​: These specify thresholds on statistical
measures of rule interestingness such as support, confidence and
correlation.
● Rule constraints​: These specify the form of, or condition on rules to
be mined.

These constraints can be specified using a high-level declarative data


mining query language and user interface.

5. Data mining primitives:

- Each user will have a data mining task in mind, i.e. some form of data
analysis that he or she would like to have performed.

- A data mining task can be specified in the form of a data mining query,
which is input to the data mining system.

- A data mining query is defined in terms of data mining task primitives.


These primitives allow the user to interactively communicate with the
data mining system during discovery in order to direct the mining
process, or examine the findings from different angles or depths.

- The data mining primitives specify the following:


● The set of task-relevant data to be mined​: This specifies the
portion of the database or the set of data in which user is
interested. This includes the database attributes.
● The kind of knowledge to be mined​: This specifies the data-mining
functions to be performed, such as characterization, discrimination,
association or correlation analysis, classification, prediction,
clustering, etc.
● The background knowledge to be used in the discovery process​:
This is useful for guiding the knowledge discovery process and for
evaluating the patterns found. ​Concept hierarchies ​are a popular
form of background knowledge, which allow data to be mined at
multiple levels of abstraction.
● The interestingness measures and thresholds for pattern
evaluation​: They may be used to guide the mining process or, after
discovery, to evaluate the discovered patterns. For example,
interestingness measures for association rules include support and
confidence.
● The expected representation for visualizing the discovered
patterns​: This refers to the form in which the discovered patterns
are to be displayed, which may include rules, tables, charts,
graphs, decision trees, and cubes.

A data mining query language can be designed to incorporate these


primitives, allowing users to flexibly interact with data mining systems.

Potrebbero piacerti anche