Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Approaches:
1. Data Cube Approach (OLAP approach)
2. Attribute-oriented induction approach
Data Cube Approach:
Performs computation and stores results in data cubes. It is based on
materialized views of the data, which typically have been precomputed in
data warehouse. It performs offline aggregation before an OLAP or data
mining query is submitted for processing.
Advantages:
a). An efficient implementation of data generalization
b). Computation of various kinds of measures, e.g. count( ), sum( ),
average( ), max( )
c). Generalization and specialization can be performed on a data cube
by roll-up and drill-down
Disadvantages:
a). Handles only dimensions of simple non-numeric data and measures
of simple aggregated numeric values.
b). Lack of intelligence analysis, can’t tell which dimensions should be
used and what levels should the generalization reach.
General idea:
1. Collect the task-relevant data using a database query.
2. Perform generalization based on the examination of the number of
each attribute’s distinct values in the relevant data set. The
generalization is performed by either attribute removal or attribute
generalization.
3. Apply aggregation by merging identical, generalized tuples and
accumulating their respective counts. This reduces the size of the
generalized data set.
4. The resulting generalized relation can be mapped into different
forms (e.g. charts or rules) for presentation to the user.
Basic principles of attribute-oriented induction:
Data focusing: task relevant data, including dimensions, and the result is
the initial relation.
Attribute removal: remove attribute A if there is a large set of distinct
values for A but (1) there is no generalization operator on A, or (2) A’s
higher level concepts are expressed in terms of other attributes.
Why is it required?
- The first limitation of class characterization for multidimensional
data analysis in data warehouses and OLAP tools is the handling
of complex objects.
- The second limitation is the lack of an automated generalization
process.
- A user may include too few attributes in the analysis, causing the
resulting mined descriptions to be incomplete and
incomprehensive.
- A user may include too many attributes for analysis, increasing the
complexity of descriptions.
- It is non-trivial for users to determine which dimensions should be
included in the analysis of class characteristics. Data relations
often contain 50 to 100 attributes, and a user may have little
knowledge regarding which attributes to be selected for effective
data mining.
1. Data collection: Collect data for both classes i.e. target and
contrasting.
2. Preliminary Relevance Analysis using conservative Attribute
Oriented Induction: Attribute Oriented Induction should employ
Attribute Thresholds that are reasonably large.
3. Evaluate Relevance Analysis using Selected measure: Attributes
are sorted based upon their ranks according to their relevance.
4. Remove irrelevant and weakly relevant attribute: This step results
in an initial target class working relation and initial contrasting class
working relation.
5. Generate Concept Description using Attribute Oriented Induction
Relevance Measures:
- The general idea behind attribute relevance analysis is to compute
some measure which is used to quantify the relevance of an
attribute with respect to a given class or concept.
Methods:
Information-Theoretic Approach:
- Decision tree:
● each internal node tests an attribute
● each branch corresponds to an attribute value
● each leaf node assigns a classification
- ID3 algorithm:
● build decision tree based on training objects with known
class labels to classify testing objects
● rank attributes with information gain measure
● minimal height: the least number of tests to classify an object
Given:
- database of transactions
- each transaction is a list of items purchased by a customer in a
visit
Find:
- all rules that correlate the presence of one set of items with that of
another set of items
- e.g. 98% of the people who purchased tires and auto accessories
also get automotive services done
A => B [s, c]
- Often, users have a good sense of which “direction” of mining may lead
to interesting patterns and the form of “patterns” or rules they want to
find.
- Each user will have a data mining task in mind, i.e. some form of data
analysis that he or she would like to have performed.
- A data mining task can be specified in the form of a data mining query,
which is input to the data mining system.