Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1
Outline
2
DATA MINING
3
1. OVERVIEW OF DATA
MINING
Data Mining as a Part
of the
Knowledge Knowledge
Discovery in Databases, abbreviated as
Discovery
KDD, encompasses moreProcess.
than data mining.
4
Example
Consider a transaction database maintained by a specially
consumer goods retails. Suppose the client data includes a
customer name, zip code, phone number, date of purchase, item
code, price, quantity, and total amount.
A variety of new knowledge can be discovered by KDD
processing on this client database.
During data selection, data about specific items or categories of
items, or from stores in a specific region or area of the country,
may be selected.
The data cleansing process then may correct invalid zip codes or
eliminate records with incorrect phone prefixes. Enrichment
enhances the data with additional sources of information. For
example, given the client names and phone numbers, the store
may purchases other data about age, income, and credit rating
and append them to each record.
Data transformation and encoding may be done to reduce the
amount of data.
5
Example (cont.)
The result of mining may be to discover the
following type of new information:
Association rules e.g., whenever a customer buys video
equipment, he or she also buys another electronic gadget.
Sequential patterns e.g., suppose a customer buys a camera,
and within three months he or she buys photographic supplies,
then within six months he is likely to buy an accessory items.
This defines a sequential pattern of transactions. A customer who
buys more than twice in the regular periods may be likely buy at
least once during the Christmas period.
Classification trees e.g., customers may be classified by
frequency of visits, by types of financing used, by amount of
purchase, or by affinity for types of items, and some revealing
statistics may be generated for such classes.
6
We can see that many possibilities exist for discovering new
knowledge about buying patterns, relating factors such as age,
income group, place of residence, to what and how much the
customers purchase.
This information can then be utilized
to plan additional store locations based on demographics,
marketing strategies.
As this retail store example shows, data mining must be
preceded by significant data preparation before it can yield useful
information that can directly influence business decisions.
The results of data mining may be reported in a variety of
formats, such as listings, graphic outputs, summary tables, or
visualization.
7
Goals of Data Mining and
Knowledge Discovery
Data mining is carried out with some end goals.
These goals fall into the following classes:
Prediction Data mining can show how certain attributes
within the data will behave in the future.
Identification Data patterns can be used to identify the
existence of an item, an event or an activity.
Classification Data mining can partition the data so that
different classes or categories can be identified based on
combinations of parameters.
8
Data Mining: On What Kind
of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
World Wide Web
9
Types of Knowledge
Discovered During Data
Mining.
Data mining addresses inductive knowledge, which
discovers new rules and patterns from the supplied data.
Knowledge can be represented in many forms: In an
unstructured sense, it can be represented by rules. In a
structured form, it may be represented in decision trees,
semantic networks, or hierarchies of classes or frames.
It is common to describe the knowledge discovered
during data mining in five ways:
Association rules These rules correlate the presence of a set
of items with another range of values for another set of variables.
10
Types of Knowledge Discovered
(cont.)
Classification hierarchies The goal is to work from an
existing set of events or transactions to create a hierarchy
of classes.
Patterns within time series
Sequential patterns: A sequence of actions or events is
sought. Detection of sequential patterns is equivalent to
detecting associations among events with certain temporal
relationship.
Clustering A given population of events can be
partitioned into sets of similar elements.
11
Main function phases of the
KD
process
Learning the application domain:
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.
12
Main phases of data mining
Pattern Evaluation/
Presentation
Task-relevant Data
Data
Cleaning
Data Integration
Data Sources 13
2. ASSOCIATION RULES
What Is Association Rule
Mining?
Association rule mining is finding frequent patterns,
associations, correlations, or causal structures
among sets of items or objects in transaction
databases, relational databases, and other
information repositories.
Applications:
Basket data analysis,
cross-marketing,
catalog design,
clustering, classification, etc.
Rule form: Body Head [support, confidence].
14
Association rule mining
Examples.
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) takes(x, DB) grade(x, A) [1%, 75%]
That is.
support, s, probability that a transaction contains {A B }
s = P(A B )
confidence, c, conditional probability that a transaction
having A also contains B.
c = P(A|B).
Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.
16
Frequent item set
A set of items is referred as an itemset. An itemset that
contains k items is a k-itemset. The occurrence frequency
of an itemset is the number of transactions that contain
the itemset.
An itemset satisfies minimum support if the occurrence
frequency of the itemset is greater than or equal to the
product of min_suf and the total number of transactions in
D. The number of transactions required for the itemset to
satisfy minimum support is referred to as the minimum
support count.
If an itemset satisfies minimum support, then it is a
frequent itemset. The set of frequent k-itemsets is
commonly denoted by Lk.
17
Data Mining Association
Rules
Bar-code technology made it possible for retail
organizations to collect and store massive amounts
of sales data, referred to as the basket data
Successful organizations view such databases as
important pieces of the marketing infrastructure
Organizations are interested in instituting
information-driven marketing processes, managed
by database technology, that enable marketers to
develop and implement customized marketing
programs and strategies
{1 2 4 5}
{1 2 5}
21
Example 2.3. From Example 2.2, suppose the data contain
the frequent itemset l = {I1, I2, I5}. The nonempty
subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2} and {I5}.
The resulting association rules are as shown blow:
22
3. CLASSIFICATION
Classification is the process of learning a model that
describes different classes of data. The classes are
predetermined.
Example: In a banking application, customers who apply
for a credit card may be classify as a good risk, a fair
risk or a poor risk. Hence, this type of activity is also
called supervised learning.
Once the model is built, then it can be used to classify
new data.
23
The first step, of learning the model, is accomplished by using a
training set of data that has already been classified. Each record
in the training data contains an attribute, called the class label,
that indicates which class the record belongs to.
The model that is produced is usually in the form of a decision
tree or a set of rules.
Some of the important issues with regard to the model and the
algorithm that produces the model include:
the models ability to predict the correct class of the new data,
24
Example 3.1
Example 3.1: Suppose that we have a database of customers
on the AllEletronics mailing list. The database describes
attributes of the customers, such as their name, age, income,
occupation, and credit rating. The customers can be classified
as to whether or not they have purchased a computer at
AllElectronics.
Suppose that new customers are added to the database and
that you would like to notify these customers of an upcoming
computer sale. To send out promotional literature to every new
customers in the database can be quite costly. A more cost-
efficient method would be to target only those new customers
who are likely to purchase a new computer. A classification
model can be constructed and used for this purpose.
The figure 2 shows a decision tree for the concept
buys_computer, indicating whether or not a customer at
AllElectronics is likely to purchase a computer.
25
Each internal node
represents a test on
an attribute. Each leaf
node represents a
class.
26
Algorithm for decision tree
induction
Input: set of training data records: R , R , , R and set of
1 2 m
Attributes A1, A2, , An
Ouput: decision tree
Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive divide-
and-conquer manner
- At start, all the training examples are at the root
- Attributes are categorical (if continuous-valued, they
are discretized in advance)
- Examples are partitioned recursively based on
selected attributes
- Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
27
Conditions for stopping partitioning
- All samples for a given node belong to the same
class
- There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
- There are no samples left.
28
Training data tuples from the
AllElectronics customer database
Class
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
29
Extracting Classification Rules
from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand.
Example
IF age = <=30 AND student = no THEN buys_computer = no
IF age = <=30 AND student = yes THEN buys_computer = yes
IF age = 3140 THEN buys_computer = yes
IF age = >40 AND credit_rating = excellent THEN
buys_computer = no
IF age = >40 AND credit_rating = fair THEN buys_computer =
yes
30
Neural Networks and
Classification
Neural network is a technique derived from AI that uses
generalized approximation and provides an iterative
method to carry it out. ANNs use the curve-fitting
approach to infer a function from a set of samples.
This technique provides a learning approach; it is
driven by a test sample that is used for the initial
inference and learning. With this kind of learning method,
responses to new inputs may be able to be interpolated
from the known samples. This interpolation depends on
the model developed by the learning method.
31
ANN and classification
ANNs can be classified into 2 categories: supervised
and unsupervised networks. Adaptive methods that
attempt to reduce the output error are supervised
learning methods, whereas those that develop internal
representations without sample outputs are called
unsupervised learning methods.
ANNs can learn from information on a specific
problem. They perform well on classification tasks and
are therefore useful in data mining.
32
Information processing at a
neuron in an ANN
33
4. Regression
Predictive data mining
Descriptive data mining
34
Regression
35
Regression function
Regression function: Y = f(X, )
X: predictor/independent variables
Y: response/dependent variables
: regression coefficients
X: used to explain the changes of response variable Y.
Y: used to describe the target phenomenon.
The relationship between Y and X can be representeb
y the functional dependence of Y to X.
describes the influence of X to Y.
36
Applications of regression
Data Mining
Preprocessing stage
Data Mining stage
Descriptive data mining
Predictive data mining
38
5. CLUSTERING
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters.
Cluster analysis
Grouping a set of data objects into clusters.
Clustering is unsupervised learning: no predefined
classes, no class-labeled training samples.
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
39
General Applications of
Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
World Wide Web
Document classification
Cluster Weblog data to discover groups of similar access
patterns
40
Examples of Clustering
Applications
Marketing: Help marketers discover distinct groups
in their customer bases, and then use this
knowledge to develop targeted marketing programs.
Land use: Identification of areas of similar land use
in an earth observation database.
Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost.
City-planning: Identifying groups of houses
according to their house type, value, and
geographical location.
Earth-quake studies: Observed earth quake
epicenters should be clustered along continent
faults.
41
The K-Means Clustering Method
Input: a database D, of m records, r1, r2,,rm and
a desired number of clusters k.
Output: set of k clusters that minimizes the square
error criterion.
Given k, the k-means algorithm is implemented in 4
steps:
Step 1: Randomly choose k records as the initial cluster
centers.
Step 2: Assign each records ri, to the cluster such that the
distance between ri and the cluster centroid (mean) is the
smallest among the k clusters.
Step 3: recalculate the centroid (mean) of each cluster
based on the records assigned to the cluster.
Step 4: Go back to Step 2, stop when no more new
assignment.
42
The algorithm begins by randomly choosing k records to
represent the centroids (means), m1, m2,,mk of the
clusters, C1, C2,,Ck. All the records are placed in a
given cluster based on the distance between the record
and the cluster mean. If the distance between mi and
record rj is the smallest among all cluster means, then
record is placed in cluster Ci.
Once all records have been placed in a cluster, the mean
for each cluster is recomputed.
Then the process repeats, by examining each record
again and placing it in the cluster whose mean is closest.
Several iterations may be needed, but the algorithm will
converge, although it may terminate at a local optimum.
43
Clustering of a set of objects based on the k-means method.
44
Hierarchical Clustering
A hierarchical clustering method works by grouping data
objects into a tree of clusters.
In general, there are two types of hierarchical clustering
methods:
Agglomerative hierarchical clustering: This bottom-up strategy
starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the
objects are in a single cluster or until a certain termination
conditions are satisfied. Most hierarchical clustering methods belong
to this category. They differ only in their definition of intercluster
similarity.
Divisive hierarchical clustering: This top-down strategy does the
reverse of agglomerative hierarchical clustering by starting with all
objects in one cluster. It subdivides the cluster into smaller and
smaller pieces, until each object forms a cluster on its own or until it
satisfied certain termination condition, such as a desired number
clusters is obtained or the distance between two closest clusters is
above a certain threshold distance.
45
Agglomerative and divisive hierarchical clustering on data objects {a, b, c,
d, e}
46
7. POTENTIAL APPLICATIONS OF
DM
Database analysis and decision support
Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Other Applications
Text mining (news group, email, documents) and Web
analysis.
Intelligent query answering
47
Market Analysis and Management
48
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
Customer profiling
data mining can tell you what types of customers buy what
products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central tendency and
variation)
49
Fraud Detection and
Management
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
50