Sei sulla pagina 1di 50

Data Mining

1
Outline

1. Overview of data mining


2. Association rules
3. Classification
4. Regression
5. Clustering
6. Other Data Mining problems
7. Applications of data mining

2
DATA MINING

Data mining refers to the mining or discovery of new


information in terms of patterns or rules from vast
amount of data.
To be practically useful, data mining must be carried out
efficiently on large files and databases.
This chapter briefly reviews the state-of-the-art of this
extensive field of data mining.
Data mining uses techniques from such areas as
machine learning,
statistics,
neural networks
genetic algorithms.

3
1. OVERVIEW OF DATA
MINING
Data Mining as a Part
of the
Knowledge Knowledge
Discovery in Databases, abbreviated as
Discovery
KDD, encompasses moreProcess.
than data mining.

The knowledge discovery process comprises six


phases: data selection, data cleansing, enrichment,
data transformation or encoding, data mining and
the reporting and displaying of the discovered
information.

4
Example
Consider a transaction database maintained by a specially
consumer goods retails. Suppose the client data includes a
customer name, zip code, phone number, date of purchase, item
code, price, quantity, and total amount.
A variety of new knowledge can be discovered by KDD
processing on this client database.
During data selection, data about specific items or categories of
items, or from stores in a specific region or area of the country,
may be selected.
The data cleansing process then may correct invalid zip codes or
eliminate records with incorrect phone prefixes. Enrichment
enhances the data with additional sources of information. For
example, given the client names and phone numbers, the store
may purchases other data about age, income, and credit rating
and append them to each record.
Data transformation and encoding may be done to reduce the
amount of data.

5
Example (cont.)
The result of mining may be to discover the
following type of new information:
Association rules e.g., whenever a customer buys video
equipment, he or she also buys another electronic gadget.
Sequential patterns e.g., suppose a customer buys a camera,
and within three months he or she buys photographic supplies,
then within six months he is likely to buy an accessory items.
This defines a sequential pattern of transactions. A customer who
buys more than twice in the regular periods may be likely buy at
least once during the Christmas period.
Classification trees e.g., customers may be classified by
frequency of visits, by types of financing used, by amount of
purchase, or by affinity for types of items, and some revealing
statistics may be generated for such classes.

6
We can see that many possibilities exist for discovering new
knowledge about buying patterns, relating factors such as age,
income group, place of residence, to what and how much the
customers purchase.
This information can then be utilized
to plan additional store locations based on demographics,

to run store promotions,

to combine items in advertisements, or to plan seasonal

marketing strategies.
As this retail store example shows, data mining must be
preceded by significant data preparation before it can yield useful
information that can directly influence business decisions.
The results of data mining may be reported in a variety of
formats, such as listings, graphic outputs, summary tables, or
visualization.

7
Goals of Data Mining and
Knowledge Discovery
Data mining is carried out with some end goals.
These goals fall into the following classes:
Prediction Data mining can show how certain attributes
within the data will behave in the future.
Identification Data patterns can be used to identify the
existence of an item, an event or an activity.
Classification Data mining can partition the data so that
different classes or categories can be identified based on
combinations of parameters.

8
Data Mining: On What Kind
of Data?
Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories
Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
World Wide Web

9
Types of Knowledge
Discovered During Data
Mining.
Data mining addresses inductive knowledge, which
discovers new rules and patterns from the supplied data.
Knowledge can be represented in many forms: In an
unstructured sense, it can be represented by rules. In a
structured form, it may be represented in decision trees,
semantic networks, or hierarchies of classes or frames.
It is common to describe the knowledge discovered
during data mining in five ways:
Association rules These rules correlate the presence of a set
of items with another range of values for another set of variables.

10
Types of Knowledge Discovered
(cont.)
Classification hierarchies The goal is to work from an
existing set of events or transactions to create a hierarchy
of classes.
Patterns within time series
Sequential patterns: A sequence of actions or events is
sought. Detection of sequential patterns is equivalent to
detecting associations among events with certain temporal
relationship.
Clustering A given population of events can be
partitioned into sets of similar elements.

11
Main function phases of the
KD
process
Learning the application domain:
relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant

representation.
Choosing functions of data mining
summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)


Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.

Use of discovered knowledge

12
Main phases of data mining

Pattern Evaluation/
Presentation

Data Mining Patterns

Task-relevant Data

Data Warehouse Selection/Transformation

Data
Cleaning
Data Integration

Data Sources 13
2. ASSOCIATION RULES
What Is Association Rule
Mining?
Association rule mining is finding frequent patterns,
associations, correlations, or causal structures
among sets of items or objects in transaction
databases, relational databases, and other
information repositories.
Applications:
Basket data analysis,
cross-marketing,
catalog design,
clustering, classification, etc.
Rule form: Body Head [support, confidence].

14
Association rule mining
Examples.
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) takes(x, DB) grade(x, A) [1%, 75%]

Association Rule Mining Problem:


Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find: all rules that correlate the presence of one set of
items with that of another set of items
E.g., 98% of people who purchase tires and auto
accessories also get automotive services done.
15
Support and confidence

That is.
support, s, probability that a transaction contains {A B }
s = P(A B )
confidence, c, conditional probability that a transaction
having A also contains B.
c = P(A|B).
Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.

16
Frequent item set
A set of items is referred as an itemset. An itemset that
contains k items is a k-itemset. The occurrence frequency
of an itemset is the number of transactions that contain
the itemset.
An itemset satisfies minimum support if the occurrence
frequency of the itemset is greater than or equal to the
product of min_suf and the total number of transactions in
D. The number of transactions required for the itemset to
satisfy minimum support is referred to as the minimum
support count.
If an itemset satisfies minimum support, then it is a
frequent itemset. The set of frequent k-itemsets is
commonly denoted by Lk.

17
Data Mining Association
Rules
Bar-code technology made it possible for retail
organizations to collect and store massive amounts
of sales data, referred to as the basket data
Successful organizations view such databases as
important pieces of the marketing infrastructure
Organizations are interested in instituting
information-driven marketing processes, managed
by database technology, that enable marketers to
develop and implement customized marketing
programs and strategies

May 21, 2002


Data Mining Association
Rules
For example, 98% of customers that purchase tires
and auto accessories also get automotive services
done
Finding all such rules is valuable for cross-marketing
and attached mailing applications
Other applications include catalog design, add-on
sales, store layout, and customer segmentation based
on buying patterns
Databases involved in these applications are very
large, thus imperative to have fast algorithms for this
task
May 21, 2002
Example of Confidence and
Support
Given the following table of transactions:
{1 2 3 4}

{1 2 4 5}

{1 2 5}

1 2 has 100% confidence, with 100% support


3 4 has 100% confidence, with 33% support
2 3 has 33% confidence, with 33% support

May 21, 2002


Example 2.1
Transaction-ID Items_bought
-------------------------------------------
2000 A, B, C
1000 A, C
4000 A, D
5000 B, E, F

Let minimum support 50%, and minimum confidence


50%, we have
A C (50%, 66.6%)
C A (50%, 100%)

21
Example 2.3. From Example 2.2, suppose the data contain
the frequent itemset l = {I1, I2, I5}. The nonempty
subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2} and {I5}.
The resulting association rules are as shown blow:

I1 I2 I5 confidence = 2/4 = 50%


I1 I5 I2 confidence = 2/2 = 100%
I2 I5 I1 confidence = 2/2 = 100%
I1 I2 I5 confidence = 2/6 = 33%
I2 I1 I5 confidence = 2/7 = 29%
I5 I1 I2 confidence = 2/2 = 100%

If the minimum confidence threshold is, say, 70%, then only


the second, third and last rules above are outputs.

22
3. CLASSIFICATION
Classification is the process of learning a model that
describes different classes of data. The classes are
predetermined.
Example: In a banking application, customers who apply
for a credit card may be classify as a good risk, a fair
risk or a poor risk. Hence, this type of activity is also
called supervised learning.
Once the model is built, then it can be used to classify
new data.

23
The first step, of learning the model, is accomplished by using a
training set of data that has already been classified. Each record
in the training data contains an attribute, called the class label,
that indicates which class the record belongs to.
The model that is produced is usually in the form of a decision
tree or a set of rules.
Some of the important issues with regard to the model and the
algorithm that produces the model include:
the models ability to predict the correct class of the new data,

the computational cost associated with the algorithm

the scalability of the algorithm.

Let examine the approach where the model is in the form of a


decision tree.
A decision tree is simply a graphical representation of the
description of each class or in other words, a representation of
the classification rules.

24
Example 3.1
Example 3.1: Suppose that we have a database of customers
on the AllEletronics mailing list. The database describes
attributes of the customers, such as their name, age, income,
occupation, and credit rating. The customers can be classified
as to whether or not they have purchased a computer at
AllElectronics.
Suppose that new customers are added to the database and
that you would like to notify these customers of an upcoming
computer sale. To send out promotional literature to every new
customers in the database can be quite costly. A more cost-
efficient method would be to target only those new customers
who are likely to purchase a new computer. A classification
model can be constructed and used for this purpose.
The figure 2 shows a decision tree for the concept
buys_computer, indicating whether or not a customer at
AllElectronics is likely to purchase a computer.

25
Each internal node
represents a test on
an attribute. Each leaf
node represents a
class.

A decision tree for the concept buys_computer, indicating whether or not


a customer at AllElectronics is likely to purchase a computer.

26
Algorithm for decision tree
induction
Input: set of training data records: R , R , , R and set of
1 2 m
Attributes A1, A2, , An
Ouput: decision tree
Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive divide-
and-conquer manner
- At start, all the training examples are at the root
- Attributes are categorical (if continuous-valued, they
are discretized in advance)
- Examples are partitioned recursively based on
selected attributes
- Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)

27
Conditions for stopping partitioning
- All samples for a given node belong to the same
class
- There are no remaining attributes for further
partitioning majority voting is employed for
classifying the leaf
- There are no samples left.

28
Training data tuples from the
AllElectronics customer database

Class
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No

29
Extracting Classification Rules
from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand.
Example
IF age = <=30 AND student = no THEN buys_computer = no
IF age = <=30 AND student = yes THEN buys_computer = yes
IF age = 3140 THEN buys_computer = yes
IF age = >40 AND credit_rating = excellent THEN
buys_computer = no
IF age = >40 AND credit_rating = fair THEN buys_computer =
yes

30
Neural Networks and
Classification
Neural network is a technique derived from AI that uses
generalized approximation and provides an iterative
method to carry it out. ANNs use the curve-fitting
approach to infer a function from a set of samples.
This technique provides a learning approach; it is
driven by a test sample that is used for the initial
inference and learning. With this kind of learning method,
responses to new inputs may be able to be interpolated
from the known samples. This interpolation depends on
the model developed by the learning method.

31
ANN and classification
ANNs can be classified into 2 categories: supervised
and unsupervised networks. Adaptive methods that
attempt to reduce the output error are supervised
learning methods, whereas those that develop internal
representations without sample outputs are called
unsupervised learning methods.
ANNs can learn from information on a specific
problem. They perform well on classification tasks and
are therefore useful in data mining.

32
Information processing at a
neuron in an ANN

33
4. Regression
Predictive data mining
Descriptive data mining

Definition: (J. Han et al., 2001&2006) Regression is a


method used to predict continuous values for given input.

34
Regression

Regression analysis can be used to model the


relationship between one or more independent or
predictor variables and one or more response or
dependent variables.
Categories
Linear regression and nonlinear regression
Uni-variate and multi-variate regression
parametric, nonparametric and semi-parametric

35
Regression function
Regression function: Y = f(X, )
X: predictor/independent variables
Y: response/dependent variables
: regression coefficients
X: used to explain the changes of response variable Y.
Y: used to describe the target phenomenon.
The relationship between Y and X can be representeb
y the functional dependence of Y to X.
describes the influence of X to Y.

36
Applications of regression

Data Mining
Preprocessing stage
Data Mining stage
Descriptive data mining
Predictive data mining

Application areas: biology, agriculture, social


issues, economy, business,
37
Some problems with
regression
Some assumptions going along with regression.
Danger of extrapolation.
Evaluation of regression models.
Other advanced techniques for regression:
Artificial Neural Network (ANN)
Support Vector Machine (SVM)

38
5. CLUSTERING
What is Cluster Analysis?
Cluster: a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters.
Cluster analysis
Grouping a set of data objects into clusters.
Clustering is unsupervised learning: no predefined
classes, no class-labeled training samples.
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms

39
General Applications of
Clustering
Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
World Wide Web
Document classification
Cluster Weblog data to discover groups of similar access
patterns

40
Examples of Clustering
Applications
Marketing: Help marketers discover distinct groups
in their customer bases, and then use this
knowledge to develop targeted marketing programs.
Land use: Identification of areas of similar land use
in an earth observation database.
Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost.
City-planning: Identifying groups of houses
according to their house type, value, and
geographical location.
Earth-quake studies: Observed earth quake
epicenters should be clustered along continent
faults.

41
The K-Means Clustering Method
Input: a database D, of m records, r1, r2,,rm and
a desired number of clusters k.
Output: set of k clusters that minimizes the square
error criterion.
Given k, the k-means algorithm is implemented in 4
steps:
Step 1: Randomly choose k records as the initial cluster
centers.
Step 2: Assign each records ri, to the cluster such that the
distance between ri and the cluster centroid (mean) is the
smallest among the k clusters.
Step 3: recalculate the centroid (mean) of each cluster
based on the records assigned to the cluster.
Step 4: Go back to Step 2, stop when no more new
assignment.

42
The algorithm begins by randomly choosing k records to
represent the centroids (means), m1, m2,,mk of the
clusters, C1, C2,,Ck. All the records are placed in a
given cluster based on the distance between the record
and the cluster mean. If the distance between mi and
record rj is the smallest among all cluster means, then
record is placed in cluster Ci.
Once all records have been placed in a cluster, the mean
for each cluster is recomputed.
Then the process repeats, by examining each record
again and placing it in the cluster whose mean is closest.
Several iterations may be needed, but the algorithm will
converge, although it may terminate at a local optimum.

43
Clustering of a set of objects based on the k-means method.

44
Hierarchical Clustering
A hierarchical clustering method works by grouping data
objects into a tree of clusters.
In general, there are two types of hierarchical clustering
methods:
Agglomerative hierarchical clustering: This bottom-up strategy
starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the
objects are in a single cluster or until a certain termination
conditions are satisfied. Most hierarchical clustering methods belong
to this category. They differ only in their definition of intercluster
similarity.
Divisive hierarchical clustering: This top-down strategy does the
reverse of agglomerative hierarchical clustering by starting with all
objects in one cluster. It subdivides the cluster into smaller and
smaller pieces, until each object forms a cluster on its own or until it
satisfied certain termination condition, such as a desired number
clusters is obtained or the distance between two closest clusters is
above a certain threshold distance.

45
Agglomerative and divisive hierarchical clustering on data objects {a, b, c,
d, e}

46
7. POTENTIAL APPLICATIONS OF
DM
Database analysis and decision support
Market analysis and management
target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Risk analysis and management
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Other Applications
Text mining (news group, email, documents) and Web
analysis.
Intelligent query answering

47
Market Analysis and Management

Where are the data sources for analysis?


Credit card transactions, discount coupons, customer
complaint calls, plus (public) lifestyle studies
Target marketing
Find clusters of model customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time
Conversion of single to a joint bank account: marriage, etc.

48
Cross-market analysis
Associations/co-relations between product sales
Prediction based on the association information
Customer profiling
data mining can tell you what types of customers buy what
products (clustering or classification)
Identifying customer requirements
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information
various multidimensional summary reports
statistical summary information (data central tendency and
variation)

49
Fraud Detection and
Management
Applications
widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples
auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references

50

Potrebbero piacerti anche