Sei sulla pagina 1di 299

Teaching an Introductory Course in Data Mining

Richard J. Roiger Computer and Information Sciences Dept. Minnesota State University, Mankato USA Email: richard.roiger@mnsu.edu Web site: krypton.mnsu.edu/~roiger

Teaching an Introductory Course in Data Mining

Designed for university instructors teaching in information science or computer science departments who wish to introduce a data mining course or unit into their curriculum.
Appropriate for anyone interested in a detailed overview of data mining as a problemsolving tool. Will emphasize material found in the text: Data Mining A Tutorial-Based Primer published by Addison-Wesley in 2003. Additional materials covering the most recent trends in data mining will also be presented. Participants will have the opportunity to experience the data mining process. Each participant will receive a complimentary copy of the aforementioned text together with a CD containing power point slides and a student version of IDA.

Questions to Answer
What constitutes data mining? Where does data mining fit in a CS or IS curriculum? Can I use data mining to solve my problem? How do I use data mining to solve my problem?

What Constitutes Data Mining?


Finding interesting patterns in data Model building Inductive learning Generalization

What Constitutes Data Mining?


Business applications
beer and diapers valid vs. invalid credit purchases churn analysis

Web applications
crawler vs. human being user browsing habits

What Constitutes Data Mining?


Medical applications
microarray data mining disease diagnosis

Scientific applications
earthquake detection gamma-ray bursts

Where does data mining fit in a CS or IS curriculum?


Intelligent Systems
Computer Science Information Systems

Minimum
1 1

Maximum
5 1

Where does data mining fit in a CS or IS curriculum?


Decision Theory
Computer Science Information Systems

Minimum
0 3

Maximum
0 3

Can I use data mining to solve my problem?


Do I have access to the data? Is the data easily obtainable? Do I have access to the right attributes?

How do I use data mining to solve my problem?


What strategies should I apply? What data mining techniques should I use? How do I evaluate results? How do I apply what has been learned? Have I adhered to all data privacy issues?

Data Mining: A First View


Chapter 1

Data Mining
The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.

Knowledge Discovery in Databases (KDD)


The application of the scientific method to data mining. Data mining is one step of the KDD process.

Computers & Learning


Computers are good at learning concepts. Concepts are the output of a data mining session.

Supervised Learning
Build a learner model using data instances of known origin. Use the model to determine the outcome new instances of unknown origin.

Supervised Learning: A Decision Tree Example

Decision Tree
A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.

Table 1.1 Hypothetical Training Data for Disease Diagnosis


Patient ID#
1 2 3 4 5 6 7 8 9 10

Sore Throat
Yes No Yes Yes No No No Yes No Yes

Fever
Yes No Yes No Yes No No No Yes Yes

Swollen Glands
Yes No No Yes No No Yes No No No

Congestion
Yes Yes Yes No Yes Yes No Yes Yes Yes

Headache
Yes Yes No No No No No Yes Yes Yes

Diagnosis
Strep throat Allergy Cold Strep throat Cold Allergy Strep throat Allergy Cold Cold

Swollen Glands

No

Yes
Diagnosis = Strep Throat

Fever

No
Diagnosis = Allergy

Yes
Diagnosis = Cold

Figure 1.1 A decision tree for the data in Table 1.1

Table 1.2 Data Instances with an Unknown Classification


Patient ID#
11 12 13

Sore Throat
No Yes No

Fever
No Yes No

Swollen Glands
Yes No No

Congestion
Yes No No

Headache
Yes Yes Yes

Diagnosis
? ? ?

Production Rules
IF Swollen Glands = Yes THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy

Unsupervised Clustering
A data mining method that builds models from data without predefined classes.

The Acme Investors Dataset


Table 1.3 Acme Investors Incorporated
Customer ID
1005 1013 1245 2110 1001

Account Type
Joint Custodial Joint Individual Individual

Margin Account
No No No Yes Yes

Transaction Method
Online Broker Online Broker Online

Trades/ Month
12.5 0.5 3.6 22.3 5.0

Sex
F F M M M

Age
3039 5059 2029 3039 4049

Favorite Recreation
Tennis Skiing Golf Fishing Golf

Annual Income
4059K 8099K 2039K 4059K 6079K

The Acme Investors Dataset & Supervised Learning


1. 2.
3.

4.

Can I develop a general profile of an online investor? Can I determine if a new customer is likely to open a margin account? Can I build a model predict the average number of trades per month for a new investor? What characteristics differentiate female and male investors?

The Acme Investors Dataset & Unsupervised Clustering


1. What attribute similarities group customers of Acme Investors together? 2. What differences in attribute values segment the customer database?

1.3 Is Data Mining Appropriate for My Problem?

Data Mining or Data Query?


Shallow Knowledge Multidimensional Knowledge Hidden Knowledge Deep Knowledge

Shallow Knowledge
Shallow knowledge is factual. It can be easily stored and manipulated in a database.

Multidimensional Knowledge
Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge.

Hidden Knowledge
Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease.

Data Mining vs. Data Query: An Example


Use data query if you already almost know what you are looking for. Use data mining to find regularities in data that are not obvious.

1.4 Expert Systems or Data Mining?

Expert System
A computer program that emulates the problem-solving skills of one or more human experts.

Knowledge Engineer
A person trained to interact with an expert in order to capture their knowledge.

Data

Data Mining Tool

If Swollen Glands = Yes Then Diagnosis = Strep Throat

Human Expert

Knowledge Engineer

Expert System Building Tool

If Swollen Glands = Yes Then Diagnosis = Strep Throat

Figure 1.2 Data mining vs. expert systems

1.5 A Simple Data Mining Process Model

Operational Database

SQL Queries

Data Warehouse

Data Mining

Interpretation & Evaluation

Result Application

Figure 1.3 A simple data mining process model

1.6 Why Not Simple Search?


Nearest Neighbor Classifier K-nearest Neighbor Classifier

Nearest Neighbor Classifier


Classification is performed by searching the training data for the instance closest in distance to the unknown instance.

Customer Intrinsic Value

_ _ _ Intrinsic (Predicted) Value _ _ _ _ _

_ _ _ X X X X X Actual Value X X X X

Figure 1.4 Intrinsic vs. actual customer value

Data Mining: A Closer Look


Chapter 2

2.1 Data Mining Strategies

Data Mining Strategies

Unsupervised Clustering

Supervised Learning

Market Basket Analysis

Classification

Estimation

Prediction

Figure 2.1 A hierarchy of data mining strategies

Data Mining Strategies: Classification


Learning is supervised. The dependent variable is categorical. Well-defined classes. Current rather than future behavior.

Data Mining Strategies: Estimation


Learning is supervised. The dependent variable is numeric. Well-defined classes. Current rather than future behavior.

Data Mining Strategies: Prediction


The emphasis is on predicting future rather than current outcomes. The output attribute may be categorical or numeric.

Classification, Estimation or Prediction?


The nature of the data determines whether a model is suitable for classification, estimation, or prediction.

The Cardiology Patient Dataset


This dataset contains 303 instances. Each instance holds information about a patient who either has or does not have a heart condition.

The Cardiology Patient Dataset


138 instances represent patients with heart disease. 165 instances contain information about patients free of heart disease.

Table 2.1 Cardiology Patient Data


Attribute Name
Age Sex Chest Pain Type Blood Pressure Cholesterol Fasting Blood Sugar < 120 Resting ECG Maximum Heart Rate Induced Angina? Old Peak Slope Number Colored Vessels Thal Concept Class

Mixed Values
Numeric Male, Female Angina, Abnormal Angina, NoTang, Asympt omatic Numeric Numeric True, False Normal, Abnormal, Hyp Numeric True, False Numeric Up, flat, dow n 0, 1, 2, 3 Normal fix, rev Healthy, Sick

Numeric Values
Numeric 1, 0 14 Numeric Numeric 1, 0 0, 1, 2 Numeric 1, 0 Numeric 13 0, 1, 2, 3 3, 6, 7 1, 0

Comments
Age in years Patient gender NoTang = Nonanginal pain Resting blood pressure upon hospital admission Serum cholesterol Is fasting blood sugar less than 120? Hyp = Left ventricular hypertrophy Maximum heart rate achieved Does t he patient experience angina as a result of exercise? ST depression induced by exercise relative to rest Slope of t he peak exercise ST segment Number of major vessels colored by fluorosopy Normal, fixed defect, reversible defect Angiographic disease stat us

Table 2.2 Most and Least Typical Instances from the Cardiology Domain
Attribute Name
Age Sex Chest Pain Type Blood Pressure Cholesterol Fasting Blood Sugar < 120 Resting ECG Maximum Heart Rate Induced Angina? Old Peak Slope Number of Colored Vessels Thal

Most Typical Healthy Class


52 Male NoTang 138 223 False Normal 169 False 0 Up 0 Normal

Least Typical Healthy Class


63 Male Angina 145 233 True Hyp 150 False 2.3 Down 0 Fix

Most Typical Sick Class


60 Male Asymptomatic 125 258 False Hyp 141 True 2.8 Flat 1 Rev

Least Typical Sick Class


62 Female Asymptomatic 160 164 False Hyp 145 False 6.2 Down 3 Rev

Classification, Estimation or Prediction?


The next two slides each contain a rule generated from this dataset. Are either of these rules predictive?

A Healthy Class Rule for the Cardiology Patient Dataset


IF 169 <= Maximum Heart Rate <=202 THEN Concept Class = Healthy Rule accuracy: 85.07% Rule coverage: 34.55%

A Sick Class Rule for the Cardiology Patient Dataset


IF Thal = Rev & Chest Pain Type = Asymptomatic THEN Concept Class = Sick Rule accuracy: 91.14% Rule coverage: 52.17%

Data Mining Strategies: Unsupervised Clustering

Unsupervised Clustering can be used to:


determine if relationships can be found in the data.
evaluate the likely performance of a supervised model. find a best set of input attributes for supervised learning. detect Outliers.

Data Mining Strategies: Market Basket Analysis


Find interesting relationships among retail products. Uses association rule algorithms.

2.2 Supervised Data Mining Techniques

The Credit Card Promotion Database

Table 2.3 The Credit Card Promotion Database


Income Range ($)
4050K 3040K 4050K 3040K 5060K 2030K 3040K 2030K 3040K 3040K 4050K 2030K 5060K 4050K 2030K

Magazine Promotion
Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No

Watch Life Insurance Promotion Promotion


No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes

Credit Card Insurance


No No No Yes No No Yes No No No No No No No Yes

Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female

Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19

A Hypothesis for the Credit Card Promotion Database


A combination of one or more of the dataset attributes differentiate Acme Credit Card Company card holders who have taken advantage of the life insurance promotion and those card holders who have chosen not to participate in the promotional offer.

Supervised Data Mining Techniques: Production Rules

A Production Rule for the Credit Card Promotion Database


IF Sex = Female & 19 <=Age <= 43 THEN Life Insurance Promotion = Yes Rule Accuracy: 100.00% Rule Coverage: 66.67%

Production Rule Accuracy & Coverage


Rule accuracy is a between-class measure. Rule coverage is a within-class measure.

Supervised Data Mining Techniques: Neural Networks

Input Layer

Hidden Layer

Output Layer

Figure 2.2 A multilayer fully connected neural network

Table 2.4 Neural Network Training: Actual and Computed Output


Instance Number
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Life Insurance Promotion


0 1 0 1 1 0 1 0 0 1 1 1 1 0 1

Computed Output
0.024 0.998 0.023 0.986 0.999 0.050 0.999 0.262 0.060 0.997 0.999 0.776 0.999 0.023 0.999

Supervised Data Mining Techniques: Statistical Regression


Life insurance promotion = 0.5909 (credit card insurance) 0.5455 (sex) + 0.7727

2.3 Association Rules

Comparing Association Rules & Production Rules


Association rules can have one or several output attributes. Production rules are limited to one output attribute. With association rules, an output attribute for one rule can be an input attribute for another rule.

Two Association Rules for the Credit Card Promotion Database


IF Sex = Female & Age = over40 & Credit Card Insurance = No THEN Life Insurance Promotion = Yes IF Sex = Female & Age = over40 THEN Credit Card Insurance = No & Life Insurance Promotion = Yes

2.4 Clustering Techniques

Cluster 1
# Instances: 3 Sex: Male => 3 Female => 0 Age: 43.3 Credit Card Insurance:

Yes => 0 No => 3 Life Insurance Promotion: Yes => 0 No => 3

Cluster 2
# Instances: 5 Sex: Male => 3 Female => 2 Age: 37.0 Credit Card Insurance:

Yes => 1 No => 4 Life Insurance Promotion: Yes => 2 No => 3

Cluster 3
# Instances: 7 Sex: Male => 2 Female => 5 Age: 39.9 Credit Card Insurance:

Yes => 2 No => 5 Life Insurance Promotion: Yes => 7 No => 0

Figure 2.3 An unsupervised clustering of the credit card database

2.5 Evaluating Performance

Evaluating Supervised Learner Models

Confusion Matrix
A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors.

Table 2.5 A Three-Class Confusion Matrix


Computed Decision
C C

1 11 21 31

C C

2 12 22 32

C C

3 13 23 33

1 2 3

Two-Class Error Analysis

Table 2.6 A Simple Confusion Matrix


Computed Accept
Accept True Accept False Accept

Computed Reject
False Reject True Reject

Reject

Table 2.7 Two Confusion Matrices Each Showing a 10% Error Rate
Model A
Accept Reject

Computed Accept
600 75

Computed Reject
25 300

Model B
Accept Reject

Computed Accept
600 25

Computed Reject
75 300

Evaluating Numeric Output


Mean absolute error Mean squared error Root mean squared error

Mean Absolute Error


The average absolute difference between classifier predicted output and actual output.

Mean Squared Error


The average of the sum of squared differences between classifier predicted output and actual output.

Root Mean Squared Error


The square root of the mean squared error.

Comparing Models by Measuring Lift

1200

1000

Number Responding

800

600

400

200

0 0 10 20 30 40 50 60 70 80 90 100

% Sampled

Figure 2.4 Targeted vs. mass mailing

Computing Lift
Lift P ( C i | Sample) P (C i | Population)

Table 2.8 Two Confusion Matrices: No Model and an Ideal Model


No Model
Accept Reject

Computed Accept
1,000 99,000

Computed Reject
0 0

Ideal Model
Accept Reject

Computed Accept
1,000 0

Computed Reject
0 99,000

Table 2.9 Two Confusion Matrices for Alternative Models with Lift Equal to 2.25
Model X
Accept Reject

Computed Accept
540 23,460

Computed Reject
460 75,540

Model Y
Accept Reject

Computed Accept
450 19,550

Computed Reject
550 79,450

Unsupervised Model Evaluation

Unsupervised Model Evaluation (cluster quality)


All clustering techniques compute some measure of cluster quality. One evaluation method is to calculate the sum of squared error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.

Supervised Learning for Unsupervised Model Evaluation


Designate each formed cluster as a class and assign each class an arbitrary name. Choose a random sample of instances from each class for supervised learning. Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model.

Basic Data Mining Techniques


Chapter 3

3.1 Decision Trees

An Algorithm for Building Decision Trees


1. Let T be the set of training instances. 2. Choose an attribute that best differentiates the instances in T. 3. Create a tree node whose value is the chosen attribute. -Create child links from this node where each link represents a unique value for the chosen attribute. -Use the child link values to further subdivide the instances into subclasses. 4. For each subclass created in step 3: -If the instances in the subclass satisfy predefined criteria or if the set of remaining attribute choices for this path is null, specify the classification for new instances following this decision path. -If the subclass does not satisfy the criteria and there is at least one attribute to further subdivide the path of the tree, let T be the current set of subclass instances and return to step 2.

Table 3.1 The Credit Card Promotion Database


Income Range
4050K 3040K 4050K 3040K 5060K 2030K 3040K 2030K 3040K 3040K 4050K 2030K 5060K 4050K 2030K

Life Insurance Promotion


No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes

Credit Card Insurance


No No No Yes No No Yes No No No No No No No Yes

Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female

Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19

Table 3.1 The Credit Card Promotion Database


Income Range
4050K 3040K 4050K 3040K 5060K 2030K 3040K 2030K 3040K 3040K 4050K 2030K 5060K 4050K 2030K

Life Insurance Promotion


No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes

Credit Card Insurance


No No No Yes No No Yes No No No No No No No Yes

Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female

Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19

Income Range

20-30K

30-40K

40-50K

50-60K

2 Yes 2 No

4 Yes 1 No

1 Yes 3 No

2 Yes

Figure 3.1 A partial decision tree with root node = income range

Table 3.1 The Credit Card Promotion Database


Income Range
4050K 3040K 4050K 3040K 5060K 2030K 3040K 2030K 3040K 3040K 4050K 2030K 5060K 4050K 2030K

Life Insurance Promotion


No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes

Credit Card Insurance


No No No Yes No No Yes No No No No No No No Yes

Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female

Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19

Credit Card Insurance

No
6 Yes 6 No

Yes
3 Yes 0 No

Figure 3.2 A partial decision tree with root node = credit card insurance

Table 3.1 The Credit Card Promotion Database


Income Range
4050K 3040K 4050K 3040K 5060K 2030K 3040K 2030K 3040K 3040K 4050K 2030K 5060K 4050K 2030K

Life Insurance Promotion


No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes

Credit Card Insurance


No No No Yes No No Yes No No No No No No No Yes

Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female

Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19

Age

<= 43
9 Yes 3 No

> 43
0 Yes 3 No

Figure 3.3 A partial decision tree with root node = age

Decision Trees for the Credit Card Promotion Database

Table 3.1 The Credit Card Promotion Database


Income Range
4050K 3040K 4050K 3040K 5060K 2030K 3040K 2030K 3040K 3040K 4050K 2030K 5060K 4050K 2030K

Life Insurance Promotion


No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes

Credit Card Insurance


No No No Yes No No Yes No No No No No No No Yes

Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female

Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19

Age

<= 43

> 43

No (3/0) Sex

Female

Male

Yes (6/0)
Credit Card Insurance

No

Yes

No (4/1)

Yes (2/0)

Figure 3.4 A three-node decision tree for the credit card database

Table 3.1 The Credit Card Promotion Database


Income Range
4050K 3040K 4050K 3040K 5060K 2030K 3040K 2030K 3040K 3040K 4050K 2030K 5060K 4050K 2030K

Life Insurance Promotion


No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes

Credit Card Insurance


No No No Yes No No Yes No No No No No No No Yes

Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female

Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19

Credit Card Insurance

No

Yes

Yes (3/0) Sex

Female

Male

Yes (6/1)

No (6/1)

Figure 3.5 A two-node decision treee for the credit card database

Table 3.2 Training Data Instances Following the Path in Figure 3.4 to Credit Card Insurance = No
Income Range
4050K 2030K 3040K 2030K

Life Insurance Promotion


No No No Yes

Credit Card Insurance


No No No No

Sex
Male Male Male Male

Age
42 27 43 29

Decision Tree Rules

A Rule for the Tree in Figure 3.4


IF Age <=43 & Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

A Simplified Rule Obtained by Removing Attribute Age


IF Sex = Male & Credit Card Insurance = No THEN Life Insurance Promotion = No

Other Methods for Building Decision Trees


CART CHAID

Advantages of Decision Trees


Easy to understand.
Map nicely to a set of production rules. Applied to real problems. Make no prior assumptions about the data. Able to process both numerical and categorical data.

Disadvantages of Decision Trees


Output attribute must be categorical.
Limited to one output attribute. Decision tree algorithms are unstable. Trees created from numeric datasets can be complex.

3.2 Generating Association Rules

Confidence and Support

Rule Confidence
Given a rule of the form If A then B, rule confidence is the conditional probability that B is true when A is known to be true.

Rule Support
The minimum percentage of instances in the database that contain all items listed in a given association rule.

Mining Association Rules: An Example

Table 3.3 A Subset of the Credit Card Promotion Database


Magazine Promotion
Yes Yes No Yes Yes No Yes No Yes Yes

Watch Promotion
No Yes No Yes No No No Yes No Yes

Life Insurance Promotion


No Yes No Yes Yes No Yes No No Yes

Credit Card Insurance


No No No Yes No No Yes No No No

Sex
Male Female Male Male Female Female Male Male Male Female

Table 3.4 Single-Item Sets


Single-Item Sets
Magazine Promotion = Yes Watch Promotion = Yes Watch Promotion = No Life Insurance Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Sex = Male Sex = Female

Number of Items
7 4 6 5 5 8 6 4

Table 3.5 Two-Item Sets


Two-Item Sets Number of Items
4 5 5 4 4 5 4 5 4 4 4

Magazine Promotion = Yes & Watch Promotion = No Magazine Promotion = Yes & Life Insurance Promotion = Yes Magazine Promotion = Yes & Credit Card Insurance = No Magazine Promotion = Yes & Sex = Male Watch Promotion = No & Life Insurance Promotion = No Watch Promotion = No & Credit Card Insurance = No Watch Promotion = No & Sex = Male Life Insurance Promotion = No & Credit Card Insurance = No Life Insurance Promotion = No & Sex = Male Credit Card Insurance = No & Sex = Male Credit Card Insurance = No & Sex = Female

Two Possible Two-Item Set Rules


IF Magazine Promotion =Yes THEN Life Insurance Promotion =Yes (5/7) IF Life Insurance Promotion =Yes THEN Magazine Promotion =Yes (5/5)

Three-Item Set Rules


IF Watch Promotion =No & Life Insurance Promotion = No THEN Credit Card Insurance =No (4/4) IF Watch Promotion =No THEN Life Insurance Promotion = No & Credit Card Insurance = No (4/6)

General Considerations
We are interested in association rules that show a
lift in product sales where the lift is the result of the products association with one or more other products. We are also interested in association rules that show a lower than expected confidence for a particular association.

3.3 The K-Means Algorithm


1. Choose a value for K, the total number of clusters. 2. Randomly choose K points as cluster centers. 3. Assign the remaining instances to their closest cluster center. 4. Calculate a new cluster center for each cluster. 5. Repeat steps 3-5 until the cluster centers do not change.

An Example Using K-Means

Table 3.6
Instance
1 2 3 4 5 6

K-Means Input Values


X
1.0 1.0 2.0 2.0 3.0 5.0

Y
1.5 4.5 1.5 3.5 2.5 6.0

f(x)

7 6 5 4 3 2 1 0 0 1 2 3 4 5 6

Figure 3.6 A coordinate mapping of the data in Table 3.6

Table 3.7 Several Applications of the K-Means Algorithm (K = 2)


Outcome
1

Cluster Centers
(2.67,4.67) (2.00,1.83)

Cluster Points
2, 4, 6

Squared Error

14.50 1, 3, 5 1, 3 15.94 (2.75,4.125) 3 (1.8,2.7) (5,6) 2, 4, 5, 6 1, 2, 3, 4, 5 9.60 6

(1.5,1.5)

f(x)

7 6 5 4 3 2 1 0 0 1 2 3 4 5 6

Figure 3.7 A K-Means clustering of the data in Table 3.6 (K = 2)

General Considerations
Requires real-valued data.
We must select the number of clusters present in the data. Works best when the clusters in the data are of approximately equal size. Attribute significance cannot be determined. Lacks explanation capabilities.

3.4 Genetic Learning

Genetic Learning Operators


Crossover
Mutation Selection

Genetic Algorithms and Supervised Learning

Keep Population Elements Fitness Function Training Data

Throw

Candidates for Crossover & Mutation

Figure 3.8 Supervised genetic learning

Table 3.8 An Initial Population for Supervised Genetic Learning


Population Element
1 2 3 4

Income Range
2030K 3040K ? 3040K

Life Insurance Promotion


No Yes No Yes

Credit Card Insurance


Yes No No Yes

Sex
Male Female Male Male

Age
3039 5059 4049 4049

Table 3.9 Training Data for Genetic Learning


Training Instance
1 2 3 4 5 6

Income Range
3040K 3040K 5060K 2030K 2030K 3040K

Life Insurance Promotion


Yes Yes Yes No No No

Credit Card Insurance


Yes No No No No No

Sex
Male Female Female Female Male Male

Age
3039 4049 3039 5059 2029 4049

Population Element #1

Income Range 20-30K

Life Insurance Promotion No

Credit Card Insurance Yes

Sex Male

Age 30-39

Population Element #2

Income Range 30-40K

Life Insurance Credit Card Promotion Insurance Yes Yes

Sex Male

Age 30-39

Population Element #2

Income Range 30-40K

Life Insurance Promotion Yes

Credit Card Insurance No

Sex Fem

Age 50-59

Population Element #1

Income Range 20-30K

Life Insurance Credit Card Promotion Insurance No No

Sex Fem

Age 50-59

Figure 3.9 A crossover operation

Table 3.10 A Second-Generation Population


Population Element
1 2 3 4

Income Range
2030K 3040K ? 3040K

Life Insurance Promotion


No Yes No Yes

Credit Card Insurance


No Yes No Yes

Sex
Female Male Male Male

Age
5059 3039 4049 4049

Genetic Algorithms and Unsupervised Clustering

a1 I1 I2 . . . . . Ip

a2

a3

. . .

S1 an

E11 E12 E21

S2
. . . .

P instances

. . .
SK

E22

Ek1 Ek2

Solutions

Figure 3.10 Unsupervised genetic clustering

Table 3.11 A First-Generation Population for Unsupervised Clustering


S 1 S 2 S 3

Solution elements (initial population) Fitness score Solution elements (second generation) Fitness score Solution elements (third generation) Fitness score

(1.0,1.0) (5.0,5.0) 11.31 (5.0,1.0) (5.0,5.0) 17.96 (5.0,5.0) (1.0,5.0) 13.64

(3.0,2.0) (3.0,5.0) 9.78 (3.0,2.0) (3.0,5.0) 9.78 (3.0,2.0) (3.0,5.0) 9.78

(4.0,3.0) (5.0,1.0) 15.55 (4.0,3.0) (1.0,1.0) 11.34 (4.0,3.0) (1.0,1.0) 11.34

General Considerations
Global optimization is not a guarantee.
The fitness function determines the complexity of the algorithm. Explain their results provided the fitness function is understandable. Transforming the data to a form suitable for genetic learning can be a challenge.

3.5 Choosing a Data Mining Technique

Initial Considerations
Is learning supervised or unsupervised?
Is explanation required? What is the interaction between input and output attributes? What are the data types of the input and output attributes?

Further Considerations
Do We Know the Distribution of the Data?
Do We Know Which Attributes Best Define the Data? Does the Data Contain Missing Values? Is Time an Issue? Which Technique Is Most Likely to Give a Best Test Set Accuracy?

An Excel-based Data Mining Tool


Chapter 4

Interface

Data

PreProcessor

Large Dataset
No

Yes

Heuristic Agent

Mining Technique

Neural Networks

ESX

Explaination

Yes

No

Generate Rules
No

Yes

RuleMaker

Rules

Report Generator

Excel Sheets

Figure 4.1 The iDA system architecture

4.2 ESX: A Multipurpose Tool for Data Mining

Root Level

Root

Concept Level

C1

C2

...

Cn

Instance Level

I11 I12 . . .

I1j

I21 I22 . . . I2k

In1 In2 . . .

Inl

Figure 4.3 An ESX concept hierarchy

Table 4.1 Credit Card Promotion Database: iDAV Format


Income Range
C I 4050K 3040K 4050K 3040K 5060K 2030K 3040K 2030K 3040K 3040K 4050K 2030K 5060K 4050K 2030K

Magazine Promotion
C I Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No

Watch Promotion
C I No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No

Life Insurance Promotion


C I No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes

Credit Card Insurance


C I No No No Yes No No Yes No No No No No No No Yes

Sex
C I Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female

Age
R I 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19

Figure 4.10 Class 3 summary results

Knowledge Discovery in Databases


Chapter 5

5.1 A KDD Process Model

Step 1: Goal Identification


Defined Goals

Step 2: Create Target Data Step 3: Data Preprocessing


Data Warehouse Cleansed Data Target Data

Transactional

Database

Step 4: Data Transformation


Flat File Transformed Data

Step 6: Interpretation & Evaluation

Step 5: Data Mining


Data Model

Step 7: Taking Action

Figure 5.1 A seven-step KDD process model

The Scientific Method


Define the Problem Formulate a Hypothesis Perform an Experiment

A KDD Process Model

Identify the Goal Create Target Data Data Preprocessing Data Transformation Data Mining

Draw Conclusions Verifiy Conclusions

Interpretation / Evaluation Take Action

Figure 5.2 Applyiing the scientific method to data mining

Step 1: Goal Identification


Define the Problem. Choose a Data Mining Tool. Estimate Project Cost. Estimate Project Completion Time. Address Legal Issues. Develop a Maintenance Plan.

Step 2: Creating a Target Dataset

Figure 5.3 The Acme credit card database

Step 3: Data Preprocessing


Noisy Data Missing Data

Noisy Data
Locate Duplicate Records. Locate Incorrect Attribute Values. Smooth Data.

Preprocessing Missing Data


Discard Records With Missing Values. Replace Missing Real-valued Items With the Class Mean. Replace Missing Values With Values Found Within Highly Similar Instances.

Processing Missing Data While Learning


Ignore Missing Values. Treat Missing Values As Equal Compares. Treat Missing values As Unequal Compares.

Step 4: Data Transformation


Data Normalization Data Type Conversion Attribute and Instance Selection

Data Normalization
Decimal Scaling Min-Max Normalization Normalization using Z-scores Logarithmic Normalization

Attribute and Instance Selection


Eliminating Attributes Creating Attributes Instance Selection

Table 5.1 An Initial Population for Genetic Attribute Selection


Population Element
1 2 3

Income Range
1 0 0

Magazine Promotion
0 0 0

Watch Promotion
0 0 0

Credit Card Insurance


1 1 0

Sex
1 0 1

Age
1 1 1

Step 5: Data Mining


1. Choose training and test data. 2. Designate a set of input attributes. 3. If learning is supervised, choose one or more output attributes. 4. Select learning parameter values. 5. Invoke the data mining tool.

Step 6: Interpretation and Evaluation


Statistical analysis. Heuristic analysis. Experimental analysis. Human analysis.

Step 7: Taking Action


Create a report. Relocate retail items. Mail promotional information. Detect fraud. Fund new research.

5.9 The Crisp-DM Process Model


1. 2. 3. 4. 5. 6. Business understanding Data understanding Data preparation Modeling Evaluation Deployment

The Data Warehouse


Chapter 6

6.1 Operational Databases

Data Modeling and Normalization


One-to-One Relationships One-to-Many Relationships Many-to-Many Relationships

Data Modeling and Normalization


First Normal Form Second Normal Form Third Normal Form

Type ID

Make Year

Customer ID Income Range

Vehicle - Type

Customer

Figure 6.1 A simple entityrelationship diagram

The Relational Model

Table 6.1a Relational Table for Vehicle-Type


Type ID
4371 6940 4595 2390

Make
Chevrolet Cadillac Chevrolet Cadillac

Year
1995 2000 2001 1997

Table 6.1b Relational Table for Customer


Customer ID
0001 0002 0003 0004 0005

Income Range ($)


7090K 3050K 7090K 3050K 7090K

Type ID
2390 4371 6940 4595 2390

Table 6.2 Join of Tables 6.1a and 6.1b


Customer ID
0001 0002 0003 0004 0005

Income Range ($)


7090K 3050K 7090K 3050K 7090K

Type ID
2390 4371 6940 4595 2390

Make
Cadillac Chevrolet Cadillac Chevrolet Cadillac

Year
1997 1995 2000 2001 1997

6.2 Data Warehouse Design

The Data Warehouse


A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements decision making process (W.H. Inmon).

Granularity
Granularity is a term used to describe the level of detail of stored information.

Dependent Data Mart External Data Extract/Summarize Data

ETL Routine Operational Database(s)


(Extract/Transform/Load)

Data Warehouse

Decision Support System

Independent Data Mart

Report

Figure 6.2 A data warehouse process model

Entering Data into the Warehouse


Independent Data Mart ETL (Extract, Transform, Load Routine) Metadata

Structuring the Data Warehouse: Two Methods


Structure the warehouse model using the star schema Structure the warehouse model as a multidimensional array

The Star Schema


Fact Table Dimension Tables Slowly Changing Dimensions

Purchase Key 1 2 3 4 5 6 . . .

Purchase Dimension Category Supermarket Travel & Entertainment Auto & Vehicle Retail Restarurant Miscellaneous . . .

Time Dimension Time Key Month Day Quarter Year 10 Jan 5 1 2002 . . . . . . . . . . . . . . .

Cardholder Key Purchase Key Location Key 1 2 1 15 4 5 1 2 3 . . . . . . . . .

Fact Table Time Key Amount 10 14.50 11 8.25 10 22.40 . . . . . .

Cardholder Key Name 1 John Doe 2 Sara Smith . . . . . .

Cardholder Dimension Gender Income Range Male 50 - 70,000 Female 70 - 90,000 . . . . . .

Location Key Street 10 425 Church St . . . . . .

Location Dimension City State Region Charleston SC 3 . . . . . . . . .

Figure 6.3 A star schema for credit card purchases

The Multidimensionality of the Star Schema

Cardholder Ci

Purchase Key

,1 Ci ( A

10 ,2 ,

Ti

e m

Ke

Location Key

Figure 6.4 Dimensions of the fact table shown in Figure 6.3

Additional Relational Schemas


Snowflake Schema Constellation Schema

Promotion Key 1 . . .

Promotion Dimension Description Cost watch promo 15.25 . . . . . .

Time Dimension Time Key Month Day Quarter Year 5 Dec 31 4 2001 8 Jan 3 1 2002 10 Jan 5 1 2002 . . . . . . . . . . . . . . .

Purchase Key 1 2 3 4 5 6 Promotion Fact Table Cardholder Key Promotion Key Time Key 1 1 5 2 1 5 . . . . . . . . .

Purchase Dimension Category Supermarket Travel & Entertainment Auto & Vehicle Retail Restarurant Miscellaneous

Response Yes No . . .

Purchase Fact Table Cardholder Key Purchase Key Location Key 1 2 1 15 4 5 1 2 3 . . . . . . . . .

Time Key Amount 10 14.50 11 8.25 10 22.40 . . . . . .

Cardholder Key Name 1 John Doe 2 Sara Smith . . . . . .

Cardholder Dimension Gender Income Range Male 50 - 70,000 Female 70 - 90,000 . . . . . .

Location Key Street 5 425 Church St . . . . . .

Location Dimension City State Region Charleston SC 3 . . . . . . . . .

Figure 6.5 A constellation schema for credit card purchases and promotions

Decision Support: Analyzing the Warehouse Data


Reporting Data Analyzing Data Knowledge Discovery

6.3 On-line Analytical Processing

OLAP Operations
Slice A single dimension operation Dice A multidimensional operation Roll-up A higher level of generalization Drill-down A greater level of detail Rotation View data from a new perspective

Month = Dec. Category = Vehicle Region = Two Amount = 6,720 Count = 110

Dec. Nov. Oct. Sep. Aug.

Month

Jul. Jun. May Apr. Mar. Feb. Jan.


On e Tw o Th r Fo ee ur

Supermarket

Restaurant

Travel

Vehicle

Retail

Miscellaneous

Re

gio

Category

Figure 6.6 A multidimensional cube for credit card purchases

Concept Hierarchy
A mapping that allows attributes to be viewed from varying levels of detail.

Region

State

City

Street Address

Figure 6.7 A concept hierarchy for location

Month = Oct./Nov/Dec. Category = Supermarket Region = One

Q4

Time

Q3 Q2 Q1
On e Tw o Fo ur Th ree

Miscellaneous

Supermarket

Travel

Vehicle

Retail

Restaurant

g io e R

Category

Figure 6.8 Rolling up from months to quarters

Formal Evaluation Techniques


Chapter 7

7.1 What Should Be Evaluated?


1. 2. 3. 4. 5. 6. Supervised Model Training Data Attributes Model Builder Parameters Test Set Evaluation

Parameters

Instances Data Attributes Test Data Training Data Model Builder

Supervised Model

Evaluation

Figure 7.1 Components for supervised learning

Single-Valued Summary Statistics


Mean Variance Standard deviation

The Normal Distribution

f(x)
13.54% 2.14% .13%

34.13%

34.13%

13.54% 2.14% .13%

x
99

-99

-3

-2

-1

Figure 7.2 A normal distribution

Normal Distributions & Sample Means


A distribution of means taken from random sets of independent samples of equal size are distributed normally. Any sample mean will vary less than two standard errors from the population mean 95% of the time.

A Classical Model for Hypothesis Testing


P
where P is the significan ce score and; X 1 and X 2 are sample means for the independen t samples;
X1

X2

(v1 / n1 v2 / n2 )

v1 and v2 are variance scores for the respective means;


n1 and n2 are corresponding sample sizes.
Equation 7.2

Table 7.1 A Confusion Matrix for the Null Hypothesis


Computed Accept
Accept Null Hypothesis Reject Null Hypothesis True Accept

Computed Reject
Type 1 Error

Type 2 Error

True Reject

7.3 Computing Test Set Confidence Intervals


Classifier Error Rate ( E ) # of test set errors # of test set instances

Equation 7.3

Computing 95% Confidence Intervals


1. Given a test set sample S of size n and error rate E 2. Compute sample variance as V= E(1-E) 3. Compute the standard error (SE) as the square root of V divided by n. 4. Calculate an upper bound error as E + 2(SE) 5. Calculate a lower bound error as E - 2(SE)

Cross Validation
Used when ample test data is not available Partition the dataset into n fixed-size units. n-1 units are used for training and the nth unit is used as a test set. Repeat this process until each of the fixedsize units has been used as test data. Model correctness is taken as the average of all training-test trials.

Bootstrapping
Used when ample training and test data is not available. Bootstrapping allows instances to appear more than once in the training data.

7.4 Comparing Supervised Learner Models

Comparing Models with Independent Test Data


P E1 E2 q(1 q )(1 / n1 1 / n2 )

where E1 = The error rate for model M1 E2 = The error rate for model M2 q = (E1 + E2)/2 n1 = the number of instances in test set A n2 = the number of instances in test set B

Equation 7.4

Comparing Models with a Single Test Dataset


P E1 E2 q(1 q )(2 / n )

where E1 = The error rate for model M1 E2 = The error rate for model M2 q = (E1 + E2)/2 n = the number of test set instances

Equation 7.5

7.5 Attribute Evaluation

Locating Redundant Attributes with Excel


Correlation Coefficient Positive Correlation Negative Correlation Curvilinear Relationship

Creating a Scatterplot Diagram with MS Excel

Hypothesis Testing for Numerical Attribute Significance


Pij
where X is the class i mean and X is the class j mean for attribute A. i j

Xi X j ( vi / ni v j / n j )

vi is the class i variance and v j is the class j variance for attribute A. ni is the number of instances in Ci and n j is the number of instances in C j

Equation 7.6

7.6 Unsupervised Evaluation Techniques


Unsupervised Clustering for Supervised Evaluation Supervised Evaluation for Unsupervised Clustering Additional Methods

7.7 Evaluating Supervised Models with Numeric Output

Mean Squared Error


mse 2 ( a1 c1 ) ( a2

c2 )2 ... ( ai ci ) ... ( an cn )2
n

where for the ith instance, ai = actual output value ci = computed output value

Equation 7.7

Mean Absolute Error


mae | a1 c1 | | a2 c2 | .... | an n
where for the ith instance, ai = actual output value ci = computed output value

cn |

Equation 7.8

Neural Networks
Chapter 8

8.1 Feed-Forward Neural Networks

Input Layer 1.0 Node 1


W1j W1i W2j

Hidden Layer

Output Layer

Node j
Wjk

0.4

Node 2

W2i

Node k Node i
Wik

W3j

0.7

Node 3

W3i

Figure 8.1 A fully connected feedforward neural network

The Sigmoid Function


1 f ( x) 1 e x
where e is the base of natural logarithmsapproximated by 2.718282.

Equation 8.2

1.200 1.000 0.800


f(x)

0.600 0.400 0.200 0.000 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6


x

Figure 8.2 The sigmoid function

Supervised Learning with FeedForward Networks


Backpropagation Learning Genetic Learning

Unsupervised Clustering with Self-Organizing Maps

Output Layer

Input Layer Node 1 Node 2

Figure 8.3 A 3x3 Kohonen network with two input layer nodes

8.3 Neural Network Explanation


Sensitivity Analysis Average Member Technique

8.4 General Considerations


What input attributes will be used to build the network? How will the network output be represented? How many hidden layers should the network contain? How many nodes should there be in each hidden layer? What condition will terminate network training?

Neural Network Strengths


Work well with noisy data. Can process numeric and categorical data. Appropriate for applications requiring a time element. Have performed well in several domains. Appropriate for supervised learning and unsupervised clustering.

Weaknesses
Lack explanation capabilities. May not provide optimal solutions to problems. Overtraining can be a problem.

Statistical Techniques
Chapter 10

10.1 Linear Regression Analysis


f ( x1 , x2 , x3 ... xn ) a1 x1 a2 x2 a3 x3 .......an xn c

Equation 10.1

Multiple Linear Regression with Excel

Regression Trees

Test 1

<

>=

Test 2

Test 3

<
LRM1

>=
LRM2

<
LRM3

>=
Test 4

<
LRM4

>=
LRM5

Figure 10.2 A generic model tree

10.2 Logistic Regression

Transforming the Linear Regression Model


Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.

The Logistic Regression Model


eax c p( y 1 | x ) ax c 1 e
where e is the base of natural logarithmsoften denoted as exp

Equation 10.7

10.3 Bayes Classifier


P( H | E ) P( E | H ) P( H ) P( E ) where H is the hypothesisto be tested E is the evidence associated with H

Equation 10.9

Bayes Classifier: An Example

Table 10.4 Data for Bayes Classifier


Magazine Promotion
Yes Yes No Yes Yes No Yes No Yes Yes

Watch Promotion
No Yes No Yes No No Yes No No Yes

Life Insurance Promotion


No Yes No Yes Yes No Yes No No Yes

Credit Card Insurance


No Yes No Yes No No Yes No No No

Sex
Male Female Male Male Female Female Male Male Male Female

The Instance to be Classified


Magazine Promotion = Yes Watch Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Sex = ?

Table 10.5 Counts and Probabilities for Attribute Sex


Magazine Promotion Sex
Yes No Ratio: yes/total Ratio: no/total Male 4 2 4/6 2/6 Female 3 1 3/4 1/4

Watch Promotion
Male 2 4 2/6 4/6 Female 2 2 2/4 2/4

Life Insurance Promotion


Male 2 4 2/6 4/6 Female 3 1 3/4 1/4

Credit Card Insurance


Male 2 4 2/6 4/6 Female 1 3 1/4 3/4

Computing The Probability For Sex = Male


P ( sex male | E ) P ( E | sex male) P ( sex male) P( E )

Equation 10.10

Conditional Probabilities for Sex = Male


P(magazine promotion = yes | sex = male) = 4/6 P(watch promotion = yes | sex = male) = 2/6 P(life insurance promotion = no | sex = male) = 4/6 P(credit card insurance = no | sex = male) = 4/6 P(E | sex =male) = (4/6) (2/6) (4/6) (4/6) = 8/81

The Probability for Sex=Male Given Evidence E


P(sex = male | E) 0.0593 / P(E)

The Probability for Sex=Female Given Evidence E


P(sex = female| E) 0.0281 / P(E)

Zero-Valued Attribute Counts


n ( k )( p ) d k k is a value between 0 and 1 (usually 1) p is an equal fractional part of the total number of possible values for theattribute

Equation 10.12

Missing Data
With Bayes classifier missing data items are ignored.

Numeric Data
f ( x) 1 /( 2 s ) e
where e = the exponential function m = the class mean for the given numerical attribute s = the class standard deviation for the attribute x = the attribute value
Equation 10.13

( x m ) 2 /( 2s 2 )

10.4 Clustering Algorithms

Agglomerative Clustering
1. Place each instance into a separate partition. 2. Until all instances are part of a single cluster: a. Determine the two most similar clusters. b. Merge the clusters chosen into a single cluster. 3. Choose a clustering formed by one of the step 2 iterations as a final result.

Conceptual Clustering
1. 2. Create a cluster with the first instance as its only member. For each remaining instance, take one of two actions at each tree level. a. Place the new instance into an existing cluster. b. Create a new concept cluster having the new instance as its only member.

Expectation Maximization
The EM (expectation-maximization) algorithm is a statistical technique that makes use of the finite Gaussian mixtures model.

Expectation Maximization
A mixture is a set of n probability distributions where each distribution represents a cluster. The mixtures model assigns each data instance a probability that it would have a certain set of attribute values given it was a member of a specified cluster.

Expectation Maximization
The EM algorithm is similar to the KMeans procedure in that a set of parameters are recomputed until a desire convergence is achieved. In the simplest case, there are two clusters, a single real-valued attribute, and the probability distributions are normal.

EM Algorithm (two-class, one attribute scenario)


1. 2. Guess initial values for the five parameters. Until a termination criterion is achieved: a. Use the probability density function for normal distributions to compute the cluster probability for each instance. b. Use the probability scores assigned to each instance in step 2(a) to re-estimate the parameters.

Specialized Techniques
Chapter 11

11.1 Time-Series Analysis


Time-series Problems: Prediction applications with one or more timedependent attributes.

Table 11.1 Weekly Average Closing Prices for the Nasdaq and Dow Jones Industrial Average
Week
200003 200004 200005 200006 200007 200008 200009 200010 200011 200012

Nasdaq Average
4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40 4818.01

Dow Average
11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52 10223.11 10937.36

Nasdaq-1 Average
3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40

Dow-1 Average
11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52 10223.11

Nasdaq-2 Average
3847.25 3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09

Dow-2 Average
11224.10 11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52

11.2 Mining the Web

Web-Based Mining
(identifying the goal)
Decrease the average number of pages visited by a customer before a purchase transaction. Increase the average number of pages viewed per user session. Increase Web server efficiency Personalize Web pages for customers Determine those products that tend to be purchased or viewed together Decrease the total number of item returns Increase visitor retention rates

Web-Based Mining
(preparing the data)
Data is stored in Web server log files, typically in the form of clickstream sequences Server log files provide information in extended common log file format

Extended Common Log File Format


Host Address Date/Time Request Status Bytes Referring Page Browser Type

Extended Common Log File Format

80.202.8.93 - - [16/Apr/2002:22:43:28 -0600] "GET /grbts/images/msu-new-color.gif HTTP/1.1" 200 5006 "http://grb.mnsu.edu/doc/index.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.01 [nb] 134.29.41.219 - - [17/Apr/2002:19:23:30 -0600] "GET /resindoc/images/resin_powered.gif HTTP/1.1" 200 571 "http://grb.mnsu.edu/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Q312461)"

Preparing the Data


(the session file)
A session file is a file created by the data preparation process.

Each instance of a session file represents a single user session.

Preparing the Data


(the session file)
A user session is a set of pageviews requested by a single user from a single Web server. A pageview contains one or more page files each forming a display window in a Web browser. Each pageview is tagged with a unique uniform resource identifier (URI).

Web Server Logs

Data Preparation

Session File

Data Mining Algorithm(s)

Learner Model

Figure 11.1 A generic Web usage model

Preparing the Data


(the session file)
Creating the session file is difficult
Identify individual users in a log file Host addresses are of limited help Host address combined with referring page is beneficial One user page request may generate multiple log file entries from several types of servers Easiest when sites are allowed to use cookies

Web-Based Mining (mining the data)


Traditional techniques such as association rule generators or clustering methods can be applied.
Sequence miners, which are special data mining algorithms used to discover frequently accessed Web pages that occur in the same order, are often used.

Web-Based Mining (evaluating results)


Consider four hypothetical pageview instances
P5 P4 P10 P3 P15 P2 P1 P2 P4 P10 P8 P15 P4 P15 P1 P4 P3 P7 P11 P14 P8 P2 P10 P1 P3 P10 P11 P4 P15 P9

Evaluating Results (association rules)


An association rule generator outputs the following rule from our session data.
IF P4 & P10 THEN P15 {3/4}

This rule states that P4, P10 and P15 appear in three session instances. Also, a four instances have P4 and P10 appearing in the same session instance

Evaluating Results
(unsupervised clustering)
Use agglomerative clustering to place session instances into clusters.

Instance similarity is computed by dividing the total number of pageviews each pair of instances share by the total number of pageviews contained within the instances.

Evaluating Results
(unsupervised clustering)
Consider the following session instances:
P5 P4 P10 P3 P15 P2 P1 P2 P4 P10 P8 P15 P4 P15 P1

The computed similarity is 5/8 = 0.625

Evaluating Results
(summary statistics)
Summary statistics about the activities taking place at a Web site can be obtained using a Web server log analyzer. The output of the analyzer is an aggregation of log file data displayed in graphical format.

Web-Based Mining
(Taking Action)
Implement a strategy based on created user profiles to personalize the Web pages viewed by site visitors. Adapt the indexing structure of a Web site to better reflect the paths followed by typical users. Set up online advertising promotions for registered Web site customers. Send e-mail to promote products of likely interest to a select group of registered customers. Modify the content of a Web site by grouping products likely to be purchased together, removing products of little interest, and expanding the offerings of high-demand products.

Data Mining for Web Site Evaluation


Web site evaluation is concerned with determining whether the actual use of a site matches the intentions of its designer.

Data Mining for Web Site Evaluation


Data mining can help with site evaluation by determining the frequent patterns and routes traveled by the user population. Sequential ordering of pageviews is of primary interest. Sequence miners are used to determine pageview order sequencing.

Data Mining for Personalization


The goal of personalization is to present Web users with what interests them without requiring them to ask for it directly. Manual techniques force users to register at a Web site and to fill in questionnaires. Data mining can be used to automate personalization.

Data Mining for Personalization


Automatic personalization is accomplished by creating usage profiles from stored session data.

Data Mining for Web Site Adaptation


The index synthesis problem: Given a Web site and a visitor access log, create new index pages containing collections of links to related but currently unlinked pages.

11.3 Mining Textual Data


Train: Create an attribute dictionary. Filter: Remove common words. Classify: Classify new documents.

11.4 Improving Performance


Bagging Boosting Instance Typicality

Data Mining Standards


Grossman, R.L., Hornick , M.F., Meyer, G., Data Mining Standards Initiatives, Communications of the ACM, August 2002,Vol. 45. No. 8

Privacy & Data Mining


Inference is the process of users posing queries and deducing unauthorized information from the legitimate responses that they receive. Data mining offers sophisticated tools to deduce sensitive patterns from data.

Privacy & Data Mining (an example)


Unnamed Health records are public information. People's names are public. People associated with their individual health records is private information.

Privacy & Data Mining (an example)


Former employees have their employment records stored in a data warehouse. An employer uses data mining to build a classification model to differentiate employees relative to their termination:
They quit They were fired They were laid off They retired The employer now uses the model to classify current employees. He fires employees likely to quit, and lays off employees likely to retire. Is this ethical?

Privacy & Data Mining (handling the inference problem)


Given a database and a data mining tool, apply the tool to determine if sensitive information can be deduced. Use an inference controller to detect the motives of the user. Give only samples of the data to the user thereby preventing the user from building a data mining model.

Privacy & Data Mining

Thuraisingham, B., Web Data Mining and Applications in Business Intelligence and Counter-Terrorism, CRC Press, 2003.

Data Mining Software


http://datamining.itsc.uah.edu/adam/binary. html http://www.cs.waikato.ac.nz/ml/weka/ http://magix.fri.uni-lj.si/orange/ www.kdnuggets.com http://datamining.itsc.uah.edu/adam/binary. html http://grb.mnsu.edu/grbts/ts.jsp

Data Mining Textbooks


Berry, M.J., Linoff, G., Data Mining Techinques For marketing, Sales, and Customer Support, Wiley, 1997. Han, J., Kamber, M., Data Mining Concepts and Techniques, Academic Press, 2001. Roiger, R.J., Geatz, M.W., Data Mining: A Tutorial-Based Primer, Addison-Wesley, 2003. Tan, P., Steinbach, M., Kumar, V., Introduction To Data Mining, Addison-Wesley, 2005. Witten, I.H., Frank, E., Data Mining Practical Machine Learning Tools with Java Implementations, Academic Press, 2000.

Data Mining Resources


AI magazine Communications of the ACM SIGKDD Explorations Computer Magazine PC AI IEEE Transactions on Data and Knowledge Engineering

Data Mining A Tutorial-Based Primer


Part I: Data Mining Fundamentals Part II: tools for Knowledge Discovery Part III: Advanced Data Mining Techniques Part IV: Intelligent Systems

Part I: Data Mining Fundamentals


Chapter 1 Data Mining: A First View Chapter 2 Data Mining: A Closer Look Chapter 3 Basic Data Mining Techniques Chapter 4 An Excel-Based Data Mining Tool

Part II: Tools for Knowledge Discovery


Chapter 5: Knowledge Discovery in Databases Chapter 6: The Data Warehouse Chapter 7: Formal Evaluation Techniques

Part III: Advanced Data Mining Techniques


Chapter 8: Neural Networks Chapter 9: Building Neural Networks with IDA Chapter 10: Statistical Techniques Chapter 11: Specialized Techniques

Part IV: Intelligent Systems


Chapter 12: Rule-Based Systems Chapter 13: Managing Uncertainty in Rule-Based Systems Chapter 14: Intelligent Agents

Potrebbero piacerti anche