Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Richard J. Roiger Computer and Information Sciences Dept. Minnesota State University, Mankato USA Email: richard.roiger@mnsu.edu Web site: krypton.mnsu.edu/~roiger
Designed for university instructors teaching in information science or computer science departments who wish to introduce a data mining course or unit into their curriculum.
Appropriate for anyone interested in a detailed overview of data mining as a problemsolving tool. Will emphasize material found in the text: Data Mining A Tutorial-Based Primer published by Addison-Wesley in 2003. Additional materials covering the most recent trends in data mining will also be presented. Participants will have the opportunity to experience the data mining process. Each participant will receive a complimentary copy of the aforementioned text together with a CD containing power point slides and a student version of IDA.
Questions to Answer
What constitutes data mining? Where does data mining fit in a CS or IS curriculum? Can I use data mining to solve my problem? How do I use data mining to solve my problem?
Web applications
crawler vs. human being user browsing habits
Scientific applications
earthquake detection gamma-ray bursts
Minimum
1 1
Maximum
5 1
Minimum
0 3
Maximum
0 3
Data Mining
The process of employing one or more computer learning techniques to automatically analyze and extract knowledge from data.
Supervised Learning
Build a learner model using data instances of known origin. Use the model to determine the outcome new instances of unknown origin.
Decision Tree
A tree structure where non-terminal nodes represent tests on one or more attributes and terminal nodes reflect decision outcomes.
Sore Throat
Yes No Yes Yes No No No Yes No Yes
Fever
Yes No Yes No Yes No No No Yes Yes
Swollen Glands
Yes No No Yes No No Yes No No No
Congestion
Yes Yes Yes No Yes Yes No Yes Yes Yes
Headache
Yes Yes No No No No No Yes Yes Yes
Diagnosis
Strep throat Allergy Cold Strep throat Cold Allergy Strep throat Allergy Cold Cold
Swollen Glands
No
Yes
Diagnosis = Strep Throat
Fever
No
Diagnosis = Allergy
Yes
Diagnosis = Cold
Sore Throat
No Yes No
Fever
No Yes No
Swollen Glands
Yes No No
Congestion
Yes No No
Headache
Yes Yes Yes
Diagnosis
? ? ?
Production Rules
IF Swollen Glands = Yes THEN Diagnosis = Strep Throat IF Swollen Glands = No & Fever = Yes THEN Diagnosis = Cold IF Swollen Glands = No & Fever = No THEN Diagnosis = Allergy
Unsupervised Clustering
A data mining method that builds models from data without predefined classes.
Account Type
Joint Custodial Joint Individual Individual
Margin Account
No No No Yes Yes
Transaction Method
Online Broker Online Broker Online
Trades/ Month
12.5 0.5 3.6 22.3 5.0
Sex
F F M M M
Age
3039 5059 2029 3039 4049
Favorite Recreation
Tennis Skiing Golf Fishing Golf
Annual Income
4059K 8099K 2039K 4059K 6079K
4.
Can I develop a general profile of an online investor? Can I determine if a new customer is likely to open a margin account? Can I build a model predict the average number of trades per month for a new investor? What characteristics differentiate female and male investors?
Shallow Knowledge
Shallow knowledge is factual. It can be easily stored and manipulated in a database.
Multidimensional Knowledge
Multidimensional knowledge is also factual. On-line analytical Processing (OLAP) tools are used to manipulate multidimensional knowledge.
Hidden Knowledge
Hidden knowledge represents patterns or regularities in data that cannot be easily found using database query. However, data mining algorithms can find such patterns with ease.
Expert System
A computer program that emulates the problem-solving skills of one or more human experts.
Knowledge Engineer
A person trained to interact with an expert in order to capture their knowledge.
Data
Human Expert
Knowledge Engineer
Operational Database
SQL Queries
Data Warehouse
Data Mining
Result Application
_ _ _ X X X X X Actual Value X X X X
Unsupervised Clustering
Supervised Learning
Classification
Estimation
Prediction
Mixed Values
Numeric Male, Female Angina, Abnormal Angina, NoTang, Asympt omatic Numeric Numeric True, False Normal, Abnormal, Hyp Numeric True, False Numeric Up, flat, dow n 0, 1, 2, 3 Normal fix, rev Healthy, Sick
Numeric Values
Numeric 1, 0 14 Numeric Numeric 1, 0 0, 1, 2 Numeric 1, 0 Numeric 13 0, 1, 2, 3 3, 6, 7 1, 0
Comments
Age in years Patient gender NoTang = Nonanginal pain Resting blood pressure upon hospital admission Serum cholesterol Is fasting blood sugar less than 120? Hyp = Left ventricular hypertrophy Maximum heart rate achieved Does t he patient experience angina as a result of exercise? ST depression induced by exercise relative to rest Slope of t he peak exercise ST segment Number of major vessels colored by fluorosopy Normal, fixed defect, reversible defect Angiographic disease stat us
Table 2.2 Most and Least Typical Instances from the Cardiology Domain
Attribute Name
Age Sex Chest Pain Type Blood Pressure Cholesterol Fasting Blood Sugar < 120 Resting ECG Maximum Heart Rate Induced Angina? Old Peak Slope Number of Colored Vessels Thal
Magazine Promotion
Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No
Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female
Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
Input Layer
Hidden Layer
Output Layer
Computed Output
0.024 0.998 0.023 0.986 0.999 0.050 0.999 0.262 0.060 0.997 0.999 0.776 0.999 0.023 0.999
Cluster 1
# Instances: 3 Sex: Male => 3 Female => 0 Age: 43.3 Credit Card Insurance:
Cluster 2
# Instances: 5 Sex: Male => 3 Female => 2 Age: 37.0 Credit Card Insurance:
Cluster 3
# Instances: 7 Sex: Male => 2 Female => 5 Age: 39.9 Credit Card Insurance:
Confusion Matrix
A matrix used to summarize the results of a supervised classification. Entries along the main diagonal are correct classifications. Entries other than those on the main diagonal are classification errors.
1 11 21 31
C C
2 12 22 32
C C
3 13 23 33
1 2 3
Computed Reject
False Reject True Reject
Reject
Table 2.7 Two Confusion Matrices Each Showing a 10% Error Rate
Model A
Accept Reject
Computed Accept
600 75
Computed Reject
25 300
Model B
Accept Reject
Computed Accept
600 25
Computed Reject
75 300
1200
1000
Number Responding
800
600
400
200
0 0 10 20 30 40 50 60 70 80 90 100
% Sampled
Computing Lift
Lift P ( C i | Sample) P (C i | Population)
Computed Accept
1,000 99,000
Computed Reject
0 0
Ideal Model
Accept Reject
Computed Accept
1,000 0
Computed Reject
0 99,000
Table 2.9 Two Confusion Matrices for Alternative Models with Lift Equal to 2.25
Model X
Accept Reject
Computed Accept
540 23,460
Computed Reject
460 75,540
Model Y
Accept Reject
Computed Accept
450 19,550
Computed Reject
550 79,450
Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female
Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female
Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
Income Range
20-30K
30-40K
40-50K
50-60K
2 Yes 2 No
4 Yes 1 No
1 Yes 3 No
2 Yes
Figure 3.1 A partial decision tree with root node = income range
Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female
Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
No
6 Yes 6 No
Yes
3 Yes 0 No
Figure 3.2 A partial decision tree with root node = credit card insurance
Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female
Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
Age
<= 43
9 Yes 3 No
> 43
0 Yes 3 No
Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female
Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
Age
<= 43
> 43
No (3/0) Sex
Female
Male
Yes (6/0)
Credit Card Insurance
No
Yes
No (4/1)
Yes (2/0)
Figure 3.4 A three-node decision tree for the credit card database
Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female
Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
No
Yes
Female
Male
Yes (6/1)
No (6/1)
Figure 3.5 A two-node decision treee for the credit card database
Table 3.2 Training Data Instances Following the Path in Figure 3.4 to Credit Card Insurance = No
Income Range
4050K 2030K 3040K 2030K
Sex
Male Male Male Male
Age
42 27 43 29
Rule Confidence
Given a rule of the form If A then B, rule confidence is the conditional probability that B is true when A is known to be true.
Rule Support
The minimum percentage of instances in the database that contain all items listed in a given association rule.
Watch Promotion
No Yes No Yes No No No Yes No Yes
Sex
Male Female Male Male Female Female Male Male Male Female
Number of Items
7 4 6 5 5 8 6 4
Magazine Promotion = Yes & Watch Promotion = No Magazine Promotion = Yes & Life Insurance Promotion = Yes Magazine Promotion = Yes & Credit Card Insurance = No Magazine Promotion = Yes & Sex = Male Watch Promotion = No & Life Insurance Promotion = No Watch Promotion = No & Credit Card Insurance = No Watch Promotion = No & Sex = Male Life Insurance Promotion = No & Credit Card Insurance = No Life Insurance Promotion = No & Sex = Male Credit Card Insurance = No & Sex = Male Credit Card Insurance = No & Sex = Female
General Considerations
We are interested in association rules that show a
lift in product sales where the lift is the result of the products association with one or more other products. We are also interested in association rules that show a lower than expected confidence for a particular association.
Table 3.6
Instance
1 2 3 4 5 6
Y
1.5 4.5 1.5 3.5 2.5 6.0
f(x)
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6
Cluster Centers
(2.67,4.67) (2.00,1.83)
Cluster Points
2, 4, 6
Squared Error
(1.5,1.5)
f(x)
7 6 5 4 3 2 1 0 0 1 2 3 4 5 6
General Considerations
Requires real-valued data.
We must select the number of clusters present in the data. Works best when the clusters in the data are of approximately equal size. Attribute significance cannot be determined. Lacks explanation capabilities.
Throw
Income Range
2030K 3040K ? 3040K
Sex
Male Female Male Male
Age
3039 5059 4049 4049
Income Range
3040K 3040K 5060K 2030K 2030K 3040K
Sex
Male Female Female Female Male Male
Age
3039 4049 3039 5059 2029 4049
Population Element #1
Sex Male
Age 30-39
Population Element #2
Sex Male
Age 30-39
Population Element #2
Sex Fem
Age 50-59
Population Element #1
Sex Fem
Age 50-59
Income Range
2030K 3040K ? 3040K
Sex
Female Male Male Male
Age
5059 3039 4049 4049
a1 I1 I2 . . . . . Ip
a2
a3
. . .
S1 an
S2
. . . .
P instances
. . .
SK
E22
Ek1 Ek2
Solutions
Solution elements (initial population) Fitness score Solution elements (second generation) Fitness score Solution elements (third generation) Fitness score
General Considerations
Global optimization is not a guarantee.
The fitness function determines the complexity of the algorithm. Explain their results provided the fitness function is understandable. Transforming the data to a form suitable for genetic learning can be a challenge.
Initial Considerations
Is learning supervised or unsupervised?
Is explanation required? What is the interaction between input and output attributes? What are the data types of the input and output attributes?
Further Considerations
Do We Know the Distribution of the Data?
Do We Know Which Attributes Best Define the Data? Does the Data Contain Missing Values? Is Time an Issue? Which Technique Is Most Likely to Give a Best Test Set Accuracy?
Interface
Data
PreProcessor
Large Dataset
No
Yes
Heuristic Agent
Mining Technique
Neural Networks
ESX
Explaination
Yes
No
Generate Rules
No
Yes
RuleMaker
Rules
Report Generator
Excel Sheets
Root Level
Root
Concept Level
C1
C2
...
Cn
Instance Level
I11 I12 . . .
I1j
In1 In2 . . .
Inl
Magazine Promotion
C I Yes Yes No Yes Yes No Yes No Yes Yes No No Yes No No
Watch Promotion
C I No Yes No Yes No No No Yes No Yes Yes Yes Yes Yes No
Sex
C I Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female
Age
R I 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
Transactional
Database
Identify the Goal Create Target Data Data Preprocessing Data Transformation Data Mining
Noisy Data
Locate Duplicate Records. Locate Incorrect Attribute Values. Smooth Data.
Data Normalization
Decimal Scaling Min-Max Normalization Normalization using Z-scores Logarithmic Normalization
Income Range
1 0 0
Magazine Promotion
0 0 0
Watch Promotion
0 0 0
Sex
1 0 1
Age
1 1 1
Type ID
Make Year
Vehicle - Type
Customer
Make
Chevrolet Cadillac Chevrolet Cadillac
Year
1995 2000 2001 1997
Type ID
2390 4371 6940 4595 2390
Type ID
2390 4371 6940 4595 2390
Make
Cadillac Chevrolet Cadillac Chevrolet Cadillac
Year
1997 1995 2000 2001 1997
Granularity
Granularity is a term used to describe the level of detail of stored information.
Data Warehouse
Report
Purchase Key 1 2 3 4 5 6 . . .
Purchase Dimension Category Supermarket Travel & Entertainment Auto & Vehicle Retail Restarurant Miscellaneous . . .
Time Dimension Time Key Month Day Quarter Year 10 Jan 5 1 2002 . . . . . . . . . . . . . . .
Cardholder Ci
Purchase Key
,1 Ci ( A
10 ,2 ,
Ti
e m
Ke
Location Key
Promotion Key 1 . . .
Time Dimension Time Key Month Day Quarter Year 5 Dec 31 4 2001 8 Jan 3 1 2002 10 Jan 5 1 2002 . . . . . . . . . . . . . . .
Purchase Key 1 2 3 4 5 6 Promotion Fact Table Cardholder Key Promotion Key Time Key 1 1 5 2 1 5 . . . . . . . . .
Purchase Dimension Category Supermarket Travel & Entertainment Auto & Vehicle Retail Restarurant Miscellaneous
Response Yes No . . .
Figure 6.5 A constellation schema for credit card purchases and promotions
OLAP Operations
Slice A single dimension operation Dice A multidimensional operation Roll-up A higher level of generalization Drill-down A greater level of detail Rotation View data from a new perspective
Month = Dec. Category = Vehicle Region = Two Amount = 6,720 Count = 110
Month
Supermarket
Restaurant
Travel
Vehicle
Retail
Miscellaneous
Re
gio
Category
Concept Hierarchy
A mapping that allows attributes to be viewed from varying levels of detail.
Region
State
City
Street Address
Q4
Time
Q3 Q2 Q1
On e Tw o Fo ur Th ree
Miscellaneous
Supermarket
Travel
Vehicle
Retail
Restaurant
g io e R
Category
Parameters
Supervised Model
Evaluation
f(x)
13.54% 2.14% .13%
34.13%
34.13%
x
99
-99
-3
-2
-1
X2
(v1 / n1 v2 / n2 )
Computed Reject
Type 1 Error
Type 2 Error
True Reject
Equation 7.3
Cross Validation
Used when ample test data is not available Partition the dataset into n fixed-size units. n-1 units are used for training and the nth unit is used as a test set. Repeat this process until each of the fixedsize units has been used as test data. Model correctness is taken as the average of all training-test trials.
Bootstrapping
Used when ample training and test data is not available. Bootstrapping allows instances to appear more than once in the training data.
where E1 = The error rate for model M1 E2 = The error rate for model M2 q = (E1 + E2)/2 n1 = the number of instances in test set A n2 = the number of instances in test set B
Equation 7.4
where E1 = The error rate for model M1 E2 = The error rate for model M2 q = (E1 + E2)/2 n = the number of test set instances
Equation 7.5
Xi X j ( vi / ni v j / n j )
vi is the class i variance and v j is the class j variance for attribute A. ni is the number of instances in Ci and n j is the number of instances in C j
Equation 7.6
c2 )2 ... ( ai ci ) ... ( an cn )2
n
where for the ith instance, ai = actual output value ci = computed output value
Equation 7.7
cn |
Equation 7.8
Neural Networks
Chapter 8
Hidden Layer
Output Layer
Node j
Wjk
0.4
Node 2
W2i
Node k Node i
Wik
W3j
0.7
Node 3
W3i
Equation 8.2
Output Layer
Figure 8.3 A 3x3 Kohonen network with two input layer nodes
Weaknesses
Lack explanation capabilities. May not provide optimal solutions to problems. Overtraining can be a problem.
Statistical Techniques
Chapter 10
Equation 10.1
Regression Trees
Test 1
<
>=
Test 2
Test 3
<
LRM1
>=
LRM2
<
LRM3
>=
Test 4
<
LRM4
>=
LRM5
Equation 10.7
Equation 10.9
Watch Promotion
No Yes No Yes No No Yes No No Yes
Sex
Male Female Male Male Female Female Male Male Male Female
Watch Promotion
Male 2 4 2/6 4/6 Female 2 2 2/4 2/4
Equation 10.10
Equation 10.12
Missing Data
With Bayes classifier missing data items are ignored.
Numeric Data
f ( x) 1 /( 2 s ) e
where e = the exponential function m = the class mean for the given numerical attribute s = the class standard deviation for the attribute x = the attribute value
Equation 10.13
( x m ) 2 /( 2s 2 )
Agglomerative Clustering
1. Place each instance into a separate partition. 2. Until all instances are part of a single cluster: a. Determine the two most similar clusters. b. Merge the clusters chosen into a single cluster. 3. Choose a clustering formed by one of the step 2 iterations as a final result.
Conceptual Clustering
1. 2. Create a cluster with the first instance as its only member. For each remaining instance, take one of two actions at each tree level. a. Place the new instance into an existing cluster. b. Create a new concept cluster having the new instance as its only member.
Expectation Maximization
The EM (expectation-maximization) algorithm is a statistical technique that makes use of the finite Gaussian mixtures model.
Expectation Maximization
A mixture is a set of n probability distributions where each distribution represents a cluster. The mixtures model assigns each data instance a probability that it would have a certain set of attribute values given it was a member of a specified cluster.
Expectation Maximization
The EM algorithm is similar to the KMeans procedure in that a set of parameters are recomputed until a desire convergence is achieved. In the simplest case, there are two clusters, a single real-valued attribute, and the probability distributions are normal.
Specialized Techniques
Chapter 11
Table 11.1 Weekly Average Closing Prices for the Nasdaq and Dow Jones Industrial Average
Week
200003 200004 200005 200006 200007 200008 200009 200010 200011 200012
Nasdaq Average
4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40 4818.01
Dow Average
11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52 10223.11 10937.36
Nasdaq-1 Average
3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40
Dow-1 Average
11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52 10223.11
Nasdaq-2 Average
3847.25 3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09
Dow-2 Average
11224.10 11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52
Web-Based Mining
(identifying the goal)
Decrease the average number of pages visited by a customer before a purchase transaction. Increase the average number of pages viewed per user session. Increase Web server efficiency Personalize Web pages for customers Determine those products that tend to be purchased or viewed together Decrease the total number of item returns Increase visitor retention rates
Web-Based Mining
(preparing the data)
Data is stored in Web server log files, typically in the form of clickstream sequences Server log files provide information in extended common log file format
80.202.8.93 - - [16/Apr/2002:22:43:28 -0600] "GET /grbts/images/msu-new-color.gif HTTP/1.1" 200 5006 "http://grb.mnsu.edu/doc/index.html" "Mozilla/4.0 (compatible; MSIE 5.0; Windows 2000) Opera 6.01 [nb] 134.29.41.219 - - [17/Apr/2002:19:23:30 -0600] "GET /resindoc/images/resin_powered.gif HTTP/1.1" 200 571 "http://grb.mnsu.edu/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows 98; Q312461)"
Data Preparation
Session File
Learner Model
This rule states that P4, P10 and P15 appear in three session instances. Also, a four instances have P4 and P10 appearing in the same session instance
Evaluating Results
(unsupervised clustering)
Use agglomerative clustering to place session instances into clusters.
Instance similarity is computed by dividing the total number of pageviews each pair of instances share by the total number of pageviews contained within the instances.
Evaluating Results
(unsupervised clustering)
Consider the following session instances:
P5 P4 P10 P3 P15 P2 P1 P2 P4 P10 P8 P15 P4 P15 P1
Evaluating Results
(summary statistics)
Summary statistics about the activities taking place at a Web site can be obtained using a Web server log analyzer. The output of the analyzer is an aggregation of log file data displayed in graphical format.
Web-Based Mining
(Taking Action)
Implement a strategy based on created user profiles to personalize the Web pages viewed by site visitors. Adapt the indexing structure of a Web site to better reflect the paths followed by typical users. Set up online advertising promotions for registered Web site customers. Send e-mail to promote products of likely interest to a select group of registered customers. Modify the content of a Web site by grouping products likely to be purchased together, removing products of little interest, and expanding the offerings of high-demand products.
Thuraisingham, B., Web Data Mining and Applications in Business Intelligence and Counter-Terrorism, CRC Press, 2003.