Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
https://www-
users.cs.umn.edu/~kumar001/dmbook/index.php
#item3
https://web.stanford.edu/~hastie/ElemStatLearn/
Based on
– purchases at department/
grocery stores, e-commerce
Amazon handles millions of visits/day
– Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
– Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Improving health care and reducing costs Predicting the impact of climate change
Prediction Methods
– Use some variables to predict unknown or
future values of other variables.
Description Methods
– Find human-interpretable patterns that
describe the data.
From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996
Data
Tid Refund Marital Taxable
Status Income Cheat
Milk
Class Employed
# years at
Level of Credit Yes
Tid Employed present No
Education Worthy
address
1 Yes Graduate 5 Yes
2 Yes High School 2 No No Education
3 No Undergrad 1 No
{ High school,
4 Yes High School 10 Yes Graduate
Undergrad }
… … … … …
10
Number of Number of
years years
Yes No Yes No
# years at
Level of Credit
Tid Employed present
Education Worthy
address
1 Yes Undergrad 7 ?
# years at 2 No Graduate 3 ?
Level of Credit
Tid Employed present 3 Yes High School 2 ?
Education Worthy
address
… … … … …
1 Yes Graduate 5 Yes 10
Set
Training
Learn
Model
Set Classifier
Fraud Detection
– Goal: Predict fraudulent cases in credit card
transactions.
– Approach:
Use credit card transactions and the information
on its account-holder as attributes.
– When does a customer buy, what does he buy, how
often he pays on time, etc
Label past transactions as fraud or fair
transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit
card transactions on an account.
01/17/2018 Introduction to Data Mining, 2nd Edition 21
Classification: Application 2
Late
Data Size:
• 72 million stars, 20 million galaxies
• Object Catalog: 9 GB
• Image Database: 150 GB
Use of K-means to
partition Sea Surface
60
Land Cluster 2
30
Temperature (SST) and
Land Cluster 1 Net Primary Production
latitude
0
(NPP) into clusters that
Ice or No NP P
-3 0
reflect the Northern
Sea Cluster 2 and Southern
-6 0
Hemispheres.
Sea Cluster 1
-9 0
-180 - 150
01/17/2018
-1 20 -90 -60 - 30 0 30 60 90 1 20 150 180
Cluster
Introduction to Data Mining, 2nd Edition 27
longitude
Clustering: Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
– Approach:
Collect different attributes of customers based on
their geographical and lifestyle related information.
Find clusters of similar customers.
Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those
from different clusters.
Document Clustering:
TID Items
1 Bread, Coke, Milk
Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Market-basket analysis
– Rules are used for sales promotion, shelf
management, and inventory management
Medical Informatics
– Rules are used to find combination of patient
symptoms and test results associated with certain
diseases
01/17/2018 Introduction to Data Mining, 2nd Edition 31
Association Analysis: Applications
Scalability
High Dimensionality
Non-traditional Analysis