Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
http://chem-eng.utoronto.ca/~datamining/ 1
Data Mining Steps
1 • Problem Definition
2 • Data Preparation
3 • Data Exploration
4 • Modeling
5 • Evaluation
6 • Deployment
http://chem-eng.utoronto.ca/~datamining/ 2
1. Problem Definition
Understanding the project objectives
and requirements from a business
perspective, converting this knowledge
into a data mining problem definition,
and a preliminary plan designed to
achieve the objectives.
Source: http://www.crisp-dm.org/Process/index.htm
http://chem-eng.utoronto.ca/~datamining/ 3
2. Data Preparation
Data ETL
DSN
Data
Text
Modeling Data
http://chem-eng.utoronto.ca/~datamining/ 4
3. Data Exploration
Frequency, Mean,
Min, Max, ...
Univariate
Analysis
Bar, Line, Pie, ...
Charts
Data
Exploration
Correlation
Z test, ...
Bivariate Analysis
Combination
Charts
http://chem-eng.utoronto.ca/~datamining/ 5
Data Exploration - Univariate Analysis
Count,
Frequency
Categoical
Bar and Pie
Charts
Univaiate
Count, Mean,
StDev
Numerical
Histogram,
Box Plot
http://chem-eng.utoronto.ca/~datamining/ 6
Univariate Analysis - Categorical
housing Count Frequency 11% Housing
18%
for free 96 10.67%
for free
own 641 71.22% own
Housing
700
600
500
400
300
200
100
0
83% Education
2,500,000
Missing Value
2,000,000
Frequency
1,500,000
1,000,000
500,000
0
1
4
K
AN
BL
http://chem-eng.utoronto.ca/~datamining/ 8
Invalid Values
Invalid doc_type_id
1,400,000
1,200,000
1,000,000
Frequency
800,000
600,000
400,000
200,000
0
LL
X
Z
3
NU
http://chem-eng.utoronto.ca/~datamining/ 9
Univariate Analysis - Numeric
Age
Count 900 Average 35.25 StDev 11.20
http://chem-eng.utoronto.ca/~datamining/ 10
Missing and Invalid Values and Outliers
Months in Business
http://chem-eng.utoronto.ca/~datamining/ 11
Box Plot
Outliers
http://chem-eng.utoronto.ca/~datamining/ 12
Univariate Analysis - Policies
Variable
Categorical Numeric
Encoding Binning
http://chem-eng.utoronto.ca/~datamining/ 13
Missing Value Policies
• Fill in missing values manually based on our
domain knowledge
• Ignore the records with missing data
• Fill in it automatically:
– A global constant (e.g., “?”)
– The variable mean
– Inference-based methods such as Bayes’ rule,
decision tree, or EM algorithm
http://chem-eng.utoronto.ca/~datamining/ 14
Managing Outliers
• Data points inconsistent with the majority of data
• Different outliers
– Valid: CEO’s salary
– Noisy: One’s age = 200, widely deviated points
• Removal methods
– Box plot
– Clustering
– Curve-fitting
http://chem-eng.utoronto.ca/~datamining/ 15
Encoding Categorical Variables
• Encoding is the process of transforming
categorical variables into numerical
counterparts.
• Encoding methods:
– Binary method
– Ordinal Method
– Target based Encoding
http://chem-eng.utoronto.ca/~datamining/ 16
Encoding
http://chem-eng.utoronto.ca/~datamining/ 17
Binning Numerical Variables
• Binning is the process of transforming
numerical variables into categorical
counterparts.
• Binning methods:
–Equal Width
–Equal Frequency
–Entropy Based
http://chem-eng.utoronto.ca/~datamining/ 18
Binning
• Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equi-width binning:
– Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
• Equi-frequency binning :
– Bin 1: 0, 4, 12 [-, 14) bin
– Bin 2: 16, 16, 18 [14, 21) bin
– Bin 3: 24, 26, 28 [21,+) bin
http://chem-eng.utoronto.ca/~datamining/ 19
Binning
Months in Business
http://chem-eng.utoronto.ca/~datamining/ 20
Data Exploration – Bivariate Analysis
Correlation
Numeric Numeric
Scatter Plot
z-test, t-test,
Bivariate
ANOVA
Numeric
Combination
Chart
Categorical
Chi2 test
Categorical
Combination
Chart
http://chem-eng.utoronto.ca/~datamining/ 21
Numeric & Numeric
$120,000
Correlation = 0.114
$100,000
$80,000
Total $60,000
Balance
$40,000
$20,000
$0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Months n Business
http://chem-eng.utoronto.ca/~datamining/ 22
Categorical & Numeric
http://chem-eng.utoronto.ca/~datamining/ 23
Categorical & Numeric
Z test t test
F test ANOVA
http://chem-eng.utoronto.ca/~datamining/ 24
Categorical & Numeric - Z, t, F Tests
X1 X 2 X1 X 2
Z t
S12 S 22 1 1
S
2
N1 N 2 N1 N 2
2
S
F 1
2
S 2
http://chem-eng.utoronto.ca/~datamining/ 25
Analysis of Variance (ANOVA)
http://chem-eng.utoronto.ca/~datamining/ 26
Categorical & Categorical
Default
Y N
Y 366 2786
Corporation
N 191 4777
http://chem-eng.utoronto.ca/~datamining/ 27
Categorical & Categorical
Default
Y N
Y 4.5% 34.3%
Corporation
N 2.4% 58.8%
http://chem-eng.utoronto.ca/~datamining/ 28
Categorical & Categorical
60%
50%
40%
30%
20%
10% Corporation N
0%
Corporation Y
Y
N
Default
http://chem-eng.utoronto.ca/~datamining/ 29
Categorical & Categorical
r c (nij eij ) 2
2
i 1 j 1 eij
ni.n. j
eij
n
df (r 1)(c 1)
http://chem-eng.utoronto.ca/~datamining/ 30
Data Exploration - MVP
Months in Business and Default
Default%
http://chem-eng.utoronto.ca/~datamining/ 31
Summary
• Data exploration covers all activities in order to
get familiar with the data, to identify data quality
problems to discover first insights into the data.
• Univariate analysis can show variable
distribution, missing values, invalid values and
outliers.
• Bivariate analysis can discover relationships
between variables.
• The combination chart (variable & target) is the
most valuable type of plot.
http://chem-eng.utoronto.ca/~datamining/ 32
http://chem-eng.utoronto.ca/~datamining/ 33