Sei sulla pagina 1di 33

Data Exploration

Dr. Saed Sayad


University of Toronto
2010
saed.sayad@utoronto.ca

http://chem-eng.utoronto.ca/~datamining/ 1
Data Mining Steps
1 • Problem Definition

2 • Data Preparation

3 • Data Exploration

4 • Modeling

5 • Evaluation

6 • Deployment

http://chem-eng.utoronto.ca/~datamining/ 2
1. Problem Definition
Understanding the project objectives
and requirements from a business
perspective, converting this knowledge
into a data mining problem definition,
and a preliminary plan designed to
achieve the objectives.

Source: http://www.crisp-dm.org/Process/index.htm

http://chem-eng.utoronto.ca/~datamining/ 3
2. Data Preparation

Data ETL
DSN

Data
Text

Modeling Data

http://chem-eng.utoronto.ca/~datamining/ 4
3. Data Exploration
Frequency, Mean,
Min, Max, ...
Univariate
Analysis
Bar, Line, Pie, ...
Charts
Data
Exploration
Correlation
Z test, ...
Bivariate Analysis
Combination
Charts

http://chem-eng.utoronto.ca/~datamining/ 5
Data Exploration - Univariate Analysis
Count,
Frequency
Categoical
Bar and Pie
Charts
Univaiate
Count, Mean,
StDev
Numerical
Histogram,
Box Plot

http://chem-eng.utoronto.ca/~datamining/ 6
Univariate Analysis - Categorical
housing Count Frequency 11% Housing
18%
for free 96 10.67%
for free
own 641 71.22% own

rent 163 18.11%


rent
71%

Housing
700
600
500
400
300
200
100
0

for free own rent


http://chem-eng.utoronto.ca/~datamining/ 7
Missing Values

83% Education
2,500,000
Missing Value
2,000,000
Frequency

1,500,000

1,000,000

500,000

0
1

4
K
AN
BL

http://chem-eng.utoronto.ca/~datamining/ 8
Invalid Values

Invalid doc_type_id
1,400,000
1,200,000
1,000,000
Frequency

800,000
600,000
400,000
200,000
0
LL

X
Z

3
NU

http://chem-eng.utoronto.ca/~datamining/ 9
Univariate Analysis - Numeric

Age
Count 900 Average 35.25 StDev 11.20

Min 19 Median 33 Variance 125.37

Maximum 75 Mode 27 CV 32%

Range 56 Skewness 1.09


Missing 0 Kurtosis 0.88

http://chem-eng.utoronto.ca/~datamining/ 10
Missing and Invalid Values and Outliers
Months in Business

http://chem-eng.utoronto.ca/~datamining/ 11
Box Plot

Outliers

http://chem-eng.utoronto.ca/~datamining/ 12
Univariate Analysis - Policies
Variable
Categorical Numeric

Missing Values Missing Values

Invalid Values Invalid & Outliers

Encoding Binning

http://chem-eng.utoronto.ca/~datamining/ 13
Missing Value Policies
• Fill in missing values manually based on our
domain knowledge
• Ignore the records with missing data
• Fill in it automatically:
– A global constant (e.g., “?”)
– The variable mean
– Inference-based methods such as Bayes’ rule,
decision tree, or EM algorithm

http://chem-eng.utoronto.ca/~datamining/ 14
Managing Outliers
• Data points inconsistent with the majority of data
• Different outliers
– Valid: CEO’s salary
– Noisy: One’s age = 200, widely deviated points
• Removal methods
– Box plot
– Clustering
– Curve-fitting

http://chem-eng.utoronto.ca/~datamining/ 15
Encoding Categorical Variables
• Encoding is the process of transforming
categorical variables into numerical
counterparts.

• Encoding methods:
– Binary method
– Ordinal Method
– Target based Encoding

http://chem-eng.utoronto.ca/~datamining/ 16
Encoding

Housing (for free, own, rent)

• Binary method: • Ordinal method:


– for free: 1, 0, 0 – own: 1
– own: 0, 1, 0 – for free: 3
– rent: 0, 0, 1 – rent: 5

http://chem-eng.utoronto.ca/~datamining/ 17
Binning Numerical Variables
• Binning is the process of transforming
numerical variables into categorical
counterparts.

• Binning methods:
–Equal Width
–Equal Frequency
–Entropy Based

http://chem-eng.utoronto.ca/~datamining/ 18
Binning
• Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equi-width binning:
– Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
• Equi-frequency binning :
– Bin 1: 0, 4, 12 [-, 14) bin
– Bin 2: 16, 16, 18 [14, 21) bin
– Bin 3: 24, 26, 28 [21,+) bin
http://chem-eng.utoronto.ca/~datamining/ 19
Binning
Months in Business

http://chem-eng.utoronto.ca/~datamining/ 20
Data Exploration – Bivariate Analysis
Correlation
Numeric Numeric
Scatter Plot

z-test, t-test,
Bivariate
ANOVA
Numeric
Combination
Chart
Categorical
Chi2 test
Categorical
Combination
Chart

http://chem-eng.utoronto.ca/~datamining/ 21
Numeric & Numeric

$120,000

Correlation = 0.114
$100,000

$80,000

Total $60,000
Balance
$40,000

$20,000

$0
0 200 400 600 800 1000 1200 1400 1600 1800 2000

Months n Business

http://chem-eng.utoronto.ca/~datamining/ 22
Categorical & Numeric

Total Balance Total Balance


Default
Average Variance
N $22,994 $3,250
Y $26,874 $3,872

Is there any significant difference the balance average in two groups?

Is there any significant difference the balance variance in two groups?

http://chem-eng.utoronto.ca/~datamining/ 23
Categorical & Numeric

Z test t test

F test ANOVA

http://chem-eng.utoronto.ca/~datamining/ 24
Categorical & Numeric - Z, t, F Tests

X1  X 2 X1  X 2
Z t
S12 S 22  1 1 
 S 
2
 
N1 N 2  N1 N 2 

2
S
F 1
2
S 2

http://chem-eng.utoronto.ca/~datamining/ 25
Analysis of Variance (ANOVA)

Source of Sum of Degree of


Mean Square F P
Variation Squares Freedom

Between Groups SSB dfB MSB = SSB/dfB F=MSB/MSW P(F)

Within Groups SSW dfw MSW = SSW/dfw

Total SST dfT

http://chem-eng.utoronto.ca/~datamining/ 26
Categorical & Categorical

Default
Y N
Y 366 2786
Corporation
N 191 4777

Is the rate of default different between two types of businesses?

http://chem-eng.utoronto.ca/~datamining/ 27
Categorical & Categorical

Default
Y N
Y 4.5% 34.3%
Corporation
N 2.4% 58.8%

http://chem-eng.utoronto.ca/~datamining/ 28
Categorical & Categorical

60%

50%

40%

30%

20%

10% Corporation N

0%
Corporation Y
Y
N
Default

http://chem-eng.utoronto.ca/~datamining/ 29
Categorical & Categorical

r c (nij  eij ) 2

  
2

i 1 j 1 eij
ni.n. j
eij 
n
df  (r  1)(c  1)

http://chem-eng.utoronto.ca/~datamining/ 30
Data Exploration - MVP
Months in Business and Default

Default%

http://chem-eng.utoronto.ca/~datamining/ 31
Summary
• Data exploration covers all activities in order to
get familiar with the data, to identify data quality
problems to discover first insights into the data.
• Univariate analysis can show variable
distribution, missing values, invalid values and
outliers.
• Bivariate analysis can discover relationships
between variables.
• The combination chart (variable & target) is the
most valuable type of plot.
http://chem-eng.utoronto.ca/~datamining/ 32
http://chem-eng.utoronto.ca/~datamining/ 33

Potrebbero piacerti anche