Log Linear Models and Logistic Regression Springer Texts in Statistics

Data Exploration
Dr. Saed Sayad

University of Toronto
2010
saed.sayad@utoronto.ca
http://chem-eng.utoronto.ca/~datamining/ 1
Data Mining Steps
1 • Problem Definition
2 • Data Preparation
3 • Data Exploration
4 • Modeling
5 • Evaluation
6 • Deployment
1. Problem Definition
Understanding the project objectives
and requirements from a business
perspective, converting this knowledge
into a data mining problem definition,
and a preliminary plan designed to
achieve the objectives.
Source: http://www.crisp-dm.org/Process/index.htm
2. Data Preparation
Data ETL
DSN
Data
Text
Modeling Data
3. Data Exploration
Frequency, Mean,
Min, Max, ...
Univariate
Analysis
Bar, Line, Pie, ...
Charts
Data
Exploration
Correlation
Z test, ...
Bivariate Analysis
Combination
Charts
Data Exploration - Univariate Analysis
Count,
Frequency
Categoical
Bar and Pie
Charts
Univaiate
Count, Mean,
StDev
Numerical
Histogram,
Box Plot
Univariate Analysis - Categorical
housing Count Frequency 11% Housing
18%
for free 96 10.67%
for free
own 641 71.22% own
rent 163 18.11%

rent
71%
Housing
700
600
500
400
300
200
100
0
for free own rent

Missing Values
83% Education
2,500,000
Missing Value
2,000,000
Frequency
1,500,000
1,000,000
500,000
0
1
4
K
AN
BL
Invalid Values
Invalid doc_type_id
1,400,000
1,200,000
1,000,000
Frequency
800,000
600,000
400,000
200,000
0
LL
X
Z
3
NU
Univariate Analysis - Numeric
Age
Count 900 Average 35.25 StDev 11.20
Min 19 Median 33 Variance 125.37
Maximum 75 Mode 27 CV 32%
Range 56 Skewness 1.09

Missing 0 Kurtosis 0.88
Missing and Invalid Values and Outliers
Months in Business
Box Plot
Outliers
Univariate Analysis - Policies
Variable
Categorical Numeric
Missing Values Missing Values
Invalid Values Invalid & Outliers
Encoding Binning
Missing Value Policies
• Fill in missing values manually based on our
domain knowledge
• Ignore the records with missing data
• Fill in it automatically:
– A global constant (e.g., “?”)
– The variable mean
– Inference-based methods such as Bayes’ rule,
decision tree, or EM algorithm
Managing Outliers
• Data points inconsistent with the majority of data
• Different outliers
– Valid: CEO’s salary
– Noisy: One’s age = 200, widely deviated points
• Removal methods
– Box plot
– Clustering
– Curve-fitting
Encoding Categorical Variables
• Encoding is the process of transforming
categorical variables into numerical
counterparts.
• Encoding methods:
– Binary method
– Ordinal Method
– Target based Encoding
Encoding
Housing (for free, own, rent)
• Binary method: • Ordinal method:

– for free: 1, 0, 0 – own: 1
– own: 0, 1, 0 – for free: 3
– rent: 0, 0, 1 – rent: 5
Binning Numerical Variables
• Binning is the process of transforming
numerical variables into categorical
counterparts.
• Binning methods:
–Equal Width
–Equal Frequency
–Entropy Based
Binning
• Variable: 0, 4, 12, 16, 16, 18, 24, 26, 28
• Equi-width binning:
– Bin 1: 0, 4 [-,10) bin
– Bin 2: 12, 16, 16, 18 [10,20) bin
– Bin 3: 24, 26, 28 [20,+) bin
• Equi-frequency binning :
– Bin 1: 0, 4, 12 [-, 14) bin
– Bin 2: 16, 16, 18 [14, 21) bin
– Bin 3: 24, 26, 28 [21,+) bin
Binning
Months in Business
Data Exploration – Bivariate Analysis
Correlation
Numeric Numeric
Scatter Plot
z-test, t-test,
Bivariate
ANOVA
Numeric
Combination
Chart
Categorical
Chi2 test
Categorical
Combination
Chart
Numeric & Numeric
$120,000
Correlation = 0.114
$100,000
$80,000
Total $60,000
Balance
$40,000
$20,000
$0
0 200 400 600 800 1000 1200 1400 1600 1800 2000
Months n Business
Categorical & Numeric
Total Balance Total Balance

Default
Average Variance
N $22,994 $3,250
Y $26,874 $3,872
Is there any significant difference the balance average in two groups?
Is there any significant difference the balance variance in two groups?
Categorical & Numeric
Z test t test
F test ANOVA
Categorical & Numeric - Z, t, F Tests
X1  X 2 X1  X 2
Z t
S12 S 22  1 1 
 S 
2
 
N1 N 2  N1 N 2 
2
S
F 1
2
S 2
Analysis of Variance (ANOVA)
Source of Sum of Degree of

Mean Square F P
Variation Squares Freedom
Between Groups SSB dfB MSB = SSB/dfB F=MSB/MSW P(F)
Within Groups SSW dfw MSW = SSW/dfw
Total SST dfT
Categorical & Categorical
Default
Y N
Y 366 2786
Corporation
N 191 4777
Is the rate of default different between two types of businesses?
Default
Y N
Y 4.5% 34.3%
Corporation
N 2.4% 58.8%
60%
50%
40%
30%
20%
10% Corporation N
0%
Corporation Y
Y
N
Default
r c (nij  eij ) 2
  
2
i 1 j 1 eij
ni.n. j
eij 
n
df  (r  1)(c  1)
Data Exploration - MVP
Months in Business and Default
Default%
Summary
• Data exploration covers all activities in order to
get familiar with the data, to identify data quality
problems to discover first insights into the data.
• Univariate analysis can show variable
distribution, missing values, invalid values and
outliers.
• Bivariate analysis can discover relationships
between variables.
• The combination chart (variable & target) is the
most valuable type of plot.

Log Linear Models and Logistic Regression Springer Texts in Statistics

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Log Linear Models and Logistic Regression Springer Texts in Statistics

Caricato da

Copyright:

Formati disponibili

Data Exploration

Dr. Saed Sayad

rent 163 18.11%

for free own rent

Min 19 Median 33 Variance 125.37

Maximum 75 Mode 27 CV 32%

Range 56 Skewness 1.09

Missing Values Missing Values

Invalid Values Invalid & Outliers

Housing (for free, own, rent)

• Binary method: • Ordinal method:

Total Balance Total Balance

Is there any significant difference the balance average in two groups?

Is there any significant difference the balance variance in two groups?

Source of Sum of Degree of

Between Groups SSB dfB MSB = SSB/dfB F=MSB/MSW P(F)

Within Groups SSW dfw MSW = SSW/dfw

Total SST dfT

Is the rate of default different between two types of businesses?

Potrebbero piacerti anche