Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
on
Classification Tree Model Development
Submitted By-Gaurav Khokhani
INDEX
1) Objective.3
2)Data Summary.4
3)Exploratory Data Analysis........510
4)Hypothesis Testing & Validation.11
5)CART Model....13-15
6) CHAID Model
7)Findings & Summary
OBJECTIVE
To explore data and build classification tree
using CHAID & CART techniques
DATA SUMMARY
HR employee attrition data gives data of attrition and various factors which may or may not be impacting attrition
To understand the data we need to import the same into R
Importing Data in R
#attrition<-read.csv(file.choose(),header=T) (will import Attrition data in R)
Summary of Data
#summary(attrition)...(Will Give basic summary of the data)
Age
Attrition
BusinessTravel DailyRate
Department
Min. :18.00 No :2466 Non-Travel
: 300 Min. : 102.0 Human Resources
: 126
1st Qu.:30.00 Yes: 474 Travel_Frequently: 554 1st Qu.: 465.0 Research & Development:1922
Median :36.00
Travel_Rarely :2086 Median : 802.0 Sales
: 892
Mean :36.92
Mean : 802.5
3rd Qu.:43.00
3rd Qu.:1157.0
Max. :60.00
Max. :1499.0
DistanceFromHome Education
EducationField EmployeeCount EmployeeNumber
Min. : 1.000 Min. :1.000 Human Resources : 54 Min. :1 Min. : 1.0
1st Qu.: 2.000 1st Qu.:2.000 Life Sciences :1212 1st Qu.:1 1st Qu.: 735.8
Median : 7.000 Median :3.000 Marketing
: 318 Median :1 Median :1470.5
Mean : 9.193 Mean :2.913 Medical
: 928 Mean :1 Mean :1470.5
3rd Qu.:14.000 3rd Qu.:4.000 Other
: 164 3rd Qu.:1 3rd Qu.:2205.2
Max. :29.000 Max. :5.000 Technical Degree: 264 Max. :1 Max. :2940.0
EnvironmentSatisfaction Gender
HourlyRate JobInvolvement JobLevel
Min. :1.000
Female:1176 Min. : 30.00 Min. :1.00 Min. :1.000
1st Qu.:2.000
Male :1764 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000
Median :3.000
Median : 66.00 Median :3.00 Median :2.000
Mean :2.722
Mean : 65.89 Mean :2.73 Mean :2.064
3rd Qu.:4.000
3rd Qu.: 84.00 3rd Qu.:3.00 3rd Qu.:3.000
Max. :4.000
Max. :100.00 Max. :4.00 Max. :5.000
JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
Sales Executive
:652 Min. :1.000 Divorced: 654 Min. : 1009 Min. : 2094
Research Scientist
:584 1st Qu.:2.000 Married :1346 1st Qu.: 2911 1st Qu.: 8045
Laboratory Technician :518 Median :3.000 Single : 940 Median : 4919 Median :14236
Manufacturing Director :290 Mean :2.729
Mean : 6503 Mean :14313
Healthcare Representative:262 3rd Qu.:4.000
3rd Qu.: 8380 3rd Qu.:20462
Manager
:204 Max. :4.000
Max. :19999 Max. :26999
(Other)
:430
NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction
Min. :0.000
Y:2940 No :2108 Min. :11.00 Min. :3.000 Min. :1.000
1st Qu.:1.000
Yes: 832 1st Qu.:12.00 1st Qu.:3.000 1st Qu.:2.000
Median :2.000
Median :14.00 Median :3.000 Median :3.000
Mean :2.693
Mean :15.21 Mean :3.154 Mean :2.712
3rd Qu.:4.000
3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:4.000
Max. :9.000
Max. :25.00 Max. :4.000 Max. :4.000
StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
Min. :80 Min. :0.0000 Min. : 0.00 Min. :0.000
Min. :1.000 Min. : 0.000
1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000
1st Qu.:2.000 1st Qu.: 3.000
Median :80 Median :1.0000 Median :10.00 Median :3.000
Median :3.000 Median : 5.000
Mean :80 Mean :0.7939 Mean :11.28 Mean :2.799
Mean :2.761 Mean : 7.008
3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000
3rd Qu.:3.000 3rd Qu.: 9.000
Max. :80 Max. :3.0000 Max. :40.00 Max. :6.000
Max. :4.000 Max. :40.000
YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Min. : 0.000 Min. : 0.000
Min. : 0.000
1st Qu.: 2.000 1st Qu.: 0.000
1st Qu.: 2.000
Median : 3.000 Median : 1.000
Median : 3.000
Mean : 4.229 Mean : 2.188
Mean : 4.123
3rd Qu.: 7.000 3rd Qu.: 3.000
3rd Qu.: 7.000
Max. :18.000
Max. :15.000
Max. :17.000
g Rn
i
s
u
y
r
umma y
S
a
t
a
D
mmar
u
S
:
e
cod
:
syntax
serve
b
o
o
t
ble
edian
m
,
-Was a
n
a
n, me
riables
a
the mi
v
c
i
r
ume
of all n
iables
r
a
v
l
a
oric
actors
-Categ
f
s
a
d
te
conver
Exploratory Data
Analysis
Relation between Attrition and Gender.6
Relation between Attrition and working Years..7
Relation between Attrition and Monthly Income...8
Relation between Attrition and Departments9
Relation between Attrition and Marital status.10
1)Attrition VS
Gender
Attrition Vs Monthly
Income
boxplot(MonthlyIncome~Attrition,main="Boxplot by Monthly
Income",ylab="Monthly Income")
As per the plot it shows employees with less
Monthly Salary are more prone to attrition as
compared to with more Salary
Department VS
Attrition
tabdep=table(Department,Attrition)
barplot(tabdep,beside=T,legend=T,main=Barplot of
Department ,ylab="Attrition")
Attrition by number is highest in R&D Department
Marital Status Vs
attrition
tabRelation=table(MaritalStatus,Attrition)
barplot(tabRelation,beside=T,legend=T,main="Barplot
of Martial status",ylab="Attrition")
Correlation Coefficient of
Numeric variables
X <-anew[which (anew$x> 0),]
correlations <- cor(X [,25:36])
corrplot(correlations, method="pie")
Hypothesis Testing
Sno
Ho
Ha
Attrition is
not
dependent on
department
Attrition is
dependent
on
department
Code
Cross Tab
Chi Sq output
Result
chisq.test(tabdep,co
rrect=T)
data: tabdep
X-squared = 21.592, df = 2, p-value =
2.048e-05
Do not accept Ho
Attrition is
Attrition is
not
dependent
dependent on
on Gender
Gender
chisq.test(tabGende
r,correct=T)
Attrition is
Attrition is
not
dependent
dependent on
on Income
Income
mytable1=table(Inco
me,Attrition)
chisq.test(mytable1,
correct=T)
tabeducation<table(Education,Attr
ition)
chisq.test(tabeducat
ion,correct=T)
Attrition is
Attrition is
not
dependent
dependent on
on Education
Education
tabjbi<H1-Attrition
table(JobInvolveme
is dependent
nt,Attrition)
on Job
chisq.test(tabjbi,cor
Involvement
rect=T)
Attrition is
not
dependent on
Job
Involvement
Attrition is
dependent
on Job
satisfaction
tabsat<table(JobSatisfactio
n,Attrition)
chisq.test(tabsat,co
rrect=T)
Attrition is
not
dependent on
Job
Satisfaction
Attrition is
not
dependent on
Martial Status
Attrition is
dependent
on Martial
Status
mytable3=table(Inco
me,Attrition)
chisq.test(mytable1,
correct=T)
data: tabGender
X-squared = 2.3896, df = 1, p-value =
0.1221
Do not reject Ho
Do not accept Ho
data: tabeducation
X-squared = 6.1479, df = 4, p-value =
0.1884
data: tabjbi
X-squared = 56.984, df = 3, p-value =
2.59e-12
data: tabsat
X-squared = 35.01, df = 3, p-value =
1.212e-07
Do not reject Ho
Do not accept Ho
Do not accept Ho
Do not accept Ho
CART ANALYSIS
CART, is a simple yet powerful
analytic tool
It helps determine the most
important (based on
explanatory power) variables in
a particular dataset,
Before Pruning
After Pruning
CHAID ANALYSIS
The CHAID Analysis (Chi Square
Automatic Interaction Detection) is a
form of analysis that determines how
variables best combine to explain the
outcome in a given dependent
variable. The model can be used in
cases of market penetration, predicting
and interpreting responses or a
multitude of other research problems
library(corrplot)
library(caret)
library(CHAID)
data_new<- read.csv(file.choose(), sep = ",", header = T)
x<-as.numeric(data_new$Attrition)
data_new<-cbind(data_new,x)
data_new$x = as.factor(data_new$x)
data_new$Gender = as.factor(data_new$Gender)
data_new$MonthlyIncome = as.factor(data_new$MonthlyIncome)
data_new$Department = as.factor(data_new$Department)
data_new$Education = as.factor(data_new$Education)
data_new$MaritalStatus = as.factor(data_new$MaritalStatus)
data_new$JobSatisfaction = as.factor(data_new$JobSatisfaction)
data_new$WorkLifeBalance = as.factor(data_new$WorkLifeBalance)
data_new$Age = as.factor(data_new$Age)
data_new$BusinessTravel = as.factor(data_new$BusinessTravel)
data_new$EnvironmentSatisfaction = as.factor(data_new$EnvironmentSatisfaction)
data_new$BusinessTravel = as.factor(data_new$BusinessTravel)
data_new$JobRole = as.factor(data_new$JobRole)
data_new$OverTime = as.factor(data_new$OverTime)
data_new$PerformanceRating = as.factor(data_new$PerformanceRating)
data_new$YearsAtCompany = as.factor(data_new$YearsAtCompany)
ctrl <- chaid_control(minbucket = 100, minsplit = 10, alpha2=.05, alpha4 = .05)
chaid.tree1 <-chaid(x~ Gender + MonthlyIncome + Department + Education + MaritalStatus + JobSatisfaction + Age+ BusinessTravel + EnvironmentSatisfaction + JobRole + OverTime + PerformanceRating +
YearsAtCompany, data=data_new, control = ctrl)
print(chaid.tree1)
plot(chaid.tree1)
THANK
YOU