Sei sulla pagina 1di 19

Assignment

on
Classification Tree Model Development
Submitted By-Gaurav Khokhani

INDEX
1) Objective.3
2)Data Summary.4
3)Exploratory Data Analysis........510
4)Hypothesis Testing & Validation.11
5)CART Model....13-15
6) CHAID Model
7)Findings & Summary

OBJECTIVE
To explore data and build classification tree
using CHAID & CART techniques

DATA SUMMARY
HR employee attrition data gives data of attrition and various factors which may or may not be impacting attrition
To understand the data we need to import the same into R
Importing Data in R
#attrition<-read.csv(file.choose(),header=T) (will import Attrition data in R)
Summary of Data
#summary(attrition)...(Will Give basic summary of the data)
Age
Attrition
BusinessTravel DailyRate
Department
Min. :18.00 No :2466 Non-Travel
: 300 Min. : 102.0 Human Resources
: 126
1st Qu.:30.00 Yes: 474 Travel_Frequently: 554 1st Qu.: 465.0 Research & Development:1922
Median :36.00
Travel_Rarely :2086 Median : 802.0 Sales
: 892
Mean :36.92
Mean : 802.5
3rd Qu.:43.00
3rd Qu.:1157.0
Max. :60.00
Max. :1499.0
DistanceFromHome Education
EducationField EmployeeCount EmployeeNumber
Min. : 1.000 Min. :1.000 Human Resources : 54 Min. :1 Min. : 1.0
1st Qu.: 2.000 1st Qu.:2.000 Life Sciences :1212 1st Qu.:1 1st Qu.: 735.8
Median : 7.000 Median :3.000 Marketing
: 318 Median :1 Median :1470.5
Mean : 9.193 Mean :2.913 Medical
: 928 Mean :1 Mean :1470.5
3rd Qu.:14.000 3rd Qu.:4.000 Other
: 164 3rd Qu.:1 3rd Qu.:2205.2
Max. :29.000 Max. :5.000 Technical Degree: 264 Max. :1 Max. :2940.0
EnvironmentSatisfaction Gender
HourlyRate JobInvolvement JobLevel
Min. :1.000
Female:1176 Min. : 30.00 Min. :1.00 Min. :1.000
1st Qu.:2.000
Male :1764 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000
Median :3.000
Median : 66.00 Median :3.00 Median :2.000
Mean :2.722
Mean : 65.89 Mean :2.73 Mean :2.064
3rd Qu.:4.000
3rd Qu.: 84.00 3rd Qu.:3.00 3rd Qu.:3.000
Max. :4.000
Max. :100.00 Max. :4.00 Max. :5.000
JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
Sales Executive
:652 Min. :1.000 Divorced: 654 Min. : 1009 Min. : 2094
Research Scientist
:584 1st Qu.:2.000 Married :1346 1st Qu.: 2911 1st Qu.: 8045
Laboratory Technician :518 Median :3.000 Single : 940 Median : 4919 Median :14236
Manufacturing Director :290 Mean :2.729
Mean : 6503 Mean :14313
Healthcare Representative:262 3rd Qu.:4.000
3rd Qu.: 8380 3rd Qu.:20462
Manager
:204 Max. :4.000
Max. :19999 Max. :26999
(Other)
:430
NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction
Min. :0.000
Y:2940 No :2108 Min. :11.00 Min. :3.000 Min. :1.000
1st Qu.:1.000
Yes: 832 1st Qu.:12.00 1st Qu.:3.000 1st Qu.:2.000
Median :2.000
Median :14.00 Median :3.000 Median :3.000
Mean :2.693
Mean :15.21 Mean :3.154 Mean :2.712
3rd Qu.:4.000
3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:4.000
Max. :9.000
Max. :25.00 Max. :4.000 Max. :4.000
StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
Min. :80 Min. :0.0000 Min. : 0.00 Min. :0.000
Min. :1.000 Min. : 0.000
1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000
1st Qu.:2.000 1st Qu.: 3.000
Median :80 Median :1.0000 Median :10.00 Median :3.000
Median :3.000 Median : 5.000
Mean :80 Mean :0.7939 Mean :11.28 Mean :2.799
Mean :2.761 Mean : 7.008
3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000
3rd Qu.:3.000 3rd Qu.: 9.000
Max. :80 Max. :3.0000 Max. :40.00 Max. :6.000
Max. :4.000 Max. :40.000
YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
Min. : 0.000 Min. : 0.000
Min. : 0.000
1st Qu.: 2.000 1st Qu.: 0.000
1st Qu.: 2.000
Median : 3.000 Median : 1.000
Median : 3.000
Mean : 4.229 Mean : 2.188
Mean : 4.123
3rd Qu.: 7.000 3rd Qu.: 3.000
3rd Qu.: 7.000
Max. :18.000

Max. :15.000

Max. :17.000

g Rn
i
s
u
y
r
umma y
S
a
t
a
D
mmar
u
S
:
e
cod
:
syntax
serve
b
o
o
t
ble
edian
m
,
-Was a
n
a
n, me
riables
a
the mi
v
c
i
r
ume
of all n
iables
r
a
v
l
a
oric
actors
-Categ
f
s
a
d
te
conver

Exploratory Data
Analysis
Relation between Attrition and Gender.6
Relation between Attrition and working Years..7
Relation between Attrition and Monthly Income...8
Relation between Attrition and Departments9
Relation between Attrition and Marital status.10

1)Attrition VS
Gender

1)As per the graph it shows that males


are more prone to attrition as
compared to female

Total Working YearsVs


Attrition
boxplot(TotalWorkingYears~Attrition,main="Boxplot by Total Working
Years",ylab="Years of Experience")
As per the plot it shows employees with
less experience are more prone to
attrition as compared to with more
experience

Attrition Vs Monthly
Income
boxplot(MonthlyIncome~Attrition,main="Boxplot by Monthly
Income",ylab="Monthly Income")
As per the plot it shows employees with less
Monthly Salary are more prone to attrition as
compared to with more Salary

Department VS
Attrition
tabdep=table(Department,Attrition)
barplot(tabdep,beside=T,legend=T,main=Barplot of
Department ,ylab="Attrition")
Attrition by number is highest in R&D Department

Marital Status Vs
attrition
tabRelation=table(MaritalStatus,Attrition)
barplot(tabRelation,beside=T,legend=T,main="Barplot
of Martial status",ylab="Attrition")

Attrition by number is highest for employees whose


martial status is single

Correlation Coefficient of
Numeric variables
X <-anew[which (anew$x> 0),]
correlations <- cor(X [,25:36])
corrplot(correlations, method="pie")

Hypothesis Testing
Sno

Ho

Ha

Attrition is
not
dependent on
department

Attrition is
dependent
on
department

Code

Cross Tab

Chi Sq output

Result

Pearson's Chi-squared test

chisq.test(tabdep,co
rrect=T)

data: tabdep
X-squared = 21.592, df = 2, p-value =
2.048e-05

Do not accept Ho

Pearson's Chi-squared test with Yates'


continuity correction

Attrition is
Attrition is
not
dependent
dependent on
on Gender
Gender

chisq.test(tabGende
r,correct=T)

Attrition is
Attrition is
not
dependent
dependent on
on Income
Income

mytable1=table(Inco
me,Attrition)
chisq.test(mytable1,
correct=T)
tabeducation<table(Education,Attr
ition)
chisq.test(tabeducat
ion,correct=T)

Pearson's Chi-squared test

Attrition is
Attrition is
not
dependent
dependent on
on Education
Education

tabjbi<H1-Attrition
table(JobInvolveme
is dependent
nt,Attrition)
on Job
chisq.test(tabjbi,cor
Involvement
rect=T)

Pearson's Chi-squared test

Attrition is
not
dependent on
Job
Involvement

Attrition is
dependent
on Job
satisfaction

tabsat<table(JobSatisfactio
n,Attrition)
chisq.test(tabsat,co
rrect=T)

Pearson's Chi-squared test

Attrition is
not
dependent on
Job
Satisfaction

Attrition is
not
dependent on
Martial Status

Attrition is
dependent
on Martial
Status

mytable3=table(Inco
me,Attrition)
chisq.test(mytable1,
correct=T)

data: tabGender
X-squared = 2.3896, df = 1, p-value =
0.1221

Do not reject Ho

Do not accept Ho

data: tabeducation
X-squared = 6.1479, df = 4, p-value =
0.1884

data: tabjbi
X-squared = 56.984, df = 3, p-value =
2.59e-12

data: tabsat
X-squared = 35.01, df = 3, p-value =
1.212e-07

Do not reject Ho

Do not accept Ho

Do not accept Ho

Do not accept Ho

CART ANALYSIS
CART, is a simple yet powerful
analytic tool
It helps determine the most
important (based on
explanatory power) variables in
a particular dataset,

CART Source Code in R


llibrary(rpart)
library(rpart.plot)
library(tree)
library(rattle)
library(RColorBrewer)
library(CHAID)
a<-read.csv(file.choose())
str(a)
set.seed(9850)
g<-runif(nrow(a))
anew<-a[order(g),]
str(anew)
training_data<-rpart(Attrition~.,data=anew[1:2000,],method="class")
rpart.plot(training_data,extra=101,fallen.leaves = T)
p1<-predict(training_data,anew[2001:2940,],type="class")
table(anew[2001:2940,5],p1)
pfit<- prune(training_data, cp=training_data$Attrition[which.min(training_data$Attrition[,"xerror"]),"CP"])
printcp(pfit)
fancyRpartPlot(pfit,uniform=TRUE, main="Pruned Classification Tree")
plotcp(pfit)
plot(pfit,uniform = TRUE)
text(pfit, use.n=TRUE, all=TRUE, cex=.8,pretty = 0)
rpart.plot(pfit,extra=101)

Before Pruning

After Pruning

CHAID ANALYSIS
The CHAID Analysis (Chi Square
Automatic Interaction Detection) is a
form of analysis that determines how
variables best combine to explain the
outcome in a given dependent
variable. The model can be used in
cases of market penetration, predicting
and interpreting responses or a
multitude of other research problems

library(corrplot)

CHAID Source Code in R

library(caret)
library(CHAID)
data_new<- read.csv(file.choose(), sep = ",", header = T)
x<-as.numeric(data_new$Attrition)
data_new<-cbind(data_new,x)
data_new$x = as.factor(data_new$x)
data_new$Gender = as.factor(data_new$Gender)
data_new$MonthlyIncome = as.factor(data_new$MonthlyIncome)
data_new$Department = as.factor(data_new$Department)
data_new$Education = as.factor(data_new$Education)
data_new$MaritalStatus = as.factor(data_new$MaritalStatus)
data_new$JobSatisfaction = as.factor(data_new$JobSatisfaction)
data_new$WorkLifeBalance = as.factor(data_new$WorkLifeBalance)
data_new$Age = as.factor(data_new$Age)
data_new$BusinessTravel = as.factor(data_new$BusinessTravel)
data_new$EnvironmentSatisfaction = as.factor(data_new$EnvironmentSatisfaction)
data_new$BusinessTravel = as.factor(data_new$BusinessTravel)
data_new$JobRole = as.factor(data_new$JobRole)
data_new$OverTime = as.factor(data_new$OverTime)
data_new$PerformanceRating = as.factor(data_new$PerformanceRating)
data_new$YearsAtCompany = as.factor(data_new$YearsAtCompany)
ctrl <- chaid_control(minbucket = 100, minsplit = 10, alpha2=.05, alpha4 = .05)
chaid.tree1 <-chaid(x~ Gender + MonthlyIncome + Department + Education + MaritalStatus + JobSatisfaction + Age+ BusinessTravel + EnvironmentSatisfaction + JobRole + OverTime + PerformanceRating +
YearsAtCompany, data=data_new, control = ctrl)
print(chaid.tree1)
plot(chaid.tree1)

THANK
YOU

Potrebbero piacerti anche