Sei sulla pagina 1di 55

DATA MINING (STA555)

PROJECT REPORT

TITLE OF PROJECT: IDENTIFYING THE DETERMINANTS


OF LEAVING WORK PREMATURELY
Contents

1.0 Introduction…………………………………………………….…………............... 2

2.0 Import Excel data to SAS………………………………………………….............. 3

3.0 Create new project………………………………………………………………..... 6

4.0 Insert data into project……………………………………………………………... 8

5.0 Data exploration………………………………………………………………......... 10

6.0 Decision tree…………………………………………………………...................... 16

7.0 Logistic regression……………………………………………………………......... 23

8.0 Neural network………………………………………………………………........... 33

9.0 Best model comparison………………………………………………………….…. 39

10.0 Output explanation for best model…………………………………………….…… 42

11.0 Conclusion……………………………………………………………..................... 54

1
1.0 Introduction

1.1 Problem statement

The worker will leave from work prematurely because of many factors such as satisfaction level,
last evaluation, the number of project, work accidents, average monthly working hours, time
spend in the company, promotion for the last five years, sales and salary. In our study, a
prediction task is to determine whether a worker will leave the work prematurely or not based on
the factors.

1.2 Objectives

The research objectives are:

 To develop and compare three predictive models which are Logistic Regression, Neural
Network, and Decision Tree Model.

 To find the best predictive model for predicting the status of employees leaving work
prematurely.

1.3 Scope and limitation

There are some limitations of this study that need to be discussed. We had limited source of data
for our research since we used the secondary data that are collected by other researcher. There
are nine variables and a target. However, we do not need to filter and impute the missing values
data since there is no missing value and outlier.

2
2.0 Import Excel Data to SAS

1. Open SAS 9.3. Then, go to file tab and click on ‘import data’.

2. Choose ‘Microsoft Excel Workbook’ then click Next.

3
3. Choose the table that we want to import and click Next.

4. Then, under library selection, select SASUSER and name the member with ‘HR_DATA’ and
click Next.

4
5. Browse where the file should be save and click Finish.

6. Then the template show that the data has succesfully imported.

5
3.0 Create New Project

1. Open SAS Enterprise Miner Station 14.1

2. Click New Project.

3. Name the project ‘PROJECT DM’ and browse the SAS server directory then click next.

6
4. Then, click Finish.

7
4.0 Insert Data Into Project

1. Right click on data source and choose Create Data Source.

2. Select SAS Table then click Next.

8
3. Browse HR_DATA in sasuser and click Next.

4. Then, click Next until Finish.

9
5.0 Data Exploration

Before we begin with model building and model prediction, we must explore the data and
modifying and correcting data source. This is because the data may not perfect and cannot be
used in model building. The problem for source of data may have too many missing values, the
outlier in data sources and too many categories for nominal measurement. This is because some
of model building such as Neural Network and Logistic Regression cannot handle missing values.
Therefore some manipulation and modification must be applied to data source. Therefore, to deal
with missing value, we must impute or delete the data. We also must do regroup for nominal
variable that has too much range and find outliers. Variables is rejected when there are too many
missing value or too many categories for nominal measurement. Below are the step on
exploration and manipulation of our data.

1. Right click and click ‘Create Diagram’ to create a new diagram

2. Enter ‘EXPLORE’ as the Diagram Name then click OK.

10
3. Drag HR_DATA from data sources into workspace, right click and select edit variable.

4. Look for histogram chart. Identify the problem such as missing value, too many categories in
nominal variable and typing error.

11
5. Since, there is no missing values, too many categories, typing error and outlier problem, so
no need to impute or filter the data.

6. Click Explore tab, drag StatExplore to the diagram and connect the data to StatExplore.

7. Run and see the results. The results show the worth of each variable.

12
8. Click Explore tab, drag Multiplot node to the diagram and connect the data to Multiplot.

9. Run and see the results. The results show the train graphs for each variable.

10. Click Sample tab, drag Sample node to the diagram and connect data to the Sample node.
After that, click Explore tab and drag Graph Explore node to the diagram. Connect the
Sample node to the Graph Explore node.

13
11. Then, run and see the results.

12. Click Sample tab and drag Data Partition node.

14
13. Click Data Partition node and under Data Set Allocations, change the Training to 70,
Validation to 30 and Test to 0.

15
6.0 Decision Tree

1. Select model tab. Drag 5 Decision Tree node to the diagram and connect with data partition.
Name each decision tree as:
 DT_Gini
 DT_Entropy
 DT_Logworth
 DT_Chaid
 DT_Cart

2. Click DT_GINI node and view the properties. At the properties bar, make sure the nominal
target criterion is changed to ‘Gini’.

16
3. Click DT_ENTROPY node and view the properties. At the properties bar, make sure the
nominal target criterion is changed to ‘Entropy’.

4. Click DT_LOGWORTH node and view the properties. At the properties bar, make sure the
nominal target criterion is changed to ‘ProbChisq’.

17
5. Click DT_CHAID node and view the properties. At the properties bar, change:
 Nominal target criterion to ‘ProbChisq’.
 Significance Level to 0.05.
 Maximum Branch to 5
 Leaf Size to 1
 Split Size to 2
 Method to largest
 Assessment Measure to Decision
 Time of Bonferroni Adjustment to After

18
6. Click DT_CART node and view the properties. At the properties bar, change:
 Nominal target criterion to ‘Gini’
 Missing values to Largest Branch
 Number of Surrogate Rules to 5
 Exhaustive to 2000000000

19
7. Then, drag Model Comparison node under Assess tab. Connect all decision tree nodes to
Model Comparison nodes. Right click on the Model Comparison node and click run.

8. Then, we obtain the result.

20
9. From the Fit Statistics results, calculate the data to find the best model using Microsoft Excel.
Copy the data from fit statistics and paste to excel.

10. Find the gap (valid-train) for average square error (ASE), misclassification rate (MR) and
ROC index.

11. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap and MR gap also no positive value for
ROC gap for every model.

12. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. Since DT_CHAID is the largest gap for
ASE, MR and ROC index, thus DT_CHAID is the over fit model.

13. To find the best model for decision tree, we need to eliminate the over fit model. Then, we
find the lowest value of valid ASE and valid MR also largest value of valid ROC index.

14. Since DT_CART is the lowest value for valid ASE and valid MR also largest value for valid
ROC index, thus DT_CART is the best model for decision tree.

21
The DT_CART model is better in predicting the employees that do not left work
prematurely (negative target) since the value of specificity is higher than sensitivity.

22
7.0 Logistic Regression

1. Select Model tab. Drag 7 Logistic Regression node to the diagram and connect with data
partition. Name each logistic regression as:

 Reg_Main
 Reg_Poly
 Reg_Int
 Reg_Main_Poly
 Reg_Main_Int
 Reg_Poly_Int
 Reg_Main_Poly_Int

23
2. Click on Reg_Main node and under the equation table, Main Effect should be yes and the
other is no.

3. Click on Reg_Poly node and under the equation table, Polynomial Terms should be yes and
the other is no.

4. Click on Reg_Int node and under the equation table, Two-Factor Interactions should be yes
and the other is no.

24
5. Click on Reg_Main_Poly node and under the equation table, Main Effect and Polynomial
Terms should be yes and the other is no.

6. Click on Reg_Main_Int node and under the equation table, Main Effect and Two-Factor
Interactions should be yes and the other is no.

Yes
s

7. Click on Reg_Poly_Int node and under the equation table, Two-Factor Interactions and
Polynomial Terms should be yes and the other is no.

25
8. Click on Reg_Main_Poly_Int node and under the equation table, Main Effect, Two-Factor
Interactions and Polynomial Terms should be yes.

9. Click on Assess tab and drag Model Comparison node to the diagram and connect all logistic
regression nodes to the Model Comparison (2).

26
10. Right click on Model Comparison (2) and click run. Then, see the results.

11. Copy the data from fit statistics and paste to Microsoft Excel. From the Fit Statistics results,
calculate the data to find the best model using Microsoft Excel.

27
12. Find the gap (valid-train) for average square error (ASE), mean square error (MSE),
misclassification rate (MR), and ROC index.

13. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap, MSE gap and MR gap also no positive value
for ROC gap for every model.

14. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. Since Reg_Poly_Int is the majority which is
largest gap for ASE, MSE and ROC index, thus Reg_Poly_Int is the over fit model.

15. To find the best model, we need to eliminate the over fit model. Then, we find the lowest
value of valid ASE, valid MSE and valid MR also largest value of valid ROC index.

16. Since Reg_Main_Poly_Int is the lowest value for valid ASE, valid MSE and valid MR also
largest value for valid ROC index, thus Reg_Main_Poly_Int is the best model.

28
17. Since Reg_Main_Poly_Int as the best model, we need to compare it with three method
selection model. Therefore, select model tab, drag another 3 logistic regression node to the
diagram and connect with data partition. Name each logistic regression as:

 Reg_Main_Poly_Int_Forward
 Reg_Main_Poly_Int_Backward
 Reg_Main_Poly_Int_Stepwise

18. Model selection for Reg_Main_Poly_Int is none.

19. Model selection for Reg_Main_Poly_Int_Forward is forward.

20. Model selection for Reg_Main_Poly_Int_Backward is backward.

21. Model selection for Reg_Main_Poly_Int_Stepwise is stepwise.

29
22. Click on Assess tab and drag Model Comparison node to the diagram and connect all logistic
regression nodes to the Model Comparison (3).

23. Right click on Model Comparison (3) and click run. Then, see the results.

30
24. Copy the data from fit statistics and paste to Microsoft Excel. From the Fit Statistics results,
calculate the data to find the best model using Microsoft Excel.

25. Find the gap (valid-train) for average square error (ASE), mean square error (MSE),
misclassification rate (MR), and ROC index.

26. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap, MSE gap and MR gap also no positive value
for ROC gap for every model.

27. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. There is no over fit model since no majority
largest gap.

28. To find the best model, we need to find the lowest value of valid ASE, valid MSE and valid
MR also largest value of valid ROC index.

29. Since Reg_Main_Poly_Int is the lowest value for valid ASE and valid MSE also largest
value for valid ROC index, thus Reg_Main_Poly_Int is the best model for logistic regression.

31
The Reg_Main_Poly_Int model is better in predicting the employees that do not left work
prematurely (negative target) since the value of specificity is higher than sensitivity.

32
8.0 Neural Network
1. Select Model tab. Drag 3 Neural Network node to the diagram and connect with data partition.
Name each logistic regression as:

 NN_2
 NN_5
 NN_7

2. Drag Variable Selection node which is under Explore tab to the diagram and connect with the
data. Click sample tab, drag Data Partition node and connect with Variable Selection node.

3. Click Data Partition (2) and under Data Set Allocations, change the Training to 70, Validation
to 30 and Test to 0.

4. Select Model tab. Drag another 3 Neural Network node to the diagram and connect with Data
Partition (2). Name each logistic regression as:

 VS_NN_2
 VS_NN_5
 VS_NN_7

33
5. For NN_2 and VS_NN_2 nodes, go to property and select Network and change the Number of
Hidden Units to 2.

6. For NN_5 and VS_NN_5 nodes, go to property and select Network and change the Number of
Hidden Units to 5.

34
7. For NN_7 and VS_NN_7 nodes, go to property and select Network and change the Number of
Hidden Units to 7.

8. Go to Assess tab, drag Model Comparison node and connect all Neural Network nodes to the
Model Comparison node.

35
9. Next, run and see the results.

10. From the Fit Statistics results, calculate the data to find the best model using Microsoft Excel.
Copy the data from fit statistics and paste to excel.

36
11. Find the gap (valid-train) for misclassification rate (MR), average square error (ASE), mean
square error (MSE) and ROC index.

12. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap, MSE gap and MR gap also no positive value
for ROC gap for every model.

13. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. Since NN_5 is the majority for largest gap of
ASE, MSE and MR, thus NN_5 is the over fit model.

14. To find the best model for neural network, we need to eliminate the over fit model. Then, we
find the lowest value of valid ASE, valid MSE and valid MR also largest value of valid ROC
index.

15. Since NN_7 is the lowest value for valid ASE, valid MSE and valid MR, thus NN_7 is the
best model for neural network.

37
The NN_7 model is better in predicting the employees that do not left work prematurely
(negative target) since the value of specificity is higher than sensitivity.

38
9.0 Best Model Comparison

1. Since DT_CART is the best model for decision tree, Reg_Main_Poly_Int is the best model for
logistic regression and NN_7 is the best model for neural network, then we will compare the
three model to choose the best model for this study.

2. Drag a Model Comparison node to the diagram, then connect DT_CART, Reg_Main_Poly_Int
and NN_7 to the Model Comparison node which is Model Comparison (5).

39
3. Right click Model Comparison (5) and click Run. After that, see the results.

4. Copy the data from fit statistics and paste to Microsoft Excel. From the Fit Statistics results,
calculate the data to find the best model using Microsoft Excel. Find the gap(valid-train) for
misclassification rate (MR), average square error (ASE), and ROC index.

40
5. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap and MR gap also no positive value for ROC
gap for every model.

6. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. Since NN_7 is the largest gap for ASE, MR and
ROC, thus NN_7 is the over fit model.

7. To find the best model, we need to eliminate the over fit model. Then, we find the lowest
value of valid ASE and valid MR also largest value of valid ROC index. Since DT_CART is the
lowest value for valid ASE and valid MR also the largest value for valid ROC index, thus
DT_CART is the best model to predict the status of employees leaving work prematurely.

41
10.0 Output Explanation for DT_CART

Output 1

From output 1, we know that the most important variable is Satisfaction Level. There are 9

important variables ranked by the value of ‘Importance’ column. Average_monthly_hours

variable is used 11 times in the decision tree model as the split. Satisfaction_level and

time_spend company variables are used 10 times in the decision tree model as the split.

Last_evaluation variable is used 9 times in the decision tree model as the split. Number_project

variable is used 5 times in the decision tree model as the split. Work_accident variable is used 2

times in the decision tree model as the split. The other three variables are used once in the

decision tree model as the split.

42
Output 2

From output 2, we know that there are 9 variables involved in building decision tree and there

are 29 rules which represented by the number of leaf. The depth of this decision tree is 6.

43
Output 3

*------------------------------------------------------------*
Node = 10
*------------------------------------------------------------*
if satisfaction_level < 0.115
AND number_project >= 2.5 or MISSING
then
Tree Node Identifier = 10
Number of Observations = 626
Predicted: left=1 = 1.00
Predicted: left=0 = 0.00

*------------------------------------------------------------*
Node = 13
*------------------------------------------------------------*
if time_spend_company < 4.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND average_montly_hours >= 290.5
then
Tree Node Identifier = 13
Number of Observations = 6
Predicted: left=1 = 1.00
Predicted: left=0 = 0.00

*------------------------------------------------------------*
Node = 14
*------------------------------------------------------------*
if time_spend_company >= 4.5
AND satisfaction_level >= 0.465 or MISSING
AND last_evaluation < 0.805
then
Tree Node Identifier = 14
Number of Observations = 550
Predicted: left=1 = 0.04
Predicted: left=0 = 0.96

*------------------------------------------------------------*
Node = 19
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND average_montly_hours >= 279
then
Tree Node Identifier = 19
Number of Observations = 5
Predicted: left=1 = 0.60
Predicted: left=0 = 0.40

*------------------------------------------------------------*
Node = 21
*------------------------------------------------------------*
if satisfaction_level < 0.465 AND satisfaction_level >= 0.115 or MISSING
AND number_project >= 6.5
then
Tree Node Identifier = 21
Number of Observations = 12
Predicted: left=1 = 1.00
Predicted: left=0 = 0.00

*------------------------------------------------------------*
Node = 28
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation < 0.575 or MISSING
AND average_montly_hours < 125.5
then

44
Tree Node Identifier = 28
Number of Observations = 16
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00

*------------------------------------------------------------*
Node = 31
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND sales IS ONE OF: SALES, PRODUCT_MNG or MISSING
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING
then
Tree Node Identifier = 31
Number of Observations = 21
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00

*------------------------------------------------------------*
Node = 32
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND average_montly_hours < 241 AND average_montly_hours >= 162 or MISSING
then
Tree Node Identifier = 32
Number of Observations = 74
Predicted: left=1 = 0.01
Predicted: left=0 = 0.99

*------------------------------------------------------------*
Node = 34
*------------------------------------------------------------*
if satisfaction_level < 0.465 AND satisfaction_level >= 0.115 or MISSING
AND number_project < 6.5 AND number_project >= 2.5 or MISSING
AND average_montly_hours < 289 or MISSING
then
Tree Node Identifier = 34
Number of Observations = 1095
Predicted: left=1 = 0.06
Predicted: left=0 = 0.94

*------------------------------------------------------------*
Node = 35
*------------------------------------------------------------*
if satisfaction_level < 0.465 AND satisfaction_level >= 0.115 or MISSING
AND number_project < 6.5 AND number_project >= 2.5 or MISSING
AND average_montly_hours >= 289
then
Tree Node Identifier = 35
Number of Observations = 7
Predicted: left=1 = 1.00
Predicted: left=0 = 0.00

*------------------------------------------------------------*
Node = 38
*------------------------------------------------------------*
if time_spend_company < 4.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: TECHNICAL, SUPPORT, IT or MISSING
AND number_project >= 5.5
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 38
Number of Observations = 47
Predicted: left=1 = 0.15
Predicted: left=0 = 0.85

*------------------------------------------------------------*

45
Node = 45
*------------------------------------------------------------*
if time_spend_company >= 4.5
AND satisfaction_level >= 0.465 or MISSING
AND last_evaluation >= 0.995
AND average_montly_hours < 216.5
then
Tree Node Identifier = 45
Number of Observations = 5
Predicted: left=1 = 0.80
Predicted: left=0 = 0.20

*------------------------------------------------------------*
Node = 47
*------------------------------------------------------------*
if time_spend_company >= 6.5
AND satisfaction_level >= 0.465 or MISSING
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING
then
Tree Node Identifier = 47
Number of Observations = 43
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00

*------------------------------------------------------------*
Node = 48
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation < 0.445
AND average_montly_hours < 162 AND average_montly_hours >= 125.5 or MISSING
then
Tree Node Identifier = 48
Number of Observations = 10
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00

*------------------------------------------------------------*
Node = 49
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation < 0.575 AND last_evaluation >= 0.445 or MISSING
AND average_montly_hours < 162 AND average_montly_hours >= 125.5 or MISSING
then
Tree Node Identifier = 49
Number of Observations = 1094
Predicted: left=1 = 0.99
Predicted: left=0 = 0.01

*------------------------------------------------------------*
Node = 50
*------------------------------------------------------------*
if satisfaction_level < 0.32 or MISSING
AND sales IS ONE OF: TECHNICAL
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING
then
Tree Node Identifier = 50
Number of Observations = 5
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00

*------------------------------------------------------------*
Node = 51
*------------------------------------------------------------*
if satisfaction_level < 0.465 AND satisfaction_level >= 0.32
AND sales IS ONE OF: TECHNICAL

46
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING
then
Tree Node Identifier = 51
Number of Observations = 5
Predicted: left=1 = 0.40
Predicted: left=0 = 0.60

*------------------------------------------------------------*
Node = 54
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation < 0.585
AND average_montly_hours < 279 AND average_montly_hours >= 241
then
Tree Node Identifier = 54
Number of Observations = 6
Predicted: left=1 = 0.50
Predicted: left=0 = 0.50

*------------------------------------------------------------*
Node = 55
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation >= 0.585 or MISSING
AND average_montly_hours < 279 AND average_montly_hours >= 241
then
Tree Node Identifier = 55
Number of Observations = 12
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00

*------------------------------------------------------------*
Node = 58
*------------------------------------------------------------*
if time_spend_company < 3.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND number_project < 2.5
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 58
Number of Observations = 292
Predicted: left=1 = 0.04
Predicted: left=0 = 0.96

*------------------------------------------------------------*
Node = 59
*------------------------------------------------------------*
if time_spend_company < 3.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND number_project < 5.5 AND number_project >= 2.5 or MISSING
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 59
Number of Observations = 4814
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00

*------------------------------------------------------------*
Node = 60
*------------------------------------------------------------*
if time_spend_company < 4.5 AND time_spend_company >= 3.5
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: HR, TECHNICAL
AND number_project < 5.5 or MISSING
AND average_montly_hours < 290.5 or MISSING
then

47
Tree Node Identifier = 60
Number of Observations = 222
Predicted: left=1 = 0.08
Predicted: left=0 = 0.92

*------------------------------------------------------------*
Node = 61
*------------------------------------------------------------*
if time_spend_company < 4.5 AND time_spend_company >= 3.5
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: SALES, ACCOUNTING, SUPPORT, IT, PRODUCT_MNG, MARKETING, MANAGEMENT, RANDD or MISSING
AND number_project < 5.5 or MISSING
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 61
Number of Observations = 685
Predicted: left=1 = 0.02
Predicted: left=0 = 0.98

*------------------------------------------------------------*
Node = 64
*------------------------------------------------------------*
if time_spend_company < 2.5
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: SALES, PRODUCT_MNG, RANDD
AND number_project >= 5.5
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 64
Number of Observations = 11
Predicted: left=1 = 0.18
Predicted: left=0 = 0.82

*------------------------------------------------------------*
Node = 65
*------------------------------------------------------------*
if time_spend_company < 4.5 AND time_spend_company >= 2.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: SALES, PRODUCT_MNG, RANDD
AND number_project >= 5.5
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 65
Number of Observations = 34
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00

*------------------------------------------------------------*
Node = 68
*------------------------------------------------------------*
if time_spend_company >= 4.5
AND satisfaction_level >= 0.465 or MISSING
AND number_project < 2.5
AND last_evaluation < 0.995 AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours < 216.5
then
Tree Node Identifier = 68
Number of Observations = 14
Predicted: left=1 = 0.29
Predicted: left=0 = 0.71

*------------------------------------------------------------*
Node = 69
*------------------------------------------------------------*
if time_spend_company >= 4.5
AND satisfaction_level >= 0.465 or MISSING
AND number_project >= 2.5 or MISSING
AND last_evaluation < 0.995 AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours < 216.5
then

48
Tree Node Identifier = 69
Number of Observations = 149
Predicted: left=1 = 0.03
Predicted: left=0 = 0.97

*------------------------------------------------------------*
Node = 70
*------------------------------------------------------------*
if time_spend_company < 6.5 AND time_spend_company >= 4.5 or MISSING
AND satisfaction_level < 0.705 AND satisfaction_level >= 0.465
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING
then
Tree Node Identifier = 70
Number of Observations = 32
Predicted: left=1 = 0.25
Predicted: left=0 = 0.75

*------------------------------------------------------------*
Node = 71
*------------------------------------------------------------*
if time_spend_company < 6.5 AND time_spend_company >= 4.5 or MISSING
AND satisfaction_level >= 0.705 or MISSING
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING
then
Tree Node Identifier = 71
Number of Observations = 606
Predicted: left=1 = 0.95
Predicted: left=0 = 0.05

From output 3, there are 29 rules in the decision tree model. The distributions of the rules are
such that:

 There are 8 rules in predicting the employees that leave work prematurely (Y=1)
 There are 20 rules in predicting the employees that does not leave work
prematurely(Y=0)
 There is 1 rule that cannot be used in predicting the target Y
There are 10498 observations used to grow the tree which it is the size of training data set.

The profile for predicting the worker that left (Y=1):


 if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation < 0.575 or MISSING
AND average_montly_hours < 125.5

 if time_spend_company >= 4.5


AND satisfaction_level >= 0.465 or MISSING
AND last_evaluation < 0.805

49
 if satisfaction_level < 0.465
AND sales IS ONE OF: SALES, PRODUCT_MNG or MISSING
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING

 if satisfaction_level < 0.465


AND number_project < 2.5
AND average_montly_hours < 241 AND average_montly_hours >= 162 or MISSING

 if satisfaction_level < 0.465 AND satisfaction_level >= 0.115 or MISSING


AND number_project < 6.5 AND number_project >= 2.5 or MISSING
AND average_montly_hours < 289 or MISSING

 if time_spend_company < 4.5 or MISSING


AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: TECHNICAL, SUPPORT, IT or MISSING
AND number_project >= 5.5
AND average_montly_hours < 290.5 or MISSING

 if time_spend_company >= 6.5


AND satisfaction_level >= 0.465 or MISSING
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING

 if satisfaction_level < 0.465


AND number_project < 2.5
AND last_evaluation < 0.445
AND average_montly_hours < 162 AND average_montly_hours >= 125.5 or MISSING

 if satisfaction_level < 0.32 or MISSING


AND sales IS ONE OF: TECHNICAL
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING

 if satisfaction_level < 0.465 AND satisfaction_level >= 0.32


AND sales IS ONE OF: TECHNICAL
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING

 if satisfaction_level < 0.465


AND number_project < 2.5
AND last_evaluation >= 0.585 or MISSING
AND average_montly_hours < 279 AND average_montly_hours >= 241

50
 if time_spend_company < 3.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND number_project < 2.5
AND average_montly_hours < 290.5 or MISSING

 if time_spend_company < 3.5 or MISSING


AND satisfaction_level >= 0.465 or MISSING
AND number_project < 5.5 AND number_project >= 2.5 or MISSING
AND average_montly_hours < 290.5 or MISSING

 if time_spend_company < 4.5 AND time_spend_company >= 3.5


AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: HR, TECHNICAL
AND number_project < 5.5 or MISSING
AND average_montly_hours < 290.5 or MISSING

 if time_spend_company < 4.5 AND time_spend_company >= 3.5


AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: SALES, ACCOUNTING, SUPPORT, IT, PRODUCT_MNG, MARKETING,
MANAGEMENT, RANDD or MISSING
AND number_project < 5.5 or MISSING
AND average_montly_hours < 290.5 or MISSING

 if time_spend_company < 2.5


AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: SALES, PRODUCT_MNG, RANDD
AND number_project >= 5.5
AND average_montly_hours < 290.5 or MISSING

 if time_spend_company < 4.5 AND time_spend_company >= 2.5 or MISSING


AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: SALES, PRODUCT_MNG, RANDD
AND number_project >= 5.5
AND average_montly_hours < 290.5 or MISSING

 if time_spend_company >= 4.5


AND satisfaction_level >= 0.465 or MISSING
AND number_project < 2.5
AND last_evaluation < 0.995 AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours < 216.5
 if time_spend_company >= 4.5
AND satisfaction_level >= 0.465 or MISSING
AND number_project >= 2.5 or MISSING
AND last_evaluation < 0.995 AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours < 216.5

51
 if time_spend_company < 6.5 AND time_spend_company >= 4.5 or MISSING
AND satisfaction_level < 0.705 AND satisfaction_level >= 0.465
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING

The profile for predicting the worker that does not left (Y=0)
 if satisfaction_level < 0.115
AND number_project >= 2.5 or MISSING

 if time_spend_company < 4.5 or MISSING


AND satisfaction_level >= 0.465 or MISSING
AND average_montly_hours >= 290.5

 if satisfaction_level < 0.465


AND number_project < 2.5
AND average_montly_hours >= 279

 if satisfaction_level < 0.465 AND satisfaction_level >= 0.115 or MISSING


AND number_project >= 6.5

 if satisfaction_level < 0.465 AND satisfaction_level >= 0.115 or MISSING


AND number_project < 6.5 AND number_project >= 2.5 or MISSING
AND average_montly_hours >= 289

 if time_spend_company >= 4.5


AND satisfaction_level >= 0.465 or MISSING
AND last_evaluation >= 0.995
AND average_montly_hours < 216.5

 if satisfaction_level < 0.465


AND number_project < 2.5
AND last_evaluation < 0.575 AND last_evaluation >= 0.445 or MISSING
AND average_montly_hours < 162 AND average_montly_hours >= 125.5 or MISSING

 if time_spend_company < 6.5 AND time_spend_company >= 4.5 or MISSING


AND satisfaction_level >= 0.705 or MISSING
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING

52
OUTPUT 4

The plot shows the ASE corresponding to each subtree as the data is sequentially split.
Assessing the performance of a leaf tree is from optimality of the leaf tree which is the number
of leaf at the smallest value of average square error on valid data set. Number of leaves: 29.

OUTPUT 5

The plot shows the Misclassification Rate corresponding to each subtree as the data is
sequentially split. Assessing the performance of a leaf tree is from optimality of the leaf tree
which is the number of leaf at the smallest value of misclassification rate on valid data set.
Number of leaves: 29.

53
11.0 Conclusion

First, among five decision tree models which are DT_GINI, DT_ENTROPY,
DT_LOGWORTH, DT_CHAID and DT_CART, we find the best model. Based on the result of
SASE-Miner, we found that best model for decision tree is DT_CART.

Second, among seven logistic regression models which are Reg_Main, Reg_Poly, Reg_Int,
Reg_Main_Poly, Reg_Main_Int, Reg_Poly_Int and Reg_Main_Poly_Int, the best model is
Reg_Main_Poly_Int. Then compare Reg_Main_Poly_Int model again by using another selection
method which are forward, backward and stepwise. Among four logistic regression models
which are Reg_Main_Poly_Int, Reg_Main_Poly_Int_Forward, Reg_Main_Poly_Int_Backward
and Reg_Main_Poly_Int_Stepwise, the best model for logistic regression is Reg_Main_Poly_Int.

Third, among six neural network models which are NN_2, NN_5, NN_7, VS_NN_2,
VS_NN_5 and VS_NN_7, the best model for neural network is NN_7.

Lastly, among DT_CART, Reg_Main_Poly_Int and NN_7, we found out that the best model
to predict the employees that leave work prematurely is DT_CART.

54

Potrebbero piacerti anche