Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
PROJECT REPORT
1.0 Introduction…………………………………………………….…………............... 2
11.0 Conclusion……………………………………………………………..................... 54
1
1.0 Introduction
The worker will leave from work prematurely because of many factors such as satisfaction level,
last evaluation, the number of project, work accidents, average monthly working hours, time
spend in the company, promotion for the last five years, sales and salary. In our study, a
prediction task is to determine whether a worker will leave the work prematurely or not based on
the factors.
1.2 Objectives
To develop and compare three predictive models which are Logistic Regression, Neural
Network, and Decision Tree Model.
To find the best predictive model for predicting the status of employees leaving work
prematurely.
There are some limitations of this study that need to be discussed. We had limited source of data
for our research since we used the secondary data that are collected by other researcher. There
are nine variables and a target. However, we do not need to filter and impute the missing values
data since there is no missing value and outlier.
2
2.0 Import Excel Data to SAS
1. Open SAS 9.3. Then, go to file tab and click on ‘import data’.
3
3. Choose the table that we want to import and click Next.
4. Then, under library selection, select SASUSER and name the member with ‘HR_DATA’ and
click Next.
4
5. Browse where the file should be save and click Finish.
6. Then the template show that the data has succesfully imported.
5
3.0 Create New Project
3. Name the project ‘PROJECT DM’ and browse the SAS server directory then click next.
6
4. Then, click Finish.
7
4.0 Insert Data Into Project
8
3. Browse HR_DATA in sasuser and click Next.
9
5.0 Data Exploration
Before we begin with model building and model prediction, we must explore the data and
modifying and correcting data source. This is because the data may not perfect and cannot be
used in model building. The problem for source of data may have too many missing values, the
outlier in data sources and too many categories for nominal measurement. This is because some
of model building such as Neural Network and Logistic Regression cannot handle missing values.
Therefore some manipulation and modification must be applied to data source. Therefore, to deal
with missing value, we must impute or delete the data. We also must do regroup for nominal
variable that has too much range and find outliers. Variables is rejected when there are too many
missing value or too many categories for nominal measurement. Below are the step on
exploration and manipulation of our data.
10
3. Drag HR_DATA from data sources into workspace, right click and select edit variable.
4. Look for histogram chart. Identify the problem such as missing value, too many categories in
nominal variable and typing error.
11
5. Since, there is no missing values, too many categories, typing error and outlier problem, so
no need to impute or filter the data.
6. Click Explore tab, drag StatExplore to the diagram and connect the data to StatExplore.
7. Run and see the results. The results show the worth of each variable.
12
8. Click Explore tab, drag Multiplot node to the diagram and connect the data to Multiplot.
9. Run and see the results. The results show the train graphs for each variable.
10. Click Sample tab, drag Sample node to the diagram and connect data to the Sample node.
After that, click Explore tab and drag Graph Explore node to the diagram. Connect the
Sample node to the Graph Explore node.
13
11. Then, run and see the results.
14
13. Click Data Partition node and under Data Set Allocations, change the Training to 70,
Validation to 30 and Test to 0.
15
6.0 Decision Tree
1. Select model tab. Drag 5 Decision Tree node to the diagram and connect with data partition.
Name each decision tree as:
DT_Gini
DT_Entropy
DT_Logworth
DT_Chaid
DT_Cart
2. Click DT_GINI node and view the properties. At the properties bar, make sure the nominal
target criterion is changed to ‘Gini’.
16
3. Click DT_ENTROPY node and view the properties. At the properties bar, make sure the
nominal target criterion is changed to ‘Entropy’.
4. Click DT_LOGWORTH node and view the properties. At the properties bar, make sure the
nominal target criterion is changed to ‘ProbChisq’.
17
5. Click DT_CHAID node and view the properties. At the properties bar, change:
Nominal target criterion to ‘ProbChisq’.
Significance Level to 0.05.
Maximum Branch to 5
Leaf Size to 1
Split Size to 2
Method to largest
Assessment Measure to Decision
Time of Bonferroni Adjustment to After
18
6. Click DT_CART node and view the properties. At the properties bar, change:
Nominal target criterion to ‘Gini’
Missing values to Largest Branch
Number of Surrogate Rules to 5
Exhaustive to 2000000000
19
7. Then, drag Model Comparison node under Assess tab. Connect all decision tree nodes to
Model Comparison nodes. Right click on the Model Comparison node and click run.
20
9. From the Fit Statistics results, calculate the data to find the best model using Microsoft Excel.
Copy the data from fit statistics and paste to excel.
10. Find the gap (valid-train) for average square error (ASE), misclassification rate (MR) and
ROC index.
11. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap and MR gap also no positive value for
ROC gap for every model.
12. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. Since DT_CHAID is the largest gap for
ASE, MR and ROC index, thus DT_CHAID is the over fit model.
13. To find the best model for decision tree, we need to eliminate the over fit model. Then, we
find the lowest value of valid ASE and valid MR also largest value of valid ROC index.
14. Since DT_CART is the lowest value for valid ASE and valid MR also largest value for valid
ROC index, thus DT_CART is the best model for decision tree.
21
The DT_CART model is better in predicting the employees that do not left work
prematurely (negative target) since the value of specificity is higher than sensitivity.
22
7.0 Logistic Regression
1. Select Model tab. Drag 7 Logistic Regression node to the diagram and connect with data
partition. Name each logistic regression as:
Reg_Main
Reg_Poly
Reg_Int
Reg_Main_Poly
Reg_Main_Int
Reg_Poly_Int
Reg_Main_Poly_Int
23
2. Click on Reg_Main node and under the equation table, Main Effect should be yes and the
other is no.
3. Click on Reg_Poly node and under the equation table, Polynomial Terms should be yes and
the other is no.
4. Click on Reg_Int node and under the equation table, Two-Factor Interactions should be yes
and the other is no.
24
5. Click on Reg_Main_Poly node and under the equation table, Main Effect and Polynomial
Terms should be yes and the other is no.
6. Click on Reg_Main_Int node and under the equation table, Main Effect and Two-Factor
Interactions should be yes and the other is no.
Yes
s
7. Click on Reg_Poly_Int node and under the equation table, Two-Factor Interactions and
Polynomial Terms should be yes and the other is no.
25
8. Click on Reg_Main_Poly_Int node and under the equation table, Main Effect, Two-Factor
Interactions and Polynomial Terms should be yes.
9. Click on Assess tab and drag Model Comparison node to the diagram and connect all logistic
regression nodes to the Model Comparison (2).
26
10. Right click on Model Comparison (2) and click run. Then, see the results.
11. Copy the data from fit statistics and paste to Microsoft Excel. From the Fit Statistics results,
calculate the data to find the best model using Microsoft Excel.
27
12. Find the gap (valid-train) for average square error (ASE), mean square error (MSE),
misclassification rate (MR), and ROC index.
13. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap, MSE gap and MR gap also no positive value
for ROC gap for every model.
14. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. Since Reg_Poly_Int is the majority which is
largest gap for ASE, MSE and ROC index, thus Reg_Poly_Int is the over fit model.
15. To find the best model, we need to eliminate the over fit model. Then, we find the lowest
value of valid ASE, valid MSE and valid MR also largest value of valid ROC index.
16. Since Reg_Main_Poly_Int is the lowest value for valid ASE, valid MSE and valid MR also
largest value for valid ROC index, thus Reg_Main_Poly_Int is the best model.
28
17. Since Reg_Main_Poly_Int as the best model, we need to compare it with three method
selection model. Therefore, select model tab, drag another 3 logistic regression node to the
diagram and connect with data partition. Name each logistic regression as:
Reg_Main_Poly_Int_Forward
Reg_Main_Poly_Int_Backward
Reg_Main_Poly_Int_Stepwise
29
22. Click on Assess tab and drag Model Comparison node to the diagram and connect all logistic
regression nodes to the Model Comparison (3).
23. Right click on Model Comparison (3) and click run. Then, see the results.
30
24. Copy the data from fit statistics and paste to Microsoft Excel. From the Fit Statistics results,
calculate the data to find the best model using Microsoft Excel.
25. Find the gap (valid-train) for average square error (ASE), mean square error (MSE),
misclassification rate (MR), and ROC index.
26. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap, MSE gap and MR gap also no positive value
for ROC gap for every model.
27. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. There is no over fit model since no majority
largest gap.
28. To find the best model, we need to find the lowest value of valid ASE, valid MSE and valid
MR also largest value of valid ROC index.
29. Since Reg_Main_Poly_Int is the lowest value for valid ASE and valid MSE also largest
value for valid ROC index, thus Reg_Main_Poly_Int is the best model for logistic regression.
31
The Reg_Main_Poly_Int model is better in predicting the employees that do not left work
prematurely (negative target) since the value of specificity is higher than sensitivity.
32
8.0 Neural Network
1. Select Model tab. Drag 3 Neural Network node to the diagram and connect with data partition.
Name each logistic regression as:
NN_2
NN_5
NN_7
2. Drag Variable Selection node which is under Explore tab to the diagram and connect with the
data. Click sample tab, drag Data Partition node and connect with Variable Selection node.
3. Click Data Partition (2) and under Data Set Allocations, change the Training to 70, Validation
to 30 and Test to 0.
4. Select Model tab. Drag another 3 Neural Network node to the diagram and connect with Data
Partition (2). Name each logistic regression as:
VS_NN_2
VS_NN_5
VS_NN_7
33
5. For NN_2 and VS_NN_2 nodes, go to property and select Network and change the Number of
Hidden Units to 2.
6. For NN_5 and VS_NN_5 nodes, go to property and select Network and change the Number of
Hidden Units to 5.
34
7. For NN_7 and VS_NN_7 nodes, go to property and select Network and change the Number of
Hidden Units to 7.
8. Go to Assess tab, drag Model Comparison node and connect all Neural Network nodes to the
Model Comparison node.
35
9. Next, run and see the results.
10. From the Fit Statistics results, calculate the data to find the best model using Microsoft Excel.
Copy the data from fit statistics and paste to excel.
36
11. Find the gap (valid-train) for misclassification rate (MR), average square error (ASE), mean
square error (MSE) and ROC index.
12. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap, MSE gap and MR gap also no positive value
for ROC gap for every model.
13. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. Since NN_5 is the majority for largest gap of
ASE, MSE and MR, thus NN_5 is the over fit model.
14. To find the best model for neural network, we need to eliminate the over fit model. Then, we
find the lowest value of valid ASE, valid MSE and valid MR also largest value of valid ROC
index.
15. Since NN_7 is the lowest value for valid ASE, valid MSE and valid MR, thus NN_7 is the
best model for neural network.
37
The NN_7 model is better in predicting the employees that do not left work prematurely
(negative target) since the value of specificity is higher than sensitivity.
38
9.0 Best Model Comparison
1. Since DT_CART is the best model for decision tree, Reg_Main_Poly_Int is the best model for
logistic regression and NN_7 is the best model for neural network, then we will compare the
three model to choose the best model for this study.
2. Drag a Model Comparison node to the diagram, then connect DT_CART, Reg_Main_Poly_Int
and NN_7 to the Model Comparison node which is Model Comparison (5).
39
3. Right click Model Comparison (5) and click Run. After that, see the results.
4. Copy the data from fit statistics and paste to Microsoft Excel. From the Fit Statistics results,
calculate the data to find the best model using Microsoft Excel. Find the gap(valid-train) for
misclassification rate (MR), average square error (ASE), and ROC index.
40
5. After finding the gap, identify the presence of under fitting and over fitting. There is no under
fit model as there is no negative value for ASE gap and MR gap also no positive value for ROC
gap for every model.
6. The over fit is identify by examining the absolute gap between train and valid results. Then,
choose a model that yield largest gap in general. Since NN_7 is the largest gap for ASE, MR and
ROC, thus NN_7 is the over fit model.
7. To find the best model, we need to eliminate the over fit model. Then, we find the lowest
value of valid ASE and valid MR also largest value of valid ROC index. Since DT_CART is the
lowest value for valid ASE and valid MR also the largest value for valid ROC index, thus
DT_CART is the best model to predict the status of employees leaving work prematurely.
41
10.0 Output Explanation for DT_CART
Output 1
From output 1, we know that the most important variable is Satisfaction Level. There are 9
variable is used 11 times in the decision tree model as the split. Satisfaction_level and
time_spend company variables are used 10 times in the decision tree model as the split.
Last_evaluation variable is used 9 times in the decision tree model as the split. Number_project
variable is used 5 times in the decision tree model as the split. Work_accident variable is used 2
times in the decision tree model as the split. The other three variables are used once in the
42
Output 2
From output 2, we know that there are 9 variables involved in building decision tree and there
are 29 rules which represented by the number of leaf. The depth of this decision tree is 6.
43
Output 3
*------------------------------------------------------------*
Node = 10
*------------------------------------------------------------*
if satisfaction_level < 0.115
AND number_project >= 2.5 or MISSING
then
Tree Node Identifier = 10
Number of Observations = 626
Predicted: left=1 = 1.00
Predicted: left=0 = 0.00
*------------------------------------------------------------*
Node = 13
*------------------------------------------------------------*
if time_spend_company < 4.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND average_montly_hours >= 290.5
then
Tree Node Identifier = 13
Number of Observations = 6
Predicted: left=1 = 1.00
Predicted: left=0 = 0.00
*------------------------------------------------------------*
Node = 14
*------------------------------------------------------------*
if time_spend_company >= 4.5
AND satisfaction_level >= 0.465 or MISSING
AND last_evaluation < 0.805
then
Tree Node Identifier = 14
Number of Observations = 550
Predicted: left=1 = 0.04
Predicted: left=0 = 0.96
*------------------------------------------------------------*
Node = 19
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND average_montly_hours >= 279
then
Tree Node Identifier = 19
Number of Observations = 5
Predicted: left=1 = 0.60
Predicted: left=0 = 0.40
*------------------------------------------------------------*
Node = 21
*------------------------------------------------------------*
if satisfaction_level < 0.465 AND satisfaction_level >= 0.115 or MISSING
AND number_project >= 6.5
then
Tree Node Identifier = 21
Number of Observations = 12
Predicted: left=1 = 1.00
Predicted: left=0 = 0.00
*------------------------------------------------------------*
Node = 28
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation < 0.575 or MISSING
AND average_montly_hours < 125.5
then
44
Tree Node Identifier = 28
Number of Observations = 16
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00
*------------------------------------------------------------*
Node = 31
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND sales IS ONE OF: SALES, PRODUCT_MNG or MISSING
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING
then
Tree Node Identifier = 31
Number of Observations = 21
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00
*------------------------------------------------------------*
Node = 32
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND average_montly_hours < 241 AND average_montly_hours >= 162 or MISSING
then
Tree Node Identifier = 32
Number of Observations = 74
Predicted: left=1 = 0.01
Predicted: left=0 = 0.99
*------------------------------------------------------------*
Node = 34
*------------------------------------------------------------*
if satisfaction_level < 0.465 AND satisfaction_level >= 0.115 or MISSING
AND number_project < 6.5 AND number_project >= 2.5 or MISSING
AND average_montly_hours < 289 or MISSING
then
Tree Node Identifier = 34
Number of Observations = 1095
Predicted: left=1 = 0.06
Predicted: left=0 = 0.94
*------------------------------------------------------------*
Node = 35
*------------------------------------------------------------*
if satisfaction_level < 0.465 AND satisfaction_level >= 0.115 or MISSING
AND number_project < 6.5 AND number_project >= 2.5 or MISSING
AND average_montly_hours >= 289
then
Tree Node Identifier = 35
Number of Observations = 7
Predicted: left=1 = 1.00
Predicted: left=0 = 0.00
*------------------------------------------------------------*
Node = 38
*------------------------------------------------------------*
if time_spend_company < 4.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: TECHNICAL, SUPPORT, IT or MISSING
AND number_project >= 5.5
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 38
Number of Observations = 47
Predicted: left=1 = 0.15
Predicted: left=0 = 0.85
*------------------------------------------------------------*
45
Node = 45
*------------------------------------------------------------*
if time_spend_company >= 4.5
AND satisfaction_level >= 0.465 or MISSING
AND last_evaluation >= 0.995
AND average_montly_hours < 216.5
then
Tree Node Identifier = 45
Number of Observations = 5
Predicted: left=1 = 0.80
Predicted: left=0 = 0.20
*------------------------------------------------------------*
Node = 47
*------------------------------------------------------------*
if time_spend_company >= 6.5
AND satisfaction_level >= 0.465 or MISSING
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING
then
Tree Node Identifier = 47
Number of Observations = 43
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00
*------------------------------------------------------------*
Node = 48
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation < 0.445
AND average_montly_hours < 162 AND average_montly_hours >= 125.5 or MISSING
then
Tree Node Identifier = 48
Number of Observations = 10
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00
*------------------------------------------------------------*
Node = 49
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation < 0.575 AND last_evaluation >= 0.445 or MISSING
AND average_montly_hours < 162 AND average_montly_hours >= 125.5 or MISSING
then
Tree Node Identifier = 49
Number of Observations = 1094
Predicted: left=1 = 0.99
Predicted: left=0 = 0.01
*------------------------------------------------------------*
Node = 50
*------------------------------------------------------------*
if satisfaction_level < 0.32 or MISSING
AND sales IS ONE OF: TECHNICAL
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING
then
Tree Node Identifier = 50
Number of Observations = 5
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00
*------------------------------------------------------------*
Node = 51
*------------------------------------------------------------*
if satisfaction_level < 0.465 AND satisfaction_level >= 0.32
AND sales IS ONE OF: TECHNICAL
46
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING
then
Tree Node Identifier = 51
Number of Observations = 5
Predicted: left=1 = 0.40
Predicted: left=0 = 0.60
*------------------------------------------------------------*
Node = 54
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation < 0.585
AND average_montly_hours < 279 AND average_montly_hours >= 241
then
Tree Node Identifier = 54
Number of Observations = 6
Predicted: left=1 = 0.50
Predicted: left=0 = 0.50
*------------------------------------------------------------*
Node = 55
*------------------------------------------------------------*
if satisfaction_level < 0.465
AND number_project < 2.5
AND last_evaluation >= 0.585 or MISSING
AND average_montly_hours < 279 AND average_montly_hours >= 241
then
Tree Node Identifier = 55
Number of Observations = 12
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00
*------------------------------------------------------------*
Node = 58
*------------------------------------------------------------*
if time_spend_company < 3.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND number_project < 2.5
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 58
Number of Observations = 292
Predicted: left=1 = 0.04
Predicted: left=0 = 0.96
*------------------------------------------------------------*
Node = 59
*------------------------------------------------------------*
if time_spend_company < 3.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND number_project < 5.5 AND number_project >= 2.5 or MISSING
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 59
Number of Observations = 4814
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00
*------------------------------------------------------------*
Node = 60
*------------------------------------------------------------*
if time_spend_company < 4.5 AND time_spend_company >= 3.5
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: HR, TECHNICAL
AND number_project < 5.5 or MISSING
AND average_montly_hours < 290.5 or MISSING
then
47
Tree Node Identifier = 60
Number of Observations = 222
Predicted: left=1 = 0.08
Predicted: left=0 = 0.92
*------------------------------------------------------------*
Node = 61
*------------------------------------------------------------*
if time_spend_company < 4.5 AND time_spend_company >= 3.5
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: SALES, ACCOUNTING, SUPPORT, IT, PRODUCT_MNG, MARKETING, MANAGEMENT, RANDD or MISSING
AND number_project < 5.5 or MISSING
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 61
Number of Observations = 685
Predicted: left=1 = 0.02
Predicted: left=0 = 0.98
*------------------------------------------------------------*
Node = 64
*------------------------------------------------------------*
if time_spend_company < 2.5
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: SALES, PRODUCT_MNG, RANDD
AND number_project >= 5.5
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 64
Number of Observations = 11
Predicted: left=1 = 0.18
Predicted: left=0 = 0.82
*------------------------------------------------------------*
Node = 65
*------------------------------------------------------------*
if time_spend_company < 4.5 AND time_spend_company >= 2.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND sales IS ONE OF: SALES, PRODUCT_MNG, RANDD
AND number_project >= 5.5
AND average_montly_hours < 290.5 or MISSING
then
Tree Node Identifier = 65
Number of Observations = 34
Predicted: left=1 = 0.00
Predicted: left=0 = 1.00
*------------------------------------------------------------*
Node = 68
*------------------------------------------------------------*
if time_spend_company >= 4.5
AND satisfaction_level >= 0.465 or MISSING
AND number_project < 2.5
AND last_evaluation < 0.995 AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours < 216.5
then
Tree Node Identifier = 68
Number of Observations = 14
Predicted: left=1 = 0.29
Predicted: left=0 = 0.71
*------------------------------------------------------------*
Node = 69
*------------------------------------------------------------*
if time_spend_company >= 4.5
AND satisfaction_level >= 0.465 or MISSING
AND number_project >= 2.5 or MISSING
AND last_evaluation < 0.995 AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours < 216.5
then
48
Tree Node Identifier = 69
Number of Observations = 149
Predicted: left=1 = 0.03
Predicted: left=0 = 0.97
*------------------------------------------------------------*
Node = 70
*------------------------------------------------------------*
if time_spend_company < 6.5 AND time_spend_company >= 4.5 or MISSING
AND satisfaction_level < 0.705 AND satisfaction_level >= 0.465
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING
then
Tree Node Identifier = 70
Number of Observations = 32
Predicted: left=1 = 0.25
Predicted: left=0 = 0.75
*------------------------------------------------------------*
Node = 71
*------------------------------------------------------------*
if time_spend_company < 6.5 AND time_spend_company >= 4.5 or MISSING
AND satisfaction_level >= 0.705 or MISSING
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING
then
Tree Node Identifier = 71
Number of Observations = 606
Predicted: left=1 = 0.95
Predicted: left=0 = 0.05
From output 3, there are 29 rules in the decision tree model. The distributions of the rules are
such that:
There are 8 rules in predicting the employees that leave work prematurely (Y=1)
There are 20 rules in predicting the employees that does not leave work
prematurely(Y=0)
There is 1 rule that cannot be used in predicting the target Y
There are 10498 observations used to grow the tree which it is the size of training data set.
49
if satisfaction_level < 0.465
AND sales IS ONE OF: SALES, PRODUCT_MNG or MISSING
AND number_project < 2.5
AND last_evaluation >= 0.575
AND average_montly_hours < 162 or MISSING
50
if time_spend_company < 3.5 or MISSING
AND satisfaction_level >= 0.465 or MISSING
AND number_project < 2.5
AND average_montly_hours < 290.5 or MISSING
51
if time_spend_company < 6.5 AND time_spend_company >= 4.5 or MISSING
AND satisfaction_level < 0.705 AND satisfaction_level >= 0.465
AND last_evaluation >= 0.805 or MISSING
AND average_montly_hours >= 216.5 or MISSING
The profile for predicting the worker that does not left (Y=0)
if satisfaction_level < 0.115
AND number_project >= 2.5 or MISSING
52
OUTPUT 4
The plot shows the ASE corresponding to each subtree as the data is sequentially split.
Assessing the performance of a leaf tree is from optimality of the leaf tree which is the number
of leaf at the smallest value of average square error on valid data set. Number of leaves: 29.
OUTPUT 5
The plot shows the Misclassification Rate corresponding to each subtree as the data is
sequentially split. Assessing the performance of a leaf tree is from optimality of the leaf tree
which is the number of leaf at the smallest value of misclassification rate on valid data set.
Number of leaves: 29.
53
11.0 Conclusion
First, among five decision tree models which are DT_GINI, DT_ENTROPY,
DT_LOGWORTH, DT_CHAID and DT_CART, we find the best model. Based on the result of
SASE-Miner, we found that best model for decision tree is DT_CART.
Second, among seven logistic regression models which are Reg_Main, Reg_Poly, Reg_Int,
Reg_Main_Poly, Reg_Main_Int, Reg_Poly_Int and Reg_Main_Poly_Int, the best model is
Reg_Main_Poly_Int. Then compare Reg_Main_Poly_Int model again by using another selection
method which are forward, backward and stepwise. Among four logistic regression models
which are Reg_Main_Poly_Int, Reg_Main_Poly_Int_Forward, Reg_Main_Poly_Int_Backward
and Reg_Main_Poly_Int_Stepwise, the best model for logistic regression is Reg_Main_Poly_Int.
Third, among six neural network models which are NN_2, NN_5, NN_7, VS_NN_2,
VS_NN_5 and VS_NN_7, the best model for neural network is NN_7.
Lastly, among DT_CART, Reg_Main_Poly_Int and NN_7, we found out that the best model
to predict the employees that leave work prematurely is DT_CART.
54