Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Objective…............................................................................................................ 3
2
Objective
To find the logistic regression equation for classification of high and low priority items. The
dependent variable is High_Priority
Methodology
Since the dependent variable is dichotomous categorical variable and independent variables are
both categorical and metric, Logistic regression is used to identify High Priority items. Also the
result of logistic regression was compared with classification tree.
1) Clean and Preparing the data
2) Perform Enter Method Logistic Regression
3) Calculate Chance Accuracy
4) Perform Stepwise Logistic Regression (Ward Method)
5) Validating the model classification accuracy with chance accuracy
From the above variables, following were entered as the categorical variables:
Item_Fat_Content, Item_Type, Outlet_Size, Outlet_Location_Type, Outlet_Type and
Outlet_Identifier.
3
OUTPUT INTERPRETATION
We received a warning which indicated there is collinearity present in the variables. Since
Logistic Regression similar to other regression doesn’t allow collinearity, we stopped our
analysis here. Further analysis revealed that Outlet Location, Outlet Type and Outlet identifier
are correlated.
After little iteration with each variable, only Outlet type was considered relevant in the next
iteration.
4
OUPUT INTERPRETATION AFTER REMOVING COLLINEARITY
No warning came which indicates there is now no collinearity present in the data.
BLOCK 0: Output
Classification Tablea,b
Predicted
Hi_priority
Percentage
Observed 0 1 Correct
1 0 3877 100.0
5
This table shows the coding as per categories of independent variables.
Model Summary
1 224.859 8 .000
Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.
6
Variables in the Equation
95% C.I.for EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Step 1 a Item_MRP .031 .001 696.469 1 .000 1.031 1.029 1.034
Outlet_Age -.010 .022 .215 1 .643 .990 .948 1.033
Outlet_Type_New(1) .632 .263 5.762 1 .016 1.882 1.123 3.153
Item_Weight -.006 .010 .361 1 .548 .994 .974 1.014
Item_Visibility .258 1.045 .061 1 .805 1.294 .167 10.030
Item_type_New 13.070 15 .597
Item_type_New(1) -.209 .399 .275 1 .600 .811 .371 1.773
Item_type_New(2) .676 .494 1.873 1 .171 1.965 .747 5.171
Item_type_New(3) -.823 .527 2.437 1 .119 .439 .156 1.234
Item_type_New(4) -.076 .404 .036 1 .850 .927 .420 2.045
Item_type_New(5) -.202 .404 .250 1 .617 .817 .370 1.803
Item_type_New(6) -.098 .395 .061 1 .804 .907 .418 1.968
Item_type_New(7) -.065 .391 .027 1 .868 .937 .436 2.015
Item_type_New(8) -.316 .478 .437 1 .508 .729 .286 1.860
Item_type_New(9) -.276 .414 .444 1 .505 .759 .337 1.708
Item_type_New(10) -.213 .403 .278 1 .598 .808 .367 1.781
Item_type_New(11) -.194 .426 .209 1 .648 .823 .357 1.896
Item_type_New(12) .071 .507 .020 1 .888 1.074 .397 2.902
Item_type_New(13) -.481 .700 .471 1 .493 .618 .157 2.440
Item_type_New(14) -.052 .393 .017 1 .895 .950 .440 2.052
Item_type_New(15) -.215 .420 .263 1 .608 .806 .354 1.836
Outlet_Size_New .449 2 .799
Outlet_Size_New(1) -.154 .320 .233 1 .630 .857 .458 1.604
Outlet_Size_New(2) .037 .135 .074 1 .785 1.038 .796 1.353
Fat_content_new(1) .112 .108 1.088 1 .297 1.119 .906 1.382
Constant -2.044 .448 20.825 1 .000 .129
There are lots of variables which are not significant. So they must be removed to form accurate
equation.
Model Summary
7
NR2 is good which is 0.397 which is good.
1 224.859 8 .000
2 224.134 8 .000
3 223.111 8 .000
4 228.720 8 .000
5 225.398 8 .000
6 233.094 8 .000
Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.
Classification Tablea
Predicted
Hi_priority Percentag
Observed 0 1 e Correct
Step 1 Hi_priority 0 391 382 50.6
1 150 3727 96.1
Overall Percentage 88.6
Step 2 Hi_priority 0 390 383 50.5
1 150 3727 96.1
Overall Percentage 88.5
Step 3 Hi_priority 0 392 381 50.7
1 153 3724 96.1
Overall Percentage 88.5
Step 4 Hi_priority 0 388 385 50.2
1 158 3719 95.9
Overall Percentage 88.3
Step 5 Hi_priority 0 390 383 50.5
1 159 3718 95.9
Overall Percentage 88.3
Step 6 Hi_priority 0 393 380 50.8
1 159 3718 95.9
Overall Percentage 88.4
From all the variables only three significant variables were found. Item _MRP, Outlet age and
Outlet type.
But there were few samples which were not considered due to outlet size and Item weight missing
values. Since they are not relevant, it would be better if all the sample values are considered
because it would increase our sample size and improve our prediction.
The below graph indicates that the probability of success is greater than 0.5 and this means the
items are correctly classified.
Step number: 6
800 + +
| 1|
| 1|
F | 1|
R 600 + 1+
E | 1|
Q | 11|
U | 11|
E 400 + 11+
N | 111|
C | 111|
Y | 1111|
200 + 11111+
| 111111|
| 1 1111 1111111|
| 11 1101011 11 1 1111111111111111111111111111111111|
Predicted ---------+---------+---------+---------+---------+---------+---------+---------+---------+----------
Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Group: 0000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111
Independent Variables used: Outlet type, Outlet Age and Item MRP
Dependent Variable: High_priority
Full sample size is utilized in the regression and there are no warnings which suggest no
collinearity.
10
Chance Accuracy
a,b
Classification Table
Predicted
Hi_priority Percentag
Observed 0 1 e Correct
Step 0 Hi_priority 0 0 2131 .0
1 0 6392 100.0
Overall Percentage 75.0
Score df Sig.
Chi-square df Sig.
11
Model Summary
1 216.668 8 .000
Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.
Classification Tablea
Predicted
Hi_priority
Percentage
Observed 0 1 Correct
Since 90% classification accuracy is more than 78.1%, our model can be used for prediction.
12
Variables in the Equation
95% C.I.for EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
a Item_MRP .031 .001 1097.603 1 .000 1.032 1.030 1.033
Step 1
Outlet_Ag -.014 .006 5.088 1 .024 .986 .974 .998
e
Outlet_Typ 1351.281 3 .000
e_New
Outlet_Typ -8.777 .269 1066.242 1 .000 .000 .000 .000
e_New(1)
Y= 0.496 + 0.031 Item(MRP) – 0.014 Outlet Age - 8.778 * Outlet Type new (1) – 2.054 *
Outlet Type new (2) – 2.629 * Outlet Type new (3)
Parameter coding
P(s) = 1/(1+e0.496 + 0.031 Item(MRP) – 0.014 Outlet Age - 8.778 * Outlet Type new (1) –
2.054 * Outlet Type new (2) – 2.629 * Outlet Type new (3) ))
13
CART
Model Summary
Number of Nodes 18
Depth 3
14
Risk
15
Classification
Predicted
16