Assignment 2 Group Extra

Advanced Data Analysis
Identify High Priority Items

Table of Contents
Objective…............................................................................................................ 3
Step 1: Standardize the Data.................................................................................... 3
Step 2: Do Logistic Regression................................................................................ 3
Step 3: Logistic Regression Equation:.......................................................................7
2
Objective
To find the logistic regression equation for classification of high and low priority items. The
dependent variable is High_Priority
Methodology
Since the dependent variable is dichotomous categorical variable and independent variables are
both categorical and metric, Logistic regression is used to identify High Priority items. Also the
result of logistic regression was compared with classification tree.
1) Clean and Preparing the data
2) Perform Enter Method Logistic Regression
3) Calculate Chance Accuracy
4) Perform Stepwise Logistic Regression (Ward Method)
5) Validating the model classification accuracy with chance accuracy
Data Cleaning and Preparation

In the variable “Item_Fat_Content” there were four categories mentioned i.e. LF, Low Fat,
Regular and reg. It is assumed that LF represents same as Low Fat and reg as Regular. So four
categories were changed to two categories which would then represents actual categories.
Various sample points in the data were having missing values from item weight and Outlet Size.
Also a new variable Outlet_Age was created with base year as 2010 from the establishment year.
Outlet_Age = 2010 - Outlet_Establishment_Year
Categorical variables were converted to numerical codes.
Perform Enter Method Logistic Regression

Following variables were entered as independent variables and dependent variables.
Dependent Variable: Hi_Priority
Independent Variables Item_Weight, Item_Fat_Content, Item_Visibility, Item_Type,
Item_MRP, Outlet_Size, Outlet_Location_Type, Outlet_Type, Outlet_Age, Outlet_Identifier
From the above variables, following were entered as the categorical variables:
Item_Fat_Content, Item_Type, Outlet_Size, Outlet_Location_Type, Outlet_Type and
Outlet_Identifier.
3
OUTPUT INTERPRETATION
We received a warning which indicated there is collinearity present in the variables. Since
Logistic Regression similar to other regression doesn’t allow collinearity, we stopped our
analysis here. Further analysis revealed that Outlet Location, Outlet Type and Outlet identifier
are correlated.
After little iteration with each variable, only Outlet type was considered relevant in the next
iteration.
4
OUPUT INTERPRETATION AFTER REMOVING COLLINEARITY
No warning came which indicates there is now no collinearity present in the data.
BLOCK 0: Output
Classification Tablea,b
Predicted
Hi_priority
Percentage
Observed 0 1 Correct
Step 0 Hi_priority 0 0 773 .0
1 0 3877 100.0
Overall Percentage 83.4
a. Constant is included in the model.
b. The cut value is .500
The chance Accuracy is (773/4650)² + (3877/4650)²= 72%

Now for the model to be valid overall accuracy should be greater than 1.25 * 72%= 90.3%
5
This table shows the coding as per categories of independent variables.
Categorical Variables Codings

Parameter coding
Frequency (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)
Item_type Baking 536 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
_New Goods
Breads 204 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Breakfast 89 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Canned 539 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Dairy 566 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Frozen 718 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Foods
Fruits and 1019 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000 .000
Vegetable
s
Hard 183 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000 .000
Drinks
Health 430 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000 .000
and
Hygiene
Househol 759 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000 .000
d
Meat 337 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000 .000
Others 137 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000 .000
Seafood 51 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000 .000
Snack 988 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000 .000
Foods
Soft 374 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 1.000
Drinks
Starchy 130 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
Foods
Outlet_Typ Grocery 555 1.000 .000
e_New Store
Supermar 5577 .000 1.000
ket Type1
Supermar 928 .000 .000
ket Type2
Fat_conte Low Fat 4566 1.000
nt_new Regular 2494 .000
Model Summary
Cox & Snell R Nagelkerke R

Step -2 Log likelihood Square Square
1 2919.502a .238 .401
a. Estimation terminated at iteration number 6 because

parameter estimates changed by less than .001.
NR2 is good which is 0.40 which is good.
Hosmer and Lemeshow Test
Step Chi-square df Sig.
1 224.859 8 .000
Hosmer and Lemeshow test failed because significance value should be >0.05. But since all
other assumptions were true we went ahead with the analysis and thus compared the results with
Classification trees.
6
Variables in the Equation
95% C.I.for EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
Step 1 a Item_MRP .031 .001 696.469 1 .000 1.031 1.029 1.034
Outlet_Age -.010 .022 .215 1 .643 .990 .948 1.033
Outlet_Type_New(1) .632 .263 5.762 1 .016 1.882 1.123 3.153
Item_Weight -.006 .010 .361 1 .548 .994 .974 1.014
Item_Visibility .258 1.045 .061 1 .805 1.294 .167 10.030
Item_type_New 13.070 15 .597
Item_type_New(1) -.209 .399 .275 1 .600 .811 .371 1.773
Item_type_New(2) .676 .494 1.873 1 .171 1.965 .747 5.171
Item_type_New(3) -.823 .527 2.437 1 .119 .439 .156 1.234
Item_type_New(4) -.076 .404 .036 1 .850 .927 .420 2.045
Item_type_New(5) -.202 .404 .250 1 .617 .817 .370 1.803
Item_type_New(6) -.098 .395 .061 1 .804 .907 .418 1.968
Item_type_New(7) -.065 .391 .027 1 .868 .937 .436 2.015
Item_type_New(8) -.316 .478 .437 1 .508 .729 .286 1.860
Item_type_New(9) -.276 .414 .444 1 .505 .759 .337 1.708
Item_type_New(10) -.213 .403 .278 1 .598 .808 .367 1.781
Item_type_New(11) -.194 .426 .209 1 .648 .823 .357 1.896
Item_type_New(12) .071 .507 .020 1 .888 1.074 .397 2.902
Item_type_New(13) -.481 .700 .471 1 .493 .618 .157 2.440
Item_type_New(14) -.052 .393 .017 1 .895 .950 .440 2.052
Item_type_New(15) -.215 .420 .263 1 .608 .806 .354 1.836
Outlet_Size_New .449 2 .799
Outlet_Size_New(1) -.154 .320 .233 1 .630 .857 .458 1.604
Outlet_Size_New(2) .037 .135 .074 1 .785 1.038 .796 1.353
Fat_content_new(1) .112 .108 1.088 1 .297 1.119 .906 1.382
Constant -2.044 .448 20.825 1 .000 .129
There are lots of variables which are not significant. So they must be removed to form accurate
equation.
Perform Step Wise Logistic Regression

Method used was “Backward:Wald”
OUTPUT INTERPRETATION
Model Summary

1 2919.502a .238 .401
2 2919.563a .238 .401
3 2920.011a .238 .401
4 2933.696a .236 .397
5 2934.071a .236 .397
6 2934.802a .236 .397

7
NR2 is good which is 0.397 which is good.
1 224.859 8 .000
2 224.134 8 .000
3 223.111 8 .000
4 228.720 8 .000
5 225.398 8 .000
6 233.094 8 .000
Classification Tablea
Predicted
Hi_priority Percentag
Observed 0 1 e Correct
Step 1 Hi_priority 0 391 382 50.6
1 150 3727 96.1
1 150 3727 96.1
1 153 3724 96.1
1 158 3719 95.9
1 159 3718 95.9
1 159 3718 95.9
6 Iterations were performed and final 6th Iteration is the output.

8
There has been improvement in accuracy from 72% (Chance Accuracy) to 88.4% which is
slightly less than 90.3% to make our model valid.
Final Variables in the equation which are significant are as follows:
From all the variables only three significant variables were found. Item _MRP, Outlet age and
Outlet type.
But there were few samples which were not considered due to outlet size and Item weight missing
values. Since they are not relevant, it would be better if all the sample values are considered
because it would increase our sample size and improve our prediction.
The below graph indicates that the probability of success is greater than 0.5 and this means the
items are correctly classified.
Step number: 6
Observed Groups and Predicted Probabilities
800 + +
| 1|
| 1|
F | 1|
R 600 + 1+
E | 1|
Q | 11|
U | 11|
E 400 + 11+
N | 111|
C | 111|
Y | 1111|
200 + 11111+
| 111111|
| 1 1111 1111111|
| 11 1101011 11 1 1111111111111111111111111111111111|
Predicted ---------+---------+---------+---------+---------+---------+---------+---------+---------+----------
Prob: 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1
Group: 0000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111
Predicted Probability is of Membership for 1

The Cut Value is .50
Symbols: 0 - 0
1 - 1
Each Symbol Represents 50 Cases.
9
Perform Step Wise Logistic Regression using only significant
Variables
Independent Variables used: Outlet type, Outlet Age and Item MRP
Dependent Variable: High_priority
Full sample size is utilized in the regression and there are no warnings which suggest no
collinearity.
10
Chance Accuracy
a,b
Classification Table
Predicted
Hi_priority Percentag
Observed 0 1 e Correct
Step 0 Hi_priority 0 0 2131 .0
1 0 6392 100.0
The chance Accuracy is (2131/8523)² + (6392/8523)²= 62.5%

Now for the model to be valid overall accuracy should be greater than 1.25 * 62.5%= 78.1%
Variables not in the Equation
Score df Sig.
Step 0 Variables Item_MRP 972.713 1 .000
Outlet_Age 123.673 1 .000
Outlet_Type_New 3259.201 3 .000
Outlet_Type_New(1) 3183.377 1 .000
Outlet_Type_New(2) 726.382 1 .000
Outlet_Type_New(3) 11.939 1 .001
Overall Statistics 4223.772 5 .000
Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 4909.133 5 .000
Block 4909.133 5 .000
Model 4909.133 5 .000
11
Model Summary

1 4676.981a .438 .648

0.648 NR2 is very good number for logistic regression.
1 216.668 8 .000
Classification Tablea
Predicted
Hi_priority
Percentage
Observed 0 1 Correct
1 283 6109 95.6
a. The cut value is .500
Since 90% classification accuracy is more than 78.1%, our model can be used for prediction.
12
Variables in the Equation
95% C.I.for EXP(B)
B S.E. Wald df Sig. Exp(B) Lower Upper
a Item_MRP .031 .001 1097.603 1 .000 1.032 1.030 1.033
Step 1
Outlet_Ag -.014 .006 5.088 1 .024 .986 .974 .998
e
Outlet_Typ 1351.281 3 .000
e_New
Outlet_Typ -8.777 .269 1066.242 1 .000 .000 .000 .000
e_New(1)
Outlet_Typ -2.054 .201 104.146 1 .000 .128 .086 .190

e_New(2)
Outlet_Typ -2.629 .253 108.117 1 .000 .072 .044 .118

e_New(3)
Constant .496 .243 4.172 1 .041 1.643
Logistic Regression Equation

From the variables in the equation are significant, let us develop the logistic regression equation.
P(s) = 1/(1+e^-y) where y = a+b1x1+b2x2…. Equation (1)
Y= 0.496 + 0.031 Item(MRP) – 0.014 Outlet Age - 8.778 * Outlet Type new (1) – 2.054 *
Outlet Type new (2) – 2.629 * Outlet Type new (3)
Categorical Variables Codings
Parameter coding
Frequency (1) (2) (3)
Outlet_Type_New Grocery Store 1083 1.000 .000 .000
Supermarket Type1 5577 .000 1.000 .000
Supermarket Type2 928 .000 .000 1.000
Supermarket Type3 935 .000 .000 .000
Now, substitute this value of y in equation (1), to get probability of success.
P(s) = 1/(1+e0.496 + 0.031 Item(MRP) – 0.014 Outlet Age - 8.778 * Outlet Type new (1) –
2.054 * Outlet Type new (2) – 2.629 * Outlet Type new (3) ))
13
CART
Model Summary
Specifications Growing Method EXHAUSTIVE CHAID
Dependent Variable Hi_priority
Independent Variables Item_Weight, Item_Visibility, Item_MRP,

Outlet_Age, Outlet_Type_New
Validation Split Sample
Maximum Tree Depth 3
Minimum Cases in Parent 100

Node
Minimum Cases in Child 50

Node
Results Independent Variables Outlet_Type_New, Item_MRP

Included
Number of Nodes 18
Number of Terminal Nodes 12
Depth 3
14
Risk
Sample Estimate Std. Error
Training .101 .005
Test .103 .005
Growing Method: EXHAUSTIVE CHAID

Dependent Variable: Hi_priority
15
Classification
Predicted
Sample Observed 0 1 Percent Correct
Training 0 786 299 72.4%
1 134 3057 95.8%
Overall Percentage 21.5% 78.5% 89.9%
Test 0 752 294 71.9%
1 144 3057 95.5%
Overall Percentage 21.1% 78.9% 89.7%
Growing Method: EXHAUSTIVE CHAID

Dependent Variable: Hi_priority
89.7% Accuracy model of CART is similar to 90 % classification accuracy achieved by logistic

regression. So this validates our regression by Logistic method despite our Hosmer test failing.
Logistic regression model can be used for prediction.
16

Assignment 2 Group Extra

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Assignment 2 Group Extra

Caricato da

Copyright:

Formati disponibili

Advanced Data Analysis

Identify High Priority Items

Step 1: Standardize the Data.................................................................................... 3

Step 2: Do Logistic Regression................................................................................ 3

Step 3: Logistic Regression Equation:.......................................................................7

Data Cleaning and Preparation

Perform Enter Method Logistic Regression

Step 0 Hi_priority 0 0 773 .0

Overall Percentage 83.4

a. Constant is included in the model.

b. The cut value is .500

The chance Accuracy is (773/4650)² + (3877/4650)²= 72%

Categorical Variables Codings

Cox & Snell R Nagelkerke R

1 2919.502a .238 .401

a. Estimation terminated at iteration number 6 because

NR2 is good which is 0.40 which is good.

Hosmer and Lemeshow Test

Step Chi-square df Sig.

Perform Step Wise Logistic Regression

Cox & Snell R Nagelkerke R

1 2919.502a .238 .401

2 2919.563a .238 .401

3 2920.011a .238 .401

4 2933.696a .236 .397

5 2934.071a .236 .397

6 2934.802a .236 .397

a. Estimation terminated at iteration number 6 because

Hosmer and Lemeshow Test

Step Chi-square df Sig.

6 Iterations were performed and final 6th Iteration is the output.

Final Variables in the equation which are significant are as follows:

Observed Groups and Predicted Probabilities

Predicted Probability is of Membership for 1

The chance Accuracy is (2131/8523)² + (6392/8523)²= 62.5%

Variables not in the Equation

Step 0 Variables Item_MRP 972.713 1 .000

Outlet_Age 123.673 1 .000

Outlet_Type_New 3259.201 3 .000

Outlet_Type_New(1) 3183.377 1 .000

Outlet_Type_New(2) 726.382 1 .000

Outlet_Type_New(3) 11.939 1 .001

Overall Statistics 4223.772 5 .000

Omnibus Tests of Model Coefficients

Step 1 Step 4909.133 5 .000

Block 4909.133 5 .000

Model 4909.133 5 .000

Cox & Snell R Nagelkerke R

1 4676.981a .438 .648

a. Estimation terminated at iteration number 7 because

0.648 NR2 is very good number for logistic regression.

Hosmer and Lemeshow Test

Step Chi-square df Sig.

Step 1 Hi_priority 0 1559 572 73.2

1 283 6109 95.6

Overall Percentage 90.0

a. The cut value is .500

Outlet_Typ -2.054 .201 104.146 1 .000 .128 .086 .190

Outlet_Typ -2.629 .253 108.117 1 .000 .072 .044 .118

Constant .496 .243 4.172 1 .041 1.643

Logistic Regression Equation

Categorical Variables Codings

Frequency (1) (2) (3)

Outlet_Type_New Grocery Store 1083 1.000 .000 .000