Sei sulla pagina 1di 18

Industrial Statistics

MS3001 Advanced Marketing Research


Faculty of Science University of Colombo

Application of Multivariate Statistical Methods in Marketing Research


Session 3 Logistics Regression

December 25, 2013

Illustration
Who will win in the next elections NDA or UPA? Are factors such as location, religion, caste, education, past voting behavior etc. decisive of this?

I am concerned about attrition of my customers. Can you help me predict who are likely to drop off in the next six months? What make them attrite? What make my loyal customers stay with me?
What determines Contraceptive Practice amongst ever married women in rural? Their religion? Caste? Education? Awareness of various methods?

Logistic Regression

December 25, 2013

Logistic Regression
Useful in making prediction of an event: Victory or Loss Will vote or Will not vote Adoption or Rejection Or of multiple events or things: Accept, Reject, Defer Hindu, Muslim, Christian Doctorate, Postgraduate, Undergraduate Of interest is in knowing probability of occurrence of an event or thing Single non-metric dependent variable (binary or multichotomous) Several (more than two) metric (interval or ratio) or non-metric (nominal) independent variables (predictors) Does not assume (or require) normally distributed data Often requires very large samples Results in an equation (or set of equations) using which probabilities can be computed and classifications can be made

December 25, 2013

Background to the course: What is Logistic Regression?


Logistic regression in a nutshell:
It is a multiple regression with an outcome variable (or dependent variable) that is a categorical dichotomic and explanatory variables that can be either continuous or categorical In other words, the interest is in predicting which of two possible events are going to happen given certain other information For example in Political Science, logistic regression could be used to analyse the factors that determine whether an individual participates in a general election or not.

December 25, 2013

Why cannot we use a Simple Linear Regression?


Let us remember what we have learnt about Simple Linear Regression:
We used it when we had reasons (a theory) to assume causality between two variables: X Y. Example:
X= Investment in R&D; Y= New Products introduced

December 25, 2013

Simple Linear Regression


This sort of regression analysis provides us with useful information:
E.g.: For a certain confidence level (95%, for example): How much the explained variable (Y) changes as a result of a change in the explanatory variable (X) With a regression we can predict the value of Y given the value of X

December 25, 2013

Simple Linear Regression How is this impact of X on Y estimated?


We assumed a linear relation between the two variables We introduced u, unobserved factors affecting Y, which we are not going to account for in our model Then we postulated the following relation: Yi = + Xi + ui

December 25, 2013

Simple Linear Regression How is this impact of X on Y estimated?


We made some assumptions about u
(Basically we assumed that ui are identically and independently distributed with zero mean and constant variance)

Then we estimated the parameters for the model (using generally


Ordinary Least Squares)

Simple Linear Regression provides the best fit line. i.e.: the straight line which best describes the relationship between the two variables

December 25, 2013

Our example: R&D and New Products


How does investment in R+D affects the number of new products developed? We can postulate the following relation:

# of new products = + *Investment in R+D + u


Let us look at the scatter plot:

50

40

30

20

NEWPROD

10

0 0 200 400 600 800

RD

December 25, 2013

Our Example: Investment in R+D and introduction of new products


It makes sense to assume a linear relation between X and Y in this case. The estimate for = 0.049 This tells us that in order to increase the number of new products in one unit, we need to invest a little bit more than 20 monetary units in R+D. If a company invests 1000 in R+D, we would predict this company to develop around 49 new products
NEWPROD

50

40

30

20

10

0 0 200 400 600 800

RD

December 25, 2013

Another example: Failing or Passing an exam


Let us define a variable Outcome

We can reasonably assume that Failing or Passing an exam depends on the quantity of hours we use to study Note than in this case, the dependent variable takes only two possible values. We will call it dichotomic variable

Outcome = 0 if the individual fails the exam = 1 if the individual passes the exam

December 25, 2013

Regression analysis with dichotomic dependent variables


We will be interested then in inference about the probability of passing the exam.
Were we to use linear regression, we would postulate:
Prob (Outcome=1) = + *Quantity of hours of study + u

As we are concerned about modelling the probability of the event occurring, this is a probability model As we model the relation between the quantity of hours of study and the probability of passing the exam as linear, this is a linear model We will call this model a Linear Probability Model (LPM)

December 25, 2013

Linear Probability Models (LPM)


Our dataset contains information about 14 students. Our statistical software (SPSS) will happily perform a linear regression of Outcome, on the quantity of study hours.

Student id 1 2 3 4 5 6

Outcome 0 1 0 0 0 1

Quantity of Study Hours 3 34 17 6 12 15

7 8
9 10 11 12 13 14

1 1
0 1 0 1 1 0

26 29
14 58 2 31 26 11

December 25, 2013

Linear Probability Models (LPM) What is wrong about them?

OUTCOME

Let us do a scatter plot and insert the regression line: The probability of Outcome=1 can take values between 0 and 1 But we do not observe probabilities but the actual event happening A straight line will predict values between negative and positive infinity, outside the [0,1] interval!

1.2

1.0

.8

.6

.4

.2

0.0

-.2 0 10 20 30 40 50 60

HSTUDY

December 25, 2013

What is wrong with LPM?


Coefficients Model Unstandardized Coefficients Sig. B Std. Error 1 (Constant) 0.031861 0.161591 0.846994 HSTUDY 0.026219 0.006483 0.001627 Dependent Variable: OUTCOME

Above is the SPSS output on the linear regression of Outcome on Hours of Study The results suggest that an increase in 1 hour of studying increases the probability of passing the exam, on average, by approx. 0.026 or 2.6%. So what would the model predict if we studied 100 hours for the exam?

December 25, 2013

Linear Probability Models (LPM) What is wrong with them?


Basically, the linear relation we had postulated before between X and Y is not appropriate when our dependent variable is dichotomic. Predictions for the probability of the event occurring would lie outside the [0,1] interval, which is unacceptable.

December 25, 2013

Non Linear Probability Models


We want to be able to model the probability of the event occurring with an explanatory variable X, but we want the predicted probability to remain within the [0,1] bounds.
There is a threshold above which the probability hardly increases as a reaction to changes in the explanatory variable.

Many functions meet these requirements (non-linearity and being bounded within the [0,1] interval). We will focus on the Logistic. The Logistic Curve will relate the explanatory variable X to the probability of the event occurring. In our example, it will relate the number of study hours with the probability of passing the exam.

December 25, 2013

The Logit Model


A Logit Model states that:
Prob(Yi=1) = F ( + Xi) Prob(Yi=0) = 1 - F ( + Xi) Where F(.) is the Logistic Function. So, the probability of the event occurring is a logistic function of the independent variables

December 25, 2013

Potrebbero piacerti anche