Sei sulla pagina 1di 11

Logistic Regression Digest An

eBook
Understand, build and use logistic regression
models for common business problems with
RapidMiner
Bala Deshpande, Ph.D., MBA

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

Table of Contents
Chapter 1: Basics of Logistic Regression models .......................................................................................... 2
Chapter 2: Seven steps to building a Logistic Regression Model.................................................................. 5
Chapter 3: Applying a logistic regression model built with RapidMiner ...................................................... 8

SimaFore LLC

Page 1

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

Chapter 1: Basics of Logistic Regression models


It can be argued that the most important step in a business analytics process is establishing a
clear business objective. Once this is done, selecting the right technique becomes a matter of
simple logic. At a very high level there are fundamentally two main classes of techniques: those
that evolved purely from statistics (such as regression) and those that emerged from a blend of
stats, computer science and mathematics (such as classification trees).
This chapter is about logistic regression and how it compares to its twin - linear regression, and
when it makes sense to use it. In chapter 2, we discuss the mechanics of logistic regression and
its implementation using RapidMiner for a simple business analytics application. Finally in
chapter 3 we discuss how to apply the model to new data.
A simple explanation of Logistic Regression
Recall that linear regression is the process of finding a straight line that passes through a bunch
of points with the objective of being able to use the equation of the line as a model for
prediction. The key assumptions here are that both the predictor and target variables are
continuous as seen in this chart below. Intuitively, one can state that when X increases, Y
increases along the slope of the line.

SimaFore LLC

Page 2

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

What happens if the target variable is not continuous? When the target (Y) variable is discrete,
the straight line is no longer a fit as seen in this chart. Although intuitively we can still state that
when X (say advertising spend) increases, Y (say response or no response to a mailing
campaign) also increases, but there is no gradual transition, the Y value abruptly jumps from
one binary outcome to the other. Thus the straight line is a poor fit for this data.

On the other hand, take a look at the S-shaped curve below. This is certainly a better fit for the
data shown. If we then know the equation to this "sigmoid" curve, we can use it as effectively
as we used the straight line in the case of linear regression.

SimaFore LLC

Page 3

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

Logistic regression is thus the process of obtaining an appropriate sigmoid curve to fit the data
when the target variable is discrete.
Key facts to keep in mind

Logistic Regression is the equivalent of linear regression to use when the target (or
dependent) variable is discrete i.e. not continuous
Logistic Regression is ideally suited for business analytics applications where the target
variable is a binary decision (fail-pass, response-no response, etc)
The predictors can be either continuous or categorical

In chapter 2, we discuss the mechanics of logistic regression and also the process of
implementing a simple analysis using RapidMiner.

SimaFore LLC

Page 4

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

Chapter 2: Seven steps to building a Logistic Regression Model


In chapter 1, we gave a brief introduction to logistic regression and indicated when it might be
appropriate to use it in business analytics settings. Probably the best definition of Logistic
regression is this "A mathematical modeling approach in which the best-fitting, yet leastrestrictive model is desired to describe the relationship between several independent
explanatory variables and a dependent dichotomous response variable".
In this chapter we get into the details of how the model equation is developed and then show
how to set up a simple analysis using RapidMiner.
How does logistic regression find the sigmoid curve?
A straight line can be depicted by only two parameters: the slope (m) and the intercept (c). The
way in which X's and Y's are related to each other can be simply specified by m and c. However
an S-shaped curve is a much more complex shape and representing it parametrically is not as
easy. So how does one find a mathematical means to relate the X's to the Y's?
It turns out that if we transform the Y's to the logarithm of the odds of Y, then the transformed
target variable is linearly related to the X's. In most cases where we need to use logistic
regression, the Y is usually a YES-NO type of response. This is usually interpreted as the
probability of an event happening (Y=1) or not happening (Y=0).

If Y is an event (response, pass etc),


and p is the probability of the event happening (Y=1),
then (1-p) is the probability of the event not happening (Y=0),
and p/(1-p) is the odds of the event happening
It turns out that log(p/1-p) is linear in the predictors, X

We can write the model as

log[p/1-p] = mX + c ------------------ Eq 1.

From the data given, we know the X and can compute the p for each value of X. After this of
course the problem is essentially similar to linear regression. (To see the sigmoid curve, the
variables need to be transformed from the p-space to the Y-space).
The logistic regression model from Eq. 1 ultimately delivers the probability of Y happening (i.e.
Y=1), given specific value(s) of X.

SimaFore LLC

Page 5

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

7-steps to a simple logistic regression model in RapidMiner


The data we used comes from an example here for a credit scoring exercise. The objective is to
predict DEFAULT (Y or N) based on two predictors: Loan age (business usage) and number of
days of delinquency. There are 100 samples.
Step 1: Load speadsheet into RapidMiner. Use the process described here. Remember to set
the DEFAULT column as "Label"
Step 2: Split data into train and test samples using the Split Validation operator as shown here
Step 3: Add the Logistic Regression operator in the "training" window of the split validation
operator
Step 4: Add Apply Model operator in the "testing" window of split validation operator in
a similar manner as discussed here. Just use default parameter values.
Step 5: Add Performance evaluation operator in the "testing" window of split validation
operator as discussed here.
Step 6: Connect all ports as shown below

Step 7: Run the model and view results. In particular check for the Kernel Model which shows
the coefficients for the two predictors and the intercept. Also check the confusion matrix
for Accuracy, Sensitivity, and Specificity and finally view the ROC curves and check AUC.

SimaFore LLC

Page 6

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

The accuracy of the model based on the 30% testing sample is 83%. The ROC curves has an AUC
of 0.863 which is quite acceptable. The next step would be to review the kernel model and
prepare for deploying this model.

SimaFore LLC

Page 7

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

Chapter 3: Applying a logistic regression model built with RapidMiner


In this chapter we will briefly describe how to apply the results from a logistic regression
analysis with RapidMiner. Let us start by recapping the basic elements of logistic regression.
1. Logistic regression is the equivalent of linear regression that is used when the response
variable or label is binomial. A binomial response variable has two categories: Yes/No,
Accept/Not Accept, Default/Not Default and so on.
2. Logarithm of the odds of the response, Y, being a "Yes" is expressed as a function of
independent or predictor variables, X, and a constant term. That is, for example
log (odds of Y = "Yes") = mX + c ---- This is also called the Logit
3. The logit gives the odds of the "Yes" event, however if we want probability, we need to use
the transformed equation below:
p (of Y = "Yes") = Reciprocal of [1+exp(-mX-c)]
A simple example
Let us use a simple example of predicting if a customer will accept a bank's personal loan offer
as a function of their income.

SimaFore LLC

Page 8

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

When we run this simple dataset and build a logistic regression model, we see the following
results

RapidMiner's implementation of logistic regression differs from many other (more


conventional) approaches. The table on the left which shows the kernel model should not be
confused with the logit model described above. In other words, w[Income] does not directly
correspond to the slope "m" and Bias (offset) does not directly correspond to "c".
The easiest way to implement the results of the analysis is to use the process below which
applies the results of the logistic regression learner on the example data set.

When the analysis runs, simply click on the "Example Set" tab and the "Data View" radio
button. You will see that for each of the cases, there is a predicted result - Prediction (Personal
loan) and the confidence or probability that the loan acceptance is "No" and the corresponding
inverse probability of "Yes".

SimaFore LLC

Page 9

Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore

The main takeaway from this chapter is that, using RapidMiner it is easier to apply the
developed model to new data to obtain probability of response variable being in one of the two
categories, rather than trying to interpret the model parameters in the light of traditional
formulas, such as the logit.
If you liked this e-book tutorial on analytics, sign up for visTASC, "a visual thesaurus of analytics,
statistics and complex systems for more like these. Sign up is FREE and allows you to search for
techniques for other common business problems.

SimaFore LLC

Page 10

Potrebbero piacerti anche