Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
eBook
Understand, build and use logistic regression
models for common business problems with
RapidMiner
Bala Deshpande, Ph.D., MBA
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
Table of Contents
Chapter 1: Basics of Logistic Regression models .......................................................................................... 2
Chapter 2: Seven steps to building a Logistic Regression Model.................................................................. 5
Chapter 3: Applying a logistic regression model built with RapidMiner ...................................................... 8
SimaFore LLC
Page 1
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
SimaFore LLC
Page 2
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
What happens if the target variable is not continuous? When the target (Y) variable is discrete,
the straight line is no longer a fit as seen in this chart. Although intuitively we can still state that
when X (say advertising spend) increases, Y (say response or no response to a mailing
campaign) also increases, but there is no gradual transition, the Y value abruptly jumps from
one binary outcome to the other. Thus the straight line is a poor fit for this data.
On the other hand, take a look at the S-shaped curve below. This is certainly a better fit for the
data shown. If we then know the equation to this "sigmoid" curve, we can use it as effectively
as we used the straight line in the case of linear regression.
SimaFore LLC
Page 3
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
Logistic regression is thus the process of obtaining an appropriate sigmoid curve to fit the data
when the target variable is discrete.
Key facts to keep in mind
Logistic Regression is the equivalent of linear regression to use when the target (or
dependent) variable is discrete i.e. not continuous
Logistic Regression is ideally suited for business analytics applications where the target
variable is a binary decision (fail-pass, response-no response, etc)
The predictors can be either continuous or categorical
In chapter 2, we discuss the mechanics of logistic regression and also the process of
implementing a simple analysis using RapidMiner.
SimaFore LLC
Page 4
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
log[p/1-p] = mX + c ------------------ Eq 1.
From the data given, we know the X and can compute the p for each value of X. After this of
course the problem is essentially similar to linear regression. (To see the sigmoid curve, the
variables need to be transformed from the p-space to the Y-space).
The logistic regression model from Eq. 1 ultimately delivers the probability of Y happening (i.e.
Y=1), given specific value(s) of X.
SimaFore LLC
Page 5
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
Step 7: Run the model and view results. In particular check for the Kernel Model which shows
the coefficients for the two predictors and the intercept. Also check the confusion matrix
for Accuracy, Sensitivity, and Specificity and finally view the ROC curves and check AUC.
SimaFore LLC
Page 6
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
The accuracy of the model based on the 30% testing sample is 83%. The ROC curves has an AUC
of 0.863 which is quite acceptable. The next step would be to review the kernel model and
prepare for deploying this model.
SimaFore LLC
Page 7
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
SimaFore LLC
Page 8
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
When we run this simple dataset and build a logistic regression model, we see the following
results
When the analysis runs, simply click on the "Example Set" tab and the "Data View" radio
button. You will see that for each of the cases, there is a predicted result - Prediction (Personal
loan) and the confidence or probability that the loan acceptance is "No" and the corresponding
inverse probability of "Yes".
SimaFore LLC
Page 9
Decision Tree Digest How to build and use decision trees for business analytics an eBook by SimaFore
The main takeaway from this chapter is that, using RapidMiner it is easier to apply the
developed model to new data to obtain probability of response variable being in one of the two
categories, rather than trying to interpret the model parameters in the light of traditional
formulas, such as the logit.
If you liked this e-book tutorial on analytics, sign up for visTASC, "a visual thesaurus of analytics,
statistics and complex systems for more like these. Sign up is FREE and allows you to search for
techniques for other common business problems.
SimaFore LLC
Page 10