Cart Project

Business Objective:
We want to build a model that will help Thera bank which is having more liability customers
than asset customers to identify the potential customer who have higher probability of
purchasing the loan on basis of last year campaign data which is having 9.6 % success rate
for 5000 customers.
Understanding the attributes:

Data
Description:
ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the
bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?
Stucture of data (str(data)):
'Data.Frame': 5000 obs. of 14 variables:
ID int 1 2 3 4 5 6 7 8 9 10 ...
Age..in.years. int 25 45 39 35 35 37 53 50 35 34 ...
Experience..in.years. int 1 19 15 9 8 13 27 24 10 9 ...
Income..in.K.month. int 49 34 11 100 45 29 72 22 81 180 ...
int 91107 90089 94720 94112 91330 92121 91711 93943 90089 93023
ZIP.Code ...
Family.members int 4 3 1 1 4 4 2 1 3 1 ...
CCAvg num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
Education int 1 1 1 2 2 2 2 3 2 3 ...
Mortgage int 0 0 0 0 0 155 0 0 104 0 ...
Personal.Loan int 0 0 0 0 0 0 0 0 0 1 ...
Securities.Account int 1 1 0 0 0 0 0 0 0 0 ...
CD.Account int 0 0 0 0 0 0 0 0 0 0 ...
Online int 0 0 0 0 0 1 1 0 1 0 ...
CreditCard int 0 0 0 0 1 0 0 1 0 0 ...
Here Personal Loan is considered the Dependent variable and all other attributes as
Independent variables.
The data includes the demographic information of customer like (Age, Income, Experience,
Family, zip code, family members, Education) which represent the customer behavior, So
we need to take these columns under consideration.
The columns like (Mortgage, Securities, CD Acc, online, credit card) helps us to
understand facilities avail by the customer with bank and represents the customer satisfaction
with bank to encourage its customer to go for personal loan, So we need to consider these
too.
Here we should not consider the ID as its completely unique for each customer and does not
help in model building
Exploratory Data Analysis:
Summary(Data):
Min. 1st Qu. Median Mean 3rd Qu. Max. NA

ID 1 1251 2500 2500 3750 5000
Age..in.years. 23 35 45 45.34 55 67
Experience..in.y -3 10 20 20.1 30 43
ears.
Income..in.K.m 8 39 64 73.77 98 224
onth.
ZIP.Code 9307 91911 93437 93152 94608 96651
Family.member 1 1 2 2.397 3 4 18
s
CCAvg 0 0.7 1.5 1.938 2.5 10
Education 1 1 2 1.881 3 3
Mortgage 0 0 0 56.5 101 635
Personal.Loan 0 0 0 0.096 0 1
Securities.Acco 0 0 0 0.1044 0 1
unt
CD.Account 0 0 0 0.0604 0 1
Online 0 0 1 0.5968 1 1
CreditCard 0 0 0 0.294 1 1
We are able to conclude below points from above summary:
 Mortgage, PersonalLoan, SecuritiesAccount, CD.Account, Online, Credit card

columns all are having only 0 or 1 as a value.
 Family Members column is having 18 Null value for which we have to do null value
analysis.
 Personal loan is having mean of 0.096 which infers having 9.6 % success rate in last
year campaign.
Null Removal approximation:
From data we are able to see customer having family member having null values is
most of aged persons whose age is more than 40,As a general case at this age
generally a person is married and have children too ,so we are assuming family of 3 as
approximation
Near Zero Variance:
ColNo freqRatio percentUnique zeroVar nzv

ID 1 1 100 FALSE FALSE
Age..in.years. 2 1.013423 0.9 FALSE FALSE
Experience..in.years. 3 1.040541 0.94 FALSE FALSE
Income..in.K.month. 4 1.011905 3.24 FALSE FALSE
ZIP.Code 5 1.330709 9.34 FALSE FALSE
Family.members 6 1.133127 0.08 FALSE FALSE
CCAvg 7 1.04329 2.16 FALSE FALSE
Education 8 1.396402 0.06 FALSE FALSE
Mortgage 9 203.647059 6.94 FALSE TRUE
Personal.Loan 10 9.416667 0.04 FALSE FALSE
Securities.Account 11 8.578544 0.04 FALSE FALSE
CD.Account 12 15.556291 0.04 FALSE FALSE
Online 13 1.480159 0.04 FALSE FALSE
CreditCard 14 2.401361 0.04 FALSE FALSE
From above data we are able to infer that ID has all unique value and no column we can
eliminate for near zero variance approximation.
Analysis of relationship of dependent and independent variables:
 Age Vs personal Loan
We are able to analyze that more people from age 30-40 have taken loan which is quite
explanatory also as mostly people take loan in their young age of settle career which starts
from age 30.
 Experience..in.years vs Personal Loan
Here also we are able infer most people taken loan in start of their career having
experience from 0-10.
 Income..in.K.month. vs Personal Loan:
Here we can confer that customers having monthly income less than 100 K most
unlikely to take personal loans with current campaign.
 CCAvg vs Personal Loan
We can confer from here customers having average credit card spending per month
more than 2.5 K are pursued by campaign and more likely to take personal loan.
 Credit card vs personal Loan:
Personal.Loan
CreditCard 0 1
0 90.453258 9.546742
1 90.272109 9.727891
We can’t infer much from credit cards as customers having credit cards and those
who does not have credit cards is same 90% of chances not taking loan.
 Certificate of deposit vs personal loan:
Personal.Loan
certificate of
deposit 0 1
0 92.762878 7.237122
1 53.642384 46.357616
We can infer that customer having certificate of deposit with bank is having higher
probability of taking loan.
 Security Account vs personal Loan:
Personal.Loan
Securities.Account 0 1
0 90.620813 9.379187
1 88.505747 11.494253
Not much we can infer from this.
Splitting the Train and Test data:
We have splitted the data using below command:
## Creat Development and Validation Sample
set.seed(1234)
data$random <- runif(nrow(data), 0, 1);
cart <- data[order(data$random),]
#SEPARATE DATA BASED ON VALUE OF RANDOM COLUMN
cart.dev <- cart[which(cart$random <= 0.7),]

cart.val <- cart[which(cart$random > 0.7),]
we have divided the data in 70 and 30 ratios randomly with 3516 and 1484 records for train
and validation sample respectively.
Cart.dev success rate is 0.09499431 and cart.val is 0.09838275
The distribution is uniform for both test and train data

CART MODEL BUILDING
Setting the control parameters as below:
cartParameters = rpart.control(minsplit=20, minbucket = 7, cp = 0, xval = 10)
here we selected the minsplit =20 (minimum data in node so that it can split)
we selected minbucket=7 ( which is minsplit/3 represent minimum records in terminal node)
cp=0 for starting which we will be selecting later for best cp

Pruning:
Using below command calculated best cp:
bestcp <- cartModel$cptable[which.min(cartModel$cptable[,"xerror"]), "CP"]
value of bestcp = 0.002994012
Pruning the tree with below command:
ptree<- prune(cartModel, cp= bestcp ,"CP")

Model Performance Measure:
we have predicted the score and class for the train data using below commands:
cart.dev$predict.class <- predict(ptree, cart.dev, type="class")

cart.dev$predict.score <- predict(ptree, cart.dev)
By using deciles and Rank method we get below performance:
decil cnt cnt_r cnt_no rrate cum_ cum_n cum_pe cum_perct ks

es esp n_resp resp on_res rct_resp _non_resp
p
10 391 310 81 79.28 310 81 92.81 2.55 90.26
9 611 24 587 3.93 334 668 100 20.99 79.01
8 2514 0 2514 0 334 3182 100 100 0
AUC KS GINI
0.9963015 0.926009 0.8878087
We get very high AUC, KS and GINI value which is showing model is built good.
We can also validate this by confusion matrix:
predict.class
Personal.Loan 0 1
0 3170 12
1 38 296
The model performed quite well.
Model Validation and performance measure on hold out sample:

we have predicted the scores for the hold out sample using below commands:
cart.val$predict.class <- predict(ptree, cart.val, type="class")

cart.val$predict.score <- predict(ptree, cart.val)
decil cnt cnt_r cnt_no rrate cum_ cum_n cum_pe cum_perct ks

es esp n_resp resp on_res rct_resp _non_resp
p
10 153 136 17 88.89 136 17 93.15 1.27 91.88
9 278 10 268 3.6 146 285 100 21.3 78.7
8 1053 0 1053 0 146 1338 100 100 0
We get very high AUC, KS and GINI value which is showing model is built good.
We can also validate this by confusion matrix:
predict.class
Personal.Loan 0 1
0 1330 8
1 15 131
Model deployment strategy:

Native Java /C++ Model:
 Faster
 Limitation of available algo?DS librarires
Hybrid Approach PMML:
 Compatibility across multiple loads

 Non Agile
 Non flexible in terms of deployment
Python Stack:
 PMM files are big

 Unit testing is tricky
API Powered Model:

 Agile
 Scalable
 Can be used as both backend and front end
 Faster
R Code:
setwd("/Users/aman.mittal01/Desktop/R work")
data=read.csv("Bank_Personal_Loan_Modelling.csv",header = TRUE)
summary(data)
str(data)
names(data)
install.packages('gower', dependencies = TRUE)

install.packages('caret')
library(gower)
library(caret)
nsv <- nearZeroVar(data, saveMetrics=TRUE)

nsv <-cbind("ColNo"=1:ncol(data),nsv)
nsv
any(is.na(data))
data[is.na(data)]=3
data = subset(data, select = c(-ID))

any(is.na(cart)) # check for missing values
attach(data)
library(ggplot2)
ggplot(data,aes(CCAvg,fill=Personal.Loan))+geom_density()+facet_grid(~Personal.Loan)
table(Personal.Loan,CCAvg)
prop.table(table(Securities.Account,Personal.Loan),1)*100
## Creat Development and Validation Sample

set.seed(1234)
data$random <- runif(nrow(data), 0, 1);
cart <- data[order(data$random),]
#SEPARATE DATA BASED ON VALUE OF RANDOM COLUMN

cart.dev <- cart[which(cart$random <= 0.7),]
cart.val <- cart[which(cart$random > 0.7),]
#SHOWS ROWCOUNT FOR DEV AND VALIDATION SAMPLE

c(nrow(cart.dev), nrow(cart.val))
length(which(cart.dev$Personal.Loan=="1"))/nrow(cart.dev)
length(which(cart.val$Personal.Loan=="1"))/nrow(cart.val)
# remove the random variable

cart.dev = subset(cart.dev, select = -random)
cart.val = subset(cart.val, select = -random)
install.packages('rpart', dependencies = TRUE)
install.packages('rpart.plot')
library(rpart)
library(rpart.plot)
cartParameters = rpart.control(minsplit=20, minbucket = 7, cp = 0, xval = 10)

cartModel <- rpart(formula = Personal.Loan ~ ., data = cart.dev, method = "class", control =
cartParameters)
cartModel
## PRINTING CART MODEL PARAMETERS

install.packages('rattle', dependencies = TRUE)
library(rattle)
library(RColorBrewer)
fancyRpartPlot(cartModel)
printcp(cartModel)
plotcp(cartModel)
bestcp <- cartModel$cptable[which.min(cartModel$cptable[,"xerror"]), "CP"]
bestcp
ptree<- prune(cartModel, cp= bestcp ,"CP")
fancyRpartPlot(ptree, uniform=TRUE, main="Pruned Classification Tree")
#Model Performance
cart.dev$predict.class <- predict(ptree, cart.dev, type="class")

cart.dev$predict.score <- predict(ptree, cart.dev)
View(cart.dev)
## deciling code
decile <- function(x){
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[9], 9, 10
))))))))))
};
## deciling
cart.dev$deciles <- decile(cart.dev$predict.score[,2])
View(cart.dev)
## Ranking code
install.packages("data.table", dependencies = TRUE)
?getDTthreads
library(data.table)
tmp_DT = data.table(cart.dev)
rank <- tmp_DT[, list(
cnt = length(Personal.Loan),
cnt_resp = length(which(Personal.Loan == '1')),
cnt_non_resp = length(which(Personal.Loan == '0'))) ,
by=deciles][order(-deciles)];
rank
rank$rrate <- round(rank$cnt_resp * 100 / rank$cnt,2);

rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_perct_resp <- round(rank$cum_resp * 100 / sum(rank$cnt_resp),2);
rank$cum_perct_non_resp <- round(rank$cum_non_resp * 100 / sum(rank$cnt_non_resp),2);
rank$ks <- abs(rank$cum_perct_resp - rank$cum_perct_non_resp);
View(rank)

install.packages("ROCR")
library(ROCR)
library(gplots)
pred <- prediction(cart.dev$predict.score[,2], cart.dev$Personal.Loan)
pred
perf <- performance(pred, "tpr", "fpr")
perf
plot(perf)
?attr
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])

KS
auc <- performance(pred,"auc");
auc
auc <- as.numeric(auc@y.values)
install.packages("ineq")
library(ineq)
gini = ineq(cart.dev$predict.score[,2], type="Gini")
with(cart.dev, table(Personal.Loan, predict.class))

auc
KS
gini
####################################
##VALIDATION FOR HOLDOUT SAMPLE#####
####################################
cart.val$predict.class <- predict(ptree, cart.val, type="class")

cart.val$predict.score <- predict(ptree, cart.val)
head(cart.val)
View(cart.val)
#install.packages(stringr)
#rattle()
#head(CART_DATA.dev$predict.score[,2])
#help("predict")
#View(CART_DATA.dev)
## deciling
cart.val$deciles <- decile(cart.val$predict.score[,2])
View(cart.val)
tmp_DT = data.table(cart.val)
rank <- tmp_DT[, list(
cnt = length(Personal.Loan),
cnt_resp = length(which(Personal.Loan == "1")),
cnt_non_resp = length(which(Personal.Loan == "0"))) ,
by=deciles][order(-deciles)];
View(rank)
pred <- prediction(cart.val$predict.score[,2], cart.val$Personal.Loan)

perf <- performance(pred, "tpr", "fpr")
plot(perf)
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
auc <- performance(pred,"auc");
auc <- as.numeric(auc@y.values)
gini = ineq(cart.val$predict.score[,2], type="Gini")
with(cart.val, table(Personal.Loan, predict.class))

auc
KS
gini

Cart Project

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Cart Project

Caricato da

Copyright:

Formati disponibili

Business Objective:

Understanding the attributes:

'Data.Frame': 5000 obs. of 14 variables:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA

We are able to conclude below points from above summary:

 Mortgage, PersonalLoan, SecuritiesAccount, CD.Account, Online, Credit card

Null Removal approximation:

ColNo freqRatio percentUnique zeroVar nzv

 Age Vs personal Loan

 Experience..in.years vs Personal Loan

 CCAvg vs Personal Loan

 Credit card vs personal Loan:

 Certificate of deposit vs personal loan:

 Security Account vs personal Loan:

Not much we can infer from this.

Splitting the Train and Test data:

We have splitted the data using below command:

## Creat Development and Validation Sample

#SEPARATE DATA BASED ON VALUE OF RANDOM COLUMN

cart.dev <- cart[which(cart$random <= 0.7),]

Cart.dev success rate is 0.09499431 and cart.val is 0.09838275

The distribution is uniform for both test and train data

cartParameters = rpart.control(minsplit=20, minbucket = 7, cp = 0, xval = 10)

cp=0 for starting which we will be selecting later for best cp

Using below command calculated best cp:

bestcp <- cartModel$cptable[which.min(cartModel$cptable[,"xerror"]), "CP"]

value of bestcp = 0.002994012

Pruning the tree with below command:

ptree<- prune(cartModel, cp= bestcp ,"CP")

cart.dev$predict.class <- predict(ptree, cart.dev, type="class")

By using deciles and Rank method we get below performance:

decil cnt cnt_r cnt_no rrate cum_ cum_n cum_pe cum_perct ks

0.9963015 0.926009 0.8878087

The model performed quite well.

Model Validation and performance measure on hold out sample:

cart.val$predict.class <- predict(ptree, cart.val, type="class")

decil cnt cnt_r cnt_no rrate cum_ cum_n cum_pe cum_perct ks

We can also validate this by confusion matrix:

Model deployment strategy:

Hybrid Approach PMML:

 Compatibility across multiple loads

 PMM files are big

API Powered Model:

install.packages('gower', dependencies = TRUE)

nsv <- nearZeroVar(data, saveMetrics=TRUE)

data = subset(data, select = c(-ID))

## Creat Development and Validation Sample

#SEPARATE DATA BASED ON VALUE OF RANDOM COLUMN

#SHOWS ROWCOUNT FOR DEV AND VALIDATION SAMPLE

# remove the random variable

cartParameters = rpart.control(minsplit=20, minbucket = 7, cp = 0, xval = 10)

## PRINTING CART MODEL PARAMETERS

bestcp <- cartModel$cptable[which.min(cartModel$cptable[,"xerror"]), "CP"]

ptree<- prune(cartModel, cp= bestcp ,"CP")

fancyRpartPlot(ptree, uniform=TRUE, main="Pruned Classification Tree")

cart.dev$predict.class <- predict(ptree, cart.dev, type="class")

rank$rrate <- round(rank$cnt_resp * 100 / rank$cnt,2);

rank$rrate <- round(rank$cnt_resp * 100 / rank$cnt,2);

perf <- performance(pred, "tpr", "fpr")

KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])

auc <- performance(pred,"auc");

auc <- as.numeric(auc@y.values)

with(cart.dev, table(Personal.Loan, predict.class))

cart.val$predict.class <- predict(ptree, cart.val, type="class")