Sei sulla pagina 1di 17

Business Objective:

We want to build a model that will help Thera bank which is having more liability customers
than asset customers to identify the potential customer who have higher probability of
purchasing the loan on basis of last year campaign data which is having 9.6 % success rate
for 5000 customers.

Understanding the attributes:


Data
Description:

ID Customer ID
Age Customer's age in years
Experience Years of professional experience
Income Annual income of the customer ($000)
ZIPCode Home Address ZIP code.
Family Family size of the customer
CCAvg Avg. spending on credit cards per month ($000)
Education Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional
Mortgage Value of house mortgage if any. ($000)
Personal Loan Did this customer accept the personal loan offered in the last campaign?
Securities Does the customer have a securities account with the bank?
Account
CD Account Does the customer have a certificate of deposit (CD) account with the
bank?
Online Does the customer use internet banking facilities?
CreditCard Does the customer use a credit card issued by the bank?
Stucture of data (str(data)):

'Data.Frame': 5000 obs. of 14 variables:

ID int 1 2 3 4 5 6 7 8 9 10 ...
Age..in.years. int 25 45 39 35 35 37 53 50 35 34 ...
Experience..in.years. int 1 19 15 9 8 13 27 24 10 9 ...
Income..in.K.month. int 49 34 11 100 45 29 72 22 81 180 ...
int 91107 90089 94720 94112 91330 92121 91711 93943 90089 93023
ZIP.Code ...
Family.members int 4 3 1 1 4 4 2 1 3 1 ...
CCAvg num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
Education int 1 1 1 2 2 2 2 3 2 3 ...
Mortgage int 0 0 0 0 0 155 0 0 104 0 ...
Personal.Loan int 0 0 0 0 0 0 0 0 0 1 ...
Securities.Account int 1 1 0 0 0 0 0 0 0 0 ...
CD.Account int 0 0 0 0 0 0 0 0 0 0 ...
Online int 0 0 0 0 0 1 1 0 1 0 ...
CreditCard int 0 0 0 0 1 0 0 1 0 0 ...

Here Personal Loan is considered the Dependent variable and all other attributes as
Independent variables.

The data includes the demographic information of customer like (Age, Income, Experience,
Family, zip code, family members, Education) which represent the customer behavior, So
we need to take these columns under consideration.

The columns like (Mortgage, Securities, CD Acc, online, credit card) helps us to
understand facilities avail by the customer with bank and represents the customer satisfaction
with bank to encourage its customer to go for personal loan, So we need to consider these
too.

Here we should not consider the ID as its completely unique for each customer and does not
help in model building
Exploratory Data Analysis:
Summary(Data):

Min. 1st Qu. Median Mean 3rd Qu. Max. NA


ID 1 1251 2500 2500 3750 5000
Age..in.years. 23 35 45 45.34 55 67
Experience..in.y -3 10 20 20.1 30 43
ears.
Income..in.K.m 8 39 64 73.77 98 224
onth.
ZIP.Code 9307 91911 93437 93152 94608 96651
Family.member 1 1 2 2.397 3 4 18
s
CCAvg 0 0.7 1.5 1.938 2.5 10
Education 1 1 2 1.881 3 3
Mortgage 0 0 0 56.5 101 635
Personal.Loan 0 0 0 0.096 0 1
Securities.Acco 0 0 0 0.1044 0 1
unt
CD.Account 0 0 0 0.0604 0 1
Online 0 0 1 0.5968 1 1
CreditCard 0 0 0 0.294 1 1

We are able to conclude below points from above summary:

 Mortgage, PersonalLoan, SecuritiesAccount, CD.Account, Online, Credit card


columns all are having only 0 or 1 as a value.
 Family Members column is having 18 Null value for which we have to do null value
analysis.
 Personal loan is having mean of 0.096 which infers having 9.6 % success rate in last
year campaign.

Null Removal approximation:

From data we are able to see customer having family member having null values is
most of aged persons whose age is more than 40,As a general case at this age
generally a person is married and have children too ,so we are assuming family of 3 as
approximation
Near Zero Variance:

ColNo freqRatio percentUnique zeroVar nzv


ID 1 1 100 FALSE FALSE
Age..in.years. 2 1.013423 0.9 FALSE FALSE
Experience..in.years. 3 1.040541 0.94 FALSE FALSE
Income..in.K.month. 4 1.011905 3.24 FALSE FALSE
ZIP.Code 5 1.330709 9.34 FALSE FALSE
Family.members 6 1.133127 0.08 FALSE FALSE
CCAvg 7 1.04329 2.16 FALSE FALSE
Education 8 1.396402 0.06 FALSE FALSE
Mortgage 9 203.647059 6.94 FALSE TRUE
Personal.Loan 10 9.416667 0.04 FALSE FALSE
Securities.Account 11 8.578544 0.04 FALSE FALSE
CD.Account 12 15.556291 0.04 FALSE FALSE
Online 13 1.480159 0.04 FALSE FALSE
CreditCard 14 2.401361 0.04 FALSE FALSE

From above data we are able to infer that ID has all unique value and no column we can
eliminate for near zero variance approximation.
Analysis of relationship of dependent and independent variables:

 Age Vs personal Loan

We are able to analyze that more people from age 30-40 have taken loan which is quite
explanatory also as mostly people take loan in their young age of settle career which starts
from age 30.

 Experience..in.years vs Personal Loan

Here also we are able infer most people taken loan in start of their career having
experience from 0-10.
 Income..in.K.month. vs Personal Loan:

Here we can confer that customers having monthly income less than 100 K most
unlikely to take personal loans with current campaign.

 CCAvg vs Personal Loan

We can confer from here customers having average credit card spending per month
more than 2.5 K are pursued by campaign and more likely to take personal loan.

 Credit card vs personal Loan:

Personal.Loan
CreditCard 0 1
0 90.453258 9.546742
1 90.272109 9.727891
We can’t infer much from credit cards as customers having credit cards and those
who does not have credit cards is same 90% of chances not taking loan.

 Certificate of deposit vs personal loan:

Personal.Loan
certificate of
deposit 0 1
0 92.762878 7.237122
1 53.642384 46.357616

We can infer that customer having certificate of deposit with bank is having higher
probability of taking loan.

 Security Account vs personal Loan:

Personal.Loan
Securities.Account 0 1
0 90.620813 9.379187
1 88.505747 11.494253

Not much we can infer from this.

Splitting the Train and Test data:

We have splitted the data using below command:

## Creat Development and Validation Sample

set.seed(1234)
data$random <- runif(nrow(data), 0, 1);
cart <- data[order(data$random),]

#SEPARATE DATA BASED ON VALUE OF RANDOM COLUMN

cart.dev <- cart[which(cart$random <= 0.7),]


cart.val <- cart[which(cart$random > 0.7),]

we have divided the data in 70 and 30 ratios randomly with 3516 and 1484 records for train
and validation sample respectively.

Cart.dev success rate is 0.09499431 and cart.val is 0.09838275

The distribution is uniform for both test and train data


CART MODEL BUILDING
Setting the control parameters as below:

cartParameters = rpart.control(minsplit=20, minbucket = 7, cp = 0, xval = 10)

here we selected the minsplit =20 (minimum data in node so that it can split)
we selected minbucket=7 ( which is minsplit/3 represent minimum records in terminal node)

cp=0 for starting which we will be selecting later for best cp


Pruning:

Using below command calculated best cp:

bestcp <- cartModel$cptable[which.min(cartModel$cptable[,"xerror"]), "CP"]

value of bestcp = 0.002994012

Pruning the tree with below command:

ptree<- prune(cartModel, cp= bestcp ,"CP")


Model Performance Measure:
we have predicted the score and class for the train data using below commands:

cart.dev$predict.class <- predict(ptree, cart.dev, type="class")


cart.dev$predict.score <- predict(ptree, cart.dev)

By using deciles and Rank method we get below performance:

decil cnt cnt_r cnt_no rrate cum_ cum_n cum_pe cum_perct ks


es esp n_resp resp on_res rct_resp _non_resp
p
10 391 310 81 79.28 310 81 92.81 2.55 90.26
9 611 24 587 3.93 334 668 100 20.99 79.01
8 2514 0 2514 0 334 3182 100 100 0

AUC KS GINI

0.9963015 0.926009 0.8878087

We get very high AUC, KS and GINI value which is showing model is built good.
We can also validate this by confusion matrix:

predict.class
Personal.Loan 0 1
0 3170 12
1 38 296

The model performed quite well.

Model Validation and performance measure on hold out sample:


we have predicted the scores for the hold out sample using below commands:

cart.val$predict.class <- predict(ptree, cart.val, type="class")


cart.val$predict.score <- predict(ptree, cart.val)

decil cnt cnt_r cnt_no rrate cum_ cum_n cum_pe cum_perct ks


es esp n_resp resp on_res rct_resp _non_resp
p
10 153 136 17 88.89 136 17 93.15 1.27 91.88
9 278 10 268 3.6 146 285 100 21.3 78.7
8 1053 0 1053 0 146 1338 100 100 0
We get very high AUC, KS and GINI value which is showing model is built good.

We can also validate this by confusion matrix:

predict.class

Personal.Loan 0 1

0 1330 8

1 15 131

Model deployment strategy:


Native Java /C++ Model:

 Faster
 Limitation of available algo?DS librarires

Hybrid Approach PMML:

 Compatibility across multiple loads


 Non Agile
 Non flexible in terms of deployment

Python Stack:

 PMM files are big


 Unit testing is tricky

API Powered Model:


 Agile
 Scalable
 Can be used as both backend and front end
 Faster
R Code:
setwd("/Users/aman.mittal01/Desktop/R work")
data=read.csv("Bank_Personal_Loan_Modelling.csv",header = TRUE)
summary(data)
str(data)
names(data)

install.packages('gower', dependencies = TRUE)


install.packages('caret')
library(gower)
library(caret)

nsv <- nearZeroVar(data, saveMetrics=TRUE)


nsv <-cbind("ColNo"=1:ncol(data),nsv)
nsv

any(is.na(data))

data[is.na(data)]=3

data = subset(data, select = c(-ID))


any(is.na(cart)) # check for missing values

attach(data)
library(ggplot2)

ggplot(data,aes(CCAvg,fill=Personal.Loan))+geom_density()+facet_grid(~Personal.Loan)

table(Personal.Loan,CCAvg)
prop.table(table(Securities.Account,Personal.Loan),1)*100

## Creat Development and Validation Sample


set.seed(1234)
data$random <- runif(nrow(data), 0, 1);
cart <- data[order(data$random),]

#SEPARATE DATA BASED ON VALUE OF RANDOM COLUMN


cart.dev <- cart[which(cart$random <= 0.7),]
cart.val <- cart[which(cart$random > 0.7),]

#SHOWS ROWCOUNT FOR DEV AND VALIDATION SAMPLE


c(nrow(cart.dev), nrow(cart.val))
length(which(cart.dev$Personal.Loan=="1"))/nrow(cart.dev)
length(which(cart.val$Personal.Loan=="1"))/nrow(cart.val)

# remove the random variable


cart.dev = subset(cart.dev, select = -random)
cart.val = subset(cart.val, select = -random)
install.packages('rpart', dependencies = TRUE)
install.packages('rpart.plot')
library(rpart)
library(rpart.plot)

cartParameters = rpart.control(minsplit=20, minbucket = 7, cp = 0, xval = 10)


cartModel <- rpart(formula = Personal.Loan ~ ., data = cart.dev, method = "class", control =
cartParameters)
cartModel

## PRINTING CART MODEL PARAMETERS


install.packages('rattle', dependencies = TRUE)

library(rattle)
library(RColorBrewer)
fancyRpartPlot(cartModel)
printcp(cartModel)
plotcp(cartModel)

bestcp <- cartModel$cptable[which.min(cartModel$cptable[,"xerror"]), "CP"]

bestcp

ptree<- prune(cartModel, cp= bestcp ,"CP")

fancyRpartPlot(ptree, uniform=TRUE, main="Pruned Classification Tree")

#Model Performance

cart.dev$predict.class <- predict(ptree, cart.dev, type="class")


cart.dev$predict.score <- predict(ptree, cart.dev)

View(cart.dev)

## deciling code
decile <- function(x){
deciles <- vector(length=10)
for (i in seq(0.1,1,.1)){
deciles[i*10] <- quantile(x, i, na.rm=T)
}
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
};

## deciling
cart.dev$deciles <- decile(cart.dev$predict.score[,2])
View(cart.dev)

## Ranking code
install.packages("data.table", dependencies = TRUE)
?getDTthreads

library(data.table)
tmp_DT = data.table(cart.dev)
rank <- tmp_DT[, list(
cnt = length(Personal.Loan),
cnt_resp = length(which(Personal.Loan == '1')),
cnt_non_resp = length(which(Personal.Loan == '0'))) ,
by=deciles][order(-deciles)];

rank

rank$rrate <- round(rank$cnt_resp * 100 / rank$cnt,2);


rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_perct_resp <- round(rank$cum_resp * 100 / sum(rank$cnt_resp),2);
rank$cum_perct_non_resp <- round(rank$cum_non_resp * 100 / sum(rank$cnt_non_resp),2);
rank$ks <- abs(rank$cum_perct_resp - rank$cum_perct_non_resp);
View(rank)

rank$rrate <- round(rank$cnt_resp * 100 / rank$cnt,2);


rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_perct_resp <- round(rank$cum_resp * 100 / sum(rank$cnt_resp),2);
rank$cum_perct_non_resp <- round(rank$cum_non_resp * 100 / sum(rank$cnt_non_resp),2);
rank$ks <- abs(rank$cum_perct_resp - rank$cum_perct_non_resp);

install.packages("ROCR")
library(ROCR)
library(gplots)
pred <- prediction(cart.dev$predict.score[,2], cart.dev$Personal.Loan)
pred

perf <- performance(pred, "tpr", "fpr")

perf

plot(perf)

?attr

KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])


KS

auc <- performance(pred,"auc");

auc

auc <- as.numeric(auc@y.values)

install.packages("ineq")
library(ineq)
gini = ineq(cart.dev$predict.score[,2], type="Gini")

with(cart.dev, table(Personal.Loan, predict.class))


auc
KS
gini

####################################
##VALIDATION FOR HOLDOUT SAMPLE#####
####################################

cart.val$predict.class <- predict(ptree, cart.val, type="class")


cart.val$predict.score <- predict(ptree, cart.val)
head(cart.val)
View(cart.val)
#install.packages(stringr)
#rattle()
#head(CART_DATA.dev$predict.score[,2])
#help("predict")
#View(CART_DATA.dev)

## deciling
cart.val$deciles <- decile(cart.val$predict.score[,2])
View(cart.val)
tmp_DT = data.table(cart.val)
rank <- tmp_DT[, list(
cnt = length(Personal.Loan),
cnt_resp = length(which(Personal.Loan == "1")),
cnt_non_resp = length(which(Personal.Loan == "0"))) ,
by=deciles][order(-deciles)];
rank$rrate <- round(rank$cnt_resp * 100 / rank$cnt,2);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_perct_resp <- round(rank$cum_resp * 100 / sum(rank$cnt_resp),2);
rank$cum_perct_non_resp <- round(rank$cum_non_resp * 100 / sum(rank$cnt_non_resp),2);
rank$ks <- abs(rank$cum_perct_resp - rank$cum_perct_non_resp);
View(rank)

pred <- prediction(cart.val$predict.score[,2], cart.val$Personal.Loan)


perf <- performance(pred, "tpr", "fpr")
plot(perf)
KS <- max(attr(perf, 'y.values')[[1]]-attr(perf, 'x.values')[[1]])
auc <- performance(pred,"auc");
auc <- as.numeric(auc@y.values)

gini = ineq(cart.val$predict.score[,2], type="Gini")

with(cart.val, table(Personal.Loan, predict.class))


auc
KS
gini

Potrebbero piacerti anche