Sei sulla pagina 1di 21

DATA MINING PROJECT

THERA BANK : LOAN CAMPAIGN

Bharath Reddy 8/5/19 GREAT LAKES : BABAI


Contents
1. PROJECT OBJECTIVE .............................................................................................................................. 2
2. ASSUMPTIONS ...................................................................................................................................... 2
3.1. Environment setup and data import............................................................................................. 2
3.2. Variable identification. .................................................................................................................. 2
3.3. Missing value identification & treatment. .................................................................................... 3
3.4. Outlier identification & treatment. ............................................................................................... 3
3.5. Multicollinearity Check ................................................................................................................. 3
4. CLUSTERING .......................................................................................................................................... 3
4.1. Hierarchical Clustering (agglomerative)........................................................................................ 4
4.2. K-Means Clustering ....................................................................................................................... 4
5. BUILDING MODELS................................................................................................................................ 5
5.1. CART MODEL ................................................................................................................................. 5
5.2. RANDOM FOREST MODEL............................................................................................................. 5
6. PERFORMANC EVALUATION OF MODELS ............................................................................................. 6
6.1. KS Chart on training datasets for both CART and Random Forest models. .................................. 6
6.2. Now we evaluate these models on Area Under the Curve........................................................... 6
6.3. GINI VALUES: We get below gini values indicating random forest to be slightly better model ... 7
6.4. Concordance values: We get below concordance figures for both models. ................................ 7
7. SUGGESTION TO THERA BANK .............................................................................................................. 7
8. APPENDIX .............................................................................................................................................. 7
1. PROJECT OBJECTIVE
Thera Bank - Loan Purchase Modeling This case is about a bank (Thera Bank) which has a growing customer
base. Majority of these customers are liability customers (depositors) with varying size of deposits. The
number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in
expanding this base rapidly to bring in more loan business and in the process, earn more through the interest
on loans. In particular, the management wants to explore ways of converting its liability customers to personal
loan customers (while retaining them as depositors). A campaign that the bank ran last year for liability
customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing
department to devise campaigns with better target marketing to increase the success ratio with a minimal
budget. The department wants to build a model that will help them identify the potential customers who have
a higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce
the cost of the campaign. The dataset has data on 5000 customers. The data include customer demographic
information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.),
and the customer response to the last personal loan campaign (Personal Loan). Among these 5000 customers,
only 480 (= 9.6%) accepted the personal loan that was offered to them in the earlier campaign. You are
brought in as a consultant and your job is to build the best model which can classify the right customers who
have a higher probability of purchasing the loan. You are expected to do the following:

• EDA of the data available. Showcase the results using appropriate graphs
• Apply appropriate clustering on the data and interpret the output
• Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the model
outputs and do the necessary modifications wherever eligible (such as pruning)
• Check the performance of all the models that you have built (test and train). Use all the model performance
measures you have learned so far. Share your remarks on which model performs the best.

2. ASSUMPTIONS
Only explanations around models are marked and grading criterion does not include assessment of the model
performance or accuracy of predictions.

3. EXPLORATORY DATA ANALYSIS


3.1. Environment setup and data import.
All the required libraries are loaded, working directory containing dataset is set and data is read to a variable.
Please refer Appendix 8.1 for the Source code

3.2. Variable identification.


Information on the features/ attributes
The binary category has five variables as below:
• Personal Loan - Did this customer accept the personal loan offered in the last campaign? This is our target
variable
• Securities Account - Does the customer have a securities account with the bank?
• CD Account - Does the customer have a certificate of deposit (CD) account with the bank?
• Online - Does the customer use internet banking facilities?
• Credit Card - Does the customer use a credit card issued by Universal Bank?
Interval variables are as below:
• Age - Age of the customer
• Experience - Years of experience
• Income - Annual income in dollars
• CCAvg - Average credit card spending
• Mortage - Value of House Mortgage
Ordinal Categorical Variables are:
• Family - Family size of the customer
• Education - education level of the customer
The nominal variables are:
• ID
• Zip Code
We change the names of the variables to make them easy to work with.
The variable ID does not add any interesting information. There is no association between a person’s customer
ID and loan, also it does not provide any general conclusion for future potential loan customers. We can neglect
this information for our model prediction.
Please refer Appendix 8.2 for the Source code

3.3. Missing value identification & treatment.


There are 18 missing values & all of these missing values come from the column – Family Members.
We check if these missing values are randomly spread or removing them would cause any bias in remaining
dataset. We do this by tagging rows with missing values with a unique identifier.
Plotting rows missing values against available values for each of the variables to determine the spread of the
missing variables.
We see that it is safe to remove the missing values from our analysis without causing any undue loss of
information or bias in the dataset.
Please refer Appendix 8.3 for the Source Code.

3.4. Outlier identification & treatment.


Only one outlier detected in ZIP code but since its nominal data, we would not exclude it but see its impact on
the analysis.

3.5. Multicollinearity Check


Multicollinearity among covariates or independent variables would cause the analysis to go wrong and have to
be investigated further. By drawing Corrplots we see that the variables Age and Experience are highly
correlated and hence for the purpose of our analysis we would drop the variable Age from our dataset and
subsequent analysis.

Please refer Appendix 8.4 for the source code.

4. CLUSTERING
There is no objective function in unsupervised learning to test the performance of the algorithm. We would
perform our clustering on main data directly. We would not need to split the data into testing or training sets.
For the purpose of this assignment we would use only two types of models
1: Hierarchical Clustering (agglomerative).
2: K-means clustering.
4.1. Hierarchical Clustering (agglomerative)

STEP 1: To do this we would first have to remove the response variable which is Personal loan and scale
rest of the data. We would also remove the variable: Zip Code as distance measure between Zip codes
would not be any meaningful number.
One of ways to check if data is scaled properly is to check for Mean and Standard deviation of data to be 0
and 1 respectively. Please see Appending 8.5 for source code.

STEP2: Calculate the distance matrix on the scaled data. There are multiple methods to calculate distance
between observations, for our analysis we would use - Euclidean distance. we use hierarchical cluster with
distance method as “average” and divide the data into two clusters. One corresponding to people getting
responding to the loan solicitation and other into people who did not respond to loan solicitation.

STEP3: Draw a dendrogram, A dendrogram, the height of the dendrogram indicates the order in which the
clusters were joined. A common mistake people make when reading dendrograms is to assume that the
shape of the dendrogram gives a clue as to how many clusters exist. We see that the dendrogram is not
very readable because of the size of the dataset and because hierarchical clustering goes all the way down
to individual datapoints when comparing distances. Because we know that, there are only 2 clusters, we
proceed with our analysis, by splitting the dendrogram in to spaces.
STEP4: Cutting the tree for 2 classes and labeling the data with cluster number we get the model to label
the data Comparing the sum of Correct predictions to overall data points we see that the model has an
accuracy of 89.96%.
Please see Appendix 8.6 For Source Code.

4.2. K-Means Clustering

We would use scaled data for K means clustering. For reproducibility we start by setting a seed.

K means uses loyds algorithm which essentially calculates the distance between the variables. For the
same reason, categorical variables which indicate presence or absence of a variable would distort the k
means output and hence should be removed from our dataset before running the k-means algorithm. We
remove the categorical variables from our dataset.

nstart parameter gives number of random sets, we choose 10 for brevity and quickness of code execution.

NOTE: Ideally to arrive at appropriate number of clusters, NBcluster or other methods are used, but for
our analysis we are aware that the we need to partition the data into two clusters. For the same reason,
these methods were not used.

For visual inspection of data clusters, we draw the cluster plot, but this is not readable due to large
number of observations but we get a clue that the algorithm has not been able to segregate the data into
two clean sets.
Drawing confusion Metrix show the code was able to accurately predict 81.69% of target variables.
Please see Appendix 8.7 for Source Code.

Removing the variable cluster from dataset


df2$cluster <- NULL
5. BUILDING MODELS
5.1. CART MODEL
we now develop a CART model for our data to develop a decision tree. We would need rpart and rpart.plot
libraries to do the same. We would work on train dataset to develop our model and cross test the accuracy of
the model by using the model on test datasets.

The model has following parameters:


1: Minsplit = 100 gives the minimum number of observations that must exist in a node in order for a split to be
attempted.
2: Minbucket = 10 gives the minimum number of observations in any leaf node.
3: cp = 0 gives complexity parameter, any split that does not decrease the overall complexity at least by this
level is not attempted we start with complexity parameter of zero and would tune the model later.
4: Xval = 10 gives the number of parts into which train dataset would be broken to perform cross validations.
Cross validation error would provide crucial information to tune the model further.
Since CART model performance do not depend on scaling of data and on categorical nature of dataset, we
would not scale our data or remove categorical data.
Please see Appendix 8.8 for source code.

CART has a problem of overfitting the data and hence we would tune the model by examining cross validation
errors and complexity parameter. We see that cross validation error keeps reducing and hence we would not
tune the model any further. Examining the variable importance gives us importance of each of covariate in the
model.

Please see Appendix 8.9 for source code.


We would check the accuracy of the model by adding model predicted outcomes and probability of outcome
as given by the model. We add both these to the training datasets to the training dataset without the target/
response variable included.

We see that model have given good 98.05% accuracy, however to check if the model overfitted the training
data, we now use the model to predict response variable in the test dataset. We see that the model has given
98.39% accuracy on test data. We can say the model is robust and does not overfit the data used for
development of the model.

Please see Appendix 8.10 for source code.

5.2. RANDOM FOREST MODEL


Building the random forest by providing below parameters
1: mtry = 5 his gives no. of independent variables randomly sampled as candidates for each split. We have 11
variables, we choose a mid-ranged figures as a high number would cause the forest to be similar and low
number would create trees with weak prediction power.
2: nodesize = 10 gives minimum observations in the terminal nodel. Too low a figure would cause overfitting
and too high a number would cause underfitting.
3: ntree= 501 gives the number of trees to grow, too low a number would cause input row getting predicted
too few times. This number is chosen to be odd as this is a classification problem and to facilitate out of bag
voting amongst the trees.
4: importance is set as true to assess importance of predictors.
Please see Appendix 8.11 for source code.
The plot shows OOB error rates not decreasing much after 150 trees. We would now tune the random forest
with 150 trees and a step factor 1.5, this step factor would try various combinations of independent variables
(mtry) and choose the number with lowest out of bag error rate.
Please see Appendix 8.12 for source code.
Model gives accuracy of 99.08% on train dataset and 98.66% test data set. We again have a model with high
accuracy scores and testing both models on test datasets allays the fear of over fitment of the model.
Please see Appendix 8.13 for source code.

6. PERFORMANC EVALUATION OF MODELS


We would now explore various techniques to evaluate the model performance of both CART and Random forest
models. These techniques are
1: Confusion Matrix: This gives various measures like
• Classification error rates: Sum of type 1 and type 2 errors as a percentage of entire data
• Sensitivity or Recall or True Positive rate: Proportion of Total positives that were correctly identified.
• Specifity or True Negative rate: Proportion of Total negatives that were correctly identified.
2: KS: Gives Percentage of cumulative rights and wrong from total dataset.
3: AUC: The area under the curve formed by plotting True positive rate vs False positive rate for every
Threshold chosen for classification.
4: GINI: Measures twice the area under curve less One. (2 AUC - 1)
5: Concordance Discordance ratios: Ratio of concordant pairs and discordant pairs. Concordant pairs is where
prob. of right classification is greater than probability of incorrect classification.
We will use all these measures on both CART and Random forest models for both training and test datasets
and compare these figures.

6.1. KS Chart on training datasets for both CART and Random Forest models.
Receiver Operating Characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a
binary classifier system as its discrimination threshold is varied. Looking at ROC for CART and Random Forest
Model - we can say the performance of Random forest is better on train datasets.

KS measures are also better for Random forest model. Random forest gives a KS max measure of 0.9917 vs a figure
of 0.9400 for CART.

On test data also the ROC curves are better for Random forest compared to CART model. The KS max measure for
Random forest is .9537 compared to .9415 for CART model.

Over all looking at ROC and KS figures, we can conclude Random forest performed better than CART.

Please see Appendix 8.14 for source code.


6.2. Now we evaluate these models on Area Under the Curve
We can clearly see the area under curve is larger for Random forest models both on Test and Train datasets. We
can conclude Random forest performed better than CART.

AOC is as below
MODEL TRAIN DATASET TEST DATASET
CART 0.9954651 0.9958384
RANDOM FOREST 0.9997411 0.9980149
Please see Appendix 8.15 for source code.
6.3. GINI VALUES: We get below gini values indicating random forest to be slightly
better model

MODEL TRAIN DATASET TEST DATASET


CART 0.8959566 0.8957577
RANDOM FOREST 0.8984477 0.9021382
Please see Appendix 8.16 for source code.
6.4. Concordance values: We get below concordance figures for both models.

MODEL TRAIN DATASET TEST DATASET


CART 0.9928142 0.9934263
RANDOM FOREST 0.9997406 0.998002
For all these outcomes there were very low number of Tied sets. Concordance figures also indicate Random
Forest being a better model compared to CART model for the given data set.
Please see Appendix 8.17 for source code
We can safely suggest that for the given dataset Random Forest performance is better than CART model and
hence we would use Random Forest model to build a model for Thera Bank to help them target customers
with better response rates to increase the success ratio and to reduce the campaign costs.

7. SUGGESTION TO THERA BANK


To arrive at an appropriate customer segment to target for the loan campaign we follow below steps.
STEP 1: Merging the test & Train datasets as no longer are creating any model, we are using the model
provided scores to provide recommendation to Thera Bank.

STEP 2: Break dataset into 10 parts with probability buckets equal to probabilities provided by Random forest
model. Once the data set is broken by these deciles, we calculate Response rates, cumulative response rates,
non-response rates, Cumulative non-Response rates and KS figures for each of these deciles

STEP 3: Calculate the average response rate over the entire data set.
We can suggest Thera bank to direct their campaign efforts only to customers in the [0.275,1] Decile. By
targeting these 499 specific customers Thera bank would have around 98% response rate than targeting all
the customers where the response rate would be around 478/4504 i.e. 9.6%.
Please see Appendix 8.18 for source code.

8. APPENDIX
8.1.
setwd("E:/BRK CLASSES/BABI/04 Data Mining/Week 6 Final Assigment")
library(corrplot)
library(tidyverse)
library(readxl)
library(caTools)
library(rpart)
library(rpart.plot)
library(ROCR)
library(tidyverse)
8.2.
df <- read_excel("Thera Bank_Personal_Loan_Modelling-dataset-1.xlsx",sheet = "Bank_Personal_Loan_Modelling")
Changing column names to more meaningful and R recognized native names
colnames(df) <-
c("ID","Age","Experience","Income","Zip_code","family_members","ccavg","Education","Mortgage","Personal_loan","sec_acc","cd_acc","onlin
e","credit_card")
names(df)
## [1] "ID" "Age" "Experience" "Income"
## [5] "Zip_code" "family_members" "ccavg" "Education"
## [9] "Mortgage" "Personal_loan" "sec_acc" "cd_acc"
## [13] "online" "credit_card"
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of 14 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience : num 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : num 49 34 11 100 45 29 72 22 81 180 ...
## $ Zip_code : num 91107 90089 94720 94112 91330 ...
## $ family_members: num 4 3 1 1 4 4 2 1 3 1 ...
## $ ccavg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : num 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal_loan : num 0 0 0 0 0 0 0 0 0 1 ...
## $ sec_acc : num 1 1 0 0 0 0 0 0 0 0 ...
## $ cd_acc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ online : num 0 0 0 0 0 1 1 0 1 0 ...
## $ credit_card : num 0 0 0 0 1 0 0 1 0 0 ...
df1 <- df %>% select(-ID)

8.3.
sum(is.na(df$Age))
## [1] 0
sum(is.na(df$Experience))
## [1] 0
sum(is.na(df$Income))
## [1] 0
sum(is.na(df$Zip_code))
## [1] 0
sum(is.na(df$family_members))
## [1] 18

df$missing <- ifelse(is.na(df$family_members),0,1)


str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of 15 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience : num 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : num 49 34 11 100 45 29 72 22 81 180 ...
## $ Zip_code : num 91107 90089 94720 94112 91330 ...
## $ family_members: num 4 3 1 1 4 4 2 1 3 1 ...
## $ ccavg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : num 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal_loan : num 0 0 0 0 0 0 0 0 0 1 ...
## $ sec_acc : num 1 1 0 0 0 0 0 0 0 0 ...
## $ cd_acc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ online : num 0 0 0 0 0 1 1 0 1 0 ...
## $ credit_card : num 0 0 0 0 1 0 0 1 0 0 ...
## $ missing : num 1 1 1 1 1 1 1 1 1 1 ...

df[!complete.cases(df), ]
## # A tibble: 18 x 15
## ID Age Experience Income Zip_code family_members ccavg Education
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 56 31 25 94015 NA 0.9 2
## 2 59 28 2 93 94065 NA 0.2 1
## 3 99 49 23 94 92374 NA 0.3 1
## 4 162 61 35 80 95053 NA 2.8 1
## 5 236 38 8 71 94720 NA 1.8 3
## 6 290 42 15 24 92121 NA 1 2
## 7 488 39 13 88 94117 NA 1.4 2
## 8 722 49 24 39 92717 NA 1.4 3
## 9 1461 40 16 85 92677 NA 0.2 3
## 10 1462 54 28 48 93022 NA 0.2 1
## 11 2400 62 36 41 90245 NA 1 3
## 12 2833 45 21 133 92056 NA 5.7 3
## 13 3702 58 33 95 90503 NA 2.6 1
## 14 4136 48 23 168 95929 NA 2.8 1
## 15 4139 47 22 114 95819 NA 0.6 1
## 16 4403 55 25 52 90095 NA 1.4 3
## 17 4404 50 24 112 92064 NA 0 1
## 18 4764 51 25 173 95051 NA 0.5 2
## # ... with 7 more variables: Mortgage <dbl>, Personal_loan <dbl>,
## # sec_acc <dbl>, cd_acc <dbl>, online <dbl>, credit_card <dbl>,
## # missing <dbl>

plot(df$Age, col = ifelse(df$missing == 0,'red','green'), pch = 19 )


plot(df$Experience, col = ifelse(df$missing == 0,'red','green'), pch = 19 )
plot(df$Income, col = ifelse(df$missing == 0,'red','green'), pch = 19 )
plot(df$ccavg, col = ifelse(df$missing == 0,'red','green'), pch = 19 )
ggplot(data = df)+
geom_point(aes(df$Personal_loan,df$Age,color=df$missing))
ggplot(data = df)+
geom_point(aes(df$Education,df$Experience,color=df$missing))
df1 <- df[rowSums(is.na(df)) == 0,]
view(df1)
nrow(df1)
## [1] 4982
df2 <- df1[,-1]

8.4.
numeric.list <- sapply(df2, is.numeric)
numeric.list

## Age Experience Income Zip_code family_members


## TRUE TRUE TRUE TRUE TRUE
## ccavg Education Mortgage Personal_loan sec_acc
## TRUE TRUE TRUE TRUE TRUE
## cd_acc online credit_card missing
## TRUE TRUE TRUE TRUE

sum(numeric.list)

## [1] 14

numeric.df <- df2[, numeric.list]


cor.mat <- cor(numeric.df)

## Warning in cor(numeric.df): the standard deviation is zero

corrplot(cor.mat, type = "lower", method = "number")


df2 <- df2 %>% select(-Age)

8.5.
str(df2)

## Classes 'tbl_df', 'tbl' and 'data.frame': 4982 obs. of 13 variables:


## $ Experience : num 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : num 49 34 11 100 45 29 72 22 81 180 ...
## $ Zip_code : num 91107 90089 94720 94112 91330 ...
## $ family_members: num 4 3 1 1 4 4 2 1 3 1 ...
## $ ccavg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : num 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal_loan : num 0 0 0 0 0 0 0 0 0 1 ...
## $ sec_acc : num 1 1 0 0 0 0 0 0 0 0 ...
## $ cd_acc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ online : num 0 0 0 0 0 1 1 0 1 0 ...
## $ credit_card : num 0 0 0 0 1 0 0 1 0 0 ...
## $ missing : num 1 1 1 1 1 1 1 1 1 1 ...

df3 <- scale(df2[,-c(3,8)])


apply(df3,2,mean)

## Experience Income family_members ccavg Education


## 1.191639e-16 1.163534e-16 1.331234e-16 -1.198875e-17 -1.496917e-17
## Mortgage sec_acc cd_acc online credit_card
## -1.734734e-17 -1.184744e-18 2.287257e-17 -3.374883e-17 -4.688800e-17
## missing
## NaN

apply(df3,2,sd)

## Experience Income family_members ccavg Education


## 1 1 1 1 1
## Mortgage sec_acc cd_acc online credit_card
## 1 1 1 1 1
## missing
## NA

8.6.
dist_matrix <- dist(df3,method = "euclidean")

cluster <- hclust(dist_matrix,method = "average")


plot(cluster)
rect.hclust(cluster,k = 2)
df2$cluster <- cutree(cluster,k = 2)
t <- table(df2$Personal_loan,df2$cluster)
t

##
## 1 2
## 0 4343 161
## 1 339 139

sum(diag(t))/sum(t)

## [1] 0.8996387

8.7.
df3 <- scale(df2[,-c(3,8,9,10,11,12)])

attributes(df3)

seed = 1001

set.seed(seed)

cluster2 <- kmeans(df3,centers = 2,nstart = 10)

clusplot(df3,cluster2$cluster,color=TRUE,lines = TRUE,shade = TRUE,label=2)


df2$cluster <- cluster2$cluster

t <- table(df2$Personal_loan,df2$cluster)

##
## 1 2
## 0 3667 837
## 1 75 403

sum(diag(t))/sum(t)

## [1] 0.816941

8.8.
library(rpart)
library(rpart.plot)
r.ctrl = rpart.control(minsplit = 100, minbucket = 10, cp = 0, xval = 10)
Cart_model = rpart(Personal_loan ~ ., data = train, method = "class", control = r.ctrl)
rpart.plot(Cart_model)

8.9.
Cart_model$cptable

## CP nsplit rel error xerror xstd


## 1 0.30149254 0 1.0000000 1.0000000 0.05194591
## 2 0.12238806 2 0.3970149 0.4358209 0.03530584
## 3 0.03880597 3 0.2746269 0.2985075 0.02941973
## 4 0.01641791 4 0.2358209 0.2716418 0.02810188
## 5 0.00000000 6 0.2029851 0.2000000 0.02419808

Cart_model$variable.importance

## Education Income family_members ccavg cd_acc


## 209.219610 189.194949 128.258913 80.553585 36.823575
## Mortgage Experience Zip_code
## 19.752530 1.688228 1.233308

8.10.
train$prob_CART = predict(Cart_model, newdata = train[,-8])[,"1"]
train$predict_CART = predict(Cart_model, newdata = train[,-8], type = "class")

t <- table(train$Personal_loan,train$predict_CART)
t

##
## 0 1
## 0 3140 13
## 1 55 280

sum(diag(t))/sum(t)

## [1] 0.9805046

test$prob_CART = predict(Cart_model, newdata = test[,-8])[,"1"]


test$predict_CART = predict(Cart_model, newdata = test[,-8], type = "class")

t <- table(test$Personal_loan,test$predict_CART)
t

##
## 0 1
## 0 1348 3
## 1 21 122

sum(diag(t))/sum(t)

## [1] 0.9839357

8.11.
library(randomForest)

## Warning: package 'randomForest' was built under R version 3.5.3

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

##
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':


##
## combine

## The following object is masked from 'package:ggplot2':


##
## margin

set.seed(1000)
RFmodel = randomForest(Personal_loan ~ ., data = train[,-c(13,14)], mtry = 5, nodesize = 10, ntree = 501, importance = TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?

print(RFmodel)

##
## Call:
## randomForest(formula = Personal_loan ~ ., data = train[, -c(13, 14)], mtry = 5, nodesize = 10, ntree = 501, importance = TRUE)
## Type of random forest: regression
## Number of trees: 501
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.01059858
## % Var explained: 87.79

plot(RFmodel)

8.12.
tRF <- tuneRF(x = train[,-c(8,13,14)],
y = as.factor(train$Personal_loan),
mtryStart = 5,
ntreeTry = 150,
stepFactor = 1.5,
improve = 0.001,
trace = TRUE,
plot = TRUE,
doBest = TRUE,
nodesize = 10,
importance = TRUE
)

## mtry = 5 OOB error = 1.26%


## Searching left ...
## mtry = 4 OOB error = 1.29%
## -0.02272727 0.001
## Searching right ...
## mtry = 7 OOB error = 1.38%
## -0.09090909 0.001
8.13.
train$predict_RF = predict(tRF, newdata = train[,-c(8,13,14)])
train$predict_RF

train$prob_RF = predict(tRF, newdata = train[,-c(8,13,14)], type = "prob")[,"1"]


train$prob_RF

## t = table(train$Personal_loan, train$predict_RF)
t

##
## 0 1
## 0 3147 6
## 1 26 309

sum(diag(t))/sum(t)

## [1] 0.9908257

test$predict_RF = predict(tRF, newdata = test[,-c(8,13,14)])


test$prob_RF = predict(tRF, newdata = test[,-c(8,13,14)], type = "prob")[,"1"]

t = table(test$Personal_loan, test$predict_RF)
t

##
## 0 1
## 0 1347 4
## 1 16 127

sum(diag(t))/sum(t)

## [1] 0.9866131

8.14.

library(ROCR)
predobj_CART <- prediction(train$prob_CART,train$Personal_loan)
predobj_RF <- prediction(train$prob_RF,train$Personal_loan)

perf_CART <- performance(predobj_CART,"tpr","fpr")


perf_RF <- performance(predobj_RF,"tpr","fpr")

plot(perf_CART)
plot(perf_RF)

ks_CART = max(perf_CART@y.values[[1]] - perf_CART@x.values[[1]])


ks_RF = max(perf_RF@y.values[[1]] - perf_RF@x.values[[1]])
ks_CART

## [1] 0.9400571

ks_RF

## [1] 0.9911196

predobj_CART_test <- prediction(test$prob_CART,test$Personal_loan)


predobj_RF_test <- prediction(test$prob_RF,test$Personal_loan)

perf_CART_test <- performance(predobj_CART_test,"tpr","fpr")


perf_RF_test <- performance(predobj_RF_test,"tpr","fpr")

plot(perf_CART_test)

plot(perf_RF_test)
ks_CART_test =max(perf_CART_test@y.values[[1]] - perf_CART_test@x.values[[1]])
ks_RF_test = max(perf_RF_test@y.values[[1]] - perf_RF_test@x.values[[1]])
ks_CART_test
## [1] 0.9415248
ks_RF_test
## [1] 0.9567376
8.15.
auc_CART_train <- performance(predobj_CART,"auc")
auc_CART_train <- as.numeric(auc_CART_train@y.values)

auc_CART_test <- performance(predobj_CART_test,"auc")


auc_CART_test <- as.numeric(auc_CART_test@y.values)

auc_RF_train <- performance(predobj_RF,"auc")


auc_RF_train <- as.numeric(auc_RF_train@y.values)

auc_RF_test <- performance(predobj_RF_test,"auc")


auc_RF_test <- as.numeric(auc_RF_test@y.values)

auc_CART_train
## [1] 0.9954651
auc_CART_test
## [1] 0.9958384
auc_RF_train
## [1] 0.9997411
auc_RF_test
## [1] 0.9980149
8.16.
library(ineq)
## Warning: package 'ineq' was built under R version 3.5.2
gini_CART_train <- ineq(train$prob_CART,"gini")
gini_CART_test <- ineq(test$prob_CART,"gini")

gini_RF_train <- ineq(train$prob_RF,"gini")


gini_RF_test <- ineq(test$prob_RF,"gini")

gini_CART_test
## [1] 0.8959566
gini_CART_train
## [1] 0.8957577
gini_RF_test
## [1] 0.8984477
gini_RF_train
## [1] 0.9021382
8.17.
library(InformationValue)
## Warning: package 'InformationValue' was built under R version 3.5.3
Concordance(actuals = train$Personal_loan,predictedScores = train$prob_CART)
## $Concordance
## [1] 0.9928142
##
## $Discordance
## [1] 0.007185765
##
## $Tied
## [1] -4.163336e-17
##
## $Pairs
## [1] 1056255
Concordance(actuals = test$Personal_loan,predictedScores = test$prob_CART)
## $Concordance
## [1] 0.9934263
##
## $Discordance
## [1] 0.006573737
##
## $Tied
## [1] 3.469447e-17
##
## $Pairs
## [1] 193193
Concordance(actuals = train$Personal_loan,predictedScores = train$prob_RF)
## $Concordance
## [1] 0.9997406
##
## $Discordance
## [1] 0.0002594071
##
## $Tied
## [1] 3.843497e-17
##
## $Pairs
## [1] 1056255
Concordance(actuals = test$Personal_loan,predictedScores = test$prob_RF)
## $Concordance
## [1] 0.998002
##
## $Discordance
## [1] 0.001998002
##
## $Tied
## [1] 4.336809e-17
##
## $Pairs
## [1] 193193
8.18.
data <- rbind(test,train)
porbs = seq(0,1,length = 11)
porbs
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
qs <- quantile(data$prob_RF,porbs)
qs
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.006 0.234 1.000
data$deciles = cut(data$prob_RF,unique(qs),include.lowest = TRUE,right = FALSE)
view(data)
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4982 obs. of 18 variables:
## $ Experience : num 19 9 8 24 5 18 18 11 20 35 ...
## $ Income : num 34 100 45 22 45 81 43 152 158 35 ...
## $ Zip_code : num 90089 94112 91330 93943 90277 ...
## $ family_members: num 3 1 4 1 3 4 2 2 1 1 ...
## $ ccavg : num 1.5 2.7 1 0.3 0.1 2.4 0.7 3.9 2.4 1.2 ...
## $ Education : num 1 2 2 3 2 1 1 1 1 3 ...
## $ Mortgage : num 0 0 0 0 0 0 163 159 0 122 ...
## $ Personal_loan : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sec_acc : num 1 0 0 0 0 0 1 0 0 0 ...
## $ cd_acc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ online : num 0 0 0 0 1 0 0 0 1 1 ...
## $ credit_card : num 0 0 1 1 0 0 0 1 1 0 ...
## $ missing : num 1 1 1 1 1 1 1 1 1 1 ...
## $ prob_CART : num 0 0 0 0 0 0 0 0 0 0 ...
## $ predict_CART : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ predict_RF : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ prob_RF : num 0 0.024 0 0 0 0 0 0.002 0.002 0 ...
## $ deciles : Factor w/ 3 levels "[0,0.006)","[0.006,0.234)",..: 1 2 1 1 1 1 1 1 1 1 ...
tl <- data %>% group_by(deciles) %>% summarize(count = n(),
resp = sum(Personal_loan),
non_resp = count-resp) %>% arrange(desc(resp))

tl <- tl %>% mutate(cum_resp_rate = cumsum(resp)/sum(resp),


cum_non_resp_rate = cumsum(non_resp)/sum(non_resp),
ks = cum_resp_rate-cum_non_resp_rate)
view(tl)
table(data$Personal_loan)
##
## 0 1
## 4504 478
478/(478+4504)
## [1] 0.0959454

Potrebbero piacerti anche