Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
• EDA of the data available. Showcase the results using appropriate graphs
• Apply appropriate clustering on the data and interpret the output
• Build appropriate models on both the test and train data (CART & Random Forest). Interpret all the model
outputs and do the necessary modifications wherever eligible (such as pruning)
• Check the performance of all the models that you have built (test and train). Use all the model performance
measures you have learned so far. Share your remarks on which model performs the best.
2. ASSUMPTIONS
Only explanations around models are marked and grading criterion does not include assessment of the model
performance or accuracy of predictions.
4. CLUSTERING
There is no objective function in unsupervised learning to test the performance of the algorithm. We would
perform our clustering on main data directly. We would not need to split the data into testing or training sets.
For the purpose of this assignment we would use only two types of models
1: Hierarchical Clustering (agglomerative).
2: K-means clustering.
4.1. Hierarchical Clustering (agglomerative)
STEP 1: To do this we would first have to remove the response variable which is Personal loan and scale
rest of the data. We would also remove the variable: Zip Code as distance measure between Zip codes
would not be any meaningful number.
One of ways to check if data is scaled properly is to check for Mean and Standard deviation of data to be 0
and 1 respectively. Please see Appending 8.5 for source code.
STEP2: Calculate the distance matrix on the scaled data. There are multiple methods to calculate distance
between observations, for our analysis we would use - Euclidean distance. we use hierarchical cluster with
distance method as “average” and divide the data into two clusters. One corresponding to people getting
responding to the loan solicitation and other into people who did not respond to loan solicitation.
STEP3: Draw a dendrogram, A dendrogram, the height of the dendrogram indicates the order in which the
clusters were joined. A common mistake people make when reading dendrograms is to assume that the
shape of the dendrogram gives a clue as to how many clusters exist. We see that the dendrogram is not
very readable because of the size of the dataset and because hierarchical clustering goes all the way down
to individual datapoints when comparing distances. Because we know that, there are only 2 clusters, we
proceed with our analysis, by splitting the dendrogram in to spaces.
STEP4: Cutting the tree for 2 classes and labeling the data with cluster number we get the model to label
the data Comparing the sum of Correct predictions to overall data points we see that the model has an
accuracy of 89.96%.
Please see Appendix 8.6 For Source Code.
We would use scaled data for K means clustering. For reproducibility we start by setting a seed.
K means uses loyds algorithm which essentially calculates the distance between the variables. For the
same reason, categorical variables which indicate presence or absence of a variable would distort the k
means output and hence should be removed from our dataset before running the k-means algorithm. We
remove the categorical variables from our dataset.
nstart parameter gives number of random sets, we choose 10 for brevity and quickness of code execution.
NOTE: Ideally to arrive at appropriate number of clusters, NBcluster or other methods are used, but for
our analysis we are aware that the we need to partition the data into two clusters. For the same reason,
these methods were not used.
For visual inspection of data clusters, we draw the cluster plot, but this is not readable due to large
number of observations but we get a clue that the algorithm has not been able to segregate the data into
two clean sets.
Drawing confusion Metrix show the code was able to accurately predict 81.69% of target variables.
Please see Appendix 8.7 for Source Code.
CART has a problem of overfitting the data and hence we would tune the model by examining cross validation
errors and complexity parameter. We see that cross validation error keeps reducing and hence we would not
tune the model any further. Examining the variable importance gives us importance of each of covariate in the
model.
We see that model have given good 98.05% accuracy, however to check if the model overfitted the training
data, we now use the model to predict response variable in the test dataset. We see that the model has given
98.39% accuracy on test data. We can say the model is robust and does not overfit the data used for
development of the model.
6.1. KS Chart on training datasets for both CART and Random Forest models.
Receiver Operating Characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a
binary classifier system as its discrimination threshold is varied. Looking at ROC for CART and Random Forest
Model - we can say the performance of Random forest is better on train datasets.
KS measures are also better for Random forest model. Random forest gives a KS max measure of 0.9917 vs a figure
of 0.9400 for CART.
On test data also the ROC curves are better for Random forest compared to CART model. The KS max measure for
Random forest is .9537 compared to .9415 for CART model.
Over all looking at ROC and KS figures, we can conclude Random forest performed better than CART.
AOC is as below
MODEL TRAIN DATASET TEST DATASET
CART 0.9954651 0.9958384
RANDOM FOREST 0.9997411 0.9980149
Please see Appendix 8.15 for source code.
6.3. GINI VALUES: We get below gini values indicating random forest to be slightly
better model
STEP 2: Break dataset into 10 parts with probability buckets equal to probabilities provided by Random forest
model. Once the data set is broken by these deciles, we calculate Response rates, cumulative response rates,
non-response rates, Cumulative non-Response rates and KS figures for each of these deciles
STEP 3: Calculate the average response rate over the entire data set.
We can suggest Thera bank to direct their campaign efforts only to customers in the [0.275,1] Decile. By
targeting these 499 specific customers Thera bank would have around 98% response rate than targeting all
the customers where the response rate would be around 478/4504 i.e. 9.6%.
Please see Appendix 8.18 for source code.
8. APPENDIX
8.1.
setwd("E:/BRK CLASSES/BABI/04 Data Mining/Week 6 Final Assigment")
library(corrplot)
library(tidyverse)
library(readxl)
library(caTools)
library(rpart)
library(rpart.plot)
library(ROCR)
library(tidyverse)
8.2.
df <- read_excel("Thera Bank_Personal_Loan_Modelling-dataset-1.xlsx",sheet = "Bank_Personal_Loan_Modelling")
Changing column names to more meaningful and R recognized native names
colnames(df) <-
c("ID","Age","Experience","Income","Zip_code","family_members","ccavg","Education","Mortgage","Personal_loan","sec_acc","cd_acc","onlin
e","credit_card")
names(df)
## [1] "ID" "Age" "Experience" "Income"
## [5] "Zip_code" "family_members" "ccavg" "Education"
## [9] "Mortgage" "Personal_loan" "sec_acc" "cd_acc"
## [13] "online" "credit_card"
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5000 obs. of 14 variables:
## $ ID : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Age : num 25 45 39 35 35 37 53 50 35 34 ...
## $ Experience : num 1 19 15 9 8 13 27 24 10 9 ...
## $ Income : num 49 34 11 100 45 29 72 22 81 180 ...
## $ Zip_code : num 91107 90089 94720 94112 91330 ...
## $ family_members: num 4 3 1 1 4 4 2 1 3 1 ...
## $ ccavg : num 1.6 1.5 1 2.7 1 0.4 1.5 0.3 0.6 8.9 ...
## $ Education : num 1 1 1 2 2 2 2 3 2 3 ...
## $ Mortgage : num 0 0 0 0 0 155 0 0 104 0 ...
## $ Personal_loan : num 0 0 0 0 0 0 0 0 0 1 ...
## $ sec_acc : num 1 1 0 0 0 0 0 0 0 0 ...
## $ cd_acc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ online : num 0 0 0 0 0 1 1 0 1 0 ...
## $ credit_card : num 0 0 0 0 1 0 0 1 0 0 ...
df1 <- df %>% select(-ID)
8.3.
sum(is.na(df$Age))
## [1] 0
sum(is.na(df$Experience))
## [1] 0
sum(is.na(df$Income))
## [1] 0
sum(is.na(df$Zip_code))
## [1] 0
sum(is.na(df$family_members))
## [1] 18
df[!complete.cases(df), ]
## # A tibble: 18 x 15
## ID Age Experience Income Zip_code family_members ccavg Education
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 56 31 25 94015 NA 0.9 2
## 2 59 28 2 93 94065 NA 0.2 1
## 3 99 49 23 94 92374 NA 0.3 1
## 4 162 61 35 80 95053 NA 2.8 1
## 5 236 38 8 71 94720 NA 1.8 3
## 6 290 42 15 24 92121 NA 1 2
## 7 488 39 13 88 94117 NA 1.4 2
## 8 722 49 24 39 92717 NA 1.4 3
## 9 1461 40 16 85 92677 NA 0.2 3
## 10 1462 54 28 48 93022 NA 0.2 1
## 11 2400 62 36 41 90245 NA 1 3
## 12 2833 45 21 133 92056 NA 5.7 3
## 13 3702 58 33 95 90503 NA 2.6 1
## 14 4136 48 23 168 95929 NA 2.8 1
## 15 4139 47 22 114 95819 NA 0.6 1
## 16 4403 55 25 52 90095 NA 1.4 3
## 17 4404 50 24 112 92064 NA 0 1
## 18 4764 51 25 173 95051 NA 0.5 2
## # ... with 7 more variables: Mortgage <dbl>, Personal_loan <dbl>,
## # sec_acc <dbl>, cd_acc <dbl>, online <dbl>, credit_card <dbl>,
## # missing <dbl>
8.4.
numeric.list <- sapply(df2, is.numeric)
numeric.list
sum(numeric.list)
## [1] 14
8.5.
str(df2)
apply(df3,2,sd)
8.6.
dist_matrix <- dist(df3,method = "euclidean")
##
## 1 2
## 0 4343 161
## 1 339 139
sum(diag(t))/sum(t)
## [1] 0.8996387
8.7.
df3 <- scale(df2[,-c(3,8,9,10,11,12)])
attributes(df3)
seed = 1001
set.seed(seed)
t <- table(df2$Personal_loan,df2$cluster)
##
## 1 2
## 0 3667 837
## 1 75 403
sum(diag(t))/sum(t)
## [1] 0.816941
8.8.
library(rpart)
library(rpart.plot)
r.ctrl = rpart.control(minsplit = 100, minbucket = 10, cp = 0, xval = 10)
Cart_model = rpart(Personal_loan ~ ., data = train, method = "class", control = r.ctrl)
rpart.plot(Cart_model)
8.9.
Cart_model$cptable
Cart_model$variable.importance
8.10.
train$prob_CART = predict(Cart_model, newdata = train[,-8])[,"1"]
train$predict_CART = predict(Cart_model, newdata = train[,-8], type = "class")
t <- table(train$Personal_loan,train$predict_CART)
t
##
## 0 1
## 0 3140 13
## 1 55 280
sum(diag(t))/sum(t)
## [1] 0.9805046
t <- table(test$Personal_loan,test$predict_CART)
t
##
## 0 1
## 0 1348 3
## 1 21 122
sum(diag(t))/sum(t)
## [1] 0.9839357
8.11.
library(randomForest)
## randomForest 4.6-14
##
## Attaching package: 'randomForest'
set.seed(1000)
RFmodel = randomForest(Personal_loan ~ ., data = train[,-c(13,14)], mtry = 5, nodesize = 10, ntree = 501, importance = TRUE)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
print(RFmodel)
##
## Call:
## randomForest(formula = Personal_loan ~ ., data = train[, -c(13, 14)], mtry = 5, nodesize = 10, ntree = 501, importance = TRUE)
## Type of random forest: regression
## Number of trees: 501
## No. of variables tried at each split: 5
##
## Mean of squared residuals: 0.01059858
## % Var explained: 87.79
plot(RFmodel)
8.12.
tRF <- tuneRF(x = train[,-c(8,13,14)],
y = as.factor(train$Personal_loan),
mtryStart = 5,
ntreeTry = 150,
stepFactor = 1.5,
improve = 0.001,
trace = TRUE,
plot = TRUE,
doBest = TRUE,
nodesize = 10,
importance = TRUE
)
## t = table(train$Personal_loan, train$predict_RF)
t
##
## 0 1
## 0 3147 6
## 1 26 309
sum(diag(t))/sum(t)
## [1] 0.9908257
t = table(test$Personal_loan, test$predict_RF)
t
##
## 0 1
## 0 1347 4
## 1 16 127
sum(diag(t))/sum(t)
## [1] 0.9866131
8.14.
library(ROCR)
predobj_CART <- prediction(train$prob_CART,train$Personal_loan)
predobj_RF <- prediction(train$prob_RF,train$Personal_loan)
plot(perf_CART)
plot(perf_RF)
## [1] 0.9400571
ks_RF
## [1] 0.9911196
plot(perf_CART_test)
plot(perf_RF_test)
ks_CART_test =max(perf_CART_test@y.values[[1]] - perf_CART_test@x.values[[1]])
ks_RF_test = max(perf_RF_test@y.values[[1]] - perf_RF_test@x.values[[1]])
ks_CART_test
## [1] 0.9415248
ks_RF_test
## [1] 0.9567376
8.15.
auc_CART_train <- performance(predobj_CART,"auc")
auc_CART_train <- as.numeric(auc_CART_train@y.values)
auc_CART_train
## [1] 0.9954651
auc_CART_test
## [1] 0.9958384
auc_RF_train
## [1] 0.9997411
auc_RF_test
## [1] 0.9980149
8.16.
library(ineq)
## Warning: package 'ineq' was built under R version 3.5.2
gini_CART_train <- ineq(train$prob_CART,"gini")
gini_CART_test <- ineq(test$prob_CART,"gini")
gini_CART_test
## [1] 0.8959566
gini_CART_train
## [1] 0.8957577
gini_RF_test
## [1] 0.8984477
gini_RF_train
## [1] 0.9021382
8.17.
library(InformationValue)
## Warning: package 'InformationValue' was built under R version 3.5.3
Concordance(actuals = train$Personal_loan,predictedScores = train$prob_CART)
## $Concordance
## [1] 0.9928142
##
## $Discordance
## [1] 0.007185765
##
## $Tied
## [1] -4.163336e-17
##
## $Pairs
## [1] 1056255
Concordance(actuals = test$Personal_loan,predictedScores = test$prob_CART)
## $Concordance
## [1] 0.9934263
##
## $Discordance
## [1] 0.006573737
##
## $Tied
## [1] 3.469447e-17
##
## $Pairs
## [1] 193193
Concordance(actuals = train$Personal_loan,predictedScores = train$prob_RF)
## $Concordance
## [1] 0.9997406
##
## $Discordance
## [1] 0.0002594071
##
## $Tied
## [1] 3.843497e-17
##
## $Pairs
## [1] 1056255
Concordance(actuals = test$Personal_loan,predictedScores = test$prob_RF)
## $Concordance
## [1] 0.998002
##
## $Discordance
## [1] 0.001998002
##
## $Tied
## [1] 4.336809e-17
##
## $Pairs
## [1] 193193
8.18.
data <- rbind(test,train)
porbs = seq(0,1,length = 11)
porbs
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
qs <- quantile(data$prob_RF,porbs)
qs
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.006 0.234 1.000
data$deciles = cut(data$prob_RF,unique(qs),include.lowest = TRUE,right = FALSE)
view(data)
str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 4982 obs. of 18 variables:
## $ Experience : num 19 9 8 24 5 18 18 11 20 35 ...
## $ Income : num 34 100 45 22 45 81 43 152 158 35 ...
## $ Zip_code : num 90089 94112 91330 93943 90277 ...
## $ family_members: num 3 1 4 1 3 4 2 2 1 1 ...
## $ ccavg : num 1.5 2.7 1 0.3 0.1 2.4 0.7 3.9 2.4 1.2 ...
## $ Education : num 1 2 2 3 2 1 1 1 1 3 ...
## $ Mortgage : num 0 0 0 0 0 0 163 159 0 122 ...
## $ Personal_loan : num 0 0 0 0 0 0 0 0 0 0 ...
## $ sec_acc : num 1 0 0 0 0 0 1 0 0 0 ...
## $ cd_acc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ online : num 0 0 0 0 1 0 0 0 1 1 ...
## $ credit_card : num 0 0 1 1 0 0 0 1 1 0 ...
## $ missing : num 1 1 1 1 1 1 1 1 1 1 ...
## $ prob_CART : num 0 0 0 0 0 0 0 0 0 0 ...
## $ predict_CART : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ predict_RF : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ prob_RF : num 0 0.024 0 0 0 0 0 0.002 0.002 0 ...
## $ deciles : Factor w/ 3 levels "[0,0.006)","[0.006,0.234)",..: 1 2 1 1 1 1 1 1 1 1 ...
tl <- data %>% group_by(deciles) %>% summarize(count = n(),
resp = sum(Personal_loan),
non_resp = count-resp) %>% arrange(desc(resp))