Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Submitted By:
Sumit Kumar
Introduction:-
The Taiwanese economy experienced tremendous growth during the 1990’s, almost doubling in
value along with the other countries known as the Asian Tigers. The country’s financial sector was
heavily involved in the growth of real estate during this period. However, in the early 2000’s, this
growth slowed and banks in Taiwan turned towards consumer lending to continue the expansion. As
a result, credit requirements were loosened and consumers were encouraged to spend by
borrowing capital.
Back in 2005, credit card issuers in Taiwan faced a cash and credit card debt crisis, with delinquency
expected to peak in the third quarter of 2006 (Chou). In order to increase market share, card-issuing
banks in Taiwan over-issued cash and credit cards to unqualified applicants. At the same time, most
cardholders, irrespective of their repayment ability, overused credit cards for consumption purposes
and accumulated heavy cash and credit card debts. This crisis caused a blow to consumer financial
confidence and presented a big challenge for both banks and cardholders.
Based on above our main aim is to Identify High risk customers based their credit history
Objective:
This project aims to study different demographic and financial variables of credit card clients in
Taiwan from April 2005 to September 2005 as our predictors, and the fact that such clients default
or not as our outcome variable to answer the following research question : “Do demographic
variables (sex, education, marriage, age) and financial variables (limit balance, repayment status,
amount of bill statement and amount of previous payment) have any impact on the probability of
default payment of credit card clients ?”. A secondary analysis will also investigate which one of the
aforementioned variables are the strongest predictors of credit card default payment.
Data Source:-
Data Dictionary:-
Dataset Description:-
The dataset considered in this analysis is the “Taiwan-Customer defaults” dataset provided
under Capstone Project. This dataset contains payment data from April 2005 to September
2005, from an important bank (a cash and credit card issuer) in Taiwan, and the targets
were credit card holders of the bank. This dataset contains 30000 observations of 25
variables; where each observation corresponds to a particular credit card client. Among the
total 30000 observations, 6636 observations (22.12%) are cardholders with default
payment. The variables of interest in this dataset are demographic variables (gender,
education level, marriage status, and age) and financial variables (amount of given credit,
monthly repayment statuses, monthly amount of bill statements, and monthly amount of
previous payments).
From what is described in the points above, it seems pretty clear that this dataset should be
considered the result of an observational retrospective study.
# Clearing Variabe
rm(list=ls())
gc()
Importing the Packages:-
library(DataExplorer)
library(data.table)
library(dplyr)
library(randomForest)
library(ROCR)
library(rpart)
library(rpart.plot)
library(usdm)
library(MASS)
library(caret)
library(InformationValue)
Data Import:-
df <- read_excel("E:/Great_Lakes/Capastone/Taiwan-Customer defaults.xls",
skip = 1)
Size of Data
[1] 30000 25
We will change the column name of our dependent variable to "payment_default" for the ease
of coding and convert it into a categorical variable.
Descriptive Statistics
ID LIMIT_BAL SEX EDUCATION
Min. : 1 Min. : 10000 Min. :1.000 Min. :0.000
1st Qu.: 7501 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000
Median :15000 Median : 140000 Median :2.000 Median :2.000
Mean :15000 Mean : 167484 Mean :1.604 Mean :1.853
3rd Qu.:22500 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000
Max. :30000 Max. :1000000 Max. :2.000 Max. :6.000
MARRIAGE AGE PAY_0 PAY_2
Min. :0.000 Min. :21.00 Min. :-2.0000 Min. :-2.0000
1st Qu.:1.000 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000
Median :2.000 Median :34.00 Median : 0.0000 Median : 0.0000
Mean :1.552 Mean :35.49 Mean :-0.0167 Mean :-0.1338
3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. :3.000 Max. :79.00 Max. : 8.0000 Max. : 8.0000
PAY_3 PAY_4 PAY_5 PAY_6
Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean :-0.1662 Mean :-0.2207 Mean :-0.2662 Mean :-0.2911
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4
Min. :-165580 Min. :-69777 Min. :-157264 Min. :-170000
1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327
Median : 22382 Median : 21200 Median : 20089 Median : 19052
Mean : 51223 Mean : 49179 Mean : 47013 Mean : 43263
3rd Qu.: 67091 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506
Max. : 964511 Max. :983931 Max. :1664089 Max. : 891586
BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
Min. :-81334 Min. :-339603 Min. : 0 Min. : 0
1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833
Median : 18105 Median : 17071 Median : 2100 Median : 2009
Mean : 40311 Mean : 38872 Mean : 5664 Mean : 5921
3rd Qu.: 50191 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000
Max. :927171 Max. : 961664 Max. :873552 Max. :1684259
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8
Median : 1800 Median : 1500 Median : 1500.0 Median : 1500.0
Mean : 5226 Mean : 4826 Mean : 4799.4 Mean : 5215.5
3rd Qu.: 4505 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0
Max. :896040 Max. :621000 Max. :426529.0 Max. :528666.0
payment_default
0:23364
1: 6636
### We can observe some discrepancies immediately:
### PAY_0 to PAY_6 are seen to be between -2 to 8. However, as per the data
dictionary the values should be -1 for pay duly and 1 - 9 for payment delay of
1 month, 2 month... 9 months and above. Could it be that the values should be
shifted by +1 (so -2 becomes -1... 8 becomes 9)? That still doesn't account
for the value of 0 though (-1 will become 0 based on this transformation).
Univariate Analysis:-
DEMOGRAPHIC VARIABLES
### Let us take a closer look at some the demographic variables like Sex,
Education and Marriage. However, before proceeding we will encode them as per
the data dictionary for better understanding and then plot them. We would also
need to convert them into the correct data type for categorical variables.
df$SEX<-ifelse(df$SEX==1,"Male","Female")
df$EDUCATION<-ifelse(df$EDUCATION==1,"Graduate
School",ifelse(df$EDUCATION==2,"University",
(ifelse(df$EDUCATION==3,"High
School",ifelse(df$EDUCATION==4,"Others","Unknown")))))
df$MARRIAGE<-
ifelse(df$MARRIAGE==1,"Married",ifelse(df$MARRIAGE==2,"Single","Others"))
names<-c("SEX","EDUCATION","MARRIAGE")
df[names]<-lapply(df[names],as.factor)
plot_bar(df[,c("SEX","EDUCATION","MARRIAGE")])
REPAYMENT STATUS
There are 6 monthly variables for the repayment status which needs to be coded properly as per the
data dicitionary for better analysis. Further they needed to be converted into categorical variables
for the modelling.The variables have been coded as following:
-2 to 0: "Paid Duly"
1: "1 month delay"
2: "2 month delay"
3: "3 month delay"
4: "4 month delay"
5: "5 month delay"
6: "6 month delay"
7: "7 month delay"
8: "8 month delay"
Anything more than 8 has been put as "9 month or more delay".
But before proceeding we will quickly check the number of counts under each category.
PAY_VAR<-
lapply(df[,c("PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6")],
function(x) table(x))
print(PAY_VAR)
$PAY_0
x
-2 -1 0 1 2 3 4 5 6 7 8
2759 5686 14737 3688 2667 322 76 26 11 9 19
$PAY_2
x
-2 -1 0 1 2 3 4 5 6 7 8
3782 6050 15730 28 3927 326 99 25 12 20 1
$PAY_3
x
-2 -1 0 1 2 3 4 5 6 7 8
4085 5938 15764 4 3819 240 76 21 23 27 3
$PAY_4
x
-2 -1 0 1 2 3 4 5 6 7 8
4348 5687 16455 2 3159 180 69 35 5 58 2
$PAY_5
x
-2 -1 0 2 3 4 5 6 7 8
4546 5539 16947 2626 178 84 17 4 58 1
$PAY_6
x
-2 -1 0 2 3 4 5 6 7 8
4895 5740 16286 2766 184 49 13 19 46 2
As we see from the results above the number of customers who have delayed payments by 5 or
more months is very low so we would group them under a single category.
names<-c("PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6")
df[names]<-lapply(df[names],as.factor)
# Most people generate a bill rapidly , but pay very small amounts but over many months
(Bill Amount generated in September ,may get paid off over September, October and
November).
Bi-Variate Analysis :-
# EDUCATION vs MARRIAGE
ggplot(data=df , aes(EDUCATION)) + geom_bar() +
facet_grid(rows=vars(MARRIAGE)) +
ggtitle('Education Level vs Marriage')
# EDUCATION vs SEX
ggplot(data=df , aes(EDUCATION)) + geom_bar() + facet_grid(rows=vars(SEX)) +
ggtitle('Education Level vs Sex')
# Nothing unexpected. It seems like for both Sex and Marriage, Education seems to follow a
similar distribution.
How do these features look when comparing against DEFAULT?
# EDUCATION vs DEFAULT
ggplot(data=df, aes(EDUCATION)) + geom_bar() +
facet_grid(rows=vars(payment_default)) +
ggtitle('Education profile - Default vs Non-Default')
# MARRIAGE vs DEFAULT
ggplot(data=df, aes(MARRIAGE)) + geom_bar() +
facet_grid(rows=vars(payment_default)) +
ggtitle('Marriage status - Default vs Non-Default')
# SEX vs DEFAULT
ggplot(data=df, aes(MARRIAGE)) + geom_bar() +
facet_grid(rows=vars(payment_default)) +
ggtitle('SEX status - Default vs Non-Default')
# However, if a default has happened , then the defaulter is almost equally likely to be singlr
or married.
We begin by creating a facet plot displaying the percentage of credit card defaults against
education level faceted by gender and marital status.
# We may see that male credit card clients have a higher percentage of defaults compared to
female ones (across all education levels and marital statuses). Another trend made apparent in
this plot is that for married and single males and females, the proportion of default decreases
with the education level; this is different for the “Other” marital status, where this trend is nearly
reversed.
# Another interesting plot is a facet plot displaying the amount of given credit for defaulting and
non defaulting credit card clients faceted by by gender and marriage status.
We may see that defaulting credit card clients have less amount of given credit (across all
genders and marital statuses).
We may also take a look at the plots displaying the percentage of credit card defaults against the
monthly repayment statuses.
It seems pretty clear that credit card clients paying their bills with some delay have a higher
percentage of defaults compared to clients paying on time (across all months).
To conclude, we display the natural logarithm of the monthly amount of previous payments or
defaulting and non defaulting credit card clients.
This plot shows us that non defaulting clients pay larger amounts of previous payment (across all
months). We may also see that we have a lot of variation (due to the presence of a large number
of zeros) in the amounts of previous payment for defaulting clients (across all months).
Outlier treatment
Outlier<-
data.frame(apply(df[,c("LIMIT_BAL","BILL_AMT1","BILL_AMT2","BILL_AMT3","BILL_A
MT4","BILL_AMT5","BILL_AMT6",
"PAY_AMT1","PAY_AMT2","PAY_AMT3","PAY_AMT4","PAY_AMT5","PAY_AMT6")],
2, function(x) quantile(x, probs = seq(0, 1, by= 0.00001))))
head(Outlier)
tail(Outlier)
LIMI BILL BILL BILL BILL BILL BILL PAY PAY PAY PAY PAY PAY
T_B _AM _AM _AM _AM _AM _AM _AM _AM _AM _AM _AM _AM
AL T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6
0. 100 - - - - - - 0 0 0 0 0 0
00 00 165 697 157 170 813 339
% 580 77 264 000 34 603
0. 100 - - - - - - 0 0 0 0 0 0
00 00 162 691 128 143 753 300
% 398 01.7 538 401 45.6 439
0. 100 - - - - - - 0 0 0 0 0 0
00 00 159 684 998 116 693 261
% 216 26.5 11.1 802 57.2 274
0. 100 - - - - - - 0 0 0 0 0 0
00 00 156 677 710 902 633 222
% 034 51.2 84.7 03.3 68.8 110
0. 100 - - - - - - 0 0 0 0 0 0
00 00 127 606 584 781 596 197
% 046 92.2 30.8 01.3 99.3 434
0. 100 - - - - - - 0 0 0 0 0 0
01 00 851 504 538 732 571 180
% 47.5 39.7 17.3 51.3 89.9 005
LIMI BILL BILL BILL BILL BILL BILL PAY PAY PAY PAY PAY PAY
T_B _AM _AM _AM _AM _AM _AM _A _A _A _A _A _A
AL T1 T2 T3 T4 T5 T6 MT1 MT2 MT3 MT4 MT5 MT6
10 800 699 707 774 667 705 634 499 122 698 512 403 485
0.0 000 942. 770. 116. 785. 315. 297. 179. 127 655 950. 032 076.
0% 7 1 6 4 3 6 6 7 1 2
10 800 728 729 822 691 776 673 502 122 812 522 412 510
0.0 000 067. 491. 701. 234. 254. 688. 672. 476 895. 518. 007. 318
0% 4 5 5 1 9 1 1 0 4 9 4
10 820 768 767 936 725 833 726 541 127 889 538 418 527
0.0 006 590. 973. 010. 341. 906. 123. 866. 281 742. 110. 844. 295.
0% 2 3 6 7 2 9 3 3 9 1 2 3
10 880 833 839 117 780 864 804 652 140 891 565 421 527
0.0 004 897. 959. 870 756. 994. 637. 428. 996 841. 740 405. 752.
0% 2 2 3 5 5 2 2 2 9 8 2
10 940 899 911 142 836 896 883 762 154 893 593 423 528
0.0 002 204. 945. 139 171. 082. 150. 990. 711 941 370 967. 209.
0% 1 1 6 2 7 6 1 0 4 1
10 100 964 983 166 891 927 961 873 168 896 621 426 528
0.0 000 511 931 408 586 171 664 552 425 040 000 529 666
0% 0 9 9
We can find two major outliers one and at each end which would be removed.
Imbalance_Check<-aggregate(ID~payment_default,df,length)
colnames(Imbalance_Check)[2]<-"Client_Count"
Imbalance_Check$Contribution<-
(Imbalance_Check$Client_Count/sum(Imbalance_Check$Client_Count))*100
Imbalance_Check
0 23362 77.87853
1 6636 22.12147
Collinearity Check
Let us plot the correlation among the variables and also their respective VIF
numeric_fields<-c("LIMIT_BAL","BILL_AMT1","BILL_AMT2","BILL_AMT3",
"BILL_AMT4","BILL_AMT5","BILL_AMT6",
"PAY_AMT1","PAY_AMT2","PAY_AMT3","PAY_AMT4","PAY_AMT5","PAY_AMT6")
df_numeric<-subset(df, select=numeric_fields)
plot_correlation(df_numeric)
vif(df_numeric)
Variables VIF
LIMIT_BAL 1.171354
BILL_AMT1 16.60771
BILL_AMT2 28.42928
BILL_AMT3 20.221
BILL_AMT4 23.19036
BILL_AMT5 32.06416
BILL_AMT6 18.86295
PAY_AMT1 2.382365
PAY_AMT2 2.746476
PAY_AMT3 2.822173
PAY_AMT4 2.264827
PAY_AMT5 1.543725
PAY_AMT6 1.156365
There is high collinearity among the 6 variables corresponding to BILL AMOUNT and will not be
suitable for modelling. We would be excluding them from our models as well. Instead of using
them we would generate monthly ratio variables for the amount of payment made for the bill
amounts corresponding to each month.
df$PAY_RATIO_APR<-ifelse(is.nan(df$PAY_AMT1/df$BILL_AMT1),0,
ifelse(is.infinite(df$PAY_AMT1/df$BILL_AMT1),0,round(df$PAY_AMT1/df$BILL_AMT1,
2)))
df$PAY_RATIO_MAY<-ifelse(is.nan(df$PAY_AMT2/df$BILL_AMT2),0,
ifelse(is.infinite(df$PAY_AMT2/df$BILL_AMT2),0,round(df$PAY_AMT2/df$BILL_AMT2,
2)))
df$PAY_RATIO_JUNE<-ifelse(is.nan(df$PAY_AMT3/df$BILL_AMT3),0,
ifelse(is.infinite(df$PAY_AMT3/df$BILL_AMT3),0,round(df$PAY_AMT3/df$BILL_AMT3,
2)))
df$PAY_RATIO_JULY<-ifelse(is.nan(df$PAY_AMT4/df$BILL_AMT4),0,
ifelse(is.infinite(df$PAY_AMT4/df$BILL_AMT4),0,round(df$PAY_AMT4/df$BILL_AMT4,
2)))
df$PAY_RATIO_AUG<-ifelse(is.nan(df$PAY_AMT5/df$BILL_AMT5),0,
ifelse(is.infinite(df$PAY_AMT5/df$BILL_AMT5),0,round(df$PAY_AMT5/df$BILL_AMT5,
2)))
df$PAY_RATIO_SEPT<-ifelse(is.nan(df$PAY_AMT6/df$BILL_AMT6),0,
ifelse(is.infinite(df$PAY_AMT6/df$BILL_AMT6),0,round(df$PAY_AMT6/df$BILL_AMT6,
2)))
numeric_fields<-
c("LIMIT_BAL","BILL_AMT1","BILL_AMT2","BILL_AMT3","BILL_AMT4","BILL_AMT5",
"BILL_AMT6","PAY_RATIO_APR","PAY_RATIO_MAY","PAY_RATIO_JUNE",
"PAY_RATIO_JULY","PAY_RATIO_AUG","PAY_RATIO_SEPT",
"PAY_AMT1","PAY_AMT2","PAY_AMT3","PAY_AMT4","PAY_AMT5","PAY_AMT6")
df_numeric<-subset(df, select=numeric_fields)
plot_correlation(df_numeric)
We can see that the collinearity is very less among the new variables and would be used for
modelling.
Train and Test Data
train_df<-sample_frac(df,0.75)
test_df<-subset(df,!(df$ID %in% train_df$ID))
Imbalance Chek in the dependent Variable for the Train and Test data
Imbalance_Check_train_df<-aggregate(ID~payment_default,train_df,length)
colnames(Imbalance_Check_train_df)[2]<-"Client_Count"
Imbalance_Check_train_df$Contribution<-
(Imbalance_Check_train_df$Client_Count/sum(Imbalance_Check_train_df$Client_Cou
nt))*100
Imbalance_Check_train_df
Train Data :-
0 17529 77.91359
1 4969 22.08641
Imbalance_Check_test_df<-aggregate(ID~payment_default,test_df,length)
colnames(Imbalance_Check_test_df)[2]<-"Client_Count"
Imbalance_Check_test_df$Contribution<-
(Imbalance_Check_test_df$Client_Count/sum(Imbalance_Check_test_df$Client_Count
))*100
Imbalance_Check_test_df
Test Data :-
payment_default Client_Count Contribution
0 5833 77.77333
1 1667 22.22667
INFORMATION VALUE
Let us check the Information Value of the categorical variables to understand if there are any that
could be omitted.
SEX<-data.frame("SEX"=IV(train_df$SEX,train_df$payment_default))
EDUCATION<-
data.frame("EDUCATION"=IV(train_df$EDUCATION,train_df$payment_default))
MARRIAGE<-
data.frame("MARRIAGE"=IV(train_df$MARRIAGE,train_df$payment_default))
PAY_0<-data.frame("PAY_0"=IV(train_df$PAY_0,train_df$payment_default))
PAY_2<-data.frame("PAY_2"=IV(train_df$PAY_2,train_df$payment_default))
PAY_3<-data.frame("PAY_3"=IV(train_df$PAY_3,train_df$payment_default))
PAY_4<-data.frame("PAY_4"=IV(train_df$PAY_4,train_df$payment_default))
PAY_5<-data.frame("PAY_5"=IV(train_df$PAY_5,train_df$payment_default))
PAY_6<-data.frame("PAY_6"=IV(train_df$PAY_6,train_df$payment_default))
Iv<-cbind(SEX,EDUCATION,MARRIAGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6)
print(Iv)
Information Value :-
The above results suggest that the variables "SEX" and "MARRIAGE" are very weak predictors.
Hence we would not use them for Modeling
In our primary analysis, we use logistic regression to assess the effect demographic and financial
variables have on the default payment. Our null hypothesis is that there is no effect of
demographic and financial variables on the default payment; obviously, the alternative
hypothesis is that there is such an effect.
LOGISTIC REGRESSION
mod_1<-glm(payment_default~.-(ID+BILL_AMT1+BILL_AMT2+BILL_AMT3+
BILL_AMT4+BILL_AMT5+BILL_AMT6+SEX+MARRIAGE),
train_df, family=binomial)
summary(mod_1)
Call:
glm(formula = payment_default ~ . - (ID + BILL_AMT1 + BILL_AMT2 +
BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6 + SEX + MARRIAGE),
family = binomial, data = train_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3032 -0.5851 -0.5260 -0.3412 3.6329
Confusion Matrix
confusion_matrix<-table(prob_1, train_df$payment_default)
print(confusion_matrix)
prob_1 0 1
0 16667 3198
1 827 1806
Model Accuracy
Accuracy<-sum(diag(confusion_matrix))/sum(confusion_matrix)
print(Accuracy*100)
[1] 82.10952
We will use the Step AIC methodology to find the best subset of variables.
step_AIC<-stepAIC(mod_1,direction='backward')
Start: AIC=19782.58
payment_default ~ (ID + LIMIT_BAL + SEX + EDUCATION + MARRIAGE +
AGE + PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT1 +
BILL_AMT2 + BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6 +
PAY_AMT1 + PAY_AMT2 + PAY_AMT3 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6 +
PAY_RATIO_APR + PAY_RATIO_MAY + PAY_RATIO_JUNE + PAY_RATIO_JULY +
PAY_RATIO_AUG + PAY_RATIO_SEPT) - (ID + BILL_AMT1 + BILL_AMT2 +
BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6 + SEX + MARRIAGE)
Df Deviance AIC
- PAY_RATIO_AUG 1 19691 19781
- PAY_RATIO_MAY 1 19691 19781
- PAY_RATIO_SEPT 1 19691 19781
- PAY_RATIO_JUNE 1 19691 19781
- PAY_RATIO_JULY 1 19691 19781
<none> 19691 19783
- PAY_AMT5 1 19693 19783
- PAY_RATIO_APR 1 19695 19785
- PAY_AMT3 1 19695 19785
- PAY_AMT6 1 19696 19786
- PAY_AMT4 1 19697 19787
- PAY_5 4 19703 19787
- PAY_AMT2 1 19701 19791
- PAY_4 4 19712 19796
- PAY_3 4 19714 19798
- AGE 1 19711 19801
- PAY_2 5 19720 19802
- EDUCATION 4 19719 19803
- PAY_6 4 19719 19803
- PAY_AMT1 1 19714 19804
- LIMIT_BAL 1 19734 19824
- PAY_0 5 20785 20867
Step: AIC=19780.58
payment_default ~ LIMIT_BAL + EDUCATION + AGE + PAY_0 + PAY_2 +
PAY_3 + PAY_4 + PAY_5 + PAY_6 + PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR + PAY_RATIO_MAY +
PAY_RATIO_JUNE + PAY_RATIO_JULY + PAY_RATIO_SEPT
Df Deviance AIC
- PAY_RATIO_MAY 1 19691 19779
- PAY_RATIO_SEPT 1 19691 19779
- PAY_RATIO_JUNE 1 19691 19779
- PAY_RATIO_JULY 1 19691 19779
<none> 19691 19781
- PAY_AMT5 1 19693 19781
- PAY_RATIO_APR 1 19695 19783
- PAY_AMT3 1 19695 19783
- PAY_AMT6 1 19696 19784
- PAY_AMT4 1 19697 19785
- PAY_5 4 19703 19785
- PAY_AMT2 1 19701 19789
- PAY_4 4 19712 19794
- PAY_3 4 19714 19796
- AGE 1 19711 19799
- PAY_2 5 19720 19800
- EDUCATION 4 19719 19801
- PAY_6 4 19719 19801
- PAY_AMT1 1 19714 19802
- LIMIT_BAL 1 19734 19822
- PAY_0 5 20785 20865
Step: AIC=19778.73
payment_default ~ LIMIT_BAL + EDUCATION + AGE + PAY_0 + PAY_2 +
PAY_3 + PAY_4 + PAY_5 + PAY_6 + PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR + PAY_RATIO_JUNE +
PAY_RATIO_JULY + PAY_RATIO_SEPT
Df Deviance AIC
- PAY_RATIO_SEPT 1 19691 19777
- PAY_RATIO_JUNE 1 19691 19777
- PAY_RATIO_JULY 1 19692 19778
<none> 19691 19779
- PAY_AMT5 1 19693 19779
- PAY_RATIO_APR 1 19695 19781
- PAY_AMT3 1 19695 19781
- PAY_AMT6 1 19696 19782
- PAY_AMT4 1 19697 19783
- PAY_5 4 19703 19783
- PAY_AMT2 1 19701 19787
- PAY_4 4 19712 19792
- PAY_3 4 19714 19794
- AGE 1 19711 19797
- PAY_2 5 19720 19798
- EDUCATION 4 19719 19799
- PAY_6 4 19719 19799
- PAY_AMT1 1 19714 19800
- LIMIT_BAL 1 19734 19820
- PAY_0 5 20786 20864
Step: AIC=19777.2
payment_default ~ LIMIT_BAL + EDUCATION + AGE + PAY_0 + PAY_2 +
PAY_3 + PAY_4 + PAY_5 + PAY_6 + PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR + PAY_RATIO_JUNE +
PAY_RATIO_JULY
Df Deviance AIC
- PAY_RATIO_JUNE 1 19692 19776
- PAY_RATIO_JULY 1 19692 19776
<none> 19691 19777
- PAY_AMT5 1 19694 19778
- PAY_RATIO_APR 1 19695 19779
- PAY_AMT3 1 19695 19779
- PAY_AMT6 1 19696 19780
- PAY_AMT4 1 19697 19781
- PAY_5 4 19704 19782
- PAY_AMT2 1 19701 19785
- PAY_4 4 19712 19790
- PAY_3 4 19714 19792
- AGE 1 19711 19795
- PAY_2 5 19720 19796
- EDUCATION 4 19720 19798
- PAY_6 4 19720 19798
- PAY_AMT1 1 19714 19798
- LIMIT_BAL 1 19735 19819
- PAY_0 5 20786 20862
Step: AIC=19775.73
payment_default ~ LIMIT_BAL + EDUCATION + AGE + PAY_0 + PAY_2 +
PAY_3 + PAY_4 + PAY_5 + PAY_6 + PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR + PAY_RATIO_JULY
Df Deviance AIC
- PAY_RATIO_JULY 1 19693 19775
<none> 19692 19776
- PAY_AMT5 1 19694 19776
- PAY_RATIO_APR 1 19696 19778
- PAY_AMT3 1 19696 19778
- PAY_AMT6 1 19697 19779
- PAY_AMT4 1 19698 19780
- PAY_5 4 19704 19780
- PAY_AMT2 1 19702 19784
- PAY_4 4 19713 19789
- PAY_3 4 19715 19791
- AGE 1 19712 19794
- PAY_2 5 19721 19795
- EDUCATION 4 19720 19796
- PAY_6 4 19720 19796
- PAY_AMT1 1 19715 19797
- LIMIT_BAL 1 19735 19817
- PAY_0 5 20787 20861
Step: AIC=19774.86
payment_default ~ LIMIT_BAL + EDUCATION + AGE + PAY_0 + PAY_2 +
PAY_3 + PAY_4 + PAY_5 + PAY_6 + PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR
Df Deviance AIC
<none> 19693 19775
- PAY_AMT5 1 19696 19776
- PAY_RATIO_APR 1 19697 19777
- PAY_AMT3 1 19697 19777
- PAY_AMT6 1 19698 19778
- PAY_AMT4 1 19699 19779
- PAY_5 4 19706 19780
- PAY_AMT2 1 19703 19783
- PAY_4 4 19714 19788
- PAY_3 4 19716 19790
- AGE 1 19713 19793
- PAY_2 5 19722 19794
- EDUCATION 4 19721 19795
- PAY_6 4 19721 19795
- PAY_AMT1 1 19716 19796
- LIMIT_BAL 1 19737 19817
- PAY_0 5 20788 20860
summary(mod_2)
Call:
glm(formula = payment_default ~ LIMIT_BAL + EDUCATION + PAY_0 +
PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + PAY_AMT1 + PAY_AMT2 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR, family = "binomial",
data = train_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2738 -0.5833 -0.5285 -0.3460 3.6815
Predicting the model probabilities along with the Confusion Matrix and Accuracy
predict_2<-predict(mod_2,train_df,type='response')
prob_2<-ifelse(predict_2>0.5,1,0)
#Confusion Matrix
confusion_matrix<-table(prob_2, train_df$payment_default)
print(confusion_matrix)
#Model Accuracy
Accuracy<-sum(diag(confusion_matrix))/sum(confusion_matrix)
print(Accuracy*100)
[1] 82.10508
ROCR Curve
0.8
0.4
0.0
Let's check the accuracy of the model with the data set aside for validation
predict_2<-predict(mod_2,test_df,type='response')
prob_2<-ifelse(predict_2>0.5,1,0)
[1] 81.86667
There isn't any significant change in accuracy of the model so we can assume our model
performs well. But can we raise the accuracy of the model using some other technique?
So let us try the Decision Tree approach
DECISION TREE
set.seed(1234)
mymod<-rpart(payment_default ~
PAY_AMT1+PAY_AMT2+PAY_AMT3+PAY_AMT4+PAY_AMT5+PAY_AMT6+LIMIT_BAL+
EDUCATION+PAY_0+PAY_2+PAY_3+PAY_4+PAY_5+PAY_6+
BILL_AMT1+BILL_AMT2+BILL_AMT3+BILL_AMT4+BILL_AMT5+BILL_AMT6+AGE+SEX+MARRIAGE,
data= train_df, method="class",
control = rpart.control(cp = 0.0001,minsplit = 30, minbucket =
30*2,
maxsurrogate = 5, usesurrogate = 2,xval=10, maxdepth = 30))
printcp(mymod)
Classification tree:
rpart(formula = payment_default ~ PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + LIMIT_BAL + EDUCATION +
PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT1 +
BILL_AMT2 + BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6 +
AGE + SEX + MARRIAGE, data = train_df, method = "class",
control = rpart.control(cp = 1e-04, minsplit = 30, minbucket = 30 *
2, maxsurrogate = 5, usesurrogate = 2, xval = 10, maxdepth = 30))
n= 22498
Let's have a look at the variable importance and drop the variables that are not of much
importance.
mymod$variable.importance
set.seed(1234)
mymod<-rpart(payment_default ~
PAY_AMT1+PAY_AMT2+PAY_AMT3+PAY_AMT4+PAY_AMT5+PAY_AMT6+LIMIT_BAL+
PAY_0+PAY_2+PAY_3+PAY_4+PAY_5+PAY_6+BILL_AMT1+BILL_AMT2+
BILL_AMT3+BILL_AMT4+BILL_AMT5+BILL_AMT6+AGE,
data= train_df, method="class",
control = rpart.control(cp = 0.0001,minsplit = 30, minbucket =
30*2,
maxsurrogate = 5, usesurrogate = 2, xval=10, maxdepth = 30))
printcp(mymod)
Classification tree:
rpart(formula = payment_default ~ PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + LIMIT_BAL + PAY_0 + PAY_2 +
PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT1 + BILL_AMT2 + BILL_AMT3 +
BILL_AMT4 + BILL_AMT5 + BILL_AMT6 + AGE, data = train_df,
method = "class", control = rpart.control(cp = 1e-04, minsplit = 30,
minbucket = 30 * 2, maxsurrogate = 5, usesurrogate = 2,
xval = 10, maxdepth = 30))
n= 22498
We will prune the tree based on the best complexity parameter that has the least error.
set.seed(1234)
bestcp <- mymod$cptable[which.min(mymod$cptable[,"xerror"]),"CP"]
[1] 81.96729
0
5004 / 22498
0 1
3348 / 20133 709 / 2365
0 0 0 1
2611 / 18400 737 / 1733 31 / 81 659 / 2284
PAY_AMT3 >= 665 PAY_6 = 4 month delay,5 month or more delay,Paid Duly PAY_6 = 4 month delay,Paid Duly
0 0 0 1 1 1
1518 / 13348 1093 / 5052 478 / 1252 222 / 481 454 / 1388 205 / 896
PAY_5 = Paid Duly BILL_AMT1 >= 530 LIMIT_BAL >= 265e+3 PAY_AMT5 >= 5 PAY_AMT5 >= 10e+3
0 0 0 0 0 0 0 1 1 1
1354 / 12810 164 / 538 548 / 3062 545 / 1990 30 / 129 448 / 1123 114 / 233 103 / 248 36 / 76 418 / 1312
PAY_3 = Paid Duly PAY_AMT4 >= 1 PAY_AMT6 >= 777 PAY_AMT3 < 906 BILL_AMT6 < 17e+3 PAY_AMT1 < 995
0 0 0 0 0 0 0 1 0 1 1 1
91 / 388 73 / 150 165 / 777 380 / 1213 226 / 644 222 / 479 22 / 60 81 / 173 53 / 109 47 / 139 142 / 384 276 / 928
PAY_AMT3 >= 3751 AGE < 53 PAY_AMT3 >= 1602 BILL_AMT3 >= 333 PAY_AMT3 >= 2452 PAY_AMT1 >= 1
0 1 0 1 0 0 0 1 0 1 0 1
24 / 60 41 / 90 337 / 1133 37 / 80 99 / 333 127 / 311 177 / 404 30 / 75 42 / 89 34 / 84 29 / 60 111 / 324
0 0 0 0 1 1
48 / 139 79 / 172 89 / 228 88 / 176 80 / 207 31 / 117
0 1 0 1 0 1
33 / 93 33 / 79 34 / 79 43 / 97 37 / 80 37 / 127
Let's check the accuracy of the model with the data set aside for validation
Actual
Predicted 0 1
0 5624 1111
1 244 521
[1] 81.93333
The accuracy achieved using the data set aside for validation is also similar hence our model fits
the data well. Overall, we see some improvement in the accuracy of the model using the
Decision Tree approach. Let's try random forest to understand if any further imrovement can be
brought to the model.
RANDOM FOREST
seed<-7
mtry <-
floor(sqrt(ncol(train_df[,c("PAY_AMT1","PAY_AMT2","PAY_AMT3","PAY_AMT4","PAY_A
MT5",
"PAY_AMT6","LIMIT_BAL","EDUCATION","PAY_0","PAY_2","PAY_3",
"PAY_4","PAY_5","PAY_6","BILL_AMT1","BILL_AMT2","BILL_AMT3",
"BILL_AMT4","BILL_AMT5","BILL_AMT6","SEX","AGE","MARRIAGE")])))
metric <- "Accuracy"
control <- trainControl(method="repeatedcv", number=10, repeats=1,
search="grid")
tunegrid <- expand.grid(.mtry=c(1:mtry))
PAY_RATIO_JULY+PAY_RATIO_AUG+PAY_RATIO_SEPT),
data=train_df, method="rf", metric=metric,
tuneGrid=tunegrid, trControl=control, ntree=ntree)
key <- toString(ntree)
modellist[[key]] <- fit
}
# compare results
results <- resamples(modellist)
summary(results)
dotplot(results)
Call:
summary.resamples(object = results)
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
500 0.8061361 0.8175556 0.8193333 0.8204727 0.8266667 0.8305914 0
1000 0.8065807 0.8172222 0.8204444 0.8203838 0.8267778 0.8297021 0
1500 0.8065807 0.8170000 0.8211111 0.8207839 0.8258889 0.8314807 0
2000 0.8070253 0.8170000 0.8211111 0.8206061 0.8263333 0.8310360 0
2500 0.8065807 0.8171111 0.8208889 0.8204283 0.8260000 0.8301467 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
500 0.3156109 0.3621014 0.3732189 0.3708866 0.3904972 0.4014915 0
1000 0.3177782 0.3617246 0.3746494 0.3710045 0.3896442 0.4025187 0
1500 0.3189702 0.3604446 0.3777197 0.3721434 0.3928142 0.4001422 0
2000 0.3211289 0.3609974 0.3772465 0.3722848 0.3927817 0.4025196 0
2500 0.3201581 0.3618386 0.3760520 0.3712406 0.3903861 0.3983949 0
rf_random <- randomForest(payment_default~.-
(ID+PAY_RATIO_APR+PAY_RATIO_MAY+PAY_RATIO_JUNE+
PAY_RATIO_JULY+PAY_RATIO_AUG+PAY_RATIO_SEPT),
data=train_df,imrove=0.1,ntree=500,mtry= mtry,
importance=TRUE,nodesize=30)
print(rf_random)
Call:
randomForest(formula = payment_default ~ . - (ID + PAY_RATIO_APR +
PAY_RATIO_MAY + PAY_RATIO_JUNE + PAY_RATIO_JULY + PAY_RATIO_AUG +
PAY_RATIO_SEPT), data = train_df, imrove = 0.1, ntree = 500, mtry =
mtry, importance = TRUE, nodesize = 30)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4