Capastone Project Taiwan Customer Default

Capstone Project
Taiwan: Customer Defaults
Submitted By:
 Sumit Kumar
Introduction:-
The Taiwanese economy experienced tremendous growth during the 1990’s, almost doubling in
value along with the other countries known as the Asian Tigers. The country’s financial sector was
heavily involved in the growth of real estate during this period. However, in the early 2000’s, this
growth slowed and banks in Taiwan turned towards consumer lending to continue the expansion. As
a result, credit requirements were loosened and consumers were encouraged to spend by
borrowing capital.
Back in 2005, credit card issuers in Taiwan faced a cash and credit card debt crisis, with delinquency
expected to peak in the third quarter of 2006 (Chou). In order to increase market share, card-issuing
banks in Taiwan over-issued cash and credit cards to unqualified applicants. At the same time, most
cardholders, irrespective of their repayment ability, overused credit cards for consumption purposes
and accumulated heavy cash and credit card debts. This crisis caused a blow to consumer financial
confidence and presented a big challenge for both banks and cardholders.
Based on above our main aim is to Identify High risk customers based their credit history
Objective:
This project aims to study different demographic and financial variables of credit card clients in
Taiwan from April 2005 to September 2005 as our predictors, and the fact that such clients default
or not as our outcome variable to answer the following research question : “Do demographic
variables (sex, education, marriage, age) and financial variables (limit balance, repayment status,
amount of bill statement and amount of previous payment) have any impact on the probability of
default payment of credit card clients ?”. A secondary analysis will also investigate which one of the
aforementioned variables are the strongest predictors of credit card default payment.
Data Source:-
Data Dictionary:-
There are 25 variables:

 ID: ID of each client
 LIMIT_BAL: Amount of given credit in NT dollars (includes individual
and family/supplementary credit)
 SEX: Gender (1=male, 2=female)
 EDUCATION: (1=graduate school, 2=university, 3=high school,
4=others)
 MARRIAGE: Marital status (1=married, 2=single, 3=others)
 AGE: Age in years
 PAY_0: Repayment status in September, 2005 (-1=pay duly, 1=payment
delay for one month, 2=payment delay for two months, … 8=payment
delay for eight months, 9=payment delay for nine months and above)
 PAY_2: Repayment status in August, 2005 (scale same as above)
 PAY_3: Repayment status in July, 2005 (scale same as above)
 PAY_4: Repayment status in June, 2005 (scale same as above)
 PAY_5: Repayment status in May, 2005 (scale same as above)
 PAY_6: Repayment status in April, 2005 (scale same as above)
 BILL_AMT1: Amount of bill statement in September, 2005 (NT dollar)
 BILL_AMT2: Amount of bill statement in August, 2005 (NT dollar)
 BILL_AMT3: Amount of bill statement in July, 2005 (NT dollar)
 BILL_AMT4: Amount of bill statement in June, 2005 (NT dollar)
 BILL_AMT5: Amount of bill statement in May, 2005 (NT dollar)
 BILL_AMT6: Amount of bill statement in April, 2005 (NT dollar)
 PAY_AMT1: Amount of previous payment in September, 2005 (NT
dollar)
 PAY_AMT2: Amount of previous payment in August, 2005 (NT dollar)
 PAY_AMT3: Amount of previous payment in July, 2005 (NT dollar)
 PAY_AMT4: Amount of previous payment in June, 2005 (NT dollar)
 PAY_AMT5: Amount of previous payment in May, 2005 (NT dollar)
 PAY_AMT6: Amount of previous payment in April, 2005 (NT dollar)
 default.payment.next.month: Default payment in June, 2005 (1=yes,
0=no)
Dataset Description:-
The dataset considered in this analysis is the “Taiwan-Customer defaults” dataset provided
under Capstone Project. This dataset contains payment data from April 2005 to September
2005, from an important bank (a cash and credit card issuer) in Taiwan, and the targets
were credit card holders of the bank. This dataset contains 30000 observations of 25
variables; where each observation corresponds to a particular credit card client. Among the
total 30000 observations, 6636 observations (22.12%) are cardholders with default
payment. The variables of interest in this dataset are demographic variables (gender,
education level, marriage status, and age) and financial variables (amount of given credit,
monthly repayment statuses, monthly amount of bill statements, and monthly amount of
previous payments).
From what is described in the points above, it seems pretty clear that this dataset should be
considered the result of an observational retrospective study.
# Clearing Variabe
rm(list=ls())
gc()
Importing the Packages:-
library(DataExplorer)
library(data.table)
library(dplyr)
library(randomForest)
library(ROCR)
library(rpart)
library(rpart.plot)
library(usdm)
library(MASS)
library(caret)
library(InformationValue)
Data Import:-
df <- read_excel("E:/Great_Lakes/Capastone/Taiwan-Customer defaults.xls",
skip = 1)
Size of Data
[1] 30000 25
The data has thirty thousand rows and twenty-five columns
Structure of the data

str(df)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30000 obs. of 25 variables:

$ ID : num 1 2 3 4 5 6 7 8 9 10 ...
$ LIMIT_BAL : num 20000 120000 90000 50000 50000 50000
500000 100000 140000 20000 ...
$ SEX : num 2 2 2 2 1 1 1 2 2 1 ...
$ EDUCATION : num 2 2 2 2 2 1 1 2 3 3 ...
$ MARRIAGE : num 1 2 2 1 1 2 2 2 1 2 ...
$ AGE : num 24 26 34 37 57 37 29 23 28 35 ...
$ PAY_0 : num 2 -1 0 0 -1 0 0 0 0 -2 ...
$ PAY_2 : num 2 2 0 0 0 0 0 -1 0 -2 ...
$ PAY_3 : num -1 0 0 0 -1 0 0 -1 2 -2 ...
$ PAY_4 : num -1 0 0 0 0 0 0 0 0 -2 ...
$ PAY_5 : num -2 0 0 0 0 0 0 0 0 -1 ...
$ PAY_6 : num -2 2 0 0 0 0 0 -1 0 -1 ...
$ BILL_AMT1 : num 3913 2682 29239 46990 8617 ...
$ BILL_AMT2 : num 3102 1725 14027 48233 5670 ...
$ BILL_AMT3 : num 689 2682 13559 49291 35835 ...
$ BILL_AMT4 : num 0 3272 14331 28314 20940 ...
$ BILL_AMT5 : num 0 3455 14948 28959 19146 ...
$ BILL_AMT6 : num 0 3261 15549 29547 19131 ...
$ PAY_AMT1 : num 0 0 1518 2000 2000 ...
$ PAY_AMT2 : num 689 1000 1500 2019 36681 ...
$ PAY_AMT3 : num 0 1000 1000 1200 10000 657 38000 0 432
0 ...
$ PAY_AMT4 : num 0 1000 1000 1100 9000 ...
$ PAY_AMT5 : num 0 0 1000 1069 689 ...
$ PAY_AMT6 : num 0 2000 5000 1000 679 ...
$ default payment next month: num 1 1 0 0 0 0 0 0 0 0 ...
We will change the column name of our dependent variable to "payment_default" for the ease
of coding and convert it into a categorical variable.
setnames(df, "default payment next month", "payment_default")

df$payment_default<-as.factor(df$payment_default)
Descriptive Statistics
ID LIMIT_BAL SEX EDUCATION
Min. : 1 Min. : 10000 Min. :1.000 Min. :0.000
1st Qu.: 7501 1st Qu.: 50000 1st Qu.:1.000 1st Qu.:1.000
Median :15000 Median : 140000 Median :2.000 Median :2.000
Mean :15000 Mean : 167484 Mean :1.604 Mean :1.853
3rd Qu.:22500 3rd Qu.: 240000 3rd Qu.:2.000 3rd Qu.:2.000
Max. :30000 Max. :1000000 Max. :2.000 Max. :6.000
MARRIAGE AGE PAY_0 PAY_2
Min. :0.000 Min. :21.00 Min. :-2.0000 Min. :-2.0000
1st Qu.:1.000 1st Qu.:28.00 1st Qu.:-1.0000 1st Qu.:-1.0000
Median :2.000 Median :34.00 Median : 0.0000 Median : 0.0000
Mean :1.552 Mean :35.49 Mean :-0.0167 Mean :-0.1338
3rd Qu.:2.000 3rd Qu.:41.00 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. :3.000 Max. :79.00 Max. : 8.0000 Max. : 8.0000
PAY_3 PAY_4 PAY_5 PAY_6
Min. :-2.0000 Min. :-2.0000 Min. :-2.0000 Min. :-2.0000
1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000 1st Qu.:-1.0000
Median : 0.0000 Median : 0.0000 Median : 0.0000 Median : 0.0000
Mean :-0.1662 Mean :-0.2207 Mean :-0.2662 Mean :-0.2911
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000
Max. : 8.0000 Max. : 8.0000 Max. : 8.0000 Max. : 8.0000
BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4
Min. :-165580 Min. :-69777 Min. :-157264 Min. :-170000
1st Qu.: 3559 1st Qu.: 2985 1st Qu.: 2666 1st Qu.: 2327
Median : 22382 Median : 21200 Median : 20089 Median : 19052
Mean : 51223 Mean : 49179 Mean : 47013 Mean : 43263
3rd Qu.: 67091 3rd Qu.: 64006 3rd Qu.: 60165 3rd Qu.: 54506
Max. : 964511 Max. :983931 Max. :1664089 Max. : 891586
BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2
Min. :-81334 Min. :-339603 Min. : 0 Min. : 0
1st Qu.: 1763 1st Qu.: 1256 1st Qu.: 1000 1st Qu.: 833
Median : 18105 Median : 17071 Median : 2100 Median : 2009
Mean : 40311 Mean : 38872 Mean : 5664 Mean : 5921
3rd Qu.: 50191 3rd Qu.: 49198 3rd Qu.: 5006 3rd Qu.: 5000
Max. :927171 Max. : 961664 Max. :873552 Max. :1684259
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6
Min. : 0 Min. : 0 Min. : 0.0 Min. : 0.0
1st Qu.: 390 1st Qu.: 296 1st Qu.: 252.5 1st Qu.: 117.8
Median : 1800 Median : 1500 Median : 1500.0 Median : 1500.0
Mean : 5226 Mean : 4826 Mean : 4799.4 Mean : 5215.5
3rd Qu.: 4505 3rd Qu.: 4013 3rd Qu.: 4031.5 3rd Qu.: 4000.0
Max. :896040 Max. :621000 Max. :426529.0 Max. :528666.0
payment_default
0:23364
1: 6636
### We can observe some discrepancies immediately:
### EDUCATION is supposed to be an integer between 1 and 6 (both inclusive).

However, we note 0 is present as well. These should be changed to value 6 (as
that stands for 'unknown')
### MARRIAGE is supposed to be an integer between 1 and 3 (both inclusive).

However, 0 is present as well. These should be changed to value 3 (as that
stands for "others")
### PAY_0 to PAY_6 are seen to be between -2 to 8. However, as per the data
dictionary the values should be -1 for pay duly and 1 - 9 for payment delay of
1 month, 2 month... 9 months and above. Could it be that the values should be
shifted by +1 (so -2 becomes -1... 8 becomes 9)? That still doesn't account
for the value of 0 though (-1 will become 0 based on this transformation).
### Continuous features (BILL_AMT1-BILL_AMT6, PAY_AMT1-PAY_AMT6) -There is a

huge spread in each of the features. Also, there are negative values
(corresponding to people who overpay on their credit card bills?).
Univariate Analysis:-
DEMOGRAPHIC VARIABLES
### Let us take a closer look at some the demographic variables like Sex,
Education and Marriage. However, before proceeding we will encode them as per
the data dictionary for better understanding and then plot them. We would also
need to convert them into the correct data type for categorical variables.
### SEX: 1: Male and 2: Female
### EDUCATION: 1: Graduate School, 2: University, 3: High School , 4:

"Others", 5 and 6 have been clubbed together as "Unknown"
### MARRIAGE: 1: Married, 2: Single; 3 and 0 have been clubbed together as
"Others"
df$SEX<-ifelse(df$SEX==1,"Male","Female")
df$EDUCATION<-ifelse(df$EDUCATION==1,"Graduate
School",ifelse(df$EDUCATION==2,"University",
(ifelse(df$EDUCATION==3,"High
School",ifelse(df$EDUCATION==4,"Others","Unknown")))))
df$MARRIAGE<-
ifelse(df$MARRIAGE==1,"Married",ifelse(df$MARRIAGE==2,"Single","Others"))
names<-c("SEX","EDUCATION","MARRIAGE")
df[names]<-lapply(df[names],as.factor)
plot_bar(df[,c("SEX","EDUCATION","MARRIAGE")])
BAR PLOT OF DEMOGRAPHIC VARIABLES
REPAYMENT STATUS
There are 6 monthly variables for the repayment status which needs to be coded properly as per the
data dicitionary for better analysis. Further they needed to be converted into categorical variables
for the modelling.The variables have been coded as following:
-2 to 0: "Paid Duly"
1: "1 month delay"
2: "2 month delay"
3: "3 month delay"
4: "4 month delay"
5: "5 month delay"
6: "6 month delay"
7: "7 month delay"
8: "8 month delay"
Anything more than 8 has been put as "9 month or more delay".
But before proceeding we will quickly check the number of counts under each category.
PAY_VAR<-
lapply(df[,c("PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6")],
function(x) table(x))
print(PAY_VAR)
$PAY_0
x
-2 -1 0 1 2 3 4 5 6 7 8
2759 5686 14737 3688 2667 322 76 26 11 9 19
$PAY_2
x
-2 -1 0 1 2 3 4 5 6 7 8
3782 6050 15730 28 3927 326 99 25 12 20 1
$PAY_3
x
-2 -1 0 1 2 3 4 5 6 7 8
4085 5938 15764 4 3819 240 76 21 23 27 3
$PAY_4
x
-2 -1 0 1 2 3 4 5 6 7 8
4348 5687 16455 2 3159 180 69 35 5 58 2
$PAY_5
x
-2 -1 0 2 3 4 5 6 7 8
4546 5539 16947 2626 178 84 17 4 58 1
$PAY_6
x
-2 -1 0 2 3 4 5 6 7 8
4895 5740 16286 2766 184 49 13 19 46 2
As we see from the results above the number of customers who have delayed payments by 5 or
more months is very low so we would group them under a single category.
df$PAY_0<-ifelse(df$PAY_0 <1 ,"Paid Duly",ifelse(df$PAY_0==1,"1 month delay",

ifelse(df$PAY_0==2,"2 month delay",ifelse(df$PAY_0==3,"3 month
delay",
ifelse(df$PAY_0==4,"4 month delay","5 month or more delay")))))

delay",

delay",

delay",

delay",

delay",
names<-c("PAY_0","PAY_2","PAY_3","PAY_4","PAY_5","PAY_6")
df[names]<-lapply(df[names],as.factor)
BAR PLOT OF PAYMENT DATA
Analysis of the Continuous Variables
Let us begin by plotting them.

# All of the Variables looks skewed and has some outliers which require treatment
.However We would not treat the variable “Age” and instead break it down into buckets.
# Most people generate a bill rapidly , but pay very small amounts but over many months
(Bill Amount generated in September ,may get paid off over September, October and
November).
Bi-Variate Analysis :-
# EDUCATION vs MARRIAGE
ggplot(data=df , aes(EDUCATION)) + geom_bar() +
facet_grid(rows=vars(MARRIAGE)) +
ggtitle('Education Level vs Marriage')
# EDUCATION vs SEX
ggplot(data=df , aes(EDUCATION)) + geom_bar() + facet_grid(rows=vars(SEX)) +
ggtitle('Education Level vs Sex')
# Nothing unexpected. It seems like for both Sex and Marriage, Education seems to follow a
similar distribution.
How do these features look when comparing against DEFAULT?
# EDUCATION vs DEFAULT
ggplot(data=df, aes(EDUCATION)) + geom_bar() +
facet_grid(rows=vars(payment_default)) +
ggtitle('Education profile - Default vs Non-Default')
# MARRIAGE vs DEFAULT
ggplot(data=df, aes(MARRIAGE)) + geom_bar() +
ggtitle('Marriage status - Default vs Non-Default')
# SEX vs DEFAULT
ggplot(data=df, aes(MARRIAGE)) + geom_bar() +
ggtitle('SEX status - Default vs Non-Default')
# Surprisingly ,Single people are less likely to default
# However, if a default has happened , then the defaulter is almost equally likely to be singlr
or married.
# Let’s take a look at default status vs 2 categorical variable at a time now .
# Default vs Marriage and Education

ggplot(data=df, aes(payment_default)) + geom_bar() +
facet_grid(rows=vars(EDUCATION), cols=vars(MARRIAGE)) +
ggtitle('Default vs Marriage + Education')
# Default vs Marriage and Sex

ggplot(data=df, aes(payment_default)) + geom_bar() +
facet_grid(rows=vars(MARRIAGE), cols=vars(SEX)) +
ggtitle('Default vs Marriage + Sex')
# Nothing too concrete... Single Males seem to be most reliable (although this could be because
they are more represented in the data).
Exploratory Analysis
We begin by creating a facet plot displaying the percentage of credit card defaults against
education level faceted by gender and marital status.
# We may see that male credit card clients have a higher percentage of defaults compared to
female ones (across all education levels and marital statuses). Another trend made apparent in
this plot is that for married and single males and females, the proportion of default decreases
with the education level; this is different for the “Other” marital status, where this trend is nearly
reversed.
# Another interesting plot is a facet plot displaying the amount of given credit for defaulting and
non defaulting credit card clients faceted by by gender and marriage status.
We may see that defaulting credit card clients have less amount of given credit (across all
genders and marital statuses).
We may also take a look at the plots displaying the percentage of credit card defaults against the
monthly repayment statuses.
It seems pretty clear that credit card clients paying their bills with some delay have a higher
percentage of defaults compared to clients paying on time (across all months).
To conclude, we display the natural logarithm of the monthly amount of previous payments or
defaulting and non defaulting credit card clients.
This plot shows us that non defaulting clients pay larger amounts of previous payment (across all
months). We may also see that we have a lot of variation (due to the presence of a large number
of zeros) in the amounts of previous payment for defaulting clients (across all months).
Missing Value Check

There are no missing values in the data so no imputation is needed for any variable.
Outlier treatment
Outlier<-
data.frame(apply(df[,c("LIMIT_BAL","BILL_AMT1","BILL_AMT2","BILL_AMT3","BILL_A
MT4","BILL_AMT5","BILL_AMT6",
"PAY_AMT1","PAY_AMT2","PAY_AMT3","PAY_AMT4","PAY_AMT5","PAY_AMT6")],
2, function(x) quantile(x, probs = seq(0, 1, by= 0.00001))))
head(Outlier)
tail(Outlier)
LIMI BILL BILL BILL BILL BILL BILL PAY PAY PAY PAY PAY PAY
T_B _AM _AM _AM _AM _AM _AM _AM _AM _AM _AM _AM _AM
AL T1 T2 T3 T4 T5 T6 T1 T2 T3 T4 T5 T6
0. 100 - - - - - - 0 0 0 0 0 0
00 00 165 697 157 170 813 339
% 580 77 264 000 34 603
0. 100 - - - - - - 0 0 0 0 0 0
00 00 162 691 128 143 753 300
% 398 01.7 538 401 45.6 439
0. 100 - - - - - - 0 0 0 0 0 0
00 00 159 684 998 116 693 261
% 216 26.5 11.1 802 57.2 274
0. 100 - - - - - - 0 0 0 0 0 0
00 00 156 677 710 902 633 222
% 034 51.2 84.7 03.3 68.8 110
0. 100 - - - - - - 0 0 0 0 0 0
00 00 127 606 584 781 596 197
% 046 92.2 30.8 01.3 99.3 434
0. 100 - - - - - - 0 0 0 0 0 0
01 00 851 504 538 732 571 180
% 47.5 39.7 17.3 51.3 89.9 005
LIMI BILL BILL BILL BILL BILL BILL PAY PAY PAY PAY PAY PAY
T_B _AM _AM _AM _AM _AM _AM _A _A _A _A _A _A
AL T1 T2 T3 T4 T5 T6 MT1 MT2 MT3 MT4 MT5 MT6
10 800 699 707 774 667 705 634 499 122 698 512 403 485
0.0 000 942. 770. 116. 785. 315. 297. 179. 127 655 950. 032 076.
0% 7 1 6 4 3 6 6 7 1 2
10 800 728 729 822 691 776 673 502 122 812 522 412 510
0.0 000 067. 491. 701. 234. 254. 688. 672. 476 895. 518. 007. 318
0% 4 5 5 1 9 1 1 0 4 9 4
10 820 768 767 936 725 833 726 541 127 889 538 418 527
0.0 006 590. 973. 010. 341. 906. 123. 866. 281 742. 110. 844. 295.
0% 2 3 6 7 2 9 3 3 9 1 2 3
10 880 833 839 117 780 864 804 652 140 891 565 421 527
0.0 004 897. 959. 870 756. 994. 637. 428. 996 841. 740 405. 752.
0% 2 2 3 5 5 2 2 2 9 8 2
10 940 899 911 142 836 896 883 762 154 893 593 423 528
0.0 002 204. 945. 139 171. 082. 150. 990. 711 941 370 967. 209.
0% 1 1 6 2 7 6 1 0 4 1
10 100 964 983 166 891 927 961 873 168 896 621 426 528
0.0 000 511 931 408 586 171 664 552 425 040 000 529 666
0% 0 9 9
We can find two major outliers one and at each end which would be removed.
df<-subset(df, !(df$LIMIT_BAL> quantile(df$LIMIT_BAL, 0.99999) |

df$BILL_AMT1< quantile(df$BILL_AMT1, 0.00001)))
Imbalance check for the Dependent Variable:-
Imbalance_Check<-aggregate(ID~payment_default,df,length)
colnames(Imbalance_Check)[2]<-"Client_Count"
Imbalance_Check$Contribution<-
(Imbalance_Check$Client_Count/sum(Imbalance_Check$Client_Count))*100
Imbalance_Check
payment_default Client_Count Contribution
0 23362 77.87853
1 6636 22.12147
There is no significant imbalance in the dependent variable.
Collinearity Check
Let us plot the correlation among the variables and also their respective VIF
numeric_fields<-c("LIMIT_BAL","BILL_AMT1","BILL_AMT2","BILL_AMT3",
"BILL_AMT4","BILL_AMT5","BILL_AMT6",
"PAY_AMT1","PAY_AMT2","PAY_AMT3","PAY_AMT4","PAY_AMT5","PAY_AMT6")
df_numeric<-subset(df, select=numeric_fields)
plot_correlation(df_numeric)
vif(df_numeric)
Variables VIF
LIMIT_BAL 1.171354
BILL_AMT1 16.60771
BILL_AMT2 28.42928
BILL_AMT3 20.221
BILL_AMT4 23.19036
BILL_AMT5 32.06416
BILL_AMT6 18.86295
PAY_AMT1 2.382365
PAY_AMT2 2.746476
PAY_AMT3 2.822173
PAY_AMT4 2.264827
PAY_AMT5 1.543725
PAY_AMT6 1.156365
There is high collinearity among the 6 variables corresponding to BILL AMOUNT and will not be
suitable for modelling. We would be excluding them from our models as well. Instead of using
them we would generate monthly ratio variables for the amount of payment made for the bill
amounts corresponding to each month.
df$PAY_RATIO_APR<-ifelse(is.nan(df$PAY_AMT1/df$BILL_AMT1),0,
ifelse(is.infinite(df$PAY_AMT1/df$BILL_AMT1),0,round(df$PAY_AMT1/df$BILL_AMT1,
2)))
df$PAY_RATIO_MAY<-ifelse(is.nan(df$PAY_AMT2/df$BILL_AMT2),0,
2)))
df$PAY_RATIO_JUNE<-ifelse(is.nan(df$PAY_AMT3/df$BILL_AMT3),0,
2)))
df$PAY_RATIO_JULY<-ifelse(is.nan(df$PAY_AMT4/df$BILL_AMT4),0,
2)))
df$PAY_RATIO_AUG<-ifelse(is.nan(df$PAY_AMT5/df$BILL_AMT5),0,
2)))
df$PAY_RATIO_SEPT<-ifelse(is.nan(df$PAY_AMT6/df$BILL_AMT6),0,
2)))
numeric_fields<-
c("LIMIT_BAL","BILL_AMT1","BILL_AMT2","BILL_AMT3","BILL_AMT4","BILL_AMT5",
"BILL_AMT6","PAY_RATIO_APR","PAY_RATIO_MAY","PAY_RATIO_JUNE",
"PAY_RATIO_JULY","PAY_RATIO_AUG","PAY_RATIO_SEPT",
"PAY_AMT1","PAY_AMT2","PAY_AMT3","PAY_AMT4","PAY_AMT5","PAY_AMT6")
df_numeric<-subset(df, select=numeric_fields)
plot_correlation(df_numeric)
We can see that the collinearity is very less among the new variables and would be used for
modelling.
Train and Test Data
train_df<-sample_frac(df,0.75)
test_df<-subset(df,!(df$ID %in% train_df$ID))
Imbalance Chek in the dependent Variable for the Train and Test data
Imbalance_Check_train_df<-aggregate(ID~payment_default,train_df,length)
colnames(Imbalance_Check_train_df)[2]<-"Client_Count"
Imbalance_Check_train_df$Contribution<-
(Imbalance_Check_train_df$Client_Count/sum(Imbalance_Check_train_df$Client_Cou
nt))*100
Imbalance_Check_train_df
Train Data :-
0 17529 77.91359
1 4969 22.08641
Imbalance_Check_test_df<-aggregate(ID~payment_default,test_df,length)
colnames(Imbalance_Check_test_df)[2]<-"Client_Count"
Imbalance_Check_test_df$Contribution<-
(Imbalance_Check_test_df$Client_Count/sum(Imbalance_Check_test_df$Client_Count
))*100
Imbalance_Check_test_df
Test Data :-
0 5833 77.77333
1 1667 22.22667
No major imbalance observed, so we would proceed to modelling.
INFORMATION VALUE
Let us check the Information Value of the categorical variables to understand if there are any that
could be omitted.
SEX<-data.frame("SEX"=IV(train_df$SEX,train_df$payment_default))
EDUCATION<-
data.frame("EDUCATION"=IV(train_df$EDUCATION,train_df$payment_default))
MARRIAGE<-
data.frame("MARRIAGE"=IV(train_df$MARRIAGE,train_df$payment_default))
PAY_0<-data.frame("PAY_0"=IV(train_df$PAY_0,train_df$payment_default))
Iv<-cbind(SEX,EDUCATION,MARRIAGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6)
print(Iv)
Information Value :-
SEX EDUCATION MARRIAGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6

0.009084013 0.03630159 0.00658 0.87105 0.55186 0.41304 0.36082 0.34261 0.29103
The above results suggest that the variables "SEX" and "MARRIAGE" are very weak predictors.
Hence we would not use them for Modeling
In our primary analysis, we use logistic regression to assess the effect demographic and financial
variables have on the default payment. Our null hypothesis is that there is no effect of
demographic and financial variables on the default payment; obviously, the alternative
hypothesis is that there is such an effect.
LOGISTIC REGRESSION
mod_1<-glm(payment_default~.-(ID+BILL_AMT1+BILL_AMT2+BILL_AMT3+
BILL_AMT4+BILL_AMT5+BILL_AMT6+SEX+MARRIAGE),
train_df, family=binomial)
summary(mod_1)
Call:
glm(formula = payment_default ~ . - (ID + BILL_AMT1 + BILL_AMT2 +
BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6 + SEX + MARRIAGE),
family = binomial, data = train_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3032 -0.5851 -0.5260 -0.3412 3.6329
Coefficients: (1 not defined because of singularities)

Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.131e+00 1.195e+02 -0.076 0.939078
LIMIT_BAL -1.141e-06 1.750e-07 -6.516 7.20e-11 ***
EDUCATIONHigh School -9.197e-02 5.736e-02 -1.603 0.108837
EDUCATIONOthers -8.004e-01 4.023e-01 -1.990 0.046640 *
EDUCATIONUniversity -6.886e-03 4.187e-02 -0.164 0.869367
EDUCATIONUnknown -1.015e+00 2.476e-01 -4.098 4.16e-05 ***
AGE 8.989e-03 2.003e-03 4.487 7.23e-06 ***
PAY_02 month delay 1.306e+00 6.650e-02 19.643 < 2e-16 ***
PAY_03 month delay 1.228e+00 1.672e-01 7.346 2.05e-13 ***
PAY_04 month delay 1.107e+00 3.346e-01 3.307 0.000942 ***
PAY_05 month or more delay 5.777e-01 4.527e-01 1.276 0.201951
PAY_0Paid Duly -7.572e-01 5.691e-02 -13.305 < 2e-16 ***
PAY_22 month delay 8.108e-01 6.505e-01 1.246 0.212609
PAY_25 month or more delay 1.477e+00 8.565e-01 1.725 0.084529 .
PAY_2Paid Duly 5.229e-01 6.505e-01 0.804 0.421547
PAY_32 month delay 8.358e+00 1.195e+02 0.070 0.944225
PAY_35 month or more delay 8.464e+00 1.195e+02 0.071 0.943521
PAY_3Paid Duly 8.055e+00 1.195e+02 0.067 0.946244
PAY_42 month delay 2.251e-01 7.608e-02 2.959 0.003088 **
PAY_43 month delay -9.472e-03 2.511e-01 -0.038 0.969907
PAY_45 month or more delay -1.885e+00 6.788e-01 -2.777 0.005479 **
PAY_4Paid Duly NA NA NA NA
PAY_55 month or more delay 1.206e+00 8.050e-01 1.498 0.134116
PAY_5Paid Duly -2.496e-01 8.425e-02 -2.963 0.003045 **
PAY_63 month delay 6.185e-01 2.639e-01 2.344 0.019071 *
PAY_6Paid Duly -3.029e-01 7.300e-02 -4.150 3.33e-05 ***
PAY_AMT1 -9.924e-06 2.367e-06 -4.193 2.75e-05 ***
PAY_AMT2 -5.463e-06 1.913e-06 -2.856 0.004294 **
PAY_AMT3 -3.270e-06 1.743e-06 -1.876 0.060652 .
PAY_AMT4 -4.428e-06 1.930e-06 -2.294 0.021786 *
PAY_AMT5 -2.751e-06 1.733e-06 -1.587 0.112455
PAY_AMT6 -3.312e-06 1.540e-06 -2.150 0.031542 *
PAY_RATIO_APR -1.301e-04 9.460e-05 -1.375 0.169045
PAY_RATIO_MAY 9.012e-05 2.464e-04 0.366 0.714567
PAY_RATIO_JUNE 5.886e-05 1.399e-04 0.421 0.674022
PAY_RATIO_JULY 1.465e-04 1.707e-04 0.858 0.390644
PAY_RATIO_AUG -2.239e-06 3.050e-04 -0.007 0.994144
PAY_RATIO_SEPT 1.845e-04 3.428e-04 0.538 0.590352
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 23846 on 22497 degrees of freedom

Residual deviance: 19691 on 22452 degrees of freedom
AIC: 19783
Number of Fisher Scoring iterations: 9
Predicting the probabilities obtained from the model

predict_1<-predict(mod_1, train_df, type='response')
prob_1<-ifelse( predict_1 > 0.5,1,0)
Confusion Matrix
confusion_matrix<-table(prob_1, train_df$payment_default)
print(confusion_matrix)
prob_1 0 1
0 16667 3198
1 827 1806
Model Accuracy
Accuracy<-sum(diag(confusion_matrix))/sum(confusion_matrix)
print(Accuracy*100)
[1] 82.10952
STEP AIC CRITERION
We will use the Step AIC methodology to find the best subset of variables.
step_AIC<-stepAIC(mod_1,direction='backward')
Start: AIC=19782.58
payment_default ~ (ID + LIMIT_BAL + SEX + EDUCATION + MARRIAGE +
AGE + PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT1 +
BILL_AMT2 + BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6 +
PAY_AMT1 + PAY_AMT2 + PAY_AMT3 + PAY_AMT4 + PAY_AMT5 + PAY_AMT6 +
PAY_RATIO_APR + PAY_RATIO_MAY + PAY_RATIO_JUNE + PAY_RATIO_JULY +
PAY_RATIO_AUG + PAY_RATIO_SEPT) - (ID + BILL_AMT1 + BILL_AMT2 +
BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6 + SEX + MARRIAGE)
Df Deviance AIC
- PAY_RATIO_AUG 1 19691 19781
- PAY_RATIO_MAY 1 19691 19781
- PAY_RATIO_SEPT 1 19691 19781
- PAY_RATIO_JUNE 1 19691 19781
- PAY_RATIO_JULY 1 19691 19781
<none> 19691 19783
- PAY_AMT5 1 19693 19783
- PAY_RATIO_APR 1 19695 19785
- PAY_AMT3 1 19695 19785
- PAY_AMT6 1 19696 19786
- PAY_AMT4 1 19697 19787
- PAY_5 4 19703 19787
- PAY_AMT2 1 19701 19791
- PAY_4 4 19712 19796
- PAY_3 4 19714 19798
- AGE 1 19711 19801
- PAY_2 5 19720 19802
- EDUCATION 4 19719 19803
- PAY_6 4 19719 19803
- PAY_AMT1 1 19714 19804
- LIMIT_BAL 1 19734 19824
- PAY_0 5 20785 20867
Step: AIC=19780.58
payment_default ~ LIMIT_BAL + EDUCATION + AGE + PAY_0 + PAY_2 +
PAY_3 + PAY_4 + PAY_5 + PAY_6 + PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR + PAY_RATIO_MAY +
PAY_RATIO_JUNE + PAY_RATIO_JULY + PAY_RATIO_SEPT
Df Deviance AIC
- PAY_RATIO_MAY 1 19691 19779
<none> 19691 19781
- PAY_AMT5 1 19693 19781
- PAY_RATIO_APR 1 19695 19783
- PAY_AMT3 1 19695 19783
- PAY_AMT6 1 19696 19784
- PAY_AMT4 1 19697 19785
- PAY_5 4 19703 19785
- PAY_AMT2 1 19701 19789
- PAY_4 4 19712 19794
- PAY_3 4 19714 19796
- AGE 1 19711 19799
- PAY_2 5 19720 19800
- EDUCATION 4 19719 19801
- PAY_6 4 19719 19801
- PAY_AMT1 1 19714 19802
- LIMIT_BAL 1 19734 19822
- PAY_0 5 20785 20865
Step: AIC=19778.73
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR + PAY_RATIO_JUNE +
PAY_RATIO_JULY + PAY_RATIO_SEPT
Df Deviance AIC
<none> 19691 19779
- PAY_AMT5 1 19693 19779
- PAY_RATIO_APR 1 19695 19781
- PAY_AMT3 1 19695 19781
- PAY_AMT6 1 19696 19782
- PAY_AMT4 1 19697 19783
- PAY_5 4 19703 19783
- PAY_AMT2 1 19701 19787
- PAY_4 4 19712 19792
- PAY_3 4 19714 19794
- AGE 1 19711 19797
- PAY_2 5 19720 19798
- EDUCATION 4 19719 19799
- PAY_6 4 19719 19799
- PAY_AMT1 1 19714 19800
- LIMIT_BAL 1 19734 19820
- PAY_0 5 20786 20864
Step: AIC=19777.2
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR + PAY_RATIO_JUNE +
PAY_RATIO_JULY
Df Deviance AIC
<none> 19691 19777
- PAY_AMT5 1 19694 19778
- PAY_RATIO_APR 1 19695 19779
- PAY_AMT3 1 19695 19779
- PAY_AMT6 1 19696 19780
- PAY_AMT4 1 19697 19781
- PAY_5 4 19704 19782
- PAY_AMT2 1 19701 19785
- PAY_4 4 19712 19790
- PAY_3 4 19714 19792
- AGE 1 19711 19795
- PAY_2 5 19720 19796
- EDUCATION 4 19720 19798
- PAY_6 4 19720 19798
- PAY_AMT1 1 19714 19798
- LIMIT_BAL 1 19735 19819
- PAY_0 5 20786 20862
Step: AIC=19775.73
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR + PAY_RATIO_JULY
Df Deviance AIC
<none> 19692 19776
- PAY_AMT5 1 19694 19776
- PAY_RATIO_APR 1 19696 19778
- PAY_AMT3 1 19696 19778
- PAY_AMT6 1 19697 19779
- PAY_AMT4 1 19698 19780
- PAY_5 4 19704 19780
- PAY_AMT2 1 19702 19784
- PAY_4 4 19713 19789
- PAY_3 4 19715 19791
- AGE 1 19712 19794
- PAY_2 5 19721 19795
- EDUCATION 4 19720 19796
- PAY_6 4 19720 19796
- PAY_AMT1 1 19715 19797
- LIMIT_BAL 1 19735 19817
- PAY_0 5 20787 20861
Step: AIC=19774.86
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR
Df Deviance AIC
<none> 19693 19775
- PAY_AMT5 1 19696 19776
- PAY_RATIO_APR 1 19697 19777
- PAY_AMT3 1 19697 19777
- PAY_AMT6 1 19698 19778
- PAY_AMT4 1 19699 19779
- PAY_5 4 19706 19780
- PAY_AMT2 1 19703 19783
- PAY_4 4 19714 19788
- PAY_3 4 19716 19790
- AGE 1 19713 19793
- PAY_2 5 19722 19794
- EDUCATION 4 19721 19795
- PAY_6 4 19721 19795
- PAY_AMT1 1 19716 19796
- LIMIT_BAL 1 19737 19817
- PAY_0 5 20788 20860
Re-running the model with the best subset obtained.
mod_2<-glm(payment_default ~ LIMIT_BAL + EDUCATION + PAY_0 +

PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + PAY_AMT1 + PAY_AMT2 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR, train_df,
family='binomial')
summary(mod_2)
Call:
glm(formula = payment_default ~ LIMIT_BAL + EDUCATION + PAY_0 +
PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + PAY_AMT1 + PAY_AMT2 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + PAY_RATIO_APR, family = "binomial",
data = train_df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2738 -0.5833 -0.5285 -0.3460 3.6815
Coefficients: (1 not defined because of singularities)

Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.901e+00 1.195e+02 -0.075 0.940605
LIMIT_BAL -1.042e-06 1.709e-07 -6.097 1.08e-09 ***
EDUCATIONHigh School -2.286e-02 5.513e-02 -0.415 0.678398
EDUCATIONOthers -7.996e-01 4.013e-01 -1.993 0.046309 *
EDUCATIONUniversity 7.109e-03 4.171e-02 0.170 0.864674
EDUCATIONUnknown -9.843e-01 2.471e-01 -3.983 6.80e-05 ***
PAY_02 month delay 1.303e+00 6.642e-02 19.613 < 2e-16 ***
PAY_03 month delay 1.232e+00 1.670e-01 7.380 1.58e-13 ***
PAY_04 month delay 1.106e+00 3.344e-01 3.308 0.000941 ***
PAY_0Paid Duly -7.606e-01 5.689e-02 -13.370 < 2e-16 ***
PAY_25 month or more delay 1.456e+00 8.574e-01 1.698 0.089565 .
PAY_2Paid Duly 5.118e-01 6.525e-01 0.784 0.432819
PAY_35 month or more delay 8.511e+00 1.195e+02 0.071 0.943208
PAY_3Paid Duly 8.104e+00 1.195e+02 0.068 0.945920
PAY_42 month delay 2.422e-01 7.576e-02 3.197 0.001388 **
PAY_45 month or more delay -1.885e+00 6.806e-01 -2.770 0.005607 **
PAY_4Paid Duly NA NA NA NA
PAY_55 month or more delay 1.188e+00 8.041e-01 1.478 0.139411
PAY_5Paid Duly -2.379e-01 8.402e-02 -2.832 0.004630 **
PAY_63 month delay 6.076e-01 2.640e-01 2.302 0.021332 *
PAY_6Paid Duly -3.009e-01 7.295e-02 -4.125 3.71e-05 ***
PAY_AMT1 -1.047e-05 2.379e-06 -4.401 1.08e-05 ***
PAY_AMT2 -5.813e-06 1.922e-06 -3.025 0.002489 **
PAY_AMT4 -4.675e-06 1.925e-06 -2.429 0.015143 *
PAY_AMT5 -3.010e-06 1.736e-06 -1.734 0.082953 .
PAY_AMT6 -3.577e-06 1.547e-06 -2.312 0.020773 *
PAY_RATIO_APR -1.250e-04 8.315e-05 -1.503 0.132727
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 23846 on 22497 degrees of freedom

Residual deviance: 19717 on 22459 degrees of freedom
AIC: 19795
Number of Fisher Scoring iterations: 9
Predicting the model probabilities along with the Confusion Matrix and Accuracy
predict_2<-predict(mod_2,train_df,type='response')
prob_2<-ifelse(predict_2>0.5,1,0)
#Confusion Matrix
confusion_matrix<-table(prob_2, train_df$payment_default)
print(confusion_matrix)
#Model Accuracy
Accuracy<-sum(diag(confusion_matrix))/sum(confusion_matrix)
print(Accuracy*100)
[1] 82.10508
ROCR Curve
#The ROCR Curve

pred1 <- prediction(predict(mod_2),train_df$payment_default)
perf1 <- performance(pred1,"tpr","fpr")
plot(perf1)
True positive rate
0.8
0.4
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
Let's check the accuracy of the model with the data set aside for validation
predict_2<-predict(mod_2,test_df,type='response')
prob_2<-ifelse(predict_2>0.5,1,0)
#Confusion matrix for test data

confusion_matrix_test<-table(prob_2,test_df$payment_default)
#Accuarcy of the model

Accuracy_test<-sum(diag(confusion_matrix_test))/sum(confusion_matrix_test)
print(Accuracy_test * 100)
[1] 81.86667
There isn't any significant change in accuracy of the model so we can assume our model
performs well. But can we raise the accuracy of the model using some other technique?
So let us try the Decision Tree approach
DECISION TREE
set.seed(1234)
mymod<-rpart(payment_default ~
PAY_AMT1+PAY_AMT2+PAY_AMT3+PAY_AMT4+PAY_AMT5+PAY_AMT6+LIMIT_BAL+
EDUCATION+PAY_0+PAY_2+PAY_3+PAY_4+PAY_5+PAY_6+
BILL_AMT1+BILL_AMT2+BILL_AMT3+BILL_AMT4+BILL_AMT5+BILL_AMT6+AGE+SEX+MARRIAGE,
data= train_df, method="class",
control = rpart.control(cp = 0.0001,minsplit = 30, minbucket =
30*2,
maxsurrogate = 5, usesurrogate = 2,xval=10, maxdepth = 30))
printcp(mymod)
Classification tree:
rpart(formula = payment_default ~ PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + LIMIT_BAL + EDUCATION +
PAY_0 + PAY_2 + PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT1 +
BILL_AMT2 + BILL_AMT3 + BILL_AMT4 + BILL_AMT5 + BILL_AMT6 +
AGE + SEX + MARRIAGE, data = train_df, method = "class",
control = rpart.control(cp = 1e-04, minsplit = 30, minbucket = 30 *
2, maxsurrogate = 5, usesurrogate = 2, xval = 10, maxdepth = 30))
Variables actually used in tree construction:

[1] AGE BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT6 LIMIT_BAL
PAY_0
[9] PAY_2 PAY_3 PAY_5 PAY_6 PAY_AMT1 PAY_AMT3 PAY_AMT4
PAY_AMT5
[17] PAY_AMT6 SEX
Root node error: 5004/22498 = 0.22242
n= 22498
CP nsplit rel error xerror xstd

1 0.18924860 0 1.00000 1.00000 0.012466
2 0.00379696 1 0.81075 0.81075 0.011524
3 0.00369704 2 0.80695 0.81615 0.011554
4 0.00229816 4 0.79956 0.81155 0.011528
5 0.00179856 6 0.79496 0.80635 0.011500
6 0.00099920 7 0.79317 0.80855 0.011512
7 0.00086597 12 0.78797 0.80895 0.011514
8 0.00039968 15 0.78537 0.81535 0.011549
9 0.00026645 23 0.78217 0.81795 0.011564
10 0.00010000 29 0.78058 0.82694 0.011613
Let's have a look at the variable importance and drop the variables that are not of much
importance.
mymod$variable.importance
PAY_0 PAY_2 PAY_5 PAY_AMT3 PAY_4 BILL_AMT4

PAY_6
1217.888225 278.903517 106.698302 92.500536 79.350640 75.679284
60.738221
BILL_AMT5 BILL_AMT1 PAY_3 BILL_AMT3 BILL_AMT6 PAY_AMT1
BILL_AMT2
58.286214 55.271230 54.153286 50.186956 48.270895 40.076963
27.227377
PAY_AMT5 PAY_AMT6 PAY_AMT4 PAY_AMT2 AGE LIMIT_BAL
SEX
19.867970 12.941451 12.838334 12.422891 8.967786 6.747299
3.225280
Let's re-run the model after dropping the variables "EDUCATION","SEX","MARRIAGE"
set.seed(1234)
mymod<-rpart(payment_default ~
PAY_AMT1+PAY_AMT2+PAY_AMT3+PAY_AMT4+PAY_AMT5+PAY_AMT6+LIMIT_BAL+
PAY_0+PAY_2+PAY_3+PAY_4+PAY_5+PAY_6+BILL_AMT1+BILL_AMT2+
BILL_AMT3+BILL_AMT4+BILL_AMT5+BILL_AMT6+AGE,
data= train_df, method="class",
control = rpart.control(cp = 0.0001,minsplit = 30, minbucket =
30*2,
maxsurrogate = 5, usesurrogate = 2, xval=10, maxdepth = 30))
printcp(mymod)
Classification tree:
rpart(formula = payment_default ~ PAY_AMT1 + PAY_AMT2 + PAY_AMT3 +
PAY_AMT4 + PAY_AMT5 + PAY_AMT6 + LIMIT_BAL + PAY_0 + PAY_2 +
PAY_3 + PAY_4 + PAY_5 + PAY_6 + BILL_AMT1 + BILL_AMT2 + BILL_AMT3 +
BILL_AMT4 + BILL_AMT5 + BILL_AMT6 + AGE, data = train_df,
method = "class", control = rpart.control(cp = 1e-04, minsplit = 30,
minbucket = 30 * 2, maxsurrogate = 5, usesurrogate = 2,
xval = 10, maxdepth = 30))
Variables actually used in tree construction:

[1] AGE BILL_AMT1 BILL_AMT2 BILL_AMT3 BILL_AMT4 BILL_AMT6 LIMIT_BAL
PAY_0
[9] PAY_2 PAY_3 PAY_5 PAY_6 PAY_AMT1 PAY_AMT3 PAY_AMT4
PAY_AMT5
[17] PAY_AMT6
Root node error: 5004/22498 = 0.22242
n= 22498
CP nsplit rel error xerror xstd

1 0.18924860 0 1.00000 1.00000 0.012466
2 0.00379696 1 0.81075 0.81075 0.011524
3 0.00369704 2 0.80695 0.81615 0.011554
4 0.00159872 4 0.79956 0.81195 0.011531
5 0.00099920 6 0.79636 0.81195 0.011531
6 0.00086597 12 0.79017 0.81115 0.011526
7 0.00059952 15 0.78757 0.81315 0.011537
8 0.00039968 16 0.78697 0.81775 0.011563
9 0.00026645 23 0.78417 0.81515 0.011548
10 0.00010000 29 0.78257 0.82114 0.011581
We will prune the tree based on the best complexity parameter that has the least error.
set.seed(1234)
bestcp <- mymod$cptable[which.min(mymod$cptable[,"xerror"]),"CP"]
# Prune the tree using the best cp.

tree.pruned <- prune(mymod, cp = bestcp)
Let us evaluate and the model accuracy
# confusion matrix (training data)

conf.matrix <-
table("Predicted"=predict(tree.pruned,type="class"),"Actual"=train_df$payment_
default)
print(conf.matrix)
Accuracy_dt<-sum(diag(conf.matrix))/sum((conf.matrix))
print(Accuracy_dt * 100)
Actual
Predicted 0 1
0 16785 3348
1 709 1656
[1] 81.96729
Let's try plotting the tree

rpart.plot(mymod,fallen.leaves = F ,extra=3)
0
5004 / 22498
yes PAY_0 = 1 month delay,Paid Duly no
0 1
3348 / 20133 709 / 2365
PAY_2 = 1 month delay,Paid Duly BILL_AMT1 < 587
0 0 0 1
2611 / 18400 737 / 1733 31 / 81 659 / 2284
PAY_AMT3 >= 665 PAY_6 = 4 month delay,5 month or more delay,Paid Duly PAY_6 = 4 month delay,Paid Duly
0 0 0 1 1 1
1518 / 13348 1093 / 5052 478 / 1252 222 / 481 454 / 1388 205 / 896
PAY_5 = Paid Duly BILL_AMT1 >= 530 LIMIT_BAL >= 265e+3 PAY_AMT5 >= 5 PAY_AMT5 >= 10e+3
0 0 0 0 0 0 0 1 1 1
1354 / 12810 164 / 538 548 / 3062 545 / 1990 30 / 129 448 / 1123 114 / 233 103 / 248 36 / 76 418 / 1312
PAY_3 = Paid Duly PAY_AMT4 >= 1 PAY_AMT6 >= 777 PAY_AMT3 < 906 BILL_AMT6 < 17e+3 PAY_AMT1 < 995
0 0 0 0 0 0 0 1 0 1 1 1
91 / 388 73 / 150 165 / 777 380 / 1213 226 / 644 222 / 479 22 / 60 81 / 173 53 / 109 47 / 139 142 / 384 276 / 928
PAY_AMT3 >= 3751 AGE < 53 PAY_AMT3 >= 1602 BILL_AMT3 >= 333 PAY_AMT3 >= 2452 PAY_AMT1 >= 1
0 1 0 1 0 0 0 1 0 1 0 1
24 / 60 41 / 90 337 / 1133 37 / 80 99 / 333 127 / 311 177 / 404 30 / 75 42 / 89 34 / 84 29 / 60 111 / 324
BILL_AMT2 >= 22e+3 BILL_AMT4 < 17e+3 BILL_AMT4 >= 19e+3
0 0 0 0 1 1
48 / 139 79 / 172 89 / 228 88 / 176 80 / 207 31 / 117
BILL_AMT6 < 12e+3 BILL_AMT6 < 19e+3 PAY_AMT3 < 1281
0 1 0 1 0 1
33 / 93 33 / 79 34 / 79 43 / 97 37 / 80 37 / 127
Let's check the accuracy of the model with the data set aside for validation
conf.matrix_test <- table("Predicted"=predict(tree.pruned,test_df,

type="class"),"Actual"=test_df$payment_default)
print(conf.matrix_test)
Accuracy_test<-sum(diag(conf.matrix_test))/sum((conf.matrix_test))
print(Accuracy_test * 100)
Actual
Predicted 0 1
0 5624 1111
1 244 521
[1] 81.93333
The accuracy achieved using the data set aside for validation is also similar hence our model fits
the data well. Overall, we see some improvement in the accuracy of the model using the
Decision Tree approach. Let's try random forest to understand if any further imrovement can be
brought to the model.
RANDOM FOREST
seed<-7
mtry <-
floor(sqrt(ncol(train_df[,c("PAY_AMT1","PAY_AMT2","PAY_AMT3","PAY_AMT4","PAY_A
MT5",
"PAY_AMT6","LIMIT_BAL","EDUCATION","PAY_0","PAY_2","PAY_3",
"PAY_4","PAY_5","PAY_6","BILL_AMT1","BILL_AMT2","BILL_AMT3",
"BILL_AMT4","BILL_AMT5","BILL_AMT6","SEX","AGE","MARRIAGE")])))
metric <- "Accuracy"
control <- trainControl(method="repeatedcv", number=10, repeats=1,
search="grid")
tunegrid <- expand.grid(.mtry=c(1:mtry))
modellist <- list()

for (ntree in c(500,1000, 1500, 2000, 2500)) {
set.seed(seed)
fit <- train(payment_default~.-(ID+PAY_RATIO_APR+PAY_RATIO_MAY+PAY_RATIO_JUNE+
PAY_RATIO_JULY+PAY_RATIO_AUG+PAY_RATIO_SEPT),
data=train_df, method="rf", metric=metric,
tuneGrid=tunegrid, trControl=control, ntree=ntree)
key <- toString(ntree)
modellist[[key]] <- fit
}
# compare results
results <- resamples(modellist)
summary(results)
dotplot(results)
Call:
summary.resamples(object = results)
Models: 500, 1000, 1500, 2000, 2500

Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
500 0.8061361 0.8175556 0.8193333 0.8204727 0.8266667 0.8305914 0
1000 0.8065807 0.8172222 0.8204444 0.8203838 0.8267778 0.8297021 0
1500 0.8065807 0.8170000 0.8211111 0.8207839 0.8258889 0.8314807 0
2000 0.8070253 0.8170000 0.8211111 0.8206061 0.8263333 0.8310360 0
2500 0.8065807 0.8171111 0.8208889 0.8204283 0.8260000 0.8301467 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
500 0.3156109 0.3621014 0.3732189 0.3708866 0.3904972 0.4014915 0
1000 0.3177782 0.3617246 0.3746494 0.3710045 0.3896442 0.4025187 0
1500 0.3189702 0.3604446 0.3777197 0.3721434 0.3928142 0.4001422 0
2000 0.3211289 0.3609974 0.3772465 0.3722848 0.3927817 0.4025196 0
2500 0.3201581 0.3618386 0.3760520 0.3712406 0.3903861 0.3983949 0
rf_random <- randomForest(payment_default~.-
(ID+PAY_RATIO_APR+PAY_RATIO_MAY+PAY_RATIO_JUNE+
PAY_RATIO_JULY+PAY_RATIO_AUG+PAY_RATIO_SEPT),
data=train_df,imrove=0.1,ntree=500,mtry= mtry,
importance=TRUE,nodesize=30)
print(rf_random)
Call:
randomForest(formula = payment_default ~ . - (ID + PAY_RATIO_APR +
PAY_RATIO_MAY + PAY_RATIO_JUNE + PAY_RATIO_JULY + PAY_RATIO_AUG +
PAY_RATIO_SEPT), data = train_df, imrove = 0.1, ntree = 500, mtry =
mtry, importance = TRUE, nodesize = 30)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 4
OOB estimate of error rate: 17.83%

Confusion matrix:
0 1 class.error
0 16663 866 0.04940385
1 3146 1823 0.63312538

Capastone Project Taiwan Customer Default

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Capastone Project Taiwan Customer Default

Caricato da

Copyright:

Formati disponibili

Capstone Project

Taiwan: Customer Defaults

There are 25 variables:

The data has thirty thousand rows and twenty-five columns

Structure of the data

Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 30000 obs. of 25 variables:

setnames(df, "default payment next month", "payment_default")

### EDUCATION is supposed to be an integer between 1 and 6 (both inclusive).

### MARRIAGE is supposed to be an integer between 1 and 3 (both inclusive).

### Continuous features (BILL_AMT1-BILL_AMT6, PAY_AMT1-PAY_AMT6) -There is a

### SEX: 1: Male and 2: Female

### EDUCATION: 1: Graduate School, 2: University, 3: High School , 4:

BAR PLOT OF DEMOGRAPHIC VARIABLES

df$PAY_0<-ifelse(df$PAY_0 <1 ,"Paid Duly",ifelse(df$PAY_0==1,"1 month delay",

df$PAY_2<-ifelse(df$PAY_2 <1 ,"Paid Duly",ifelse(df$PAY_2==1,"1 month delay",

df$PAY_3<-ifelse(df$PAY_3 <1 ,"Paid Duly",ifelse(df$PAY_3==1,"1 month delay",

df$PAY_4<-ifelse(df$PAY_4 <1 ,"Paid Duly",ifelse(df$PAY_4==1,"1 month delay",

df$PAY_5<-ifelse(df$PAY_5 <1 ,"Paid Duly",ifelse(df$PAY_5==1,"1 month delay",

df$PAY_6<-ifelse(df$PAY_6 <1 ,"Paid Duly",ifelse(df$PAY_6==1,"1 month delay",

BAR PLOT OF PAYMENT DATA

Analysis of the Continuous Variables

Let us begin by plotting them.

# Surprisingly ,Single people are less likely to default

# Let’s take a look at default status vs 2 categorical variable at a time now .

# Default vs Marriage and Education

# Default vs Marriage and Sex

Missing Value Check

df<-subset(df, !(df$LIMIT_BAL> quantile(df$LIMIT_BAL, 0.99999) |

Imbalance check for the Dependent Variable:-

payment_default Client_Count Contribution

There is no significant imbalance in the dependent variable.

payment_default Client_Count Contribution

No major imbalance observed, so we would proceed to modelling.

SEX EDUCATION MARRIAGE PAY_0 PAY_2 PAY_3 PAY_4 PAY_5 PAY_6

Coefficients: (1 not defined because of singularities)

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 23846 on 22497 degrees of freedom

Number of Fisher Scoring iterations: 9

Predicting the probabilities obtained from the model

STEP AIC CRITERION

Re-running the model with the best subset obtained.

mod_2<-glm(payment_default ~ LIMIT_BAL + EDUCATION + PAY_0 +

Coefficients: (1 not defined because of singularities)

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 23846 on 22497 degrees of freedom

Number of Fisher Scoring iterations: 9

#The ROCR Curve

0.0 0.2 0.4 0.6 0.8 1.0

False positive rate

#Confusion matrix for test data

#Accuarcy of the model

Variables actually used in tree construction:

Root node error: 5004/22498 = 0.22242

CP nsplit rel error xerror xstd

PAY_0 PAY_2 PAY_5 PAY_AMT3 PAY_4 BILL_AMT4

Let's re-run the model after dropping the variables "EDUCATION","SEX","MARRIAGE"

Variables actually used in tree construction:

Root node error: 5004/22498 = 0.22242

CP nsplit rel error xerror xstd

# Prune the tree using the best cp.

Let us evaluate and the model accuracy

# confusion matrix (training data)

Let's try plotting the tree

yes PAY_0 = 1 month delay,Paid Duly no