Credit Risk Modelling (EDA & Classification) - Kaggle

7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
We use cookies on kaggle to deliver our services, analyze web traffic, and improve your experience on the site. By using kaggle, you agree to our use of
Got it Learn more
cookies.
Search  Competitions Datasets Kernels Discussion Courses Sign in Register
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 1/21
Credit Risk Modelling [EDA & Classification]
IonasKel
September 30, 2018
Introduction (https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-
classification#introduction)
Load data and Libraries (https://www.kaggle.com/ionaskel/credit-risk-
modelling-eda-classification#load-data-and-libraries)
Feature Selection & Engineering (https://www.kaggle.com/ionaskel/credit-
risk-modelling-eda-classification#feature-selection-engineering)
Exploratory Data Analysis (https://www.kaggle.com/ionaskel/credit-risk-
modelling-eda-classification#exploratory-data-analysis)
Data modelling (https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-
Credit Risk Modelling [EDAclassification#data-modelling)

& Classification]
Conclusion  34  Copy and Edit 45
Rmarkdown script using data from Lending Club Loan(https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-
Data · 9,087 views · 10mo ago ·  beginner, data visualization,
eda, +2 more classification#conclusion) 
Introduction
The analysis of credit risk and the decision making for granting loans is one of
the most important operations for financial institutions. By taking into account
Version 21
past results, we need to train a model to accurately predict future outcomes.
 21 commits
Report Load data and Libraries

Load libraries we are going to use.
Credit Risk Modelling
[EDA & Classification] library(tidyverse)
library(ggthemes)
Introduction
library(corrplot)
Load Data And Libraries library(GGally)
library(DT)
Feature Selection & library(caret)
Engineering
Exploratory Data Load the data available for analysis. The dataset is takenas bank’s records
Analysis about the statuw of loan defaults and the profile of customers.
Data Modelling
# Set the blank spaces to NA's
Conclusion loan = read_csv("../input/loan.csv" , na = "")
Code colnames(loan)
Data
## [1] "id" "member_id"
Log ## [3] "loan_amnt" "funded_amnt"
## [5] "funded_amnt_inv" "term"
Comments
## [7] "int_rate" "installment"
## [9] "grade" "sub_grade"
## [11] "emp_title" "emp_length"
## [13] "home_ownership" "annual_inc"
## [15] "verification_status" "issue_d"
## [17] "loan_status" "pymnt_plan"
## [19] "url" "desc"
## [21] "purpose" "title"
## [23] "zip_code" "addr_state"
## [25] "dti" "delinq_2yrs"
## [27] "earliest_cr_line" "inq_last_6mths"
[29] " h i l d li " " h i l
## [29] "mths_since_last_delinq" "mths_since_last_
record"
## [31] "open_acc" "pub_rec"
## [33] "revol_bal" "revol_util"
## [35] "total_acc" "initial_list_sta
tus"
## [37] "out_prncp" "out_prncp_inv"
## [39] "total_pymnt" "total_pymnt_inv"
## [41] "total_rec_prncp" "total_rec_int"
## [43] "total_rec_late_fee" "recoveries"
## [45] "collection_recovery_fee" "last_pymnt_d"
## [47] "last_pymnt_amnt" "next_pymnt_d"
## [49] "last_credit_pull_d" "collections_12_m
ths_ex_med"
## [51] "mths_since_last_major_derog" "policy_code"
## [53] "application_type" "annual_inc_join
t"
## [55] "dti_joint" "verification_sta
tus_joint"
## [57] "acc_now_delinq" "tot_coll_amt"
    
## [59] "tot_cur_bal"
Report Code Data
"open_acc_6m"
Log Comments
## [61] "open_il_6m" "open_il_12m"
## [63] "open_il_24m" "mths_since_rcnt_
il"
## [65] "total_bal_il" "il_util"
## [67] "open_rv_12m" "open_rv_24m"
## [69] "max_bal_bc" "all_util"
## [71] "total_rev_hi_lim" "inq_fi"
## [73] "total_cu_tl" "inq_last_12m"
Feature Selection & Engineering

The dataset contains of information of age, annual income, grade of employee,
home ownership that affect the probability of default of the borrower. The
columns we are going to use are namely:
loan_status : Variable with multiple levels (e.g. Charged off, Current,

Default, Fully Paid …)
loan_amnt : Total amount of loan taken
int_rate : Loan interset rate
grade : Grade of employment
emp_length : Duration of employment
home_ownership : Type of ownership of house
annual_inc : Total annual income
term : 36-month or 60-month period
# Select only the columns mentioned above.

loan = loan %>%
select(loan_status , loan_amnt , int_rate , gra
de , emp_length , home_ownership ,
annual_inc , term)
loan
## # A tibble: 887,379 x 8
## loan_status loan_amnt int_rate grade emp_length h
ome_ownership
## <chr> <dbl> <dbl> <chr> <chr> <
chr>
## 1 Fully Paid 5000 10.6 B 10+ years R
ENT
## 2 Charged Off 2500 15.3 C < 1 year R
ENT
## 3 Fully Paid 2400 16.0 C 10+ years R

ENT
## 4 Fully Paid 10000 13.5 C 10+ years R
ENT
## 5 Current 3000 12.7 B 1 year R
ENT
## 6 Fully Paid 5000 7.9 A 3 years R
ENT
## 7 Current 7000 16.0 C 8 years R
ENT
## 8 Fully Paid 3000 18.6 E 9 years R
ENT
## 9 Charged Off 5600 21.3 F 4 years O
WN
## 10 Charged Off 5375 12.7 B < 1 year R
ENT
## # ... with 887,369 more rows, and 2 more variables:
annual_inc <dbl>,
## # term <chr>
Missing Values:
sapply(loan , function(x) sum(is.na(x)))
## loan_status loan_amnt int_rate

grade emp_length
## 0 0 0
0 0
## home_ownership annual_inc term
## 0 4 0
# Remove the 4 rows with missing annual income, 49 rows

where home ownership is 'NONE' or 'ANY' and rows where
emp_length is 'n/a'.
loan = loan %>%

filter(!is.na(annual_inc) ,
!(home_ownership %in% c('NONE' , 'ANY'))
,
emp_length != 'n/a')
Exploratory Data Analysis

loan_status :
loan %>%
count(loan_status) %>%
ggplot(aes(x = reorder(loan_status , desc(n)) ,
y = n , fill = n)) +
geom_col() +
coord_flip() +
labs(x = 'Loan Status' , y = 'Count')
We want to convert this variable to binary (1 for default and 0 for non-default)
but we have 10 different levels. Loans with status Current, Late payments, In
grace period need to be removed. Therefore, we create a new variable called
loan_outcome where
loan_outcome -> 1 if loan_status = ‘Charged Off’ or ‘Default’ loan_outcome -> 0

if loan_status = ‘Fully Paid’
loan = loan %>%

mutate(loan_outcome = ifelse(loan_status %in% c
('Charged Off' , 'Default') ,
1,
ifelse(loan_status
== 'Fully Paid' , 0 , 'No info')
))
barplot(table(loan$loan_outcome) , col = 'lightblue')
We will create a new dataset which contains only rows with 0 or 1 in

loan_outcome feature for better modelling.
# Create the new dataset by filtering 0's and 1's in the

loan_outcome column and remove loan_status column for th
e modelling
loan2 = loan %>%
select(-loan_status) %>%
filter(loan_outcome %in% c(0 , 1))
Our new dataset contains of 244179 rows.
Let’s observe how useful these variables would be for credit risk modelling. It is
known that the better the grade the lowest the interest rate. We can nicely
visualise this with boxplots.
ggplot(loan2 , aes(x = grade , y = int_rate , fill = gr

ade)) +
geom_boxplot() +
theme_igray() +
labs(y = 'Interest Rate' , x = 'Grade')
We assume that grade is a great predictor for the volume of non-performing

loans. But how many of them did not performed grouped by grade?
table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1

) , c('Fully Paid' , 'Default')))
##
## Fully Paid Default
## A 38268 2472
## B 64185 9095
## C 50823 12252
## D 28874 10202
## E 12473 6162
## F 4581 2890
## G 1110 792
ggplot(loan2 , aes(x = grade , y = ..count.. , fill = f

actor(loan_outcome , c(1 , 0) , c('Default' , 'Fully Pa
id')))) +
geom_bar() +
theme(legend.title = element_blank())
Now let’s try to find out what impact the annual income of the borrower has on
the other variables.
ggplot(loan2[sample(244179 , 10000) , ] , aes(x = annua
l_inc , y = loan_amnt , color = int_rate)) +
geom_point(alpha = 0.5 , size = 1.5) +
geom_smooth(se = F , color = 'darkred' , method
= 'loess') +
xlim(c(0 , 300000)) +
labs(x = 'Annual Income' , y = 'Loan Ammount' ,
color = 'Interest Rate')
As expected the larger the annual income the larger the demanded ammount by
the borrower.
Data modelling
Modelling Process:
We created the binary loan_outcome which will be our response variable.

We exclude some independent variables in order to make the model simpler.
We split the dataset to training set(75%) and testing set(25%) for the
validation.
We train a model to predict the probability of default.
Because of the binary response variable we can use logistic regression. Rather
than modelling the response Y directly, logistic regression models the probability
that Y belongs to a particular category, in our case the probability of a non-
performing loan. This probability can be computed by the logistic function,
P = exp(b0 + b1X1 + … + bNXN) / [ 1 + exp(b0 + b1X1 + … + bNXN) ]
where
P is the probability of default

b0 , b1 , … , bN are the coefficient estimates
N the number of observations
X1 , … , XN are the independent variables
# Split dataset
loan2$loan_outcome = as.numeric(loan2$loan_outcome)
idx = sample(dim(loan2)[1] , 0.75*dim(loan2)[1] , repla
ce = F)
trainset = loan2[idx , ]
testset = loan2[-idx , ]
# Fit logistic regression

glm.model = glm(loan_outcome ~ . , trainset , family =
binomial(link = 'logit'))
summary(glm.model)
##
## Call:
## glm(formula = loan_outcome ~ ., family = binomial(li
nk = "logit"),
## data = trainset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4008 -0.6693 -0.5173 -0.3400 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr
(>|z|)
## (Intercept) -3.478e+00 4.987e-02 -69.752 <
2e-16 ***
## loan_amnt 1.413e-05 9.694e-07 14.576 <
2e-16 ***
## int_rate 1.362e-01 4.693e-03 29.017 <
2e-16 ***
## gradeB 1.402e-01 3.327e-02 4.213 2.
52e-05 ***
## gradeC 1.611e-01 4.282e-02 3.762 0.
000169 ***
## gradeD 1.178e-01 5.466e-02 2.155 0.
031159 *
## gradeE -2.819e-02 6.770e-02 -0.416 0.
677082
## gradeF -2.548e-01 8.360e-02 -3.048 0.
002303 **
## gradeG -3.342e-01 1.020e-01 -3.276 0.
001053 **
## emp_length1 year -7.567e-02 3.199e-02 -2.366 0.
018000 *
## emp_length10+ years -8.850e-02 2.439e-02 -3.629 0.
000284 ***
## emp_length2 years -9.606e-02 2.942e-02 -3.265 0.
001095 **
041061 *
023888 *
331941
681278
071440 .
399682
966458
## home_ownershipOTHER 4.291e-01 2.478e-01 1.732 0.
083343 .
## home_ownershipOWN 6.730e-02 2.391e-02 2.815 0.
004881 **
## home_ownershipRENT 2.059e-01 1.404e-02 14.668 <
2e-16 ***
## annual_inc -6.441e-06 2.122e-07 -30.351 <
2e-16 ***
## term60 months 3.242e-01 1.681e-02 19.284 <
2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to b
e 1)
##
## Null deviance: 172580 on 183133 degrees of fre
edom
## Residual deviance: 159710 on 183110 degrees of fre
edom
## AIC: 159758
##
## Number of Fisher Scoring iterations: 6
The coefficients of the following features are positive:
1. Loan Ammount
2. Interest Rate
3. Home Ownership - Other
4. Term
5. The better the grade the more difficult to default
This means the probability of defaulting on the given credit varies directly with
these factors. For example more the given ammount of the loan, more the risk
of losing credit.
The coefficients of the following features are negative:
1. Annual Income
2. Home Ownership - Own
3. Home Ownership - Rent
4. Borrowers with 10+ years of experience are more likely to pay their debt
5. There is no significant difference in the early years of employment
This means that the probability of defaulting is inversely proportional to the

factors mentioned above.
# Prediction on test set

preds = predict(glm.model , testset , type = 'response'
)
# Density of probabilities
ggplot(data.frame(preds) , aes(preds)) +
geom_density(fill = 'lightblue' , alpha = 0.4)
+
labs(x = 'Predicted Probabilities on test set')
But now let’s see how the accuracy, sensitivity and specificity are transformed
for given threshold. We can use a threshold of 50% for the posterior probability
of default in order to assign an observation to the default class. However, if we
are concerned about incorrectly predicting the default status for individuals who
default, then we can consider lowering this threshold. So we will consider these
three metrics for threshold levels from 1% up to 50%.
k = 0
accuracy = c()
sensitivity = c()
specificity = c()
for(i in seq(from = 0.01 , to = 0.5 , by = 0.01)){
k = k + 1
preds_binomial = ifelse(preds > i , 1 , 0)
confmat = table(testset$loan_outcome , preds_bi
nomial)
accuracy[k] = sum(diag(confmat)) / sum(confmat)
sensitivity[k] = confmat[1 , 1] / sum(confmat[
, 1])
specificity[k] = confmat[2 , 2] / sum(confmat[
, 2])
}
If we plot our results we get this visualization.
threshold = seq(from = 0.01 , to = 0.5 , by = 0.01)
data = data.frame(threshold , accuracy , sensitivity ,

specificity)
head(data)
## threshold accuracy sensitivity specificity

## 1 0.01 0.1799492 0.9402985 0.1791138
## 2 0.02 0.1817348 0.9615385 0.1794029
## 3 0.03 0.1862724 0.9606625 0.1800964
## 4 0.04 0.1988697 0.9647779 0.1821256
## 5 0.05 0.2252109 0.9647563 0.1865055
## 6 0.06 0.2614956 0.9571610 0.1924878
# Gather accuracy , sensitivity and specificity in one c

olumn
ggplot(gather(data , key = 'Metric' , value = 'Value' ,
2:4) ,
aes(x = threshold , y = Value , color = Metric))
+
geom_line(size = 1.5)
A threshold of 25% - 30% seems ideal cause further increase of the cut off
percentage does not have significant impact on the accuracy of the model. The
Confusion Matrix for cut off point at 30% will be this,
preds.for.30 = ifelse(preds > 0.3 , 1 , 0)

confusion_matrix_30 = table(Predicted = preds.for.30 ,
Actual = testset$loan_outcome)
confusion_matrix_30
## Actual
## Predicted 0 1
## 0 44853 7834
## 1 5266 3092
## [1] "Accuracy : 0.7854"
The ROC (Receiver Operating Characteristics) curve is a popular graphic for

simultaneously displaying the two types of errors for all possible thresholds.
library(pROC)
# Area Under Curve

auc(roc(testset$loan_outcome , preds))
## Area under the curve: 0.6957
# Plot ROC curve

plot.roc(testset$loan_outcome , preds , main = "Confide
nce interval of a threshold" , percent = TRUE ,
ci = TRUE , of = "thresholds" , thresholds =
"best" , print.thres = "best" , col = 'blue')
This kernel has been released under the Apache 2.0 open source license.
Did you find this Kernel useful? 

Show your appreciation with an upvote 34
Code This kernel has been released under the Apache 2.0 open source license. Download Code
1 ---
2 title: "Credit Risk Modelling [EDA & Classification]"
3 author: "IonasKel"
4 date: "September 30, 2018"
5 output:
6 html_document:
7 fig_height: 8
8 fig_width: 12
9 highlight: tango
10 toc: yes
11 ---
12
13 ```{r setup, include=FALSE}
14 knitr::opts_chunk$set(echo = TRUE , warning = FALSE , message = FALSE)
15 ```
16
17 # Introduction
18
19 The analysis of credit risk and the decision making for granting loans is one of the most important
20
21 # Load data and Libraries
22
23 Load libraries we are going to use.
24
25 ```{r libraries}
26 library(tidyverse)
27 library(ggthemes)
28 library(corrplot)
29 library(GGally)
30 library(DT)
31 library(caret)
32 ```
33
34 Load the data available for analysis. The dataset is takenas bank's records about the statuw of loa
35
36 ```{r load data}
37 # Set the blank spaces to NA's
38 loan = read_csv("../input/loan.csv" , na = "")
39
40 ```
41 ```{r columns names}
42 colnames(loan)
43
44 ```
45
46
47
48 # Feature Selection & Engineering
49
50 The dataset contains of information of age, annual income, grade of employee, home ownership that a
51
52 * **loan_status** : Variable with multiple levels (e.g. Charged off, Current, Default, Fully Pai
53 * **loan_amnt** : Total amount of loan taken
54 * **int_rate** : Loan interset rate
55 * **grade** : Grade of employment
56 * **emp_length** : Duration of employment
57 * **home_ownership** : Type of ownership of house
58 * **annual_inc** : Total annual income
59 * **term** : 36-month or 60-month period
60
61
62 ```{r select columns}
63 # Select only the columns mentioned above.
64 loan = loan %>%
65 select(loan_status , loan_amnt , int_rate , grade , emp_length , home_ownership ,
66 annual_inc , term)
67 loan
68
69 ```
70
71
72 Missing Values:
73 ```{r NAs}
74 sapply(loan , function(x) sum(is.na(x)))
75
76 # Remove the 4 rows with missing annual income, 49 rows where home ownership is 'NONE' or 'ANY' and
77
78 loan = loan %>%
79 filter(!is.na(annual_inc) ,
80 !(home_ownership %in% c('NONE' , 'ANY')) ,
81 emp_length != 'n/a')
82
83 ```
84
85
86 # Exploratory Data Analysis
87
88 * **loan_status** :
89
90 ```{r loan_status}
91 loan %>%
92 count(loan_status) %>%
93 ggplot(aes(x = reorder(loan_status , desc(n)) , y = n , fill = n)) +
94 geom_col() +
95 coord_flip() +
96 labs(x = 'Loan Status' , y = 'Count')
97
98 ```
99
100 We want to convert this variable to binary (1 for default and 0 for non-default) but we have 10 dif
101
102 loan_outcome -> 1 if loan_status = 'Charged Off' or 'Default'
103 loan_outcome -> 0 if loan_status = 'Fully Paid'
104
105 ```{r loan_outcome}
106 loan = loan %>%
107 mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default') ,
108 1,
109 ifelse(loan_status == 'Fully Paid' , 0 , 'No info')
110 ))
111
112 barplot(table(loan$loan_outcome) , col = 'lightblue')
113
114 ```
115
116
117 We will create a new dataset which contains only rows with 0 or 1 in loan_outcome feature for bette
118
119 ```{r loan2}
120 # Create the new dataset by filtering 0's and 1's in the loan_outcome column and remove loan_status
121 loan2 = loan %>%
122 select(-loan_status) %>%
123 filter(loan_outcome %in% c(0 , 1))
124
125 ```
126
127
128 Our new dataset contains of **`r nrow(loan2)` rows**.
129
130 Let's observe how useful these variables would be for credit risk modelling. It is known that the b
131
132 ```{r grade_boxplot}
133 ggplot(loan2 , aes(x = grade , y = int_rate , fill = grade)) +
134 geom_boxplot() +
135 theme_igray() +
136 labs(y = 'Interest Rate' , x = 'Grade')
137
138 ```
139
140 We assume that grade is a great predictor for the volume of non-performing loans. But how many of t
141
142 ```{r grade_barplot}
143 table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1) , c('Fully Paid' , 'Default')))
144
145 ggplot(loan2 , aes(x = grade , y = ..count.. , fill = factor(loan_outcome , c(1 , 0) , c('Default'
146 geom_bar() +
147 theme(legend.title = element_blank())
148
149 ```
150
151
152 Now let's try to find out what impact the annual income of the borrower has on the other variables
153
154 ```{r ann_inc vs loan_amnt}
155 ggplot(loan2[sample(244179 , 10000) , ] , aes(x = annual_inc , y = loan_amnt , color = int_rate)) +
156 geom_point(alpha = 0.5 , size = 1.5) +
157 geom_smooth(se = F , color = 'darkred' , method = 'loess') +
158 xlim(c(0 , 300000)) +
159 labs(x = 'Annual Income' , y = 'Loan Ammount' , color = 'Interest Rate')
160
161 ```
162
163 As expected the larger the annual income the larger the demanded ammount by the borrower.
164
165 # Data modelling
166
167 Modelling Process:
168
169 * We created the binary loan_outcome which will be our response variable.
170 * We exclude some independent variables in order to make the model simpler.
171 * We split the dataset to training set(75%) and testing set(25%) for the validation.
172 * We train a model to predict the probability of default.
173
174 Because of the binary response variable we can use logistic regression. Rather than modelling the r
175
176 P = exp(b0 + b1X1 + ... + bNXN) / [ 1 + exp(b0 + b1X1 + ... + bNXN) ]
177
178 where
179
180 * P is the probability of default
181 * b0 , b1 , ... , bN are the coefficient estimates
182 * N the number of observations
183 * X1 , ... , XN are the independent variables
184
185
186 ```{r log_regr}
187 # Split dataset
188 loan2$loan_outcome = as.numeric(loan2$loan_outcome)
189 idx = sample(dim(loan2)[1] , 0.75*dim(loan2)[1] , replace = F)
190 trainset = loan2[idx , ]
191 testset = loan2[-idx , ]
192
193 # Fit logistic regression
194 glm.model = glm(loan_outcome ~ . , trainset , family = binomial(link = 'logit'))
195 summary(glm.model)
196
197 ```
198
199 The coefficients of the following features are **positive**:
200
201 1) Loan Ammount
202 2) Interest Rate
203 3) Home Ownership - Other
204 4) Term
205 5) The better the grade the more difficult to default
206
207 This means the probability of defaulting on the given credit varies directly with these factors. Fo
208
209
210 The coefficients of the following features are **negative**:
211
212 1) Annual Income
213 2) Home Ownership - Own
214 3) Home Ownership - Rent
215 4) Borrowers with 10+ years of experience are more likely to pay their debt
216 5) There is no significant difference in the early years of employment
217
218 This means that the probability of defaulting is inversely proportional to the factors mentioned ab
219
220
221 ```{r pred}
222 # Prediction on test set
223 preds = predict(glm.model , testset , type = 'response')
224
225 # Density of probabilities
226 ggplot(data.frame(preds) , aes(preds)) +
227 geom_density(fill = 'lightblue' , alpha = 0.4) +
228 labs(x = 'Predicted Probabilities on test set')
229
230
231 ```
232
233 But now let's see how the accuracy, sensitivity and specificity are transformed for given threshold
234
235 ```{r acc}
236 k = 0
237 accuracy = c()
238 sensitivity = c()
239 specificity = c()
240 for(i in seq(from = 0.01 , to = 0.5 , by = 0.01)){
241 k = k + 1
242 preds_binomial = ifelse(preds > i , 1 , 0)
243 confmat = table(testset$loan_outcome , preds_binomial)
244 accuracy[k] = sum(diag(confmat)) / sum(confmat)
245 sensitivity[k] = confmat[1 , 1] / sum(confmat[ , 1])
246 specificity[k] = confmat[2 , 2] / sum(confmat[ , 2])
247 }
248 ```
249
250 ```{r remove , echo = FALSE}
251 rm(confmat , k , i , preds_binomial)
252
253 ```
254
255
256 If we plot our results we get this visualization.
257
258 ```{r threshold}
259 threshold = seq(from = 0.01 , to = 0.5 , by = 0.01)
260
261 data = data.frame(threshold , accuracy , sensitivity , specificity)
262 head(data)
263
264 # Gather accuracy , sensitivity and specificity in one column
265 ggplot(gather(data , key = 'Metric' , value = 'Value' , 2:4) ,
266 aes(x = threshold , y = Value , color = Metric)) +
267 geom_line(size = 1.5)
268 ```
269 ```{r , echo = FALSE}
270 rm(data)
271
272 ```
273
274 A threshold of 25% - 30% seems ideal cause further increase of the cut off percentage does not have
275
276 ```{r cutoff.30%}
277 preds.for.30 = ifelse(preds > 0.3 , 1 , 0)
278 confusion_matrix_30 = table(Predicted = preds.for.30 , Actual = testset$loan_outcome)
279 confusion_matrix_30
280
281 ```
282 ```{r acc2 , echo = FALSE}
283 paste('Accuracy :' , round(sum(diag(confusion_matrix_30)) / sum(confusion_matrix_30) , 4))
284
285 ```
286
287 ```{r , echo = FALSE}
288 rm(preds.for.30)
289
290 ```
291
292 The *ROC (Receiver Operating Characteristics) curve* is a popular graphic for simultaneously displa
293
294 ```{r roc}
295 library(pROC)
296
297 # Area Under Curve
298 auc(roc(testset$loan_outcome , preds))
299
300 # Plot ROC curve
301 plot.roc(testset$loan_outcome , preds , main = "Confidence interval of a threshold" , percent = TRU
302 ci = TRUE , of = "thresholds" , thresholds = "best" , print.thres = "best" , col = 'blue')
303
304 ```
305
306
307 # Conclusion
308
309 A logistic regression model was used to predict the loan status. Different cut off's were used to d
310
311
312
313
Did you find this Kernel useful? 

Show your appreciation with an upvote 34
Data
Data Sources Lending Club Loan Data

  Lending Club Loan Data
Analyze Lending Club's issued loans
Last Updated: 3 years ago (Version 1)
 loan.csv 148 columns
About this Dataset
  database.sqlite
 loan 887k x 75 These files contain complete loan data for all loans
issued through the 2007-2015, including the current
 LCDataDictionary.xlsx
loan status (Current, Late, Fully Paid, etc.) and latest
payment information. The file containing loan data
through the "present" contains complete loan data for all
loans issued through the previous completed calendar
quarter. Additional features include credit scores,
number of finance inquiries, address including zip
codes, and state, and collections among others. The file
is a matrix of about 890 thousand observations and 75
variables. A data dictionary is provided in a separate file.
k
Run Info
Succeeded True Run Time 71.8 seconds
Exit Code 0 Queue Time 0 seconds
Docker Image Name kaggle/rstats (Dockerfile) Output Size 0
Timeout Exceeded False Used All Space False
Failure Message
Log Download Log
Time Line # Log Message

4.4s 1
processing file: script.Rmd

4.5s 2 | |
| 0% |
|.. | 2%
ordinary text without R code
4.5s 3 | |...
| 5%
label: setup (with options)
List of 1
$ include: logi FALSE
4.6s 4 |
|..... | 7%
4.6s 5 |
|...... | 10%
4.7s 6 label: libraries
6.3s 7 ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
6.3s 8 ✔ ggplot2 3.0.0.9000 ✔ purrr 0.2.5
✔ tibble 1.4.2 ✔ dplyr 0.7.6
✔ tidyr 0.8.1 ✔ stringr 1.3.1
✔ readr 1.2.0 ✔ forcats 0.3.0
6.7s 9 ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
6.8s 10 corrplot 0.84 loaded
7.1s 11
Attaching package: 'GGally'
7.2s 12 The following object is masked from 'package:dplyr':
nasa
7.2s 13 Loading required package: lattice

10.3s 14
Attaching package: 'caret'
10.3s 15 |
|........ | 12%
10.4s 16 The following object is masked from 'package:purrr':
lift
10.4s 17 |
|......... | 14%
label: load data
10.6s 18 Parsed with column specification:
cols(
.default = col_double(),
term = col_character(),
grade = col_character(),
sub_grade = col_character(),
emp_title = col_character(),
emp_length = col_character(),
home_ownership = col_character(),
verification_status = col_character(),
issue_d = col_character(),
loan_status = col_character(),
pymnt_plan = col_character(),
url = col_character(),
desc = col_character(),
purpose = col_character(),
title = col_character(),
zip_code = col_character(),
addr_state = col_character(),
earliest_cr_line = col_character(),
initial_list_status = col_character(),
last_pymnt_d = col_character(),
next_pymnt_d = col_character()
# ... with 23 more columns
)
See spec(...) for full column specifications.
39.7s 19 Warning: 2970434 parsing failures.
row col expected actual file
42536 tot_coll_amt 1/0/T/F/TRUE/FALSE 0.0 '../input/loan.csv'
42536 tot_cur_bal 1/0/T/F/TRUE/FALSE 114834.0 '../input/loan.csv'
42536 total_rev_hi_lim 1/0/T/F/TRUE/FALSE 59900.0 '../input/loan.csv'
42537 tot_coll_amt 1/0/T/F/TRUE/FALSE 0.0 '../input/loan.csv'
42537 tot_cur_bal 1/0/T/F/TRUE/FALSE 14123.0 '../input/loan.csv'
..... ................ .................. ........ ...................
See problems(...) for more details.
39.7s 20 |
|........... | 17%
39.8s 21 label: columns names
|
|............ | 19%
|
|.............. | 21%
label: select columns
40.3s 22 |
|............... | 24%
40.4s 23 |
|................. | 26%
label: NAs
40.5s 24 |
|................... | 29%
|
|.................... | 31%
40.5s 25 label: loan_status
42.4s 26 |
|...................... | 33%
|
|....................... | 36%
42.4s 27 label: loan_outcome
44.0s 28 |
|......................... | 38%
|
|.......................... | 40%
44.0s 29 label: loan2
44.2s 30 |
|............................ | 43%
inline R code fragments
44.2s 31 |
|............................. | 45%
label: grade_boxplot
47.0s 32 |
|............................... | 48%
47.0s 33 |
|................................ | 50%
label: grade_barplot
48.6s 34 |
|.................................. | 52%
48.6s 35
|
|.................................... | 55%
label: ann_inc vs loan_amnt
51.2s 36 |
|..................................... | 57%
|
|....................................... | 60%
51.2s 37
label: log_regr
54.9s 38 |
|........................................ | 62%
|
|.......................................... | 64%
54.9s 39 label: pred
55.6s 40 |
|........................................... | 67%
|
|............................................. | 69%
55.7s 41
label: acc
60.5s 42 |
|.............................................. | 71%
60.5s 43 |
|................................................ | 74%
label: remove (with options)
List of 1
$ echo: logi FALSE
|
|.................................................. | 76%
|
|................................................... | 79%
label: threshold
61.3s 44 |
|..................................................... | 81%
61.3s 45 label: unnamed-chunk-1 (with options)
List of 1
$ echo: logi FALSE
|
|...................................................... | 83%
|
|........................................................ | 86%
label: cutoff.30%
61.4s 46 |
|......................................................... | 88%
61.4s 47 label: acc2 (with options)
List of 1
$ echo: logi FALSE
|
|........................................................... | 90%
|
|............................................................ | 93%
label: unnamed-chunk-2 (with options)
List of 1
$ echo: logi FALSE
|
|.............................................................. | 95%
|
|............................................................... | 98%
label: roc
61.5s 48 Type 'citation("pROC")' for a citation.
61.5s 49
Attaching package: 'pROC'
The following objects are masked from 'package:stats':
cov, smooth, var
70.1s 50 |
|.................................................................| 100%
inline R code fragments
70.8s 51
70.8s 52 output file: /kaggle/working/script.knit.md
71.0s 53 /usr/local/bin/pandoc +RTS -K512m -RTS /kaggle/working/script.utf8.md --to html4

--from markdown+autolink_bare_uris+ascii_identifiers+tex_math_single_backslash --
output /kaggle/working/__results__.html --smart --email-obfuscation none --
standalone --section-divs --table-of-contents --toc-depth 3 --template
/usr/local/lib/R/site-library/rmarkdown/rmd/h/default.html --highlight-style
tango --variable 'theme:bootstrap' --include-in-header /tmp/RtmpJ6nuxr/rmarkdown-
str169ed6e46.html --mathjax --variable 'mathjax-
url:https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'
71.3s 54
Output created: __results__.html
71.3s 55 There were 41 warnings (use warnings() to see them)
71.3s 56
71.3s 58 Complete. Exited with code 0.
Sort by
Comments (15)
All Comments Hotness
Please sign in to leave a comment.
Guilherme Arauj… • Posted on Latest Version • 10 months ago  1
Great workflow, IonasKel.
Ionas Kelepo… Kernel Author • Posted on Latest Version • 10 months ago  0
I appreciate it very much Guilherme.
snuow • Posted on Version 20 • 10 months ago  1
nice work!
Mitchell O'Brien • Posted on Version 20 • 10 months ago  1
Fantastic kernel!
Ionas Kelepo… Kernel Author • Posted on Version 20 • 10 months ago  0
Thank you Mitchell!
Xavier • Posted on Version 20 • 10 months ago  1
Great analysis, thanks for sharing!
Thank you Xavier!
sailaja • Posted on Version 20 • 10 months ago  1
hi, this is sailaja. i understood about what u explained but I have a doubt Whether is this same for risk
prediction for mortgage loans. can u pls reply
I believe the same features as predictors, especially annual income, loan amount and grade,
would work well on predicting mortgage loans too. Although, you 'll need to construct a
new model to get the right estimates of the coefficients.
Pavlo Fesenko • Posted on Latest Version • 2 days ago  0
Hi Ionas @ionaskel ! Upvoted your kernel. I was just wondering why you have selected those particular
features for the modelling?
Check out also my kernel where I try to predict good loans among high risk / high gain loans. If you like it, I
would greatly appreciate your upvote or comments/remarks. 😊
And thank you for upvoting this comment! Trying to become a Discussion Master! ✌
Rodrigo Landab… • Posted on Latest Version • 19 days ago  0
Thank you very much.
Terry Lv • Posted on Latest Version • 8 months ago  0
Nice work, thanks for sharing!

I have a question that logistic regression require all variables to be numeric, I don't seem to see other
variables in your code to be numeric. can u pls reply
Nbagne • Posted on Latest Version • 9 months ago  0
great work!
Sairam V • Posted on Latest Version • 10 months ago 0

Sairam V • Posted on Latest Version • 10 months ago  0
Nice work
Ionas Kelepo… Kernel Author • Posted on Latest Version • 10 months ago  0
Thank you Sairam!
Similar Kernels
© 2019 Kaggle Inc Our Team Terms Privacy Contact/Support 

Credit Risk Modelling (EDA & Classification) - Kaggle

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Credit Risk Modelling (EDA & Classification) - Kaggle

Caricato da

Copyright:

Formati disponibili

7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle

Search  Competitions Datasets Kernels Discussion Courses Sign in Register

Credit Risk Modelling [EDA & Classification]

September 30, 2018

Credit Risk Modelling [EDAclassification#data-modelling)

Report Load data and Libraries

Feature Selection & Engineering

loan_status : Variable with multiple levels (e.g. Charged off, Current,

# Select only the columns mentioned above.

## 3 Fully Paid 2400 16.0 C 10+ years R

sapply(loan , function(x) sum(is.na(x)))

## loan_status loan_amnt int_rate

# Remove the 4 rows with missing annual income, 49 rows

loan = loan %>%

Exploratory Data Analysis

loan_outcome -> 1 if loan_status = ‘Charged Off’ or ‘Default’ loan_outcome -> 0

loan = loan %>%

barplot(table(loan$loan_outcome) , col = 'lightblue')

We will create a new dataset which contains only rows with 0 or 1 in

# Create the new dataset by filtering 0's and 1's in the

Our new dataset contains of 244179 rows.

ggplot(loan2 , aes(x = grade , y = int_rate , fill = gr

We assume that grade is a great predictor for the volume of non-performing

table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1

ggplot(loan2 , aes(x = grade , y = ..count.. , fill = f

We created the binary loan_outcome which will be our response variable.

P = exp(b0 + b1X1 + … + bNXN) / [ 1 + exp(b0 + b1X1 + … + bNXN) ]

P is the probability of default

# Fit logistic regression

The coefficients of the following features are positive:

The coefficients of the following features are negative:

This means that the probability of defaulting is inversely proportional to the

# Prediction on test set

If we plot our results we get this visualization.

threshold = seq(from = 0.01 , to = 0.5 , by = 0.01)

data = data.frame(threshold , accuracy , sensitivity ,

## threshold accuracy sensitivity specificity

# Gather accuracy , sensitivity and specificity in one c

preds.for.30 = ifelse(preds > 0.3 , 1 , 0)

## [1] "Accuracy : 0.7854"

The ROC (Receiver Operating Characteristics) curve is a popular graphic for

# Area Under Curve

## Area under the curve: 0.6957

# Plot ROC curve

Did you ﬁnd this Kernel useful? 

Did you ﬁnd this Kernel useful? 

Data Sources Lending Club Loan Data

Succeeded True Run Time 71.8 seconds

Exit Code 0 Queue Time 0 seconds

Docker Image Name kaggle/rstats (Dockerﬁle) Output Size 0

Timeout Exceeded False Used All Space False

Log Download Log

Time Line # Log Message

processing file: script.Rmd

7.2s 12 The following object is masked from 'package:dplyr':

7.2s 13 Loading required package: lattice

10.4s 16 The following object is masked from 'package:purrr':

The following objects are masked from 'package:stats':

cov, smooth, var

71.0s 53 /usr/local/bin/pandoc +RTS -K512m -RTS /kaggle/working/script.utf8.md --to html4

Please sign in to leave a comment.

Guilherme Arauj… • Posted on Latest Version • 10 months ago  1

Great workﬂow, IonasKel.