Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
We use cookies on kaggle to deliver our services, analyze web traffic, and improve your experience on the site. By using kaggle, you agree to our use of
Got it Learn more
cookies.
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 1/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
IonasKel
Introduction (https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-
classification#introduction)
Load data and Libraries (https://www.kaggle.com/ionaskel/credit-risk-
modelling-eda-classification#load-data-and-libraries)
Feature Selection & Engineering (https://www.kaggle.com/ionaskel/credit-
risk-modelling-eda-classification#feature-selection-engineering)
Exploratory Data Analysis (https://www.kaggle.com/ionaskel/credit-risk-
modelling-eda-classification#exploratory-data-analysis)
Data modelling (https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-
Introduction
The analysis of credit risk and the decision making for granting loans is one of
the most important operations for financial institutions. By taking into account
Version 21
past results, we need to train a model to accurately predict future outcomes.
21 commits
Exploratory Data Load the data available for analysis. The dataset is takenas bank’s records
Analysis about the statuw of loan defaults and the profile of customers.
Data Modelling
# Set the blank spaces to NA's
Conclusion loan = read_csv("../input/loan.csv" , na = "")
Code colnames(loan)
Data
## [1] "id" "member_id"
Log ## [3] "loan_amnt" "funded_amnt"
## [5] "funded_amnt_inv" "term"
Comments
## [7] "int_rate" "installment"
## [9] "grade" "sub_grade"
## [11] "emp_title" "emp_length"
## [13] "home_ownership" "annual_inc"
## [15] "verification_status" "issue_d"
## [17] "loan_status" "pymnt_plan"
## [19] "url" "desc"
## [21] "purpose" "title"
## [23] "zip_code" "addr_state"
## [25] "dti" "delinq_2yrs"
## [27] "earliest_cr_line" "inq_last_6mths"
[29] " h i l d li " " h i l
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 2/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
## [29] "mths_since_last_delinq" "mths_since_last_
record"
## [31] "open_acc" "pub_rec"
## [33] "revol_bal" "revol_util"
## [35] "total_acc" "initial_list_sta
tus"
## [37] "out_prncp" "out_prncp_inv"
## [39] "total_pymnt" "total_pymnt_inv"
## [41] "total_rec_prncp" "total_rec_int"
## [43] "total_rec_late_fee" "recoveries"
## [45] "collection_recovery_fee" "last_pymnt_d"
## [47] "last_pymnt_amnt" "next_pymnt_d"
## [49] "last_credit_pull_d" "collections_12_m
ths_ex_med"
## [51] "mths_since_last_major_derog" "policy_code"
## [53] "application_type" "annual_inc_join
t"
## [55] "dti_joint" "verification_sta
tus_joint"
## [57] "acc_now_delinq" "tot_coll_amt"
## [59] "tot_cur_bal"
Report Code Data
"open_acc_6m"
Log Comments
## [61] "open_il_6m" "open_il_12m"
## [63] "open_il_24m" "mths_since_rcnt_
il"
## [65] "total_bal_il" "il_util"
## [67] "open_rv_12m" "open_rv_24m"
## [69] "max_bal_bc" "all_util"
## [71] "total_rev_hi_lim" "inq_fi"
## [73] "total_cu_tl" "inq_last_12m"
## # A tibble: 887,379 x 8
## loan_status loan_amnt int_rate grade emp_length h
ome_ownership
## <chr> <dbl> <dbl> <chr> <chr> <
chr>
## 1 Fully Paid 5000 10.6 B 10+ years R
ENT
## 2 Charged Off 2500 15.3 C < 1 year R
ENT
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 3/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
Missing Values:
loan %>%
count(loan_status) %>%
ggplot(aes(x = reorder(loan_status , desc(n)) ,
y = n , fill = n)) +
geom_col() +
coord_flip() +
labs(x = 'Loan Status' , y = 'Count')
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 4/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
We want to convert this variable to binary (1 for default and 0 for non-default)
but we have 10 different levels. Loans with status Current, Late payments, In
grace period need to be removed. Therefore, we create a new variable called
loan_outcome where
Let’s observe how useful these variables would be for credit risk modelling. It is
known that the better the grade the lowest the interest rate. We can nicely
visualise this with boxplots.
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 5/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
theme_igray() +
labs(y = 'Interest Rate' , x = 'Grade')
##
## Fully Paid Default
## A 38268 2472
## B 64185 9095
## C 50823 12252
## D 28874 10202
## E 12473 6162
## F 4581 2890
## G 1110 792
Now let’s try to find out what impact the annual income of the borrower has on
the other variables.
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 6/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
ggplot(loan2[sample(244179 , 10000) , ] , aes(x = annua
l_inc , y = loan_amnt , color = int_rate)) +
geom_point(alpha = 0.5 , size = 1.5) +
geom_smooth(se = F , color = 'darkred' , method
= 'loess') +
xlim(c(0 , 300000)) +
labs(x = 'Annual Income' , y = 'Loan Ammount' ,
color = 'Interest Rate')
As expected the larger the annual income the larger the demanded ammount by
the borrower.
Data modelling
Modelling Process:
Because of the binary response variable we can use logistic regression. Rather
than modelling the response Y directly, logistic regression models the probability
that Y belongs to a particular category, in our case the probability of a non-
performing loan. This probability can be computed by the logistic function,
where
# Split dataset
loan2$loan_outcome = as.numeric(loan2$loan_outcome)
idx = sample(dim(loan2)[1] , 0.75*dim(loan2)[1] , repla
ce = F)
trainset = loan2[idx , ]
testset = loan2[-idx , ]
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 7/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
##
## Call:
## glm(formula = loan_outcome ~ ., family = binomial(li
nk = "logit"),
## data = trainset)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4008 -0.6693 -0.5173 -0.3400 8.4904
##
## Coefficients:
## Estimate Std. Error z value Pr
(>|z|)
## (Intercept) -3.478e+00 4.987e-02 -69.752 <
2e-16 ***
## loan_amnt 1.413e-05 9.694e-07 14.576 <
2e-16 ***
## int_rate 1.362e-01 4.693e-03 29.017 <
2e-16 ***
## gradeB 1.402e-01 3.327e-02 4.213 2.
52e-05 ***
## gradeC 1.611e-01 4.282e-02 3.762 0.
000169 ***
## gradeD 1.178e-01 5.466e-02 2.155 0.
031159 *
## gradeE -2.819e-02 6.770e-02 -0.416 0.
677082
## gradeF -2.548e-01 8.360e-02 -3.048 0.
002303 **
## gradeG -3.342e-01 1.020e-01 -3.276 0.
001053 **
## emp_length1 year -7.567e-02 3.199e-02 -2.366 0.
018000 *
## emp_length10+ years -8.850e-02 2.439e-02 -3.629 0.
000284 ***
## emp_length2 years -9.606e-02 2.942e-02 -3.265 0.
001095 **
## emp_length3 years -6.216e-02 3.043e-02 -2.043 0.
041061 *
## emp_length4 years -7.339e-02 3.249e-02 -2.259 0.
023888 *
## emp_length5 years -3.031e-02 3.124e-02 -0.970 0.
331941
## emp_length6 years -1.353e-02 3.294e-02 -0.411 0.
681278
## emp_length7 years -6.067e-02 3.366e-02 -1.803 0.
071440 .
## emp_length8 years -2.980e-02 3.539e-02 -0.842 0.
399682
## emp_length9 years -1.592e-03 3.785e-02 -0.042 0.
966458
## home_ownershipOTHER 4.291e-01 2.478e-01 1.732 0.
083343 .
## home_ownershipOWN 6.730e-02 2.391e-02 2.815 0.
004881 **
## home_ownershipRENT 2.059e-01 1.404e-02 14.668 <
2e-16 ***
## annual_inc -6.441e-06 2.122e-07 -30.351 <
2e-16 ***
## term60 months 3.242e-01 1.681e-02 19.284 <
2e-16 ***
## ---
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 8/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.'
0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to b
e 1)
##
## Null deviance: 172580 on 183133 degrees of fre
edom
## Residual deviance: 159710 on 183110 degrees of fre
edom
## AIC: 159758
##
## Number of Fisher Scoring iterations: 6
1. Loan Ammount
2. Interest Rate
3. Home Ownership - Other
4. Term
5. The better the grade the more difficult to default
This means the probability of defaulting on the given credit varies directly with
these factors. For example more the given ammount of the loan, more the risk
of losing credit.
1. Annual Income
2. Home Ownership - Own
3. Home Ownership - Rent
4. Borrowers with 10+ years of experience are more likely to pay their debt
5. There is no significant difference in the early years of employment
# Density of probabilities
ggplot(data.frame(preds) , aes(preds)) +
geom_density(fill = 'lightblue' , alpha = 0.4)
+
labs(x = 'Predicted Probabilities on test set')
But now let’s see how the accuracy, sensitivity and specificity are transformed
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 9/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
for given threshold. We can use a threshold of 50% for the posterior probability
of default in order to assign an observation to the default class. However, if we
are concerned about incorrectly predicting the default status for individuals who
default, then we can consider lowering this threshold. So we will consider these
three metrics for threshold levels from 1% up to 50%.
k = 0
accuracy = c()
sensitivity = c()
specificity = c()
for(i in seq(from = 0.01 , to = 0.5 , by = 0.01)){
k = k + 1
preds_binomial = ifelse(preds > i , 1 , 0)
confmat = table(testset$loan_outcome , preds_bi
nomial)
accuracy[k] = sum(diag(confmat)) / sum(confmat)
sensitivity[k] = confmat[1 , 1] / sum(confmat[
, 1])
specificity[k] = confmat[2 , 2] / sum(confmat[
, 2])
}
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 10/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
A threshold of 25% - 30% seems ideal cause further increase of the cut off
percentage does not have significant impact on the accuracy of the model. The
Confusion Matrix for cut off point at 30% will be this,
## Actual
## Predicted 0 1
## 0 44853 7834
## 1 5266 3092
library(pROC)
This kernel has been released under the Apache 2.0 open source license.
Code This kernel has been released under the Apache 2.0 open source license. Download Code
1 ---
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 11/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
2 title: "Credit Risk Modelling [EDA & Classification]"
3 author: "IonasKel"
4 date: "September 30, 2018"
5 output:
6 html_document:
7 fig_height: 8
8 fig_width: 12
9 highlight: tango
10 toc: yes
11 ---
12
13 ```{r setup, include=FALSE}
14 knitr::opts_chunk$set(echo = TRUE , warning = FALSE , message = FALSE)
15 ```
16
17 # Introduction
18
19 The analysis of credit risk and the decision making for granting loans is one of the most important
20
21 # Load data and Libraries
22
23 Load libraries we are going to use.
24
25 ```{r libraries}
26 library(tidyverse)
27 library(ggthemes)
28 library(corrplot)
29 library(GGally)
30 library(DT)
31 library(caret)
32 ```
33
34 Load the data available for analysis. The dataset is takenas bank's records about the statuw of loa
35
36 ```{r load data}
37 # Set the blank spaces to NA's
38 loan = read_csv("../input/loan.csv" , na = "")
39
40 ```
41 ```{r columns names}
42 colnames(loan)
43
44 ```
45
46
47
48 # Feature Selection & Engineering
49
50 The dataset contains of information of age, annual income, grade of employee, home ownership that a
51
52 * **loan_status** : Variable with multiple levels (e.g. Charged off, Current, Default, Fully Pai
53 * **loan_amnt** : Total amount of loan taken
54 * **int_rate** : Loan interset rate
55 * **grade** : Grade of employment
56 * **emp_length** : Duration of employment
57 * **home_ownership** : Type of ownership of house
58 * **annual_inc** : Total annual income
59 * **term** : 36-month or 60-month period
60
61
62 ```{r select columns}
63 # Select only the columns mentioned above.
64 loan = loan %>%
65 select(loan_status , loan_amnt , int_rate , grade , emp_length , home_ownership ,
66 annual_inc , term)
67 loan
68
69 ```
70
71
72 Missing Values:
73 ```{r NAs}
74 sapply(loan , function(x) sum(is.na(x)))
75
76 # Remove the 4 rows with missing annual income, 49 rows where home ownership is 'NONE' or 'ANY' and
77
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 12/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
78 loan = loan %>%
79 filter(!is.na(annual_inc) ,
80 !(home_ownership %in% c('NONE' , 'ANY')) ,
81 emp_length != 'n/a')
82
83 ```
84
85
86 # Exploratory Data Analysis
87
88 * **loan_status** :
89
90 ```{r loan_status}
91 loan %>%
92 count(loan_status) %>%
93 ggplot(aes(x = reorder(loan_status , desc(n)) , y = n , fill = n)) +
94 geom_col() +
95 coord_flip() +
96 labs(x = 'Loan Status' , y = 'Count')
97
98 ```
99
100 We want to convert this variable to binary (1 for default and 0 for non-default) but we have 10 dif
101
102 loan_outcome -> 1 if loan_status = 'Charged Off' or 'Default'
103 loan_outcome -> 0 if loan_status = 'Fully Paid'
104
105 ```{r loan_outcome}
106 loan = loan %>%
107 mutate(loan_outcome = ifelse(loan_status %in% c('Charged Off' , 'Default') ,
108 1,
109 ifelse(loan_status == 'Fully Paid' , 0 , 'No info')
110 ))
111
112 barplot(table(loan$loan_outcome) , col = 'lightblue')
113
114 ```
115
116
117 We will create a new dataset which contains only rows with 0 or 1 in loan_outcome feature for bette
118
119 ```{r loan2}
120 # Create the new dataset by filtering 0's and 1's in the loan_outcome column and remove loan_status
121 loan2 = loan %>%
122 select(-loan_status) %>%
123 filter(loan_outcome %in% c(0 , 1))
124
125 ```
126
127
128 Our new dataset contains of **`r nrow(loan2)` rows**.
129
130 Let's observe how useful these variables would be for credit risk modelling. It is known that the b
131
132 ```{r grade_boxplot}
133 ggplot(loan2 , aes(x = grade , y = int_rate , fill = grade)) +
134 geom_boxplot() +
135 theme_igray() +
136 labs(y = 'Interest Rate' , x = 'Grade')
137
138 ```
139
140 We assume that grade is a great predictor for the volume of non-performing loans. But how many of t
141
142 ```{r grade_barplot}
143 table(loan2$grade , factor(loan2$loan_outcome , c(0 , 1) , c('Fully Paid' , 'Default')))
144
145 ggplot(loan2 , aes(x = grade , y = ..count.. , fill = factor(loan_outcome , c(1 , 0) , c('Default'
146 geom_bar() +
147 theme(legend.title = element_blank())
148
149 ```
150
151
152 Now let's try to find out what impact the annual income of the borrower has on the other variables
153
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 13/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
154 ```{r ann_inc vs loan_amnt}
155 ggplot(loan2[sample(244179 , 10000) , ] , aes(x = annual_inc , y = loan_amnt , color = int_rate)) +
156 geom_point(alpha = 0.5 , size = 1.5) +
157 geom_smooth(se = F , color = 'darkred' , method = 'loess') +
158 xlim(c(0 , 300000)) +
159 labs(x = 'Annual Income' , y = 'Loan Ammount' , color = 'Interest Rate')
160
161 ```
162
163 As expected the larger the annual income the larger the demanded ammount by the borrower.
164
165 # Data modelling
166
167 Modelling Process:
168
169 * We created the binary loan_outcome which will be our response variable.
170 * We exclude some independent variables in order to make the model simpler.
171 * We split the dataset to training set(75%) and testing set(25%) for the validation.
172 * We train a model to predict the probability of default.
173
174 Because of the binary response variable we can use logistic regression. Rather than modelling the r
175
176 P = exp(b0 + b1X1 + ... + bNXN) / [ 1 + exp(b0 + b1X1 + ... + bNXN) ]
177
178 where
179
180 * P is the probability of default
181 * b0 , b1 , ... , bN are the coefficient estimates
182 * N the number of observations
183 * X1 , ... , XN are the independent variables
184
185
186 ```{r log_regr}
187 # Split dataset
188 loan2$loan_outcome = as.numeric(loan2$loan_outcome)
189 idx = sample(dim(loan2)[1] , 0.75*dim(loan2)[1] , replace = F)
190 trainset = loan2[idx , ]
191 testset = loan2[-idx , ]
192
193 # Fit logistic regression
194 glm.model = glm(loan_outcome ~ . , trainset , family = binomial(link = 'logit'))
195 summary(glm.model)
196
197 ```
198
199 The coefficients of the following features are **positive**:
200
201 1) Loan Ammount
202 2) Interest Rate
203 3) Home Ownership - Other
204 4) Term
205 5) The better the grade the more difficult to default
206
207 This means the probability of defaulting on the given credit varies directly with these factors. Fo
208
209
210 The coefficients of the following features are **negative**:
211
212 1) Annual Income
213 2) Home Ownership - Own
214 3) Home Ownership - Rent
215 4) Borrowers with 10+ years of experience are more likely to pay their debt
216 5) There is no significant difference in the early years of employment
217
218 This means that the probability of defaulting is inversely proportional to the factors mentioned ab
219
220
221 ```{r pred}
222 # Prediction on test set
223 preds = predict(glm.model , testset , type = 'response')
224
225 # Density of probabilities
226 ggplot(data.frame(preds) , aes(preds)) +
227 geom_density(fill = 'lightblue' , alpha = 0.4) +
228 labs(x = 'Predicted Probabilities on test set')
229
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 14/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
230
231 ```
232
233 But now let's see how the accuracy, sensitivity and specificity are transformed for given threshold
234
235 ```{r acc}
236 k = 0
237 accuracy = c()
238 sensitivity = c()
239 specificity = c()
240 for(i in seq(from = 0.01 , to = 0.5 , by = 0.01)){
241 k = k + 1
242 preds_binomial = ifelse(preds > i , 1 , 0)
243 confmat = table(testset$loan_outcome , preds_binomial)
244 accuracy[k] = sum(diag(confmat)) / sum(confmat)
245 sensitivity[k] = confmat[1 , 1] / sum(confmat[ , 1])
246 specificity[k] = confmat[2 , 2] / sum(confmat[ , 2])
247 }
248 ```
249
250 ```{r remove , echo = FALSE}
251 rm(confmat , k , i , preds_binomial)
252
253 ```
254
255
256 If we plot our results we get this visualization.
257
258 ```{r threshold}
259 threshold = seq(from = 0.01 , to = 0.5 , by = 0.01)
260
261 data = data.frame(threshold , accuracy , sensitivity , specificity)
262 head(data)
263
264 # Gather accuracy , sensitivity and specificity in one column
265 ggplot(gather(data , key = 'Metric' , value = 'Value' , 2:4) ,
266 aes(x = threshold , y = Value , color = Metric)) +
267 geom_line(size = 1.5)
268 ```
269 ```{r , echo = FALSE}
270 rm(data)
271
272 ```
273
274 A threshold of 25% - 30% seems ideal cause further increase of the cut off percentage does not have
275
276 ```{r cutoff.30%}
277 preds.for.30 = ifelse(preds > 0.3 , 1 , 0)
278 confusion_matrix_30 = table(Predicted = preds.for.30 , Actual = testset$loan_outcome)
279 confusion_matrix_30
280
281 ```
282 ```{r acc2 , echo = FALSE}
283 paste('Accuracy :' , round(sum(diag(confusion_matrix_30)) / sum(confusion_matrix_30) , 4))
284
285 ```
286
287 ```{r , echo = FALSE}
288 rm(preds.for.30)
289
290 ```
291
292 The *ROC (Receiver Operating Characteristics) curve* is a popular graphic for simultaneously displa
293
294 ```{r roc}
295 library(pROC)
296
297 # Area Under Curve
298 auc(roc(testset$loan_outcome , preds))
299
300 # Plot ROC curve
301 plot.roc(testset$loan_outcome , preds , main = "Confidence interval of a threshold" , percent = TRU
302 ci = TRUE , of = "thresholds" , thresholds = "best" , print.thres = "best" , col = 'blue')
303
304 ```
305
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 15/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
306
307 # Conclusion
308
309 A logistic regression model was used to predict the loan status. Different cut off's were used to d
310
311
312
313
Data
loan 887k x 75 These files contain complete loan data for all loans
issued through the 2007-2015, including the current
LCDataDictionary.xlsx
loan status (Current, Late, Fully Paid, etc.) and latest
payment information. The file containing loan data
through the "present" contains complete loan data for all
loans issued through the previous completed calendar
quarter. Additional features include credit scores,
number of finance inquiries, address including zip
codes, and state, and collections among others. The file
is a matrix of about 890 thousand observations and 75
variables. A data dictionary is provided in a separate file.
k
Run Info
Failure Message
4.5s 3 | |...
| 5%
label: setup (with options)
List of 1
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 16/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
$ include: logi FALSE
4.6s 4 |
|..... | 7%
ordinary text without R code
4.6s 5 |
|...... | 10%
4.7s 6 label: libraries
6.3s 7 ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
6.3s 8 ✔ ggplot2 3.0.0.9000 ✔ purrr 0.2.5
✔ tibble 1.4.2 ✔ dplyr 0.7.6
✔ tidyr 0.8.1 ✔ stringr 1.3.1
✔ readr 1.2.0 ✔ forcats 0.3.0
6.7s 9 ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
6.8s 10 corrplot 0.84 loaded
7.1s 11
Attaching package: 'GGally'
nasa
10.3s 15 |
|........ | 12%
ordinary text without R code
lift
10.4s 17 |
|......... | 14%
label: load data
10.6s 18 Parsed with column specification:
cols(
.default = col_double(),
term = col_character(),
grade = col_character(),
sub_grade = col_character(),
emp_title = col_character(),
emp_length = col_character(),
home_ownership = col_character(),
verification_status = col_character(),
issue_d = col_character(),
loan_status = col_character(),
pymnt_plan = col_character(),
url = col_character(),
desc = col_character(),
purpose = col_character(),
title = col_character(),
zip_code = col_character(),
addr_state = col_character(),
earliest_cr_line = col_character(),
initial_list_status = col_character(),
last_pymnt_d = col_character(),
next_pymnt_d = col_character()
# ... with 23 more columns
)
See spec(...) for full column specifications.
39.7s 19 Warning: 2970434 parsing failures.
row col expected actual file
42536 tot_coll_amt 1/0/T/F/TRUE/FALSE 0.0 '../input/loan.csv'
42536 tot_cur_bal 1/0/T/F/TRUE/FALSE 114834.0 '../input/loan.csv'
42536 total_rev_hi_lim 1/0/T/F/TRUE/FALSE 59900.0 '../input/loan.csv'
42537 tot_coll_amt 1/0/T/F/TRUE/FALSE 0.0 '../input/loan.csv'
42537 tot_cur_bal 1/0/T/F/TRUE/FALSE 14123.0 '../input/loan.csv'
..... ................ .................. ........ ...................
See problems(...) for more details.
39.7s 20 |
|........... | 17%
39.8s 21 label: columns names
|
|............ | 19%
ordinary text without R code
|
|.............. | 21%
label: select columns
40.3s 22 |
|............... | 24%
ordinary text without R code
40.4s 23 |
|................. | 26%
label: NAs
40.5s 24 |
|................... | 29%
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 17/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
ordinary text without R code
|
|.................... | 31%
40.5s 25 label: loan_status
42.4s 26 |
|...................... | 33%
ordinary text without R code
|
|....................... | 36%
42.4s 27 label: loan_outcome
44.0s 28 |
|......................... | 38%
ordinary text without R code
|
|.......................... | 40%
44.0s 29 label: loan2
44.2s 30 |
|............................ | 43%
inline R code fragments
44.2s 31 |
|............................. | 45%
label: grade_boxplot
47.0s 32 |
|............................... | 48%
ordinary text without R code
47.0s 33 |
|................................ | 50%
label: grade_barplot
48.6s 34 |
|.................................. | 52%
48.6s 35
ordinary text without R code
|
|.................................... | 55%
label: ann_inc vs loan_amnt
51.2s 36 |
|..................................... | 57%
ordinary text without R code
|
|....................................... | 60%
51.2s 37
label: log_regr
54.9s 38 |
|........................................ | 62%
ordinary text without R code
|
|.......................................... | 64%
54.9s 39 label: pred
55.6s 40 |
|........................................... | 67%
ordinary text without R code
|
|............................................. | 69%
55.7s 41
label: acc
60.5s 42 |
|.............................................. | 71%
ordinary text without R code
60.5s 43 |
|................................................ | 74%
label: remove (with options)
List of 1
$ echo: logi FALSE
|
|.................................................. | 76%
ordinary text without R code
|
|................................................... | 79%
label: threshold
61.3s 44 |
|..................................................... | 81%
61.3s 45 label: unnamed-chunk-1 (with options)
List of 1
$ echo: logi FALSE
|
|...................................................... | 83%
ordinary text without R code
|
|........................................................ | 86%
label: cutoff.30%
61.4s 46 |
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 18/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
|......................................................... | 88%
61.4s 47 label: acc2 (with options)
List of 1
$ echo: logi FALSE
|
|........................................................... | 90%
ordinary text without R code
|
|............................................................ | 93%
label: unnamed-chunk-2 (with options)
List of 1
$ echo: logi FALSE
|
|.............................................................. | 95%
ordinary text without R code
|
|............................................................... | 98%
label: roc
61.5s 48 Type 'citation("pROC")' for a citation.
61.5s 49
Attaching package: 'pROC'
70.1s 50 |
|.................................................................| 100%
inline R code fragments
70.8s 51
70.8s 52 output file: /kaggle/working/script.knit.md
Sort by
Comments (15)
All Comments Hotness
nice work!
Fantastic kernel!
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 19/21
7/28/2019 Credit Risk Modelling [EDA & Classification] | Kaggle
hi, this is sailaja. i understood about what u explained but I have a doubt Whether is this same for risk
prediction for mortgage loans. can u pls reply
I believe the same features as predictors, especially annual income, loan amount and grade,
would work well on predicting mortgage loans too. Although, you 'll need to construct a
new model to get the right estimates of the coefficients.
Hi Ionas @ionaskel ! Upvoted your kernel. I was just wondering why you have selected those particular
features for the modelling?
Check out also my kernel where I try to predict good loans among high risk / high gain loans. If you like it, I
would greatly appreciate your upvote or comments/remarks. 😊
And thank you for upvoting this comment! Trying to become a Discussion Master! ✌
great work!
Nice work
Similar Kernels
https://www.kaggle.com/ionaskel/credit-risk-modelling-eda-classification/report 21/21