Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
08.09.2019
─
Janani Prakash
PGPBABI-Online
GreatLearning, Great Lakes Institute of Management
Project Objective 3
Modelling 25
Logistic Regression 25
Analysis 28
Source Code 35
1. Project Objective
The objective of the project is to create India credit risk(default) model using the given
training dataset and validate it on the holdout dataset. Logistic regression framework is to be
used to develop the credit default model.
Net worth next year, Total assets, Net worth, Total income, Total expenses, Profit after tax,
PBDITA, PBT (Profit Before Tax), Cash profit, PBDITA as % of total income, PBT as % of
total income, Cash profit as % of total income, PAT as % of net worth, Sales, Total capital,
Reserves and funds, Borrowings, Current liabilities & provisions, Capital employed, Net
fixed assets, Investments, Net working capital, Debt to equity ratio (times), Cash to current
liabilities (times), Total liabilities.
In addition to the above variables there are other financial parameters which define the
financial strength of the organization taking the total tally of variables to 51.
> dim(train)
[1] 3541 52
The datasets contains 3541 observations and 52 variables.
> names(CarData)
[1] "Num" "Networth Next Year"
[3] "Total assets" "Net worth"
[5] "Total income" "Change in stock"
[7] "Total expenses" "Profit after tax"
[9] "PBDITA" "PBT"
[11] "Cash profit" "PBDITA as % of total income"
[13] "PBT as % of total income" "PAT as % of total income"
[15] "Cash profit as % of total income" "PAT as % of net worth"
[17] "Sales" "Income from financial services"
[19] "Other income" "Total capital"
[21] "Reserves and funds" "Deposits (accepted by commercial
banks)"
[23] "Borrowings" "Current liabilities & provisions"
[25] "Deferred tax liability" "Shareholders funds"
[27] "Cumulative retained profits" "Capital employed"
[29] "TOL/TNW" "Total term liabilities / tangible net worth"
[31] "Contingent liabilities / Net worth (%)" "Contingent liabilities"
[33] "Net fixed assets" "Investments"
[35] "Current assets" "Net working capital"
[37] "Quick ratio (times)" "Current ratio (times)"
[39] "Debt to equity ratio (times)" "Cash to current liabilities (times)"
[41] "Cash to average cost of sales per day" "Creditors turnover"
[43] "Debtors turnover" "Finished goods turnover"
[45] "WIP turnover" "Raw material turnover"
[47] "Shares outstanding" "Equity face value"
[49] "EPS" "Adjusted EPS"
[51] "Total liabilities" "PE on BSE"
> dim(test)
[1] 715 52
> names(test)
[1] "Num" "Default - 1"
[3] "Total assets" "Net worth"
[5] "Total income" "Change in stock"
[7] "Total expenses" "Profit after tax"
[9] "PBDITA" "PBT"
[11] "Cash profit" "PBDITA as % of total income"
[13] "PBT as % of total income" "PAT as % of total income"
[15] "Cash profit as % of total income" "PAT as % of net worth"
[17] "Sales" "Income from financial services"
[19] "Other income" "Total capital"
[21] "Reserves and funds" "Deposits (accepted by commercial
banks)"
[23] "Borrowings" "Current liabilities & provisions"
[25] "Deferred tax liability" "Shareholders funds"
[27] "Cumulative retained profits" "Capital employed"
[29] "TOL/TNW" "Total term liabilities / tangible net worth"
[31] "Contingent liabilities / Net worth (%)" "Contingent liabilities"
[33] "Net fixed assets" "Investments"
[35] "Current assets" "Net working capital"
[37] "Quick ratio (times)" "Current ratio (times)"
[39] "Debt to equity ratio (times)" "Cash to current liabilities (times)"
[41] "Cash to average cost of sales per day" "Creditors turnover"
[43] "Debtors turnover" "Finished goods turnover"
[45] "WIP turnover" "Raw material turnover"
[47] "Shares outstanding" "Equity face value"
[49] "EPS" "Adjusted EPS"
[51] "Total liabilities" "PE on BSE"
The training dataset does not have default variable so the default variable is
created by splitting the observations of the Networth Next Year variable.
Generally, it is expected that the firms that will have negative net worth next year
are likely to default. Negative observations in ‘Networth Next Year’ will be ‘1’ in
the Default variable and the positive observations will be ‘0’ in the Default
variable.
>newtrain$Default <- ifelse(newtrain$'Networth Next Year' < 0 ,1,0)
Companies with Total Assets less than 3 is removed from further analysis.
>newtrain <- newtrain[!newtrain$`Total assets` <= 3, ]
This plot shows that the training dataset has 6.8% missing observations and
1.9% missing columns.
>plot_intro(newtrain)
The variables of type character are converted to the type numeric and also the
missing observations in a column is replaced with the median of that column for
the whole of the training dataset.
>for(i in 1:ncol(newtrain)){
newtrain[,i] <- as.numeric(newtrain[,i])
newtrain[is.na(newtrain[,i]), i] <- median(newtrain[,i], na.rm = TRUE)
}
The training dataset has missing observations as well as columns. The missing
observations are replaced with median of that column and the missing columns
are removed from the dataset.
>newtrain <- newtrain[,-22]
On running the plot again, it shows that the training dataset does not have any
missing observations or columns.
>plot_intro(newtrain)
Similarly the testing dataset also has variables of the type character.
> newtest<-as.data.frame(newtest)
> for(i in 1:length(newtest)){
+ print(paste(colnames(newtest[i]),class(newtest[,i])))}
[1] "Num numeric"
[1] "Default - 1 numeric"
[1] "Total assets numeric"
[1] "Net worth numeric"
[1] "Total income numeric"
[1] "Change in stock numeric"
[1] "Total expenses numeric"
[1] "Profit after tax numeric"
[1] "PBDITA numeric"
[1] "PBT numeric"
[1] "Cash profit numeric"
[1] "PBDITA as % of total income numeric"
[1] "PBT as % of total income numeric"
[1] "PAT as % of total income numeric"
[1] "Cash profit as % of total income numeric"
[1] "PAT as % of net worth numeric"
[1] "Sales numeric"
[1] "Income from financial services numeric"
[1] "Other income numeric"
[1] "Total capital numeric"
[1] "Reserves and funds numeric"
[1] "Deposits (accepted by commercial banks) logical"
[1] "Borrowings numeric"
[1] "Current liabilities & provisions numeric"
[1] "Deferred tax liability numeric"
[1] "Shareholders funds numeric"
[1] "Cumulative retained profits numeric"
[1] "Capital employed numeric"
[1] "TOL/TNW numeric"
[1] "Total term liabilities / tangible net worth numeric"
[1] "Contingent liabilities / Net worth (%) numeric"
[1] "Contingent liabilities numeric"
[1] "Net fixed assets numeric"
[1] "Investments numeric"
[1] "Current assets numeric"
[1] "Net working capital numeric"
[1] "Quick ratio (times) numeric"
[1] "Current ratio (times) numeric"
[1] "Debt to equity ratio (times) numeric"
[1] "Cash to current liabilities (times) numeric"
[1] "Cash to average cost of sales per day numeric"
[1] "Creditors turnover character"
[1] "Debtors turnover character"
[1] "Finished goods turnover character"
[1] "WIP turnover character"
[1] "Raw material turnover character"
[1] "Shares outstanding character"
[1] "Equity face value character"
[1] "EPS numeric"
[1] "Adjusted EPS numeric"
[1] "Total liabilities numeric"
[1] "PE on BSE character"
This plot shows that the testing dataset has 7% missing observations and 1.9%
missing columns.
> plot_intro(newtest)
In the following code, the variables of type character are changed into the type
numeric and then the missing observations in each column is replaced with the
median of that column.
> for(i in 1:ncol(newtest)){
+ newtest[,i] <- as.numeric(newtest[,i])
+ newtest[is.na(newtest[,i]), i] <- median(newtest[,i], na.rm = TRUE)
+}
The missing columns are removed from the dataset.
>newtest <- newtest[,-22]
This plot shows that the dataset does not contain any missing observations or
missing columns.
>plot_intro(newtest)
3.3. Outlier treatment
The outliers in the dataset are treated by replacing the observations lesser than
the 1st percentile with value of the 1st percentile and the observations more than
the 99th percentile with the value of the 99th percentile. This outlier treatment is
done for every column in the dataset.
The quantile function identifies the observations less than 1st percentile and
more than the 99th percentile. The squish function replaces the values of these
identified outliers with the value of the 1st percentile and the 99th percentile.
>for(i in 2:ncol(newtrain)){
q <- quantile(newtrain[,i], c(0.1, 0.99))
newtrain[,i] <- squish(newtrain[,i], q)
}
Redundant variables are removed from the Training and Testing dataset.
>newtrain <- newtrain[,-c(1,2)]
>newtest <- newtest[,-1]
3.4. Univariate and Multivariate Analysis
The variables can be explored further and can be analysed using univariate and
multivariate analysis
> plot_str(newtrain)
> plot_intro(newtrain)
> plot_missing(newtrain)
> plot_histogram(newtrain)
> plot_qq(newtrain)
> plot_bar(newtrain)
> plot_correlation(newtrain)
The liquidity ratio is derived by dividing Net Working Capital by Total Assets
> newtrain$NWC2TA <- newtrain$`Net working capital`/newtrain$`Total assets`
Other ratios are also created by dividing multiple variables by Total assets and the
contribution of these ratios towards the model can be found later.
> newtrain$Networth2Totalassets <- newtrain$`Net worth`/newtrain$`Total assets`
> newtrain$Totalincome2Totalassets<- newtrain$`Total income`/newtrain$`Total assets`
> newtrain$Totalexpenses2Totalassets <-newtrain$`Total expenses`/newtrain$`Total
assets`
> newtrain$Profitaftertax2Totalassets <-newtrain$`Profit after tax`/newtrain$`Total
assets`
> newtrain$PBT2Totalassets <-newtrain$PBT/newtrain$`Total assets`
> newtrain$Sales2Totalassets <-newtrain$Sales/newtrain$`Total assets`
> newtrain$Currentliabilitiesprovisions2Totalassets <-newtrain$`Current liabilities &
provisions`/newtrain$`Total assets`
> newtrain$Capitalemployed2Totalassets <-newtrain$`Capital employed`/newtrain$`Total
assets`
> newtrain$Netfixedassets2Totalassets <-newtrain$`Net fixed assets`/newtrain$`Total
assets`
> newtrain$Investments2Totalassets <-newtrain$Investments/newtrain$`Total assets`
> newtrain$Totalliabilities2Totalassets <-newtrain$`Total liabilities`/newtrain$`Total
assets`
The most important variables identified from the previous Logistic regression
model is used as predictors in this model with the Default variable as the
response.
Among the most important variables, the variables with positive estimates are
Total Assets, Current ratio, Sales/Total Assets and the variables with negative
estimates are Cash Profit, PAT as % of net worth,Current Liabilities and
Provisions, Capital employed
The confusion matrix shows that there are 35 Type 1 error and 107 type 2 error.
> accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
> accuracy.logit
[1] 0.9591719
Call:
roc.default(response = newtrain$Default, predictor = PredLOGIT)
The same logistic regression model is used to predict the testing dataset.
> PredLOGIT <- predict.glm(trainLOGIT, newdata=newtest, type="response")
> tab.logit<-confusion.matrix(newtest$`Default - 1`,PredLOGIT,threshold = 0.5)
> tab.logit
obs
pred 0 1
0 639 20
1 22 34
attr(,"class")
[1] "confusion.matrix"
The confusion matrix shows that there are 22 Type 1 error and 20 Type 2 error.
> accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
> accuracy.logit
[1] 0.9412587
The accuracy of the model is 94.125%
> roc.logit<-roc(newtest$`Default - 1`,PredLOGIT )
Setting levels: control = 0, case = 1
Setting direction: controls < cases
> roc.logit
Call:
roc.default(response = newtest$`Default - 1`, predictor = PredLOGIT)
5.2. Deciling
The training dataset is then divided into 10 deciles based on the probability of default.
> View(rank)
The ranks of the deciles are seen above. The deciles are sorted in the descending order.
The 10th decile has the maximum number of defaults in the form of cnt_resp.
The testing dataset is then divided into 10 deciles based on the probability of default.
> newtest$pred = predict(trainLOGIT, newtest, type="response")
> View(rank)
The ranks of the deciles are seen above. The deciles are sorted in the descending order.
The 10th decile has the maximum number of defaults in the form of cnt_resp.
The mean is taken for both the Training and Testing dataset to differentiate the predicted
and observed values.
> mean.obs.train = aggregate(Default ~ rank, data = newtrain, mean)
> mean.pred.train = aggregate(pred ~ rank, data = newtrain, mean)
6. Source Code
---
title: "Untitled"
output: html_document
---
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and
MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the
output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
## Including Plots
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that
generated the plot.
```{r}
install.packages("DataExplorer")
library(DataExplorer)
library(readxl)
dim(train)
names(train)
head(train)
dim(test)
names(test)
head(test)
```
#summary
#summary(newtrain)
newtrain<-as.data.frame(newtrain)
for(i in 1:length(newtrain)){
print(paste(colnames(newtrain[i]),class(newtrain[,i])))}
plot_intro(newtrain)
for(i in 1:ncol(newtrain)){
newtest<-as.data.frame(newtest)
for(i in 1:length(newtest)){
print(paste(colnames(newtest[i]),class(newtest[,i])))}
plot_intro(newtest)
for(i in 1:ncol(newtest)){
plot_intro(newtest)
#identify outlier
boxplot(newtrain)
library(scales)
for(i in 2:ncol(newtrain)){
summary(newtrain)
```
plot_str(newtrain)
plot_intro(newtrain)
plot_missing(newtrain)
plot_histogram(newtrain)
plot_density(newtrain)
plot_qq(newtrain)
plot_bar(newtrain)
plot_correlation(newtrain)
```
plot_str(newtest)
plot_intro(newtest)
plot_missing(newtest)
plot_histogram(newtest)
plot_density(newtest)
plot_qq(newtest)
plot_bar(newtest)
plot_correlation(newtest)
```
#Liquidity
#leverage
#Liquidity
#leverage
```
```{r multicollinearity}
#vif(newtrain)
#for(i in 1:length(newtrain)){
## print(paste(colnames(newtrain[i]),class(newtrain[,i])))}
```
summary(trainLOGIT)
summary(trainLOGIT)
```
##install.packages("SDMTools")
library(SDMTools)
##install.packages("pROC")
library(pROC)
tab.logit<-confusion.matrix(newtrain$Default,PredLOGIT,threshold = 0.5)
tab.logit
accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
accuracy.logit
roc.logit<-roc(newtrain$Default,PredLOGIT )
roc.logit
plot(roc.logit)
tab.logit
accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
accuracy.logit
roc.logit<-roc(newtest$`Default - 1`,PredLOGIT )
roc.logit
plot(roc.logit)
```
for (i in seq(0.1,1,.1))
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
tmp_DT = data.table(newtrain)
cnt_resp=sum(Default==1),
cnt_non_resp=sum(Default==0)
), by=deciles][order(-deciles)]
View(rank)
```
for (i in seq(0.1,1,.1))
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[2], 2,
ifelse(x<deciles[3], 3,
ifelse(x<deciles[4], 4,
ifelse(x<deciles[5], 5,
ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,
ifelse(x<deciles[8], 8,
ifelse(x<deciles[9], 9, 10
))))))))))
tmp_DT = data.table(newtest)
cnt_resp=sum(`Default - 1`==1),
cnt_non_resp=sum(`Default - 1`==0)
), by=deciles][order(-deciles)]
newtestRank<-rank
View(rank)
```
cut_ptrain = with(newtrain,
cut_ptest = with(newtest,
levels(cut_ptrain)
levels(cut_ptest)
par(mfrow=c(1,2))
title(main="Validation Sample")
```