Credit Risk Analysis

Credit Risk Analysis
08.09.2019
─
Janani Prakash
PGPBABI-Online
GreatLearning, Great Lakes Institute of Management
Project Objective 3
Directory and dataset creation 4

Install necessary Packages and Invoke Libraries 4
Set up working Directory 4
Import and Read the Dataset 4
Exploratory Data Analysis 4

Importing Dataset 4
Missing value treatment 7
Outlier treatment 12
Univariate and Multivariate Analysis 13
Variable Creation 23
Modelling 25
Logistic Regression 25
Analysis 28
Model Performance and Measure 29

Model performance on Train and Test data 29
Deciling 31
Source Code 35
1. Project Objective
The objective of the project is to create India credit risk(default) model using the given
training dataset and validate it on the holdout dataset. Logistic regression framework is to be
used to develop the credit default model.
The data provided in raw-data comprises of financial data.
Major data points or variables are
Net worth next year, Total assets, Net worth, Total income, Total expenses, Profit after tax,
PBDITA, PBT (Profit Before Tax), Cash profit, PBDITA as % of total income, PBT as % of
total income, Cash profit as % of total income, PAT as % of net worth, Sales, Total capital,
Reserves and funds, Borrowings, Current liabilities & provisions, Capital employed, Net
fixed assets, Investments, Net working capital, Debt to equity ratio (times), Cash to current
liabilities (times), Total liabilities.
In addition to the above variables there are other financial parameters which define the
financial strength of the organization taking the total tally of variables to 51.
The below process is to be followed:
1. Exploratory Data Analysis(EDA)

a. Outlier treatment has to be done
b. Missing value treatment has to be done
c. New variables for Profitability, leverage and liquidity has to be created
d. Univariate and Bivariate analysis has to be done
2. Modelling
a. Logistic Regression Model has to be built on important variables
b. Coefficients of important variables have to be analysed
3. Model Performance Measures
a. The accuracy of the model has to be predicted on the training and
holdout dataset
b. The data has to be sorted in descending order based on probability of
default and then divided into 10 deciles based on probability
2. Directory and dataset creation

2.1.1. Install necessary Packages and Invoke Libraries
The necessary packages were installed and the associated libraries were
invoked. Having all the packages at the same places increases code
readability.
Please refer Appendix A for Source Code.
2.1.2. Set up working Directory

Setting a working directory on starting of the R session makes importing
and exporting data files and code files easier. Basically, working directory
is the location/ folder on the PC where you have the data, codes etc.
related to the project.
2.1.3. Import and Read the Dataset

The given dataset is in .xlsx format. Hence, the command ‘read.xlsx’ is
used for importing the file.
3. Exploratory Data Analysis
3.1. Importing Dataset

There are two datasets: Training and Testing dataset with similar variables. The
dataset consists of Organisation details such as Net worth next year, Total
assets, Net worth, Total income, Total expenses, Profit after tax, PBDITA, PBT
(Profit Before Tax), Cash profit, PBDITA as % of total income, PBT as % of total
income, Cash profit as % of total income, PAT as % of net worth, Sales, Total
capital, Reserves and funds, Borrowings, Current liabilities & provisions, Capital
employed, Net fixed assets, Investments, Net working capital, Debt to equity
ratio (times), Cash to current liabilities (times), Total liabilities
The dataset is imported for further analysis
>train <- read_excel("C:/Users/Janani Prakash/Desktop/R/raw.xlsx")

>test <- read_excel("C:/Users/Janani Prakash/Desktop/R/valid.xlsx")
> dim(train)
[1] 3541 52
The datasets contains 3541 observations and 52 variables.
> names(CarData)
[1] "Num" "Networth Next Year"
[3] "Total assets" "Net worth"
[5] "Total income" "Change in stock"
[7] "Total expenses" "Profit after tax"
[9] "PBDITA" "PBT"
[11] "Cash profit" "PBDITA as % of total income"
[13] "PBT as % of total income" "PAT as % of total income"
[15] "Cash profit as % of total income" "PAT as % of net worth"
[17] "Sales" "Income from financial services"
[19] "Other income" "Total capital"
[21] "Reserves and funds" "Deposits (accepted by commercial
banks)"
[23] "Borrowings" "Current liabilities & provisions"
[25] "Deferred tax liability" "Shareholders funds"
[27] "Cumulative retained profits" "Capital employed"
[29] "TOL/TNW" "Total term liabilities / tangible net worth"
[31] "Contingent liabilities / Net worth (%)" "Contingent liabilities"
[33] "Net fixed assets" "Investments"
[35] "Current assets" "Net working capital"
[37] "Quick ratio (times)" "Current ratio (times)"
[39] "Debt to equity ratio (times)" "Cash to current liabilities (times)"
[41] "Cash to average cost of sales per day" "Creditors turnover"
[43] "Debtors turnover" "Finished goods turnover"
[45] "WIP turnover" "Raw material turnover"
[47] "Shares outstanding" "Equity face value"
[49] "EPS" "Adjusted EPS"
[51] "Total liabilities" "PE on BSE"
The names of the 52 variables are displayed above.
> dim(test)
[1] 715 52
> names(test)
[1] "Num" "Default - 1"
[3] "Total assets" "Net worth"
[5] "Total income" "Change in stock"
[7] "Total expenses" "Profit after tax"
[9] "PBDITA" "PBT"
[11] "Cash profit" "PBDITA as % of total income"
[13] "PBT as % of total income" "PAT as % of total income"
[15] "Cash profit as % of total income" "PAT as % of net worth"
[17] "Sales" "Income from financial services"
[19] "Other income" "Total capital"
[21] "Reserves and funds" "Deposits (accepted by commercial
banks)"
[23] "Borrowings" "Current liabilities & provisions"
[25] "Deferred tax liability" "Shareholders funds"
[27] "Cumulative retained profits" "Capital employed"
[29] "TOL/TNW" "Total term liabilities / tangible net worth"
[31] "Contingent liabilities / Net worth (%)" "Contingent liabilities"
[33] "Net fixed assets" "Investments"
[35] "Current assets" "Net working capital"
[37] "Quick ratio (times)" "Current ratio (times)"
[39] "Debt to equity ratio (times)" "Cash to current liabilities (times)"
[41] "Cash to average cost of sales per day" "Creditors turnover"
[43] "Debtors turnover" "Finished goods turnover"
[45] "WIP turnover" "Raw material turnover"
[47] "Shares outstanding" "Equity face value"
[49] "EPS" "Adjusted EPS"
[51] "Total liabilities" "PE on BSE"
Another copy of the dataset is made to work further on the dataset.

>newtrain <- train
>newtest <- test
The training dataset does not have default variable so the default variable is
created by splitting the observations of the Networth Next Year variable.
Generally, it is expected that the firms that will have negative net worth next year
are likely to default. Negative observations in ‘Networth Next Year’ will be ‘1’ in
the Default variable and the positive observations will be ‘0’ in the Default
variable.
>newtrain$Default <- ifelse(newtrain$'Networth Next Year' < 0 ,1,0)
Companies with Total Assets less than 3 is removed from further analysis.
>newtrain <- newtrain[!newtrain$`Total assets` <= 3, ]
3.2. Missing value treatment

The dataset should contain only numeric values for performing logistic regression
but the below code shows that few variables are of class ‘character’.
> newtrain<-as.data.frame(newtrain)
> for(i in 1:length(newtrain)){
+ print(paste(colnames(newtrain[i]),class(newtrain[,i])))}
[1] "Num numeric"
[1] "Networth Next Year numeric"
[1] "Total assets numeric"
[1] "Net worth numeric"
[1] "Total income numeric"
[1] "Change in stock numeric"
[1] "Total expenses numeric"
[1] "Profit after tax numeric"
[1] "PBDITA numeric"
[1] "PBT numeric"
[1] "Cash profit numeric"
[1] "PBDITA as % of total income numeric"
[1] "PBT as % of total income numeric"
[1] "PAT as % of total income numeric"
[1] "Cash profit as % of total income numeric"
[1] "PAT as % of net worth numeric"
[1] "Sales numeric"
[1] "Income from financial services numeric"
[1] "Other income numeric"
[1] "Total capital numeric"
[1] "Reserves and funds numeric"
[1] "Deposits (accepted by commercial banks) logical"
[1] "Borrowings numeric"
[1] "Current liabilities & provisions numeric"
[1] "Deferred tax liability numeric"
[1] "Shareholders funds numeric"
[1] "Cumulative retained profits numeric"
[1] "Capital employed numeric"
[1] "TOL/TNW numeric"
[1] "Total term liabilities / tangible net worth numeric"
[1] "Contingent liabilities / Net worth (%) numeric"
[1] "Contingent liabilities numeric"
[1] "Net fixed assets numeric"
[1] "Investments numeric"
[1] "Current assets numeric"
[1] "Net working capital numeric"
[1] "Quick ratio (times) numeric"
[1] "Current ratio (times) numeric"
[1] "Debt to equity ratio (times) numeric"
[1] "Cash to current liabilities (times) numeric"
[1] "Cash to average cost of sales per day numeric"
[1] "Creditors turnover character"
[1] "Debtors turnover character"
[1] "Finished goods turnover character"
[1] "WIP turnover character"
[1] "Raw material turnover character"
[1] "Shares outstanding character"
[1] "Equity face value character"
[1] "EPS numeric"
[1] "Adjusted EPS numeric"
[1] "Total liabilities numeric"
[1] "PE on BSE character"
[1] "Default numeric"
This plot shows that the training dataset has 6.8% missing observations and
1.9% missing columns.
>plot_intro(newtrain)
The variables of type character are converted to the type numeric and also the
missing observations in a column is replaced with the median of that column for
the whole of the training dataset.
>for(i in 1:ncol(newtrain)){
newtrain[,i] <- as.numeric(newtrain[,i])
newtrain[is.na(newtrain[,i]), i] <- median(newtrain[,i], na.rm = TRUE)
}
The training dataset has missing observations as well as columns. The missing
observations are replaced with median of that column and the missing columns
are removed from the dataset.
>newtrain <- newtrain[,-22]
On running the plot again, it shows that the training dataset does not have any
missing observations or columns.
>plot_intro(newtrain)
Similarly the testing dataset also has variables of the type character.
> newtest<-as.data.frame(newtest)
> for(i in 1:length(newtest)){
+ print(paste(colnames(newtest[i]),class(newtest[,i])))}
[1] "Num numeric"
[1] "Default - 1 numeric"
[1] "Total assets numeric"
[1] "Net worth numeric"
[1] "Total income numeric"
[1] "Change in stock numeric"
[1] "Total expenses numeric"
[1] "Profit after tax numeric"
[1] "PBDITA numeric"
[1] "PBT numeric"
[1] "Cash profit numeric"
[1] "PBDITA as % of total income numeric"
[1] "PBT as % of total income numeric"
[1] "PAT as % of total income numeric"
[1] "Cash profit as % of total income numeric"
[1] "PAT as % of net worth numeric"
[1] "Sales numeric"
[1] "Income from financial services numeric"
[1] "Other income numeric"
[1] "Total capital numeric"
[1] "Reserves and funds numeric"
[1] "Deposits (accepted by commercial banks) logical"
[1] "Borrowings numeric"
[1] "Current liabilities & provisions numeric"
[1] "Deferred tax liability numeric"
[1] "Shareholders funds numeric"
[1] "Cumulative retained profits numeric"
[1] "Capital employed numeric"
[1] "TOL/TNW numeric"
[1] "Total term liabilities / tangible net worth numeric"
[1] "Contingent liabilities / Net worth (%) numeric"
[1] "Contingent liabilities numeric"
[1] "Net fixed assets numeric"
[1] "Investments numeric"
[1] "Current assets numeric"
[1] "Net working capital numeric"
[1] "Quick ratio (times) numeric"
[1] "Current ratio (times) numeric"
[1] "Debt to equity ratio (times) numeric"
[1] "Cash to current liabilities (times) numeric"
[1] "Cash to average cost of sales per day numeric"
[1] "Creditors turnover character"
[1] "Debtors turnover character"
[1] "Finished goods turnover character"
[1] "WIP turnover character"
[1] "Raw material turnover character"
[1] "Shares outstanding character"
[1] "Equity face value character"
[1] "EPS numeric"
[1] "Adjusted EPS numeric"
[1] "Total liabilities numeric"
[1] "PE on BSE character"
This plot shows that the testing dataset has 7% missing observations and 1.9%
missing columns.
> plot_intro(newtest)
In the following code, the variables of type character are changed into the type
numeric and then the missing observations in each column is replaced with the
median of that column.
> for(i in 1:ncol(newtest)){
+ newtest[,i] <- as.numeric(newtest[,i])
+ newtest[is.na(newtest[,i]), i] <- median(newtest[,i], na.rm = TRUE)
+}
The missing columns are removed from the dataset.
>newtest <- newtest[,-22]
This plot shows that the dataset does not contain any missing observations or
missing columns.
>plot_intro(newtest)
3.3. Outlier treatment
The outliers in the dataset are treated by replacing the observations lesser than
the 1st percentile with value of the 1st percentile and the observations more than
the 99th percentile with the value of the 99th percentile. This outlier treatment is
done for every column in the dataset.
The quantile function identifies the observations less than 1st percentile and
more than the 99th percentile. The squish function replaces the values of these
identified outliers with the value of the 1st percentile and the 99th percentile.
>for(i in 2:ncol(newtrain)){
q <- quantile(newtrain[,i], c(0.1, 0.99))
newtrain[,i] <- squish(newtrain[,i], q)
}
Redundant variables are removed from the Training and Testing dataset.
>newtrain <- newtrain[,-c(1,2)]
>newtest <- newtest[,-1]
3.4. Univariate and Multivariate Analysis
The variables can be explored further and can be analysed using univariate and
multivariate analysis
> plot_str(newtrain)
> plot_intro(newtrain)
> plot_missing(newtrain)
> plot_histogram(newtrain)
> plot_qq(newtrain)
> plot_bar(newtrain)
> plot_correlation(newtrain)
3.5. Variable Creation

New variables are created as per the requirement. One ratio for Profitability,
Liquidity and Leverage is required as per the problem statement.
The profitability ratio is derived by dividing Profit after tax by Sales

> newtrain$Profitability <- newtrain$`Profit after tax`/newtrain$Sales
> newtrain$PriceperShare <- newtrain$EPS*newtrain$`PE on BSE`
The liquidity ratio is derived by dividing Net Working Capital by Total Assets
> newtrain$NWC2TA <- newtrain$`Net working capital`/newtrain$`Total assets`
> newtrain$TotalEquity <- newtrain$`Total liabilities`/newtrain$`Debt to equity ratio

(times)`
The leverage ratio is derived by dividing Total Assets by TotalEquity
> newtrain$EquityMultiplier <- newtrain$`Total assets`/newtrain$TotalEquity
Other ratios are also created by dividing multiple variables by Total assets and the
contribution of these ratios towards the model can be found later.
> newtrain$Networth2Totalassets <- newtrain$`Net worth`/newtrain$`Total assets`
> newtrain$Totalincome2Totalassets<- newtrain$`Total income`/newtrain$`Total assets`
> newtrain$Totalexpenses2Totalassets <-newtrain$`Total expenses`/newtrain$`Total
assets`
> newtrain$Profitaftertax2Totalassets <-newtrain$`Profit after tax`/newtrain$`Total
assets`
> newtrain$PBT2Totalassets <-newtrain$PBT/newtrain$`Total assets`
> newtrain$Sales2Totalassets <-newtrain$Sales/newtrain$`Total assets`
> newtrain$Currentliabilitiesprovisions2Totalassets <-newtrain$`Current liabilities &
provisions`/newtrain$`Total assets`
> newtrain$Capitalemployed2Totalassets <-newtrain$`Capital employed`/newtrain$`Total
assets`
> newtrain$Netfixedassets2Totalassets <-newtrain$`Net fixed assets`/newtrain$`Total
assets`
> newtrain$Investments2Totalassets <-newtrain$Investments/newtrain$`Total assets`
> newtrain$Totalliabilities2Totalassets <-newtrain$`Total liabilities`/newtrain$`Total
assets`
Similar variables are created for the holdout dataset as well.
> newtest$Profitability <- newtest$`Profit after tax`/newtest$Sales

> newtest$PriceperShare <- newtest$EPS*newtest$`PE on BSE`
> #Liquidity
> newtest$NWC2TA <- newtest$`Net working capital`/newtest$`Total assets`
> #leverage
> newtest$TotalEquity <- newtest$`Total liabilities`/newtest$`Debt to equity ratio (times)`
> newtest[is.infinite(newtest[,54]), 54] <- 0
> newtest$EquityMultiplier <- newtest$`Total assets`/newtest$TotalEquity
> newtest[is.infinite(newtest[,55]), 55] <- 0
> newtest$Networth2Totalassets <- newtest$`Net worth`/newtest$`Total assets`

> newtest$Totalincome2Totalassets<- newtest$`Total income`/newtest$`Total assets`
> newtest$Totalexpenses2Totalassets <-newtest$`Total expenses`/newtest$`Total
assets`
> newtest$Profitaftertax2Totalassets <-newtest$`Profit after tax`/newtest$`Total assets`
> newtest$PBT2Totalassets <-newtest$PBT/newtest$`Total assets`
> newtest$Sales2Totalassets <-newtest$Sales/newtest$`Total assets`
> newtest$Currentliabilitiesprovisions2Totalassets <-newtest$`Current liabilities &
provisions`/newtest$`Total assets`
> newtest$Capitalemployed2Totalassets <-newtest$`Capital employed`/newtest$`Total
assets`
> newtest$Netfixedassets2Totalassets <-newtest$`Net fixed assets`/newtest$`Total
assets`
> newtest$Investments2Totalassets <-newtest$Investments/newtest$`Total assets`
> newtest$Totalliabilities2Totalassets <-newtest$`Total liabilities`/newtest$`Total assets`
4. Modelling
4.1. Logistic Regression
The Logistic regression model is used for this dataset. Initially, all the variables
are used as the predictors with the Default variable as the response variable.
> trainLOGIT<- glm(Default~.,data = newtrain, family=binomial)
glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(trainLOGIT)
The result of the Logistic Regression shows that few variables are important and
contribute more towards the model.
The most important variables identified from the previous Logistic regression
model is used as predictors in this model with the Default variable as the
response.
>trainLOGIT <- glm(Default~`Total assets`+`Total income`+`Change in stock`+`Total

expenses`+`Profit after tax`+PBDITA+`Cash profit`+`PBDITA as % of total income`+`PBT as % of
total income`+`PAT as % of total income`+`Cash profit as % of total income`+`PAT as % of net
worth`+`Total capital`+`Reserves and funds`+`Borrowings`+`Current liabilities &
provisions`+`Capital employed`+`Total term liabilities / tangible net worth`+`Contingent
liabilities`+`Current ratio (times)`+Investments+`Finished goods turnover`+`TOL/TNW`+`PE on
BSE` +`Net fixed assets`+`Debt to equity ratio (times)`+`Cash to average cost of sales per
day`+PriceperShare+NWC2TA+Networth2Totalassets+Sales2Totalassets+
Capitalemployed2Totalassets+ Investments2Totalassets , data= newtrain, family = binomial)
>summary(trainLOGIT)
4.2. Analysis
The model has an AIC value of 875.4 and predicts the training and testing
dataset with almost 95% accuracy (seen later in the document).
Few of the most important variables are Total Assets, Cash Profit, PAT as % of
net worth, Reserves and Funds, Current Liabilities and Provisions, Capital
employed, Net Working Capital/Total Assets and Networth/Total Assets. These
variables have very less Pr(>|z|).
Among the most important variables, the variables with positive estimates are
Total Assets, Current ratio, Sales/Total Assets and the variables with negative
estimates are Cash Profit, PAT as % of net worth,Current Liabilities and
Provisions, Capital employed
5. Model Performance and Measure

5.1. Model performance on Train and Test data
The Logistic regression model that was created is used to predict the Training
dataset.
> PredLOGIT <- predict.glm(trainLOGIT, newdata=newtrain, type="response")
> tab.logit<-confusion.matrix(newtrain$Default,PredLOGIT,threshold = 0.5)
> tab.logit
obs
pred 0 1
0 3227 107
1 35 109
attr(,"class")
[1] "confusion.matrix"
The confusion matrix shows that there are 35 Type 1 error and 107 type 2 error.
> accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
> accuracy.logit
[1] 0.9591719
The accuracy of the model is 95.9179%

> roc.logit<-roc(newtrain$Default,PredLOGIT )
Setting levels: control = 0, case = 1
Setting direction: controls < cases
> roc.logit
Call:
roc.default(response = newtrain$Default, predictor = PredLOGIT)
Data: PredLOGIT in 3262 controls (newtrain$Default 0) < 216 cases

(newtrain$Default 1).
Area under the curve: 0.9423
> plot(roc.logit)
The same logistic regression model is used to predict the testing dataset.
> PredLOGIT <- predict.glm(trainLOGIT, newdata=newtest, type="response")
> tab.logit<-confusion.matrix(newtest$`Default - 1`,PredLOGIT,threshold = 0.5)
> tab.logit
obs
pred 0 1
0 639 20
1 22 34
attr(,"class")
[1] "confusion.matrix"
The confusion matrix shows that there are 22 Type 1 error and 20 Type 2 error.
> accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
> accuracy.logit
[1] 0.9412587
The accuracy of the model is 94.125%
> roc.logit<-roc(newtest$`Default - 1`,PredLOGIT )
Setting levels: control = 0, case = 1
Setting direction: controls < cases
> roc.logit
Call:
roc.default(response = newtest$`Default - 1`, predictor = PredLOGIT)
Data: PredLOGIT in 661 controls (newtest$`Default - 1` 0) < 54 cases

(newtest$`Default - 1` 1).
Area under the curve: 0.941
> plot(roc.logit)
5.2. Deciling
The training dataset is then divided into 10 deciles based on the probability of default.
> newtrain$pred = predict(trainLOGIT, newtrain, type="response")
> decile <- function(x)

+{
+ deciles <- vector(length=10)
+ for (i in seq(0.1,1,.1))
+ {
+ deciles[i*10] <- quantile(x, i, na.rm=T)
+ }
+ return (
+ ifelse(x<deciles[1], 1,
+ ifelse(x<deciles[9], 9, 10
+ ))))))))))
+}
> newtrain$deciles <- decile(newtrain$pred)
> tmp_DT = data.table(newtrain)

After the deciles are created, they then ranked.
> rank <- tmp_DT[, list(cnt=length(Default),
+ cnt_resp=sum(Default==1),
+ cnt_non_resp=sum(Default==0)
+ ), by=deciles][order(-deciles)]
> rank$rrate <- round(rank$cnt_resp / rank$cnt,4);

> rank$cum_resp <- cumsum(rank$cnt_resp)
> rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
> rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
> rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
> rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
> rank$rrate <- percent(rank$rrate)
> rank$cum_rel_resp <- percent(rank$cum_rel_resp)
> rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
> newtrainRank <- rank
> View(rank)
The ranks of the deciles are seen above. The deciles are sorted in the descending order.
The 10th decile has the maximum number of defaults in the form of cnt_resp.
The testing dataset is then divided into 10 deciles based on the probability of default.
> newtest$pred = predict(trainLOGIT, newtest, type="response")
> decile <- function(x)

+{
+ deciles <- vector(length=10)
+ for (i in seq(0.1,1,.1))
+ {
+ deciles[i*10] <- quantile(x, i, na.rm=T)
+ }
+ return (
+ ifelse(x<deciles[9], 9, 10
+ ))))))))))
+}
> newtest$deciles <- decile(newtest$pred)
> tmp_DT = data.table(newtest)

The deciles are then ranked.
> rank <- tmp_DT[, list(cnt=length(`Default - 1`),
+ cnt_resp=sum(`Default - 1`==1),
+ cnt_non_resp=sum(`Default - 1`==0)
+ ), by=deciles][order(-deciles)]
> rank$rrate <- round(rank$cnt_resp / rank$cnt,4);

> rank$cum_resp <- cumsum(rank$cnt_resp)
> rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
> rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
> rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
> rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
> rank$rrate <- percent(rank$rrate)
> rank$cum_rel_resp <- percent(rank$cum_rel_resp)
> rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
> newtestRank<-rank
> View(rank)
The ranks of the deciles are seen above. The deciles are sorted in the descending order.
The 10th decile has the maximum number of defaults in the form of cnt_resp.
The mean is taken for both the Training and Testing dataset to differentiate the predicted
and observed values.
> mean.obs.train = aggregate(Default ~ rank, data = newtrain, mean)
> mean.pred.train = aggregate(pred ~ rank, data = newtrain, mean)
> mean.obs.val = aggregate( `Default - 1`~ rank, data = newtest, mean)

> mean.pred.val = aggregate(pred ~ rank, data = newtest, mean)
# plot the mean vs deciles

> par(mfrow=c(1,2))
> plot(mean.obs.train[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile", ylab="Prob")
> lines(mean.pred.train[,2], type="b", col="red", lty=2)
> title(main="Training Sample")
> plot(mean.obs.val[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile", ylab="Prob")

> lines(mean.pred.val[,2], type="b", col="red", lty=2)
> title(main="Validation Sample")
The plot shows that the model almost accurately predicted both the Training and Testing
dataset with an accuracy of almost 95%
6. Source Code
---
title: "Untitled"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and
MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.
When you click the **Knit** button a document will be generated that includes both content as well as the
output of any embedded R code chunks within the document. You can embed an R code chunk like this:
```{r cars}
summary(cars)
```
## Including Plots
You can also embed plots, for example:
```{r pressure, echo=FALSE}
plot(pressure)
```
Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that
generated the plot.
```{r}
install.packages("DataExplorer")
library(DataExplorer)
library(readxl)
train <- read_excel("C:/Users/Janani Prakash/Desktop/R/raw.xlsx")
test <- read_excel("C:/Users/Janani Prakash/Desktop/R/valid.xlsx")
dim(train)
names(train)
head(train)
dim(test)
names(test)
head(test)
```
```{r Missing value and Outlier Treatment}
newtrain <- train
newtest <- test
newtrain$Default <- ifelse(newtrain$'Networth Next Year' < 0 ,1,0)
##remove very small companies
newtrain <- newtrain[!newtrain$`Total assets` <= 3, ]
#summary
#summary(newtrain)
#replace missing values of train with median
newtrain<-as.data.frame(newtrain)
for(i in 1:length(newtrain)){
print(paste(colnames(newtrain[i]),class(newtrain[,i])))}
plot_intro(newtrain)
for(i in 1:ncol(newtrain)){
newtrain[,i] <- as.numeric(newtrain[,i])
newtrain[is.na(newtrain[,i]), i] <- median(newtrain[,i], na.rm = TRUE)
#replace missing values of test with median
newtest<-as.data.frame(newtest)
for(i in 1:length(newtest)){
print(paste(colnames(newtest[i]),class(newtest[,i])))}
plot_intro(newtest)
for(i in 1:ncol(newtest)){
newtest[,i] <- as.numeric(newtest[,i])

newtest[is.na(newtest[,i]), i] <- median(newtest[,i], na.rm = TRUE)
#removing missing column
newtrain <- newtrain[,-22]
newtest <- newtest[,-22]
plot_intro(newtest)
#identify outlier
boxplot(newtrain)
library(scales)
for(i in 2:ncol(newtrain)){
q <- quantile(newtrain[,i], c(0.01, 0.999))
newtrain[,i] <- squish(newtrain[,i], q)
summary(newtrain)
newtrain <- newtrain[,-c(1,2)]
newtest <- newtest[,-1]
```
```{r Univariate and Bivariate for Train}
plot_str(newtrain)
plot_intro(newtrain)
plot_missing(newtrain)
plot_histogram(newtrain)
plot_density(newtrain)
plot_qq(newtrain)
plot_bar(newtrain)
plot_correlation(newtrain)
```
```{r Univariate and Bivariate for Test}
plot_str(newtest)
plot_intro(newtest)
plot_missing(newtest)
plot_histogram(newtest)
plot_density(newtest)
plot_qq(newtest)
plot_bar(newtest)
plot_correlation(newtest)
```
```{r new variables}
newtrain$Profitability <- newtrain$`Profit after tax`/newtrain$Sales
newtrain$PriceperShare <- newtrain$EPS*newtrain$`PE on BSE`
#Liquidity
newtrain$NWC2TA <- newtrain$`Net working capital`/newtrain$`Total assets`
#leverage
newtrain$TotalEquity <- newtrain$`Total liabilities`/newtrain$`Debt to equity ratio (times)`
newtrain[is.infinite(newtrain[,54]), 54] <- 0
newtrain$EquityMultiplier <- newtrain$`Total assets`/newtrain$TotalEquity
newtrain[is.infinite(newtrain[,55]), 55] <- 0
newtrain$Networth2Totalassets <- newtrain$`Net worth`/newtrain$`Total assets`
newtrain$Totalincome2Totalassets<- newtrain$`Total income`/newtrain$`Total assets`
newtrain$Totalexpenses2Totalassets <-newtrain$`Total expenses`/newtrain$`Total assets`
newtrain$Profitaftertax2Totalassets <-newtrain$`Profit after tax`/newtrain$`Total assets`
newtrain$PBT2Totalassets <-newtrain$PBT/newtrain$`Total assets`

newtrain$Sales2Totalassets <-newtrain$Sales/newtrain$`Total assets`
newtrain$Currentliabilitiesprovisions2Totalassets <-newtrain$`Current liabilities &

provisions`/newtrain$`Total assets`
newtrain$Capitalemployed2Totalassets <-newtrain$`Capital employed`/newtrain$`Total assets`
newtrain$Netfixedassets2Totalassets <-newtrain$`Net fixed assets`/newtrain$`Total assets`
newtrain$Investments2Totalassets <-newtrain$Investments/newtrain$`Total assets`
newtrain$Totalliabilities2Totalassets <-newtrain$`Total liabilities`/newtrain$`Total assets`
newtest$Profitability <- newtest$`Profit after tax`/newtest$Sales
newtest$PriceperShare <- newtest$EPS*newtest$`PE on BSE`
#Liquidity
newtest$NWC2TA <- newtest$`Net working capital`/newtest$`Total assets`
#leverage
newtest$TotalEquity <- newtest$`Total liabilities`/newtest$`Debt to equity ratio (times)`
newtest[is.infinite(newtest[,54]), 54] <- 0
newtest$EquityMultiplier <- newtest$`Total assets`/newtest$TotalEquity
newtest[is.infinite(newtest[,55]), 55] <- 0
newtest$Networth2Totalassets <- newtest$`Net worth`/newtest$`Total assets`
newtest$Totalincome2Totalassets<- newtest$`Total income`/newtest$`Total assets`
newtest$Totalexpenses2Totalassets <-newtest$`Total expenses`/newtest$`Total assets`
newtest$Profitaftertax2Totalassets <-newtest$`Profit after tax`/newtest$`Total assets`
newtest$PBT2Totalassets <-newtest$PBT/newtest$`Total assets`
newtest$Sales2Totalassets <-newtest$Sales/newtest$`Total assets`
newtest$Currentliabilitiesprovisions2Totalassets <-newtest$`Current liabilities & provisions`/newtest$`Total

assets`
newtest$Capitalemployed2Totalassets <-newtest$`Capital employed`/newtest$`Total assets`
newtest$Netfixedassets2Totalassets <-newtest$`Net fixed assets`/newtest$`Total assets`
newtest$Investments2Totalassets <-newtest$Investments/newtest$`Total assets`
newtest$Totalliabilities2Totalassets <-newtest$`Total liabilities`/newtest$`Total assets`
```
```{r multicollinearity}
#vif(newtrain)
#for(i in 1:length(newtrain)){
## print(paste(colnames(newtrain[i]),class(newtrain[,i])))}
```
```{r Logistic Regression}
trainLOGIT<- glm(Default~.-Profitability,data = newtrain, family=binomial)
summary(trainLOGIT)
trainLOGIT <- glm(Default~`Total assets`+`Total income`+`Change in stock`+`Total expenses`+`Profit after

tax`+PBDITA+`Cash profit`+`PBDITA as % of total income`+`PBT as % of total income`+`PAT as % of total
income`+`Cash profit as % of total income`+`PAT as % of net worth`+`Total capital`+`Reserves and
funds`+`Borrowings`+`Current liabilities & provisions`+`Capital employed`+`Total term liabilities / tangible net
worth`+`Contingent liabilities`+`Current ratio (times)`+Investments+`Finished goods
turnover`+`TOL/TNW`+`PE on BSE` +`Net fixed assets`+`Debt to equity ratio (times)`+`Cash to average cost
of sales per day`+PriceperShare+NWC2TA+Networth2Totalassets+Sales2Totalassets+
Capitalemployed2Totalassets+ Investments2Totalassets , data= newtrain, family = binomial)
summary(trainLOGIT)
```
```{r Model validation Train and Test}
##install.packages("SDMTools")
library(SDMTools)
##install.packages("pROC")
library(pROC)
PredLOGIT <- predict.glm(trainLOGIT, newdata=newtrain, type="response")
tab.logit<-confusion.matrix(newtrain$Default,PredLOGIT,threshold = 0.5)
tab.logit
accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
accuracy.logit
roc.logit<-roc(newtrain$Default,PredLOGIT )
roc.logit
plot(roc.logit)
PredLOGIT <- predict.glm(trainLOGIT, newdata=newtest, type="response")
tab.logit<-confusion.matrix(newtest$`Default - 1`,PredLOGIT,threshold = 0.5)
tab.logit
accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
accuracy.logit
roc.logit<-roc(newtest$`Default - 1`,PredLOGIT )
roc.logit
plot(roc.logit)
```
```{r Train data Deciling}
newtrain$pred = predict(trainLOGIT, newtrain, type="response")
decile <- function(x)
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
deciles[i*10] <- quantile(x, i, na.rm=T)
return (
ifelse(x<deciles[1], 1,
ifelse(x<deciles[9], 9, 10
))))))))))
newtrain$deciles <- decile(newtrain$pred)
tmp_DT = data.table(newtrain)
rank <- tmp_DT[, list(cnt=length(Default),
cnt_resp=sum(Default==1),
cnt_non_resp=sum(Default==0)
), by=deciles][order(-deciles)]
rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
rank$rrate <- percent(rank$rrate)
rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
newtrainRank <- rank
View(rank)
```
```{r Test data Deciling}
newtest$pred = predict(trainLOGIT, newtest, type="response")
decile <- function(x)

{
deciles <- vector(length=10)
for (i in seq(0.1,1,.1))
deciles[i*10] <- quantile(x, i, na.rm=T)
return (
ifelse(x<deciles[9], 9, 10
))))))))))
newtest$deciles <- decile(newtest$pred)
tmp_DT = data.table(newtest)
rank <- tmp_DT[, list(cnt=length(`Default - 1`),
cnt_resp=sum(`Default - 1`==1),
cnt_non_resp=sum(`Default - 1`==0)
), by=deciles][order(-deciles)]
rank$rrate <- round(rank$cnt_resp / rank$cnt,4);
rank$cum_resp <- cumsum(rank$cnt_resp)
rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
rank$rrate <- percent(rank$rrate)

rank$cum_rel_resp <- percent(rank$cum_rel_resp)
rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
newtestRank<-rank
View(rank)
```
```{r Decile Comparison}
# cut_p returns the cut internal for each observation
cut_ptrain = with(newtrain,
cut(pred, breaks = quantile(pred, prob=seq(0,1,0.1)), include.lowest = T))
cut_ptest = with(newtest,
cut(pred, breaks = quantile(pred, prob=seq(0,1,0.1)), include.lowest = T))
levels(cut_ptrain)
levels(cut_ptest)
newtrain$rank = factor(cut_ptrain, labels = 1:10)
newtest$rank = factor(cut_ptest, labels = 1:10)
# get aggregated data
mean.obs.train = aggregate(Default ~ rank, data = newtrain, mean)
mean.pred.train = aggregate(pred ~ rank, data = newtrain, mean)
mean.obs.val = aggregate( `Default - 1`~ rank, data = newtest, mean)
mean.pred.val = aggregate(pred ~ rank, data = newtest, mean)
# plot the mean vs deciles
par(mfrow=c(1,2))
plot(mean.obs.train[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile", ylab="Prob")
lines(mean.pred.train[,2], type="b", col="red", lty=2)

title(main="Training Sample")
plot(mean.obs.val[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile", ylab="Prob")
lines(mean.pred.val[,2], type="b", col="red", lty=2)
title(main="Validation Sample")
```

Credit Risk Analysis - Janani Prakash

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Credit Risk Analysis - Janani Prakash

Caricato da

Copyright:

Formati disponibili

Directory and dataset creation 4

Exploratory Data Analysis 4

Model Performance and Measure 29

The data provided in raw-data comprises of financial data.

Major data points or variables are

The below process is to be followed:

1. Exploratory Data Analysis(EDA)

2. Directory and dataset creation

2.1.2. Set up working Directory

2.1.3. Import and Read the Dataset

3. Exploratory Data Analysis

3.1. Importing Dataset

The dataset is imported for further analysis

>train <- read_excel("C:/Users/Janani Prakash/Desktop/R/raw.xlsx")

The names of the 52 variables are displayed above.

Another copy of the dataset is made to work further on the dataset.

3.2. Missing value treatment

3.5. Variable Creation

The profitability ratio is derived by dividing Profit after tax by Sales

> newtrain$TotalEquity <- newtrain$`Total liabilities`/newtrain$`Debt to equity ratio

Similar variables are created for the holdout dataset as well.

> newtest$Profitability <- newtest$`Profit after tax`/newtest$Sales

> newtest$Networth2Totalassets <- newtest$`Net worth`/newtest$`Total assets`

>trainLOGIT <- glm(Default~`Total assets`+`Total income`+`Change in stock`+`Total

5. Model Performance and Measure

The accuracy of the model is 95.9179%

Data: PredLOGIT in 3262 controls (newtrain$Default 0) < 216 cases

Data: PredLOGIT in 661 controls (newtest$`Default - 1` 0) < 54 cases

> newtrain$pred = predict(trainLOGIT, newtrain, type="response")

> decile <- function(x)

> tmp_DT = data.table(newtrain)

> rank$rrate <- round(rank$cnt_resp / rank$cnt,4);

> decile <- function(x)

> tmp_DT = data.table(newtest)

> rank$rrate <- round(rank$cnt_resp / rank$cnt,4);

> mean.obs.val = aggregate( `Default - 1`~ rank, data = newtest, mean)

# plot the mean vs deciles

> plot(mean.obs.val[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile", ylab="Prob")

```{r setup, include=FALSE}

You can also embed plots, for example:

```{r pressure, echo=FALSE}

train <- read_excel("C:/Users/Janani Prakash/Desktop/R/raw.xlsx")

test <- read_excel("C:/Users/Janani Prakash/Desktop/R/valid.xlsx")

```{r Missing value and Outlier Treatment}

newtrain <- train

newtest <- test

newtrain$Default <- ifelse(newtrain$'Networth Next Year' < 0 ,1,0)

##remove very small companies

newtrain <- newtrain[!newtrain$`Total assets` <= 3, ]

#replace missing values of train with median

newtrain[,i] <- as.numeric(newtrain[,i])

newtrain[is.na(newtrain[,i]), i] <- median(newtrain[,i], na.rm = TRUE)

#replace missing values of test with median

newtest[,i] <- as.numeric(newtest[,i])

#removing missing column

newtrain <- newtrain[,-22]

newtest <- newtest[,-22]

q <- quantile(newtrain[,i], c(0.01, 0.999))

newtrain[,i] <- squish(newtrain[,i], q)

newtrain <- newtrain[,-c(1,2)]

newtest <- newtest[,-1]

```{r Univariate and Bivariate for Train}