Sei sulla pagina 1di 46

Credit Risk Analysis

08.09.2019

Janani Prakash
PGPBABI-Online
GreatLearning, Great Lakes Institute of Management
Project Objective 3

Directory and dataset creation 4


Install necessary Packages and Invoke Libraries 4
Set up working Directory 4
Import and Read the Dataset 4

Exploratory Data Analysis 4


Importing Dataset 4
Missing value treatment 7
Outlier treatment 12
Univariate and Multivariate Analysis 13
Variable Creation 23

Modelling 25
Logistic Regression 25
Analysis 28

Model Performance and Measure 29


Model performance on Train and Test data 29
Deciling 31

Source Code 35
1. Project Objective
The objective of the project is to create India credit risk(default) model using the given
training dataset and validate it on the holdout dataset. Logistic regression framework is to be
used to develop the credit default model.

The data provided in raw-data comprises of financial data.

Major data points or variables are

Net worth next year, Total assets, Net worth, Total income, Total expenses, Profit after tax,
PBDITA, PBT (Profit Before Tax), Cash profit, PBDITA as % of total income, PBT as % of
total income, Cash profit as % of total income, PAT as % of net worth, Sales, Total capital,
Reserves and funds, Borrowings, Current liabilities & provisions, Capital employed, Net
fixed assets, Investments, Net working capital, Debt to equity ratio (times), Cash to current
liabilities (times), Total liabilities.

In addition to the above variables there are other financial parameters which define the
financial strength of the organization taking the total tally of variables to 51.

The below process is to be followed:

1. Exploratory Data Analysis(EDA)


a. Outlier treatment has to be done
b. Missing value treatment has to be done
c. New variables for Profitability, leverage and liquidity has to be created
d. Univariate and Bivariate analysis has to be done
2. Modelling
a. Logistic Regression Model has to be built on important variables
b. Coefficients of important variables have to be analysed
3. Model Performance Measures
a. The accuracy of the model has to be predicted on the training and
holdout dataset
b. The data has to be sorted in descending order based on probability of
default and then divided into 10 deciles based on probability

2. Directory and dataset creation


2.1.1. Install necessary Packages and Invoke Libraries
The necessary packages were installed and the associated libraries were
invoked. Having all the packages at the same places increases code
readability.
Please refer Appendix A for Source Code.

2.1.2. Set up working Directory


Setting a working directory on starting of the R session makes importing
and exporting data files and code files easier. Basically, working directory
is the location/ folder on the PC where you have the data, codes etc.
related to the project.
Please refer Appendix A for Source Code.

2.1.3. Import and Read the Dataset


The given dataset is in .xlsx format. Hence, the command ‘read.xlsx’ is
used for importing the file.
Please refer Appendix A for Source Code.

3. Exploratory Data Analysis

3.1. Importing Dataset


There are two datasets: Training and Testing dataset with similar variables. The
dataset consists of Organisation details such as Net worth next year, Total
assets, Net worth, Total income, Total expenses, Profit after tax, PBDITA, PBT
(Profit Before Tax), Cash profit, PBDITA as % of total income, PBT as % of total
income, Cash profit as % of total income, PAT as % of net worth, Sales, Total
capital, Reserves and funds, Borrowings, Current liabilities & provisions, Capital
employed, Net fixed assets, Investments, Net working capital, Debt to equity
ratio (times), Cash to current liabilities (times), Total liabilities

The dataset is imported for further analysis

>train <- read_excel("C:/Users/Janani Prakash/Desktop/R/raw.xlsx")


>test <- read_excel("C:/Users/Janani Prakash/Desktop/R/valid.xlsx")

> dim(train)
[1] 3541 52
The datasets contains 3541 observations and 52 variables.

> names(CarData)
[1] "Num" "Networth Next Year"
[3] "Total assets" "Net worth"
[5] "Total income" "Change in stock"
[7] "Total expenses" "Profit after tax"
[9] "PBDITA" "PBT"
[11] "Cash profit" "PBDITA as % of total income"
[13] "PBT as % of total income" "PAT as % of total income"
[15] "Cash profit as % of total income" "PAT as % of net worth"
[17] "Sales" "Income from financial services"
[19] "Other income" "Total capital"
[21] "Reserves and funds" "Deposits (accepted by commercial
banks)"
[23] "Borrowings" "Current liabilities & provisions"
[25] "Deferred tax liability" "Shareholders funds"
[27] "Cumulative retained profits" "Capital employed"
[29] "TOL/TNW" "Total term liabilities / tangible net worth"
[31] "Contingent liabilities / Net worth (%)" "Contingent liabilities"
[33] "Net fixed assets" "Investments"
[35] "Current assets" "Net working capital"
[37] "Quick ratio (times)" "Current ratio (times)"
[39] "Debt to equity ratio (times)" "Cash to current liabilities (times)"
[41] "Cash to average cost of sales per day" "Creditors turnover"
[43] "Debtors turnover" "Finished goods turnover"
[45] "WIP turnover" "Raw material turnover"
[47] "Shares outstanding" "Equity face value"
[49] "EPS" "Adjusted EPS"
[51] "Total liabilities" "PE on BSE"

The names of the 52 variables are displayed above.

> dim(test)
[1] 715 52
> names(test)
[1] "Num" "Default - 1"
[3] "Total assets" "Net worth"
[5] "Total income" "Change in stock"
[7] "Total expenses" "Profit after tax"
[9] "PBDITA" "PBT"
[11] "Cash profit" "PBDITA as % of total income"
[13] "PBT as % of total income" "PAT as % of total income"
[15] "Cash profit as % of total income" "PAT as % of net worth"
[17] "Sales" "Income from financial services"
[19] "Other income" "Total capital"
[21] "Reserves and funds" "Deposits (accepted by commercial
banks)"
[23] "Borrowings" "Current liabilities & provisions"
[25] "Deferred tax liability" "Shareholders funds"
[27] "Cumulative retained profits" "Capital employed"
[29] "TOL/TNW" "Total term liabilities / tangible net worth"
[31] "Contingent liabilities / Net worth (%)" "Contingent liabilities"
[33] "Net fixed assets" "Investments"
[35] "Current assets" "Net working capital"
[37] "Quick ratio (times)" "Current ratio (times)"
[39] "Debt to equity ratio (times)" "Cash to current liabilities (times)"
[41] "Cash to average cost of sales per day" "Creditors turnover"
[43] "Debtors turnover" "Finished goods turnover"
[45] "WIP turnover" "Raw material turnover"
[47] "Shares outstanding" "Equity face value"
[49] "EPS" "Adjusted EPS"
[51] "Total liabilities" "PE on BSE"

Another copy of the dataset is made to work further on the dataset.


>newtrain <- train
>newtest <- test

The training dataset does not have default variable so the default variable is
created by splitting the observations of the Networth Next Year variable.
Generally, it is expected that the firms that will have negative net worth next year
are likely to default. Negative observations in ‘Networth Next Year’ will be ‘1’ in
the Default variable and the positive observations will be ‘0’ in the Default
variable.
>newtrain$Default <- ifelse(newtrain$'Networth Next Year' < 0 ,1,0)

Companies with Total Assets less than 3 is removed from further analysis.
>newtrain <- newtrain[!newtrain$`Total assets` <= 3, ]

3.2. Missing value treatment


The dataset should contain only numeric values for performing logistic regression
but the below code shows that few variables are of class ‘character’.
> newtrain<-as.data.frame(newtrain)
> for(i in 1:length(newtrain)){
+ print(paste(colnames(newtrain[i]),class(newtrain[,i])))}
[1] "Num numeric"
[1] "Networth Next Year numeric"
[1] "Total assets numeric"
[1] "Net worth numeric"
[1] "Total income numeric"
[1] "Change in stock numeric"
[1] "Total expenses numeric"
[1] "Profit after tax numeric"
[1] "PBDITA numeric"
[1] "PBT numeric"
[1] "Cash profit numeric"
[1] "PBDITA as % of total income numeric"
[1] "PBT as % of total income numeric"
[1] "PAT as % of total income numeric"
[1] "Cash profit as % of total income numeric"
[1] "PAT as % of net worth numeric"
[1] "Sales numeric"
[1] "Income from financial services numeric"
[1] "Other income numeric"
[1] "Total capital numeric"
[1] "Reserves and funds numeric"
[1] "Deposits (accepted by commercial banks) logical"
[1] "Borrowings numeric"
[1] "Current liabilities & provisions numeric"
[1] "Deferred tax liability numeric"
[1] "Shareholders funds numeric"
[1] "Cumulative retained profits numeric"
[1] "Capital employed numeric"
[1] "TOL/TNW numeric"
[1] "Total term liabilities / tangible net worth numeric"
[1] "Contingent liabilities / Net worth (%) numeric"
[1] "Contingent liabilities numeric"
[1] "Net fixed assets numeric"
[1] "Investments numeric"
[1] "Current assets numeric"
[1] "Net working capital numeric"
[1] "Quick ratio (times) numeric"
[1] "Current ratio (times) numeric"
[1] "Debt to equity ratio (times) numeric"
[1] "Cash to current liabilities (times) numeric"
[1] "Cash to average cost of sales per day numeric"
[1] "Creditors turnover character"
[1] "Debtors turnover character"
[1] "Finished goods turnover character"
[1] "WIP turnover character"
[1] "Raw material turnover character"
[1] "Shares outstanding character"
[1] "Equity face value character"
[1] "EPS numeric"
[1] "Adjusted EPS numeric"
[1] "Total liabilities numeric"
[1] "PE on BSE character"
[1] "Default numeric"

This plot shows that the training dataset has 6.8% missing observations and
1.9% missing columns.
>plot_intro(newtrain)

The variables of type character are converted to the type numeric and also the
missing observations in a column is replaced with the median of that column for
the whole of the training dataset.
>for(i in 1:ncol(newtrain)){
newtrain[,i] <- as.numeric(newtrain[,i])
newtrain[is.na(newtrain[,i]), i] <- median(newtrain[,i], na.rm = TRUE)
}
The training dataset has missing observations as well as columns. The missing
observations are replaced with median of that column and the missing columns
are removed from the dataset.
>newtrain <- newtrain[,-22]
On running the plot again, it shows that the training dataset does not have any
missing observations or columns.
>plot_intro(newtrain)

Similarly the testing dataset also has variables of the type character.
> newtest<-as.data.frame(newtest)
> for(i in 1:length(newtest)){
+ print(paste(colnames(newtest[i]),class(newtest[,i])))}
[1] "Num numeric"
[1] "Default - 1 numeric"
[1] "Total assets numeric"
[1] "Net worth numeric"
[1] "Total income numeric"
[1] "Change in stock numeric"
[1] "Total expenses numeric"
[1] "Profit after tax numeric"
[1] "PBDITA numeric"
[1] "PBT numeric"
[1] "Cash profit numeric"
[1] "PBDITA as % of total income numeric"
[1] "PBT as % of total income numeric"
[1] "PAT as % of total income numeric"
[1] "Cash profit as % of total income numeric"
[1] "PAT as % of net worth numeric"
[1] "Sales numeric"
[1] "Income from financial services numeric"
[1] "Other income numeric"
[1] "Total capital numeric"
[1] "Reserves and funds numeric"
[1] "Deposits (accepted by commercial banks) logical"
[1] "Borrowings numeric"
[1] "Current liabilities & provisions numeric"
[1] "Deferred tax liability numeric"
[1] "Shareholders funds numeric"
[1] "Cumulative retained profits numeric"
[1] "Capital employed numeric"
[1] "TOL/TNW numeric"
[1] "Total term liabilities / tangible net worth numeric"
[1] "Contingent liabilities / Net worth (%) numeric"
[1] "Contingent liabilities numeric"
[1] "Net fixed assets numeric"
[1] "Investments numeric"
[1] "Current assets numeric"
[1] "Net working capital numeric"
[1] "Quick ratio (times) numeric"
[1] "Current ratio (times) numeric"
[1] "Debt to equity ratio (times) numeric"
[1] "Cash to current liabilities (times) numeric"
[1] "Cash to average cost of sales per day numeric"
[1] "Creditors turnover character"
[1] "Debtors turnover character"
[1] "Finished goods turnover character"
[1] "WIP turnover character"
[1] "Raw material turnover character"
[1] "Shares outstanding character"
[1] "Equity face value character"
[1] "EPS numeric"
[1] "Adjusted EPS numeric"
[1] "Total liabilities numeric"
[​1] "PE on BSE character"

This plot shows that the testing dataset has 7% missing observations and 1.9%
missing columns.
> plot_intro(newtest)
In the following code, the variables of type character are changed into the type
numeric and then the missing observations in each column is replaced with the
median of that column.
> for(i in 1:ncol(newtest)){
+ newtest[,i] <- as.numeric(newtest[,i])
+ newtest[is.na(newtest[,i]), i] <- median(newtest[,i], na.rm = TRUE)
+}
The missing columns are removed from the dataset.
>newtest <- newtest[,-22]
This plot shows that the dataset does not contain any missing observations or
missing columns.
>plot_intro(newtest)
3.3. Outlier treatment
The outliers in the dataset are treated by replacing the observations lesser than
the 1st percentile with value of the 1st percentile and the observations more than
the 99th percentile with the value of the 99th percentile. This outlier treatment is
done for every column in the dataset.

The quantile function identifies the observations less than 1st percentile and
more than the 99th percentile. The squish function replaces the values of these
identified outliers with the value of the 1st percentile and the 99th percentile.
>for(i in 2:ncol(newtrain)){
q <- quantile(newtrain[,i], c(0.1, 0.99))
newtrain[,i] <- squish(newtrain[,i], q)
}

Redundant variables are removed from the Training and Testing dataset.
>newtrain <- newtrain[,-c(1,2)]
>newtest <- newtest[,-1]
3.4. Univariate and Multivariate Analysis
The variables can be explored further and can be analysed using univariate and
multivariate analysis

> plot_str(newtrain)

> plot_intro(newtrain)
> plot_missing(newtrain)
> plot_histogram(newtrain)
> plot_qq(newtrain)
> plot_bar(newtrain)
> plot_correlation(newtrain)

3.5. Variable Creation


New variables are created as per the requirement. One ratio for Profitability,
Liquidity and Leverage is required as per the problem statement.

The profitability ratio is derived by dividing Profit after tax by Sales


> newtrain$Profitability <- newtrain$`Profit after tax`/newtrain$Sales
> newtrain$PriceperShare <- newtrain$EPS*newtrain$`PE on BSE`

The liquidity ratio is derived by dividing Net Working Capital by Total Assets
> newtrain$NWC2TA <- newtrain$`Net working capital`/newtrain$`Total assets`

> newtrain$TotalEquity <- newtrain$`Total liabilities`/newtrain$`Debt to equity ratio


(times)`
The leverage ratio is derived by dividing Total Assets by TotalEquity
> newtrain$EquityMultiplier <- newtrain$`Total assets`/newtrain$TotalEquity

Other ratios are also created by dividing multiple variables by Total assets and the
contribution of these ratios towards the model can be found later.
> newtrain$Networth2Totalassets <- newtrain$`Net worth`/newtrain$`Total assets`
> newtrain$Totalincome2Totalassets<- newtrain$`Total income`/newtrain$`Total assets`
> newtrain$Totalexpenses2Totalassets <-newtrain$`Total expenses`/newtrain$`Total
assets`
> newtrain$Profitaftertax2Totalassets <-newtrain$`Profit after tax`/newtrain$`Total
assets`
> newtrain$PBT2Totalassets <-newtrain$PBT/newtrain$`Total assets`
> newtrain$Sales2Totalassets <-newtrain$Sales/newtrain$`Total assets`
> newtrain$Currentliabilitiesprovisions2Totalassets <-newtrain$`Current liabilities &
provisions`/newtrain$`Total assets`
> newtrain$Capitalemployed2Totalassets <-newtrain$`Capital employed`/newtrain$`Total
assets`
> newtrain$Netfixedassets2Totalassets <-newtrain$`Net fixed assets`/newtrain$`Total
assets`
> newtrain$Investments2Totalassets <-newtrain$Investments/newtrain$`Total assets`
> newtrain$Totalliabilities2Totalassets <-newtrain$`Total liabilities`/newtrain$`Total
assets`

Similar variables are created for the holdout dataset as well.

> newtest$Profitability <- newtest$`Profit after tax`/newtest$Sales


> newtest$PriceperShare <- newtest$EPS*newtest$`PE on BSE`
> #Liquidity
> newtest$NWC2TA <- newtest$`Net working capital`/newtest$`Total assets`
> #leverage
> newtest$TotalEquity <- newtest$`Total liabilities`/newtest$`Debt to equity ratio (times)`
> newtest[is.infinite(newtest[,54]), 54] <- 0
> newtest$EquityMultiplier <- newtest$`Total assets`/newtest$TotalEquity
> newtest[is.infinite(newtest[,55]), 55] <- 0

> newtest$Networth2Totalassets <- newtest$`Net worth`/newtest$`Total assets`


> newtest$Totalincome2Totalassets<- newtest$`Total income`/newtest$`Total assets`
> newtest$Totalexpenses2Totalassets <-newtest$`Total expenses`/newtest$`Total
assets`
> newtest$Profitaftertax2Totalassets <-newtest$`Profit after tax`/newtest$`Total assets`
> newtest$PBT2Totalassets <-newtest$PBT/newtest$`Total assets`
> newtest$Sales2Totalassets <-newtest$Sales/newtest$`Total assets`
> newtest$Currentliabilitiesprovisions2Totalassets <-newtest$`Current liabilities &
provisions`/newtest$`Total assets`
> newtest$Capitalemployed2Totalassets <-newtest$`Capital employed`/newtest$`Total
assets`
> newtest$Netfixedassets2Totalassets <-newtest$`Net fixed assets`/newtest$`Total
assets`
> newtest$Investments2Totalassets <-newtest$Investments/newtest$`Total assets`
> newtest$Totalliabilities2Totalassets <-newtest$`Total liabilities`/newtest$`Total assets`
4. Modelling
4.1. Logistic Regression
The Logistic regression model is used for this dataset. Initially, all the variables
are used as the predictors with the Default variable as the response variable.
> trainLOGIT<- glm(Default~.,data = newtrain, family=binomial)
glm.fit: fitted probabilities numerically 0 or 1 occurred
> summary(trainLOGIT)
The result of the Logistic Regression shows that few variables are important and
contribute more towards the model.

The most important variables identified from the previous Logistic regression
model is used as predictors in this model with the Default variable as the
response.

>trainLOGIT <- glm(Default~`Total assets`+`Total income`+`Change in stock`+`Total


expenses`+`Profit after tax`+PBDITA+`Cash profit`+`PBDITA as % of total income`+`PBT as % of
total income`+`PAT as % of total income`+`Cash profit as % of total income`+`PAT as % of net
worth`+`Total capital`+`Reserves and funds`+`Borrowings`+`Current liabilities &
provisions`+`Capital employed`+`Total term liabilities / tangible net worth`+`Contingent
liabilities`+`Current ratio (times)`+Investments+`Finished goods turnover`+`TOL/TNW`+`PE on
BSE` +`Net fixed assets`+`Debt to equity ratio (times)`+`Cash to average cost of sales per
day`+PriceperShare+NWC2TA+Networth2Totalassets+Sales2Totalassets+
Capitalemployed2Totalassets+ Investments2Totalassets , data= newtrain, family = binomial)
>summary(trainLOGIT)
4.2. Analysis
The model has an AIC value of 875.4 and predicts the training and testing
dataset with almost 95% accuracy (seen later in the document).
Few of the most important variables are Total Assets, Cash Profit, PAT as % of
net worth, Reserves and Funds, Current Liabilities and Provisions, Capital
employed, Net Working Capital/Total Assets and Networth/Total Assets. These
variables have very less Pr(>|z|).

Among the most important variables, the variables with positive estimates are
Total Assets, Current ratio, Sales/Total Assets and the variables with negative
estimates are Cash Profit, PAT as % of net worth,Current Liabilities and
Provisions, Capital employed

5. Model Performance and Measure


5.1. Model performance on Train and Test data
The Logistic regression model that was created is used to predict the Training
dataset.
> PredLOGIT <- predict.glm(trainLOGIT, newdata=newtrain, type="response")
> tab.logit<-confusion.matrix(newtrain$Default,PredLOGIT,threshold = 0.5)
> tab.logit
obs
pred 0 1
0 3227 107
1 35 109
attr(,"class")
[1] "confusion.matrix"

The confusion matrix shows that there are 35 Type 1 error and 107 type 2 error.
> accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
> accuracy.logit
[1] 0.9591719

The accuracy of the model is 95.9179%


> roc.logit<-roc(newtrain$Default,PredLOGIT )
Setting levels: control = 0, case = 1
Setting direction: controls < cases
> roc.logit

Call:
roc.default(response = newtrain$Default, predictor = PredLOGIT)

Data: PredLOGIT in 3262 controls (newtrain$Default 0) < 216 cases


(newtrain$Default 1).
Area under the curve: 0.9423
> plot(roc.logit)

The same logistic regression model is used to predict the testing dataset.
> PredLOGIT <- predict.glm(trainLOGIT, newdata=newtest, type="response")
> tab.logit<-confusion.matrix(newtest$`Default - 1`,PredLOGIT,threshold = 0.5)
> tab.logit
obs
pred 0 1
0 639 20
1 22 34
attr(,"class")
[1] "confusion.matrix"

The confusion matrix shows that there are 22 Type 1 error and 20 Type 2 error.
> accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
> accuracy.logit
[1] 0.9412587
The accuracy of the model is 94.125%
> roc.logit<-roc(newtest$`Default - 1`,PredLOGIT​ )
Setting levels: control = 0, case = 1
Setting direction: controls < cases
> roc.logit
Call:
roc.default(response = newtest$`Default - 1`, predictor = PredLOGIT)

Data: PredLOGIT in 661 controls (newtest$`Default - 1` 0) < 54 cases


(newtest$`Default - 1` 1).
Area under the curve: 0.941
> plot(roc.logit)

5.2. Deciling
The training dataset is then divided into 10 deciles based on the probability of default.

> newtrain$pred = predict(trainLOGIT, newtrain, type="response")

> decile <- function(x)


+{
+ deciles <- vector(length=10)
+ for (i in seq(0.1,1,.1))
+ {
+ deciles[i*10] <- quantile(x, i, na.rm=T)
+ }
+ return (
+ ifelse(x<deciles[1], 1,
+ ifelse(x<deciles[2], 2,
+ ifelse(x<deciles[3], 3,
+ ifelse(x<deciles[4], 4,
+ ifelse(x<deciles[5], 5,
+ ifelse(x<deciles[6], 6,
+ ifelse(x<deciles[7], 7,
+ ifelse(x<deciles[8], 8,
+ ifelse(x<deciles[9], 9, 10
+ ))))))))))
+}
> newtrain$deciles <- decile(newtrain$pred)

> tmp_DT = data.table(newtrain)


After the deciles are created, they then ranked.
> rank <- tmp_DT[, list(cnt=length(Default),
+ cnt_resp=sum(Default==1),
+ cnt_non_resp=sum(Default==0)
+ ), by=deciles][order(-deciles)]

> rank$rrate <- round(rank$cnt_resp / rank$cnt,4);


> rank$cum_resp <- cumsum(rank$cnt_resp)
> rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
> rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
> rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
> rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
> rank$rrate <- percent(rank$rrate)
> rank$cum_rel_resp <- percent(rank$cum_rel_resp)
> rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
> newtrainRank <- rank

> View(rank)

The ranks of the deciles are seen above. The deciles are sorted in the descending order.
The 10th decile has the maximum number of defaults in the form of cnt_resp.

The testing dataset is then divided into 10 deciles based on the probability of default.
> newtest$pred = predict(trainLOGIT, newtest, type="response")

> decile <- function(x)


+{
+ deciles <- vector(length=10)
+ for (i in seq(0.1,1,.1))
+ {
+ deciles[i*10] <- quantile(x, i, na.rm=T)
+ }
+ return (
+ ifelse(x<deciles[1], 1,
+ ifelse(x<deciles[2], 2,
+ ifelse(x<deciles[3], 3,
+ ifelse(x<deciles[4], 4,
+ ifelse(x<deciles[5], 5,
+ ifelse(x<deciles[6], 6,
+ ifelse(x<deciles[7], 7,
+ ifelse(x<deciles[8], 8,
+ ifelse(x<deciles[9], 9, 10
+ ))))))))))
+}
> newtest$deciles <- decile(newtest$pred)

> tmp_DT = data.table(newtest)


The deciles are then ranked.
> rank <- tmp_DT[, list(cnt=length(`Default - 1`),
+ cnt_resp=sum(`Default - 1`==1),
+ cnt_non_resp=sum(`Default - 1`==0)
+ ), by=deciles][order(-deciles)]

> rank$rrate <- round(rank$cnt_resp / rank$cnt,4);


> rank$cum_resp <- cumsum(rank$cnt_resp)
> rank$cum_non_resp <- cumsum(rank$cnt_non_resp)
> rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);
> rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);
> rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;
> rank$rrate <- percent(rank$rrate)
> rank$cum_rel_resp <- percent(rank$cum_rel_resp)
> rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)
> newtestRank<-rank

> View(rank)
The ranks of the deciles are seen above. The deciles are sorted in the descending order.
The 10th decile has the maximum number of defaults in the form of cnt_resp.

The mean is taken for both the Training and Testing dataset to differentiate the predicted
and observed values.
> mean.obs.train = aggregate(Default ~ rank, data = newtrain, mean)
> mean.pred.train = aggregate(pred ~ rank, data = newtrain, mean)

> mean.obs.val = aggregate( `Default - 1`~ rank, data = newtest, mean)


> mean.pred.val = aggregate(pred ~ rank, data = newtest, mean)

# plot the mean vs deciles


> par(mfrow=c(1,2))
> plot(mean.obs.train[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile", ylab="Prob")
> lines(mean.pred.train[,2], type="b", col="red", lty=2)
> title(main="Training Sample")

> plot(mean.obs.val[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile", ylab="Prob")


> lines(mean.pred.val[,2], type="b", col="red", lty=2)
> title(main="Validation Sample")
The plot shows that the model almost accurately predicted both the Training and Testing
dataset with an accuracy of almost 95%

6. Source Code

---

title: "Untitled"

output: html_document

---

```{r setup, include=FALSE}

knitr::opts_chunk$set(echo = TRUE)

```

## R Markdown
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and
MS Word documents. For more details on using R Markdown see <http://rmarkdown.rstudio.com>.

When you click the **Knit** button a document will be generated that includes both content as well as the
output of any embedded R code chunks within the document. You can embed an R code chunk like this:

```{r cars}

summary(cars)

```

## Including Plots

You can also embed plots, for example:

```{r pressure, echo=FALSE}

plot(pressure)

```

Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that
generated the plot.

```{r}

install.packages("DataExplorer")

library(DataExplorer)

library(readxl)

train <- read_excel("C:/Users/Janani Prakash/Desktop/R/raw.xlsx")

test <- read_excel("C:/Users/Janani Prakash/Desktop/R/valid.xlsx")

dim(train)

names(train)

head(train)

dim(test)

names(test)
head(test)

```

```{r Missing value and Outlier Treatment}

newtrain <- train

newtest <- test

newtrain$Default <- ifelse(newtrain$'Networth Next Year' < 0 ,1,0)

##remove very small companies

newtrain <- newtrain[!newtrain$`Total assets` <= 3, ]

#summary

#summary(newtrain)

#replace missing values of train with median

newtrain<-as.data.frame(newtrain)

for(i in 1:length(newtrain)){

print(paste(colnames(newtrain[i]),class(newtrain[,i])))}

plot_intro(newtrain)

for(i in 1:ncol(newtrain)){

newtrain[,i] <- as.numeric(newtrain[,i])

newtrain[is.na(newtrain[,i]), i] <- median(newtrain[,i], na.rm = TRUE)

#replace missing values of test with median

newtest<-as.data.frame(newtest)

for(i in 1:length(newtest)){

print(paste(colnames(newtest[i]),class(newtest[,i])))}

plot_intro(newtest)

for(i in 1:ncol(newtest)){

newtest[,i] <- as.numeric(newtest[,i])


newtest[is.na(newtest[,i]), i] <- median(newtest[,i], na.rm = TRUE)

#removing missing column

newtrain <- newtrain[,-22]

newtest <- newtest[,-22]

plot_intro(newtest)

#identify outlier

boxplot(newtrain)

library(scales)

for(i in 2:ncol(newtrain)){

q <- quantile(newtrain[,i], c(0.01, 0.999))

newtrain[,i] <- squish(newtrain[,i], q)

summary(newtrain)

newtrain <- newtrain[,-c(1,2)]

newtest <- newtest[,-1]

```

```{r Univariate and Bivariate for Train}

plot_str(newtrain)

plot_intro(newtrain)

plot_missing(newtrain)

plot_histogram(newtrain)

plot_density(newtrain)

plot_qq(newtrain)

plot_bar(newtrain)

plot_correlation(newtrain)
```

```{r Univariate and Bivariate for Test}

plot_str(newtest)

plot_intro(newtest)

plot_missing(newtest)

plot_histogram(newtest)

plot_density(newtest)

plot_qq(newtest)

plot_bar(newtest)

plot_correlation(newtest)

```

```{r new variables}

newtrain$Profitability <- newtrain$`Profit after tax`/newtrain$Sales

newtrain$PriceperShare <- newtrain$EPS*newtrain$`PE on BSE`

#Liquidity

newtrain$NWC2TA <- newtrain$`Net working capital`/newtrain$`Total assets`

#leverage

newtrain$TotalEquity <- newtrain$`Total liabilities`/newtrain$`Debt to equity ratio (times)`

newtrain[is.infinite(newtrain[,54]), 54] <- 0

newtrain$EquityMultiplier <- newtrain$`Total assets`/newtrain$TotalEquity

newtrain[is.infinite(newtrain[,55]), 55] <- 0

newtrain$Networth2Totalassets <- newtrain$`Net worth`/newtrain$`Total assets`

newtrain$Totalincome2Totalassets<- newtrain$`Total income`/newtrain$`Total assets`

newtrain$Totalexpenses2Totalassets <-newtrain$`Total expenses`/newtrain$`Total assets`

newtrain$Profitaftertax2Totalassets <-newtrain$`Profit after tax`/newtrain$`Total assets`

newtrain$PBT2Totalassets <-newtrain$PBT/newtrain$`Total assets`


newtrain$Sales2Totalassets <-newtrain$Sales/newtrain$`Total assets`

newtrain$Currentliabilitiesprovisions2Totalassets <-newtrain$`Current liabilities &


provisions`/newtrain$`Total assets`

newtrain$Capitalemployed2Totalassets <-newtrain$`Capital employed`/newtrain$`Total assets`

newtrain$Netfixedassets2Totalassets <-newtrain$`Net fixed assets`/newtrain$`Total assets`

newtrain$Investments2Totalassets <-newtrain$Investments/newtrain$`Total assets`

newtrain$Totalliabilities2Totalassets <-newtrain$`Total liabilities`/newtrain$`Total assets`

newtest$Profitability <- newtest$`Profit after tax`/newtest$Sales

newtest$PriceperShare <- newtest$EPS*newtest$`PE on BSE`

#Liquidity

newtest$NWC2TA <- newtest$`Net working capital`/newtest$`Total assets`

#leverage

newtest$TotalEquity <- newtest$`Total liabilities`/newtest$`Debt to equity ratio (times)`

newtest[is.infinite(newtest[,54]), 54] <- 0

newtest$EquityMultiplier <- newtest$`Total assets`/newtest$TotalEquity

newtest[is.infinite(newtest[,55]), 55] <- 0

newtest$Networth2Totalassets <- newtest$`Net worth`/newtest$`Total assets`

newtest$Totalincome2Totalassets<- newtest$`Total income`/newtest$`Total assets`

newtest$Totalexpenses2Totalassets <-newtest$`Total expenses`/newtest$`Total assets`

newtest$Profitaftertax2Totalassets <-newtest$`Profit after tax`/newtest$`Total assets`

newtest$PBT2Totalassets <-newtest$PBT/newtest$`Total assets`

newtest$Sales2Totalassets <-newtest$Sales/newtest$`Total assets`

newtest$Currentliabilitiesprovisions2Totalassets <-newtest$`Current liabilities & provisions`/newtest$`Total


assets`

newtest$Capitalemployed2Totalassets <-newtest$`Capital employed`/newtest$`Total assets`

newtest$Netfixedassets2Totalassets <-newtest$`Net fixed assets`/newtest$`Total assets`

newtest$Investments2Totalassets <-newtest$Investments/newtest$`Total assets`

newtest$Totalliabilities2Totalassets <-newtest$`Total liabilities`/newtest$`Total assets`

```
```{r multicollinearity}

#vif(newtrain)

#for(i in 1:length(newtrain)){

## print(paste(colnames(newtrain[i]),class(newtrain[,i])))}

```

```{r Logistic Regression}

trainLOGIT<- glm(Default~.-Profitability,data = newtrain, family=binomial)

summary(trainLOGIT)

trainLOGIT <- glm(Default~`Total assets`+`Total income`+`Change in stock`+`Total expenses`+`Profit after


tax`+PBDITA+`Cash profit`+`PBDITA as % of total income`+`PBT as % of total income`+`PAT as % of total
income`+`Cash profit as % of total income`+`PAT as % of net worth`+`Total capital`+`Reserves and
funds`+`Borrowings`+`Current liabilities & provisions`+`Capital employed`+`Total term liabilities / tangible net
worth`+`Contingent liabilities`+`Current ratio (times)`+Investments+`Finished goods
turnover`+`TOL/TNW`+`PE on BSE` +`Net fixed assets`+`Debt to equity ratio (times)`+`Cash to average cost
of sales per day`+PriceperShare+NWC2TA+Networth2Totalassets+Sales2Totalassets+
Capitalemployed2Totalassets+ Investments2Totalassets , data= newtrain, family = binomial)

summary(trainLOGIT)

```

```{r Model validation Train and Test}

##install.packages("SDMTools")

library(SDMTools)

##install.packages("pROC")

library(pROC)

PredLOGIT <- predict.glm(trainLOGIT, newdata=newtrain, type="response")

tab.logit<-confusion.matrix(newtrain$Default,PredLOGIT,threshold = 0.5)

tab.logit

accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)
accuracy.logit

roc.logit<-roc(newtrain$Default,PredLOGIT )

roc.logit

plot(roc.logit)

PredLOGIT <- predict.glm(trainLOGIT, newdata=newtest, type="response")

tab.logit<-confusion.matrix(newtest$`Default - 1`,PredLOGIT,threshold = 0.5)

tab.logit

accuracy.logit<-sum(diag(tab.logit))/sum(tab.logit)

accuracy.logit

roc.logit<-roc(newtest$`Default - 1`,PredLOGIT )

roc.logit

plot(roc.logit)

```

```{r Train data Deciling}

newtrain$pred = predict(trainLOGIT, newtrain, type="response")

decile <- function(x)

deciles <- vector(length=10)

for (i in seq(0.1,1,.1))

deciles[i*10] <- quantile(x, i, na.rm=T)

return (

ifelse(x<deciles[1], 1,

ifelse(x<deciles[2], 2,

ifelse(x<deciles[3], 3,

ifelse(x<deciles[4], 4,

ifelse(x<deciles[5], 5,

ifelse(x<deciles[6], 6,
ifelse(x<deciles[7], 7,

ifelse(x<deciles[8], 8,

ifelse(x<deciles[9], 9, 10

))))))))))

newtrain$deciles <- decile(newtrain$pred)

tmp_DT = data.table(newtrain)

rank <- tmp_DT[, list(cnt=length(Default),

cnt_resp=sum(Default==1),

cnt_non_resp=sum(Default==0)

), by=deciles][order(-deciles)]

rank$rrate <- round(rank$cnt_resp / rank$cnt,4);

rank$cum_resp <- cumsum(rank$cnt_resp)

rank$cum_non_resp <- cumsum(rank$cnt_non_resp)

rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);

rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);

rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;

rank$rrate <- percent(rank$rrate)

rank$cum_rel_resp <- percent(rank$cum_rel_resp)

rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)

newtrainRank <- rank

View(rank)

```

```{r Test data Deciling}

newtest$pred = predict(trainLOGIT, newtest, type="response")

decile <- function(x)


{

deciles <- vector(length=10)

for (i in seq(0.1,1,.1))

deciles[i*10] <- quantile(x, i, na.rm=T)

return (

ifelse(x<deciles[1], 1,

ifelse(x<deciles[2], 2,

ifelse(x<deciles[3], 3,

ifelse(x<deciles[4], 4,

ifelse(x<deciles[5], 5,

ifelse(x<deciles[6], 6,

ifelse(x<deciles[7], 7,

ifelse(x<deciles[8], 8,

ifelse(x<deciles[9], 9, 10

))))))))))

newtest$deciles <- decile(newtest$pred)

tmp_DT = data.table(newtest)

rank <- tmp_DT[, list(cnt=length(`Default - 1`),

cnt_resp=sum(`Default - 1`==1),

cnt_non_resp=sum(`Default - 1`==0)

), by=deciles][order(-deciles)]

rank$rrate <- round(rank$cnt_resp / rank$cnt,4);

rank$cum_resp <- cumsum(rank$cnt_resp)

rank$cum_non_resp <- cumsum(rank$cnt_non_resp)

rank$cum_rel_resp <- round(rank$cum_resp / sum(rank$cnt_resp),4);

rank$cum_rel_non_resp <- round(rank$cum_non_resp / sum(rank$cnt_non_resp),4);

rank$ks <- abs(rank$cum_rel_resp - rank$cum_rel_non_resp) * 100;

rank$rrate <- percent(rank$rrate)


rank$cum_rel_resp <- percent(rank$cum_rel_resp)

rank$cum_rel_non_resp <- percent(rank$cum_rel_non_resp)

newtestRank<-rank

View(rank)

```

```{r Decile Comparison}

# cut_p returns the cut internal for each observation

cut_ptrain = with(newtrain,

cut(pred, breaks = quantile(pred, prob=seq(0,1,0.1)), include.lowest = T))

cut_ptest = with(newtest,

cut(pred, breaks = quantile(pred, prob=seq(0,1,0.1)), include.lowest = T))

levels(cut_ptrain)

levels(cut_ptest)

newtrain$rank = factor(cut_ptrain, labels = 1:10)

newtest$rank = factor(cut_ptest, labels = 1:10)

# get aggregated data

mean.obs.train = aggregate(Default ~ rank, data = newtrain, mean)

mean.pred.train = aggregate(pred ~ rank, data = newtrain, mean)

mean.obs.val = aggregate( `Default - 1`~ rank, data = newtest, mean)

mean.pred.val = aggregate(pred ~ rank, data = newtest, mean)

# plot the mean vs deciles

par(mfrow=c(1,2))

plot(mean.obs.train[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile", ylab="Prob")

lines(mean.pred.train[,2], type="b", col="red", lty=2)


title(main="Training Sample")

plot(mean.obs.val[,2], type="b", col="black", ylim=c(0,0.8), xlab="Decile", ylab="Prob")

lines(mean.pred.val[,2], type="b", col="red", lty=2)

title(main="Validation Sample")

```

Potrebbero piacerti anche