Talking Points - Answer - 2017030

Robert Gordon University
CMM723 – Statistics for Business Analysis 2018
Student ID 2017030
Answer
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 1
Table of Contents
Introduction: .................................................................................................................................................. 2
Data Analysis: ............................................................................................................................................... 2
1. Recoding of Numerical variables into categorical variables: .................................................................... 2
2. Splitting the dataset into Training and Testing data set: ........................................................................... 2
3. Summarization and association: ................................................................................................................ 3
Summary: ................................................................................................................................................. 3
3.1. Descriptive Statistics: ........................................................................................................................ 4
3.2. Frequency Distribution: ..................................................................................................................... 7
3.3. Histogram plots: .............................................................................................................................. 13
3.4. Box plots: ........................................................................................................................................ 14
3.5. Correlation Coefficient: ................................................................................................................... 15
4. Is there any difference between the creditability of female customers and male customers? ................ 16
4.1 T-test: ............................................................................................................................................... 16
4.2 Chi-Squared Test and Contingency Table: ....................................................................................... 17
4.3 Hypotheses: ..................................................................................................................................... 18
5. Logistic Regression Model: .................................................................................................................... 19
6. Assessment of validity of regression model and its variables of training data on testing data set: ......... 22
7. Cost Profit Analysis: ............................................................................................................................... 25
Attachments ................................................................................................................................................ 25
Internet References: .................................................................................................................................... 25

Introduction:
The data set reflects the outlook of proposed creditability measures of 1000 samples of the given bank.
On the view point of their performances and performances of other parameters, the bank would decide
whether it would extend credit facilities or not.
The necessary cost-profit portfolio analysis is being executed with the help of this data. Most of the
data variables are categorical (ordinal or nominal, i.e. nominal variables are used to “name,” or label
a series of values, while ordinal scales provide good information about the order of choices). For
purpose of the analysis, I transformed the numerical data set into categorical variables too.
The JASP (version 0.8.6.0) software tool is utilised in this analysis for solving the assignment.
However, we have taken the help of MS Excel (the format in which data set is given) in some instances
of the calculation or execution, where JASP software faced failure.
Data Analysis:
1. Recoding of Numerical variables into categorical variables:

Firstly, the three numerical variables of the data set are merged in categorical variables. Credit amount
is transformed into categorical variable (credit amount < $2000 = “1”, $2000<= credit amount <$4000
= “2”, $4000 <=credit amount < $6000 = “3”, credit amount => $6000 = “4”). Age is transformed to
age group (age < 25 = “1”, 25<age<50 = “2,” 50<=age <=75 = “3”). Duration of monthly credit (months
<20 = “1”, 20<=months<40 = “2”, 40 <=months<60 = “3”, 60<=months<80 = “4”).
The recoding is NOT possible in JASP software. Therefore, I took the help of MS Excel.
Refer 01.dataset.xls
2. Splitting the dataset into Training and Testing data set:

By random sample drawing process, I have drawn the random samples in MS Excel and out of then
80% (800 samples) are chosen for training data set and 20% (200 samples) are selected for testing data
set. This operation is also not possible in JASP tool. Therefore, in this case also, the help of MS Excel
is also taken.
Refer 02A. testing dataset.xls in MS Excel and 02B.testing dataset. JASP file
Refer 03B. training dataset.xls in MS Excel and 03B.training dataset. JASP file
3. Summarization and association:
Summary:
Refer 03B. training dataset JASP file
From the table below, it is clear that majority of the respondents in the study (Apprx 70%, n = 554)
were categorized to be credit worthy people where more than a quarter (Apprx 30%, n = 246) being
categorized as not credit worthy people.
Looking at the previous payment status of the previous credit, I observe that majority (52.5%, n = 420)
are said to have had no previous credits or rather they paid back all previous credits. However, about
5% (n = 40) were problematic running account while 4% (n = 32) were hesitant in paying the previous
credits.
Frequencies for Payment Status of Previous Credit

Payment Status of Previous Credit Frequency Percent Valid Cumulative
Percent Percent
Hesitant payment of previous credits 32 4.000 4.000 4.000
Problematic running account 40 5.000 5.000 9.000
No previous credits / paid back all previous 420 52.500 52.500 61.500
credits
No problems with current credits at this 73 9.125 9.125 70.625
bank
Paid back previous credits at this bank 235 29.375 29.375 100.000
Total 800 100.000
Further narrative discussion of Descriptive Statistics, Frequencies, graphical presentation of

associations, Pearson’s Correlation are given in the following paragraphs.
3.1. Descriptive Statistics:
The descriptive statistic of Creditability, Account balance, Payment Status of Previous credit and
Purpose of training data indicates that the average values are 0.693, 2.539, 2.556 and 2.804 respectively.
The standard deviations of these categorical variables are 0.462, 1.252, 1.097 and 2.749 respectively.
The Creditability, Account balance, payment Status of Previous Credit vary from 0 to 1, 1 to 4, 0 to 4
and 0 to 10 level respectively.
The descriptive statistic of Value Savings/Stocks, Length of current employment, Instalment per cent
and Sex & Marital Status of training data shows that the average values are 2.07, 3.395, 2.958 and 2.672
respectively. The standard deviations of these categorical variables are 1.562, 1.207, 1.113 and 0.713
respectively. The Value Savings/Stocks, Length of current employment, Instalment per cent and Sex &
Marital Status vary from 1 to 5, 1 to 5, 1 to 4 and 1 to 4 level respectively.
The descriptive statistic of Guarantors, Duration in Current Address, most valuable available asset,
Concurrent Credits and Type of apartment of training data shows that the average values are 1.144,
2.886, 2.377, 2.674 and 1.921 respectively. The standard deviations of these categorical variables are
0.475, 1.093, 1.062, 0.706 and 0.539 respectively. The Value Guarantors, Duration in Current Address,
most valuable available asset, Concurrent Credits and Type of apartment vary from 1 to 3, 1 to 4, 1 to
4, 1 to 3 and 1 to 3 level respectively.
The descriptive statistic of Number of Credits at this bank, Occupation, number of dependents,
Telephone and Foreign Workers of training data shows that the average values are 1.413, 2.875, 1.156,
1.389 and 1.036 respectively. The standard deviations of these five categorical variables are 0.583,
0.661, 0.363, 0.448 and 0.187 respectively. The Number of Credits at this bank, Occupation, number
of dependents, Telephone and Foreign Workers vary from 1 to 4, 1 to 4, 1 to 2, 1 to 2 and 1 to 2 level
respectively.
The descriptive statistic of Duration of monthly credit, Amount of Credit and Age group of training
data shows that the average values are 1.524, 1.964 and 1.915 respectively. The standard deviations of
these categorical variables are 0.638, 1.059 and 0.548 respectively. The Duration of monthly credit,
Amount of Credit and Age group vary from 1 to 3, 1 to 4 and 1 to 3 level respectively.
3.2. Frequency Distribution:
Out of 800 people, 30.75% people are not credit worthy and rest of 69.25% people are credit worthy.
Out of 800 people, 28.375% people have no balance or debit followed by 27.875% people have no
running account. A highest percentage of 38% people have checked out $200 for at least 1 year.
A highest percentage of 52% people have no previous or pending credits followed by 30.125% people
who had paid back previous credits at this bank. Only 4.25% and 4.875% people are facing hesitant
payment of previous credits and problematic running account.
A highest percentage of 27.875% people need credit for purchasing items and furniture followed by the
24.375% people require credit for purchasing other purposes. It is notable that the percentage of people
who need credit for purchasing used cars (9.875%) is also satisfactory.
Among 800 chosen people, mostly (61.375%) people have no available savings followed by 17.625%
people who have more than $1000 savings.
Only 6.125% people are currently unemployed. 34.25% people are employed for 1 to 4 years followed
by 25.625% people who are employed for more than 7 years.
A significant number of 46.5% people are under the Instalment percent less than 20%.
53.875% people are either single or widowed male. Only 9.25% people are females.
90.75% people has no guarantor of their credits.

42.5% people are living in their current addresses for more than 7 years followed by 29.625% people
who are living in their current addresses for only 2 to 4 years.
More than 33% people have savings contract with building society or life insurance.
A very high percentage of 81.25% people prefer to run no further credits.
More than 70% people are living in owner-occupied flat with highest percentage and only 10.875%
people are living in rented flat with least frequency.
More than 62% people have only one credit in this bank and only 0.75% people have six or more credits
in this bank.
The people who are asking for credit in this bank are either skilled worker or skilled employees and
minor civil servants. Only 2.625% unemployed or unskilled labour will no permanent resistance are
asking for credit.
84.375% employees have 3 or more dependents.
61.125% people are using telephones whereas 38.875% people are not using telephones.
96.375% people are foreign workers whereas 3.625% people are not foreign workers.
The duration if credits of employees less than 40 months is for 55.5% people, followed by the duration
of credits of employees less than 60 months but greater than 40 months is for 36.625% people.
The frequency of people is highest for the people whose amount of credit is less than $2000 (42.875%)
followed by the frequencies with credit amount more than $2000 but greater than $4000 (32.875%).
Major number of people belong to the age-group 25 years to 50 years with percentage almost 70%.
3.3. Histogram plots:

3.4. Box plots:

Age (years) Credit amount
Duration of Credit (month)
Reading a Boxplot
3.5. Correlation Coefficient:
*** p < 0.001, **p < 0.01, * p < 0.05
The statistically significant associations with explanatory variable Creditability is found in case of the
following variables-
1) Account Balance (r = 0.358, p-value <0.001): Moderate positive correlation
2) Payment Status of previous credits (r = 0.249, p-value <0.001): Weak positive correlation
3) Value savings/stocks (r = 0.174, p-value <0.001): Weak positive correlation
4) Length of current employment (r =0.124, p-value<0.001): Weak positive correlation
5) Most valuable available assets (r = -0.13, p-value<0.001): Weak negative correlation
6) Concurrent Credits (r = 0.141, p-value <0.001): Weak positive correlation
7) Duration of monthly Credits (r = -0.175, p-value <0.001): Weak negative correlation
8) Credit amount (r = -0.112, p-value = 0.001): Weak negative correlation.

4. Is there any difference between the creditability of female customers and male
customers?
Refer 04B. dataset for t-test JASP file
4.1 T-test:
This test help to find the equality of averages of any numerical variable (here, Creditability) with respect
to different levels of categorical variables (here, age and sex). For calculation, I transformed the level
1,2 and 3 to the level “Males” and level 4 to the level “Females”. This recoding is not possible in JASP
software. Therefore, I have incorporated this in MS Excel.
The independent sample t-test produces the t-value 0.62 with 998 degrees of freedom and p-value 0.535.
The mean creditability of 92 females is 0.728 and 908 males are 0.697.
Creditability
4.2 Chi-Squared Test and Contingency Table:

I ran a Chi-Square test of association to check whether there is any significant association between
gender of the person and creditability of the person. This was tested at 5% level of significance. The
results are presented below.
Contingency Tables
Creditability
Sex & Marital Status Not credit- Credit- Total
worthy worthy
Male- divorced / living Count 20.0 30.0 50.0
apart % within row 40.0 % 60.0% 100.0%
Male- single Count 109.0 201.0 310.0
% within row 35.2 % 64.8% 100.0%
Male- married / widowed Count 146.0 402.0 548.0
% within row 26.6% 73.4% 100.0%
Female Count 25.0 67.0 92.0
% within row 27.2% 72.8% 100.0%
Total Count 300.0 700.0 1000.0
% within row 30.0% 70.0 % 100.0%
Chi-Squared Tests
Value df p
Χ² 9.605 3 0.022
N 1000
The p-value is given as 0.022 (a value less than 5% level of significance), we thus reject the null
hypothesis and conclude that there is significant association between sex and credibility of the person.
4.3 Hypotheses:
Null hypothesis (H0): The difference of averages of credibility of males and females is 0.
Alternative hypothesis (H1): The averages of credibility of males and females are different to each other.
Test applied 1: Independent sample t-test assuming equal variances
Level of significance: 5%, Calculated degrees of freedom: 998
Calculated t-statistic: (0.620)
Two-tailed p-value: 0.535
Interpretation: 0.5352>0.05. Therefore, null hypothesis is accepted at 5% level of significance. It is 95%

evident that the average values credibility of males and females have no difference.
Decision Making: The average values of credibility of males and females are equal.
Test applied 2: Two-sample t-test assuming un-equal variances
Level of significance: 5%
Calculated degrees of freedom: 11.4
Calculated t-statistic: (0.634)
Two-tailed p-value: 0.527
Interpretation: 0.5272>0.05. Therefore, null hypothesis is accepted at 5% level of significance. It is 95%

evident that the average values credibility of males and females have no difference.
Decision Making: The average values of credibility of males and females are equal.
Conclusion: According to the both types of t-test and Chi-Sq, it is concluded that the average
scores of credibility for males is equal to the average scores of credibility for females. As can be
seen, the female tend to be more trust worthy than men especially those men who are single or
divorced and living apart.
5. Logistic Regression Model:

Refer 03B. training dataset JASP file
Training data
Using the training dataset, a logistic model was fitted to predict the creditability of a customer. Factors
such as duration of the credit, credit amount, instalment percent and age of the person were considered
in developing the model. Results are given below:
Area Under Curve: Validation check- AUC should be more than 0.7 in both the training and validation
samples. Should not be a significant difference between AUC score of both these samples. If it is more
than 0.8, it is considered as an excellent score. As calculated above AUC of the model is 0.805,
meaning the model is well fitted.
Coefficients Wald-Chi-
square
(Intercept) 11.33423872
Account Balance 56.25
Payment Status of Previous Credit 16.24837921
Purpose 1.525951557
Value Savings/Stocks 9.460284665
Length of current employment 2.963627624

Instalment per cent 14.87755102
Sex & Marital Status 2.743164063

Guarantors 2.078280811
Duration in Current address 0.003460208
Most valuable available asset 4.364376042
Concurrent Credits 8.150104058

Type of apartment 1.205222117
No of Credits at this Bank 1.289571962
Occupation 0.011531012
No of dependents 0.516601563
Telephone 3.058274405
Foreign Worker 2.698979592
Duration_of_monthly_Credit 4.794589774
Credit_Amount 3.398412098
Age_group 2.891921223
The logistic regression model takes into consideration “Creditability” as dependent and all other
variables as independent variables. The logistic regression model interprets that that significant
factors that influence the rate of Creditability are Account balance, payment Status of Previous
Credits and Instalment per cent. Rest of the factors do not significantly impact the dependent
factor- Creditability. It could be suggested that variables like Duration in current address (Wald
statistic = 0.003), Occupation (Wald statistic = 0.0115) and number of dependents (Wald statistic = 0.5)
could be easily omitted from the model.
The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a
given set of data. Given a collection of models for the data, AIC estimates the quality of each model,
relative to each of the other models. Thus, AIC provides a means for model selection. The AIC value
for the Logistic Regression Model of the bank, is 814.369. It indicates that the model is not badly
fitted. The significant p-value (p < 0.001) indicates that the model is well fitted.
The coefficient for the duration of credit was found to be -0.038; this shows that an increase in the
duration of credit would result to a lower chance of not being credit worthy. In short, longer
duration increase the chances of credit worthiness of a person. The credit amount is not significant
in the model. The coefficient for the instalment percent is -0.351; this shows that as the credit
instalment percent decreases so does the chances of not being credit worthy. That is, low interest
rates tend to increase the credit worthiness. Lastly, the coefficient for the age is 0.031; this implies that
an increase in age increases the chance of credit worthiness by 3%. In overall, the results of the logistic
regression indicated that there was a significant association between duration of credit,
instalment percent, and age of the person.
Predicted - residuals plot
.
Squared Pearson residuals plot
6. Assessment of validity of regression model and its variables of training data on testing
data set:
Testing Data: Refer 03B. testing dataset JASP File
The logistic regression on testing data indicates that the model is also good fitted (p-value <0.001)
with AIC value = 215.79. However, only Account balance is found significant in the logistic model
of testing data with significant p-value less than 0.001. In this logistic regression model, the Wald
Chi-square statistic also validates that Purpose (Wald statistic = 0.49), Duration in current address
(Wald statistic = 0.005), Age group (Wald statistic = 0.69), foreign workers (Wald statistic = 0.0001)
and Occupation (Wald statistic = 0.001) are unnecessary predictors present in the variable. These
variables should be eliminated from the logistic model. Therefore, it could be said that the logistic
regression model executed on training data do not completely validates the logistic regression on
testing data.
Coefficients Wald Chi-

square
(Intercept) 0.000225
Account Balance 13.77036
Payment Status of Previous Credit 5.653374
Purpose 0.49
Value Savings/Stocks 7.716049
Length of current employment 2.626635
Instalment per cent 0.756144
Sex & Marital Status 3.114915
Guarantors 2.528843
Duration in Current address 0.005262
Most valuable available asset 0.671978
Concurrent Credits 0.979275

Type of apartment 1.367808
No of Credits at this Bank 3.356839
Occupation 0.001072
No of dependents 0.206612
Telephone 0.409489
Foreign Worker 0.000168
Duration_of_monthly_Credit 0.046172
Credit_Amonth 4.548889
Age_group 0.690305
Predicted - residuals plot
Squared Pearson residuals plot
Using the Testing dataset, I found that unlike in the Training dataset where 3 out of the four independent
variables were significant, however, the results for the testing dataset showed that none of the
independent variables was statistically significant in the model.
7. Cost Profit Analysis:
Refer 01.dataset.xls
When a bank rejects an applicant with a good credit risk who are likely to repay the loan, then it results
loss in business and also when a bank accepts an applicant with a bad credit risk, then also it results the
financial loss in business.
The two decisions that might bring causes of loss and profit are said two be wrong and correct decisions.
The analysis incorporates the total credit amount is $3,271,248 (See Dataset.xls, tab dataset, cell F1003)
As per fitted logistic model of training data, I find the probabilities of credit risk of the whole data set
(1000 samples).
 If the credit risk probabilities are found greater than equal to 0.5, then it can be considered as it
risky and level it by “1”.
 If credit risk probabilities are found lesser than 0.5, then we consider it non-risky and therefor
level it “0”.
 For, equality of levels of “Creditability” and “Credit risk” (1 or 0 for both cases), we consider
wrong decision and otherwise correct decision.
Further, the revenues for correct decisions are accounted as 135% and for wrong decisions are
accounted as 0%. The total revenue is found $1,256,545. (See Dataset.xls, tab dataset, cell AK1003)
The deficit is calculated as $2,014,703. The bank would face a loss if they would not verify their
creditability procedure. Note that, Cost-profit analysis is almost impossible by JASP software.
Therefore, I executed it by MS Excel.
Attachments
 01.dataset.xls
 02A. testing dataset.xls
 02B.testing dataset. JASP file
 03B. training dataset.xls
 03B.training dataset. JASP file
 04A. dataset for t-test
 04A. dataset for t-test JASP file
Internet References:
Heeren, T. and D'Agostino, R., 1987. Robustness of the two independent samples t‐test when applied to ordinal scaled
data. Statistics in medicine, 6(1), pp.79-90.
Lee Rodgers, J. and Nicewander, W.A., 1988. Thirteen ways to look at the correlation coefficient. The American
Statistician, 42(1), pp.59-66.
Ray, S.C. and Das, A., 2010. Distribution of cost and profit efficiency: Evidence from Indian banking. European Journal of
Operational Research, 201(1), pp.297-307.
Google search for statistical terms and meaning

Talking Points - Answer - 2017030

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Talking Points - Answer - 2017030

Caricato da

Copyright:

Formati disponibili

Robert Gordon University

CMM723 – Statistics for Business Analysis 2018

Data Analysis: ............................................................................................................................................... 2

1. Recoding of Numerical variables into categorical variables: .................................................................... 2

3. Summarization and association: ................................................................................................................ 3

3.1. Descriptive Statistics: ........................................................................................................................ 4

3.2. Frequency Distribution: ..................................................................................................................... 7

3.3. Histogram plots: .............................................................................................................................. 13

3.4. Box plots: ........................................................................................................................................ 14

3.5. Correlation Coefficient: ................................................................................................................... 15

4.1 T-test: ............................................................................................................................................... 16

4.2 Chi-Squared Test and Contingency Table: ....................................................................................... 17

4.3 Hypotheses: ..................................................................................................................................... 18

5. Logistic Regression Model: .................................................................................................................... 19

7. Cost Profit Analysis: ............................................................................................................................... 25

Internet References: .................................................................................................................................... 25

1. Recoding of Numerical variables into categorical variables:

2. Splitting the dataset into Training and Testing data set:

3. Summarization and association:

Frequencies for Payment Status of Previous Credit

Further narrative discussion of Descriptive Statistics, Frequencies, graphical presentation of

3.1. Descriptive Statistics:

3.2. Frequency Distribution:

90.75% people has no guarantor of their credits.

A very high percentage of 81.25% people prefer to run no further credits.

84.375% employees have 3 or more dependents.

3.3. Histogram plots:

3.4. Box plots:

Duration of Credit (month)

3.5. Correlation Coefficient:

*** p < 0.001, **p < 0.01, * p < 0.05

1) Account Balance (r = 0.358, p-value <0.001): Moderate positive correlation

3) Value savings/stocks (r = 0.174, p-value <0.001): Weak positive correlation

4) Length of current employment (r =0.124, p-value<0.001): Weak positive correlation

5) Most valuable available assets (r = -0.13, p-value<0.001): Weak negative correlation

6) Concurrent Credits (r = 0.141, p-value <0.001): Weak positive correlation

7) Duration of monthly Credits (r = -0.175, p-value <0.001): Weak negative correlation

8) Credit amount (r = -0.112, p-value = 0.001): Weak negative correlation.

4.2 Chi-Squared Test and Contingency Table:

Test applied 1: Independent sample t-test assuming equal variances

Level of significance: 5%, Calculated degrees of freedom: 998

Calculated t-statistic: (0.620)

Two-tailed p-value: 0.535

Interpretation: 0.5352>0.05. Therefore, null hypothesis is accepted at 5% level of significance. It is 95%

Test applied 2: Two-sample t-test assuming un-equal variances

Calculated degrees of freedom: 11.4

Calculated t-statistic: (0.634)

Two-tailed p-value: 0.527

Interpretation: 0.5272>0.05. Therefore, null hypothesis is accepted at 5% level of significance. It is 95%

5. Logistic Regression Model:

Account Balance 56.25

Payment Status of Previous Credit 16.24837921

Length of current employment 2.963627624

Sex & Marital Status 2.743164063

Most valuable available asset 4.364376042

Concurrent Credits 8.150104058

Predicted - residuals plot

Coefficients Wald Chi-

Account Balance 13.77036

Payment Status of Previous Credit 5.653374

Length of current employment 2.626635

Instalment per cent 0.756144

Sex & Marital Status 3.114915

* p < 0.001, p < 0.01, * p < 0.05