Sei sulla pagina 1di 26

Robert Gordon University

CMM723 – Statistics for Business Analysis 2018

Student ID 2017030

Answer
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 1

Table of Contents

Introduction: .................................................................................................................................................. 2

Data Analysis: ............................................................................................................................................... 2

1. Recoding of Numerical variables into categorical variables: .................................................................... 2

2. Splitting the dataset into Training and Testing data set: ........................................................................... 2

3. Summarization and association: ................................................................................................................ 3

Summary: ................................................................................................................................................. 3

3.1. Descriptive Statistics: ........................................................................................................................ 4

3.2. Frequency Distribution: ..................................................................................................................... 7

3.3. Histogram plots: .............................................................................................................................. 13

3.4. Box plots: ........................................................................................................................................ 14

3.5. Correlation Coefficient: ................................................................................................................... 15

4. Is there any difference between the creditability of female customers and male customers? ................ 16

4.1 T-test: ............................................................................................................................................... 16

4.2 Chi-Squared Test and Contingency Table: ....................................................................................... 17

4.3 Hypotheses: ..................................................................................................................................... 18

5. Logistic Regression Model: .................................................................................................................... 19

6. Assessment of validity of regression model and its variables of training data on testing data set: ......... 22

7. Cost Profit Analysis: ............................................................................................................................... 25

Attachments ................................................................................................................................................ 25

Internet References: .................................................................................................................................... 25


CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 2

Introduction:
The data set reflects the outlook of proposed creditability measures of 1000 samples of the given bank.
On the view point of their performances and performances of other parameters, the bank would decide
whether it would extend credit facilities or not.

The necessary cost-profit portfolio analysis is being executed with the help of this data. Most of the
data variables are categorical (ordinal or nominal, i.e. nominal variables are used to “name,” or label
a series of values, while ordinal scales provide good information about the order of choices). For
purpose of the analysis, I transformed the numerical data set into categorical variables too.

The JASP (version 0.8.6.0) software tool is utilised in this analysis for solving the assignment.
However, we have taken the help of MS Excel (the format in which data set is given) in some instances
of the calculation or execution, where JASP software faced failure.

Data Analysis:

1. Recoding of Numerical variables into categorical variables:


Firstly, the three numerical variables of the data set are merged in categorical variables. Credit amount
is transformed into categorical variable (credit amount < $2000 = “1”, $2000<= credit amount <$4000
= “2”, $4000 <=credit amount < $6000 = “3”, credit amount => $6000 = “4”). Age is transformed to
age group (age < 25 = “1”, 25<age<50 = “2,” 50<=age <=75 = “3”). Duration of monthly credit (months
<20 = “1”, 20<=months<40 = “2”, 40 <=months<60 = “3”, 60<=months<80 = “4”).

The recoding is NOT possible in JASP software. Therefore, I took the help of MS Excel.

Refer 01.dataset.xls

2. Splitting the dataset into Training and Testing data set:


By random sample drawing process, I have drawn the random samples in MS Excel and out of then
80% (800 samples) are chosen for training data set and 20% (200 samples) are selected for testing data
set. This operation is also not possible in JASP tool. Therefore, in this case also, the help of MS Excel
is also taken.

Refer 02A. testing dataset.xls in MS Excel and 02B.testing dataset. JASP file

Refer 03B. training dataset.xls in MS Excel and 03B.training dataset. JASP file
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 3

3. Summarization and association:

Summary:
Refer 03B. training dataset JASP file

From the table below, it is clear that majority of the respondents in the study (Apprx 70%, n = 554)
were categorized to be credit worthy people where more than a quarter (Apprx 30%, n = 246) being
categorized as not credit worthy people.

Looking at the previous payment status of the previous credit, I observe that majority (52.5%, n = 420)
are said to have had no previous credits or rather they paid back all previous credits. However, about
5% (n = 40) were problematic running account while 4% (n = 32) were hesitant in paying the previous
credits.

Frequencies for Payment Status of Previous Credit


Payment Status of Previous Credit Frequency Percent Valid Cumulative
Percent Percent
Hesitant payment of previous credits 32 4.000 4.000 4.000
Problematic running account 40 5.000 5.000 9.000
No previous credits / paid back all previous 420 52.500 52.500 61.500
credits
No problems with current credits at this 73 9.125 9.125 70.625
bank
Paid back previous credits at this bank 235 29.375 29.375 100.000
Total 800 100.000

Further narrative discussion of Descriptive Statistics, Frequencies, graphical presentation of


associations, Pearson’s Correlation are given in the following paragraphs.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 4

3.1. Descriptive Statistics:

The descriptive statistic of Creditability, Account balance, Payment Status of Previous credit and
Purpose of training data indicates that the average values are 0.693, 2.539, 2.556 and 2.804 respectively.
The standard deviations of these categorical variables are 0.462, 1.252, 1.097 and 2.749 respectively.
The Creditability, Account balance, payment Status of Previous Credit vary from 0 to 1, 1 to 4, 0 to 4
and 0 to 10 level respectively.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 5

The descriptive statistic of Value Savings/Stocks, Length of current employment, Instalment per cent
and Sex & Marital Status of training data shows that the average values are 2.07, 3.395, 2.958 and 2.672
respectively. The standard deviations of these categorical variables are 1.562, 1.207, 1.113 and 0.713
respectively. The Value Savings/Stocks, Length of current employment, Instalment per cent and Sex &
Marital Status vary from 1 to 5, 1 to 5, 1 to 4 and 1 to 4 level respectively.

The descriptive statistic of Guarantors, Duration in Current Address, most valuable available asset,
Concurrent Credits and Type of apartment of training data shows that the average values are 1.144,
2.886, 2.377, 2.674 and 1.921 respectively. The standard deviations of these categorical variables are
0.475, 1.093, 1.062, 0.706 and 0.539 respectively. The Value Guarantors, Duration in Current Address,
most valuable available asset, Concurrent Credits and Type of apartment vary from 1 to 3, 1 to 4, 1 to
4, 1 to 3 and 1 to 3 level respectively.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 6

The descriptive statistic of Number of Credits at this bank, Occupation, number of dependents,
Telephone and Foreign Workers of training data shows that the average values are 1.413, 2.875, 1.156,
1.389 and 1.036 respectively. The standard deviations of these five categorical variables are 0.583,
0.661, 0.363, 0.448 and 0.187 respectively. The Number of Credits at this bank, Occupation, number
of dependents, Telephone and Foreign Workers vary from 1 to 4, 1 to 4, 1 to 2, 1 to 2 and 1 to 2 level
respectively.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 7

The descriptive statistic of Duration of monthly credit, Amount of Credit and Age group of training
data shows that the average values are 1.524, 1.964 and 1.915 respectively. The standard deviations of
these categorical variables are 0.638, 1.059 and 0.548 respectively. The Duration of monthly credit,
Amount of Credit and Age group vary from 1 to 3, 1 to 4 and 1 to 3 level respectively.

3.2. Frequency Distribution:

Out of 800 people, 30.75% people are not credit worthy and rest of 69.25% people are credit worthy.

Out of 800 people, 28.375% people have no balance or debit followed by 27.875% people have no
running account. A highest percentage of 38% people have checked out $200 for at least 1 year.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 8

A highest percentage of 52% people have no previous or pending credits followed by 30.125% people
who had paid back previous credits at this bank. Only 4.25% and 4.875% people are facing hesitant
payment of previous credits and problematic running account.

A highest percentage of 27.875% people need credit for purchasing items and furniture followed by the
24.375% people require credit for purchasing other purposes. It is notable that the percentage of people
who need credit for purchasing used cars (9.875%) is also satisfactory.

Among 800 chosen people, mostly (61.375%) people have no available savings followed by 17.625%
people who have more than $1000 savings.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 9

Only 6.125% people are currently unemployed. 34.25% people are employed for 1 to 4 years followed
by 25.625% people who are employed for more than 7 years.

A significant number of 46.5% people are under the Instalment percent less than 20%.

53.875% people are either single or widowed male. Only 9.25% people are females.

90.75% people has no guarantor of their credits.


CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 10

42.5% people are living in their current addresses for more than 7 years followed by 29.625% people
who are living in their current addresses for only 2 to 4 years.

More than 33% people have savings contract with building society or life insurance.

A very high percentage of 81.25% people prefer to run no further credits.

More than 70% people are living in owner-occupied flat with highest percentage and only 10.875%
people are living in rented flat with least frequency.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 11

More than 62% people have only one credit in this bank and only 0.75% people have six or more credits
in this bank.

The people who are asking for credit in this bank are either skilled worker or skilled employees and
minor civil servants. Only 2.625% unemployed or unskilled labour will no permanent resistance are
asking for credit.

84.375% employees have 3 or more dependents.

61.125% people are using telephones whereas 38.875% people are not using telephones.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 12

96.375% people are foreign workers whereas 3.625% people are not foreign workers.

The duration if credits of employees less than 40 months is for 55.5% people, followed by the duration
of credits of employees less than 60 months but greater than 40 months is for 36.625% people.

The frequency of people is highest for the people whose amount of credit is less than $2000 (42.875%)
followed by the frequencies with credit amount more than $2000 but greater than $4000 (32.875%).

Major number of people belong to the age-group 25 years to 50 years with percentage almost 70%.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 13

3.3. Histogram plots:


CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 14

3.4. Box plots:


Age (years) Credit amount

Duration of Credit (month)

Reading a Boxplot
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 15

3.5. Correlation Coefficient:

*** p < 0.001, **p < 0.01, * p < 0.05

The statistically significant associations with explanatory variable Creditability is found in case of the
following variables-

1) Account Balance (r = 0.358, p-value <0.001): Moderate positive correlation

2) Payment Status of previous credits (r = 0.249, p-value <0.001): Weak positive correlation

3) Value savings/stocks (r = 0.174, p-value <0.001): Weak positive correlation

4) Length of current employment (r =0.124, p-value<0.001): Weak positive correlation

5) Most valuable available assets (r = -0.13, p-value<0.001): Weak negative correlation

6) Concurrent Credits (r = 0.141, p-value <0.001): Weak positive correlation

7) Duration of monthly Credits (r = -0.175, p-value <0.001): Weak negative correlation

8) Credit amount (r = -0.112, p-value = 0.001): Weak negative correlation.


CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 16

4. Is there any difference between the creditability of female customers and male
customers?
Refer 04B. dataset for t-test JASP file

4.1 T-test:
This test help to find the equality of averages of any numerical variable (here, Creditability) with respect
to different levels of categorical variables (here, age and sex). For calculation, I transformed the level
1,2 and 3 to the level “Males” and level 4 to the level “Females”. This recoding is not possible in JASP
software. Therefore, I have incorporated this in MS Excel.

The independent sample t-test produces the t-value 0.62 with 998 degrees of freedom and p-value 0.535.
The mean creditability of 92 females is 0.728 and 908 males are 0.697.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 17

Creditability

4.2 Chi-Squared Test and Contingency Table:


I ran a Chi-Square test of association to check whether there is any significant association between
gender of the person and creditability of the person. This was tested at 5% level of significance. The
results are presented below.

Contingency Tables

Creditability
Sex & Marital Status Not credit- Credit- Total
worthy worthy
Male- divorced / living Count 20.0 30.0 50.0
apart % within row 40.0 % 60.0% 100.0%
Male- single Count 109.0 201.0 310.0
% within row 35.2 % 64.8% 100.0%
Male- married / widowed Count 146.0 402.0 548.0
% within row 26.6% 73.4% 100.0%
Female Count 25.0 67.0 92.0
% within row 27.2% 72.8% 100.0%
Total Count 300.0 700.0 1000.0
% within row 30.0% 70.0 % 100.0%

Chi-Squared Tests
Value df p
Χ² 9.605 3 0.022
N 1000

The p-value is given as 0.022 (a value less than 5% level of significance), we thus reject the null
hypothesis and conclude that there is significant association between sex and credibility of the person.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 18

4.3 Hypotheses:

Null hypothesis (H0): The difference of averages of credibility of males and females is 0.

Alternative hypothesis (H1): The averages of credibility of males and females are different to each other.

Test applied 1: Independent sample t-test assuming equal variances

Level of significance: 5%, Calculated degrees of freedom: 998

Calculated t-statistic: (0.620)

Two-tailed p-value: 0.535

Interpretation: 0.5352>0.05. Therefore, null hypothesis is accepted at 5% level of significance. It is 95%


evident that the average values credibility of males and females have no difference.

Decision Making: The average values of credibility of males and females are equal.

Test applied 2: Two-sample t-test assuming un-equal variances

Level of significance: 5%

Calculated degrees of freedom: 11.4

Calculated t-statistic: (0.634)

Two-tailed p-value: 0.527

Interpretation: 0.5272>0.05. Therefore, null hypothesis is accepted at 5% level of significance. It is 95%


evident that the average values credibility of males and females have no difference.

Decision Making: The average values of credibility of males and females are equal.

Conclusion: According to the both types of t-test and Chi-Sq, it is concluded that the average
scores of credibility for males is equal to the average scores of credibility for females. As can be
seen, the female tend to be more trust worthy than men especially those men who are single or
divorced and living apart.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 19

5. Logistic Regression Model:


Refer 03B. training dataset JASP file

Training data

Using the training dataset, a logistic model was fitted to predict the creditability of a customer. Factors
such as duration of the credit, credit amount, instalment percent and age of the person were considered
in developing the model. Results are given below:
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 20

Area Under Curve: Validation check- AUC should be more than 0.7 in both the training and validation
samples. Should not be a significant difference between AUC score of both these samples. If it is more
than 0.8, it is considered as an excellent score. As calculated above AUC of the model is 0.805,
meaning the model is well fitted.

Coefficients Wald-Chi-
square

(Intercept) 11.33423872

Account Balance 56.25

Payment Status of Previous Credit 16.24837921

Purpose 1.525951557
Value Savings/Stocks 9.460284665

Length of current employment 2.963627624


Instalment per cent 14.87755102

Sex & Marital Status 2.743164063


Guarantors 2.078280811
Duration in Current address 0.003460208

Most valuable available asset 4.364376042

Concurrent Credits 8.150104058


Type of apartment 1.205222117
No of Credits at this Bank 1.289571962

Occupation 0.011531012
No of dependents 0.516601563

Telephone 3.058274405
Foreign Worker 2.698979592
Duration_of_monthly_Credit 4.794589774

Credit_Amount 3.398412098

Age_group 2.891921223
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 21

The logistic regression model takes into consideration “Creditability” as dependent and all other
variables as independent variables. The logistic regression model interprets that that significant
factors that influence the rate of Creditability are Account balance, payment Status of Previous
Credits and Instalment per cent. Rest of the factors do not significantly impact the dependent
factor- Creditability. It could be suggested that variables like Duration in current address (Wald
statistic = 0.003), Occupation (Wald statistic = 0.0115) and number of dependents (Wald statistic = 0.5)
could be easily omitted from the model.

The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a
given set of data. Given a collection of models for the data, AIC estimates the quality of each model,
relative to each of the other models. Thus, AIC provides a means for model selection. The AIC value
for the Logistic Regression Model of the bank, is 814.369. It indicates that the model is not badly
fitted. The significant p-value (p < 0.001) indicates that the model is well fitted.

The coefficient for the duration of credit was found to be -0.038; this shows that an increase in the
duration of credit would result to a lower chance of not being credit worthy. In short, longer
duration increase the chances of credit worthiness of a person. The credit amount is not significant
in the model. The coefficient for the instalment percent is -0.351; this shows that as the credit
instalment percent decreases so does the chances of not being credit worthy. That is, low interest
rates tend to increase the credit worthiness. Lastly, the coefficient for the age is 0.031; this implies that
an increase in age increases the chance of credit worthiness by 3%. In overall, the results of the logistic
regression indicated that there was a significant association between duration of credit,
instalment percent, and age of the person.

Predicted - residuals plot

.
Squared Pearson residuals plot
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 22

6. Assessment of validity of regression model and its variables of training data on testing
data set:
Testing Data: Refer 03B. testing dataset JASP File

The logistic regression on testing data indicates that the model is also good fitted (p-value <0.001)
with AIC value = 215.79. However, only Account balance is found significant in the logistic model
of testing data with significant p-value less than 0.001. In this logistic regression model, the Wald
Chi-square statistic also validates that Purpose (Wald statistic = 0.49), Duration in current address
(Wald statistic = 0.005), Age group (Wald statistic = 0.69), foreign workers (Wald statistic = 0.0001)
and Occupation (Wald statistic = 0.001) are unnecessary predictors present in the variable. These
variables should be eliminated from the logistic model. Therefore, it could be said that the logistic
regression model executed on training data do not completely validates the logistic regression on
testing data.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 23

Coefficients Wald Chi-


square
(Intercept) 0.000225

Account Balance 13.77036

Payment Status of Previous Credit 5.653374

Purpose 0.49
Value Savings/Stocks 7.716049

Length of current employment 2.626635

Instalment per cent 0.756144

Sex & Marital Status 3.114915

Guarantors 2.528843

Duration in Current address 0.005262

Most valuable available asset 0.671978

Concurrent Credits 0.979275


Type of apartment 1.367808

No of Credits at this Bank 3.356839

Occupation 0.001072
No of dependents 0.206612

Telephone 0.409489

Foreign Worker 0.000168

Duration_of_monthly_Credit 0.046172

Credit_Amonth 4.548889

Age_group 0.690305
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 24

Predicted - residuals plot

Squared Pearson residuals plot

Using the Testing dataset, I found that unlike in the Training dataset where 3 out of the four independent
variables were significant, however, the results for the testing dataset showed that none of the
independent variables was statistically significant in the model.
CMM723 – Statistics for Business Analysis 2018: Answer by 2017030 25

7. Cost Profit Analysis:

Refer 01.dataset.xls

When a bank rejects an applicant with a good credit risk who are likely to repay the loan, then it results
loss in business and also when a bank accepts an applicant with a bad credit risk, then also it results the
financial loss in business.

The two decisions that might bring causes of loss and profit are said two be wrong and correct decisions.
The analysis incorporates the total credit amount is $3,271,248 (See Dataset.xls, tab dataset, cell F1003)

As per fitted logistic model of training data, I find the probabilities of credit risk of the whole data set
(1000 samples).

 If the credit risk probabilities are found greater than equal to 0.5, then it can be considered as it
risky and level it by “1”.
 If credit risk probabilities are found lesser than 0.5, then we consider it non-risky and therefor
level it “0”.
 For, equality of levels of “Creditability” and “Credit risk” (1 or 0 for both cases), we consider
wrong decision and otherwise correct decision.

Further, the revenues for correct decisions are accounted as 135% and for wrong decisions are
accounted as 0%. The total revenue is found $1,256,545. (See Dataset.xls, tab dataset, cell AK1003)
The deficit is calculated as $2,014,703. The bank would face a loss if they would not verify their
creditability procedure. Note that, Cost-profit analysis is almost impossible by JASP software.
Therefore, I executed it by MS Excel.

Attachments
 01.dataset.xls
 02A. testing dataset.xls
 02B.testing dataset. JASP file
 03B. training dataset.xls
 03B.training dataset. JASP file
 04A. dataset for t-test
 04A. dataset for t-test JASP file

Internet References:
Heeren, T. and D'Agostino, R., 1987. Robustness of the two independent samples t‐test when applied to ordinal scaled
data. Statistics in medicine, 6(1), pp.79-90.

Lee Rodgers, J. and Nicewander, W.A., 1988. Thirteen ways to look at the correlation coefficient. The American
Statistician, 42(1), pp.59-66.

Ray, S.C. and Das, A., 2010. Distribution of cost and profit efficiency: Evidence from Indian banking. European Journal of
Operational Research, 201(1), pp.297-307.

Google search for statistical terms and meaning

Potrebbero piacerti anche