Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
by
All rights reserved. Reproduction in whole or in part in any form requires the prior
written permission of Muhammad Manqaad Faheem and Muhammad Jazib Hussain,
or designated representative.
ii
DECLARATION
iii
CE RT IFI C AT E OF AP P RO V AL
It is certified that the project titled ―Loan Default Prediction‖ carried out by
Muhammad Manqaad Faheem, Reg. No. SEU-XF15-110 and Muhammad Jazib
Hussain Reg. No. SEU-XF15-112, under the supervision of Dr. Usama Khalid.
University of Lahore, Islamabad, is fully adequate, in scope and in quality, as a final
year project for the degree of BS of Softwares Engineering.
Supervisor: -------------------------
Dr. Syed Usama Khalid
Assistant Professor
Dept. of CS & IT
The University of Lahore, Islamabad
HOD: -------------------------
Dr. Syed M. Jawad Hussain
Head of Department
Dept. of CS & IT
The University of Lahore, Islamabad
iv
ACKNOWLEDGMENT
This page is intended to thank your supervisor, co-supervisor and all those (students,
teachers, TA/SA or any third party) who directly helped you out in the completion of
the project/thesis. [Font: Times New Roman, Size: 12]
v
ABSTRACT
The abstract is the most important part of a Project report. Any abstract will be read by
ten or twenty times more than any other words in the report. So, to make a positive
impression, or just convey information, here's where to really pay attention to writing.
The purpose of abstract in not just to tell the reader about what was done: it is to tell
him/her what was done in the simplest, most informative way possible. Making an
abstract understandable for a non-technical person should be the first priority.
Discussed below are the basic components of an abstract in any discipline and should
be handled in separate paragraphs.
First paragraph should be about Motivation/problem statement: Why do you care about
the problem? What practical, scientific, theoretical gap is your research/project filling?
vi
TABLE OF CONTENTS
vii
4.3 Implementation procedure……………………………… ……..17
4.3.1 Details about hardwar ………………………………………….17
4.3.2 Details about software/ algorithms……………………………..17
4.3.2 Details about control et …… …………………………………..18
4.4 Verification of functionalities ………………………………… 27
4.5 Details about simulation / mathematical modeling ....................33
4.6 Summary ..................................................................................44
Summary of all your methodologies. Chapter 5 ................................ 44
SYSTEM TESTING ...........................................................................................46
5.1 Objective testing ...............................................................................46
5.2 Usability Testing ...............................................................................46
5.3 Software Performance Testing ...........................................................46
5.4 Compatibility Testing ........................................................................46
5.5 Load Testing .....................................................................................47
5.6 Security Testing ................................................................................47
5.7 Installation Testing ............................................................................47
5.8 Test Cases .........................................................................................47
Chapter 6 .......................................................................................... 56
RESULTS AND CONCLUSION .......................................................................56
6.1 Presentation of the findings.......................................................56
6.1.1 Hardware results .......................................................................56
6.1.2 Software results ........................................................................56
6.2 Discussion of the findings .........................................................60
6.2.1 Comparison with initial GOAL .................................................61
6.2.2 Reasoning for short comings .....................................................61
6.3 Limitations ...............................................................................62
6.4 Recommendations ....................................................................62
6.5 Summary ..................................................................................62
Chapter 7 .......................................................................................... 63
FUTURE WORK ...............................................................................................63
viii
LIST OF FIGURES
Figure-1.1 Figure Caption .................................................................................... 2
Figure-2.1 Figure Caption .................................................................................... 4
Figure-2.2 Figure Caption .................................................................................... 6
Figure-3.1 Figure Caption .................................................................................... 8
ix
LIST OF TABLES
x
LIST OF ACRONYMS
xi
Chapter 1
INTRODUCTION
1.1 Overview
Every business in the present era cannot prosper without the help of the banks and to
do so banks provides loans to the businesses which put the banks in the zone of credit
risk. In short banks have the confusion that weather the borrower is going to pay back
the borrowed amount in the fixed days or not. To resolve this issue data mining
classification algorithm is taken into consideration. Through this algorithm system
can set up classification model by using the relevant personal information and
consumption data of the loan applicants in the past, and find out the characteristics of
risk customers. The techniques is called supervised learning through which previous
six months data is observers by the system and system gives the output that whether
the user is going to default or not in next month.
1
1.3 Purpose of the research/project
Bank plays a vital role in market economy. The success or failure of organization
largely depends on the industry‘s ability to evaluate credit risk. Before giving the
credit loan to borrowers, bank decides whether the borrower is bad (defaulter) or good
(non-defaulter). The prediction of borrower status i.e. in future borrower will be
defaulter or non-defaulter is a challenging task for any organization or bank. Basically
the loan defaulter prediction is a binary classification problem. Loan amount,
costumer‘s history governs his creditability for receiving loan. The problem is to
classify borrower as defaulter or non-defaulter. However, developing such a model is
a very challenging task due to increasing in demands for loans. A prototype of the
model is described in the paper which can be used by the organizations for making the
correct or right decision for approve or reject the request for loan of the customers.
Loan Prediction is very helpful for employee of banks as well as for the applicant
also. The Loan Prediction System can automatically calculate the weight of each
features taking part in loan processing and on new test data same features are
processed with respect to their associated weight. A time limit can be set for the
applicant to check whether his/her loan can be sanctioned or not.
2
Chapter 2 LITRATURE REVIEW: In this chapter of document details literature
review is done. In this chapter related technology, related project and related studies
have been discussed.
Chapter 5 SYSTEM TESTING: This chapter is all about testing. Various tests
were performed on the system including objective testing, usability testing, software
performance testing, compatibility testing, load testing and security testing.
Chapter 7 FUTURE WORK: In this chapter we have discussed that if the point
we ended the project of ours, from that point what can be done more.
1.6 Summary
Banks play a vital role in the boosting up of the economy of any country by providing
loans to the businesses. In doing so banks put themselves in the credit risk problem
because it is hard to figure out that whether the borrower is going to default or not. To
resolve this issue a system is developed called loan predictor. This loan predictor is
going to use a data mining technique called supervised learning in which system is
going to examine the customer demographic and the payment behavior over the
previous six months to determine which customer will default their loan next month.
We use Machine learning to train the Dataset and predict that if the customer will
default the loan or not. In this system there are going to be two major actors one is
Financial Analyst and the other is Bank Employee. The Financial Analyst will be
responsible for analyzing financial data with already trained system. Bank Employee
will add customer‘s financial data to be further over sighted by the Financial Analyst.
3
Chapter 2
LITERATURE REVIEW
This chapter will include all of you work before starting the core of your report. What
you studied and why you studied that particular article/paper or book.
Linear regression is the algorithm that helps the user to model the relationship
between two variables and this is done by fitting a linear equation to an observed data.
One of the variables mostly of the x- axis is considered to be the explanatory variable
and the other (y-axis) is considered to be the dependent variable as shown in the
diagram.
The dependent variable is the variable whose value is mostly the concerned personnel
are in need to forecast whereas the explanatory variable is the variable that explains
the other variable and it is also the called independent variable which is denoted by
the Y. There are basically two application of the linear regression one is that whether
4
there is a statistically significant relationship between two variables or not and the
other application is that to establish a relationship between the two variables. The
application can be explained as it is used to forecast unobserved values like what will
be the price. The second application as explained earlier that it used to show the
statistic relationship between the two variables the example of it is like the increase of
the sin tax by the use of the cigarette packs. The linear regressing can be explained as
the line of best fit.
Y=A+BX
OR
Y=B-0+B-1*X
OR
In the equation (A) the B is the intercept which means that if B increases so will the
value of the Y whereas M is the slope or the gradient and in the increase the line will
rotate with the increment. As the project is basically based on the classification
problem so some of this type of the problems cannot be tackled by the use of
algorithm like linear regression algorithm. In the linear regression data set processed
output is divided into two sets and this is done by setting the midpoint value as shown
in the figure given bellow.
5
In the graph show above E(Y) is the midpoint or in other words is called the threshold
classifier. In this case the threshold is about 12.5 and in this case any value below this
value is considered as negative or corresponding to No whereas in case of the values
above the threshold are considered as the positive values. But this type regression
model is not applicable for all the scenarios so that was the reason the linear
regression algorithm are not used in the project under development. One of the reason
linear regression is not used for the development of the system under discussion is
that it gives values larger than one (i.e. <1) and smaller than zero (i.e. > 0) here we
need the values in the form of 0 and 1 so that they could be assigned to default and no
default which is quite hard but the values should be in between 1 and 0. The second
reason of not using this algorithm is that there exists another scenario which leads to
the further classification that is muskrat classification. The scenario is demonstrated in
the graph given bellow.
Geographic Recognition
6
Figure 2.5(Geographical Recognition)
7
Climate Forecaster
Stock Market Forecaster
Financial applications of the KNN are shown in the figure as follows:
Decision Tree:
Over-fitting
The decision tree learners can create over complex tree structure that cannot
generalize the data very well.
Variance
Sometimes the tree becomes unstable and mostly it happens when a little
change occurs in the data which results into a completely new tree.
Biased Trees
The algorithm sometimes creates the biased trees, this happens when some of
classes dominate that is why it is recommended that whenever you process the
data by this algorithm the balancing should be done prior to that.
K Nearest Neighbor
DATA SPARSITY
False Intuition
8
Large Data Storage Requirement
Low Computation Efficiency
Random Forest
The random forest is not as good for the regression problems as it is good for
the classification problems and the reason is it sometimes over fits the
regression data usually when it is noisy.
It can be considered as the black model approach for the statistical data
because in this scenario most of the cases one has no control over the data.
Logistic Regression:
Limited Outcome variables
Independent Variables Required
Over fitting the Variable
2.5 Summary
In this project the literature review is given of the project. In that the very first thing
that is been discussed is the related technology. In related technology linear regression
was discussed like how it works its advantages and disadvantages. After that the thing
that is been discussed in this chapter is the related projects. In the related projects few
of the projects are discussed according to the algorithms used in this project. The third
thing is the studies that are being done on regarding the technologies used in the
project and the last thing is the limitations of the models used in this project.
9
Chapter 3
In this chapter, you will be discussing in detail all the tools used in your work. This
includes hardware, software and simulation tools or any other thing which aided in
your project. If multiple hardware/software tools are used, use subheadings and go in
detail of each one of them.
HDD 256GB
10
personal computer on which this system is tested for the sake of testing purpose.
SYSTEM Hewlett-Packard
MANUFACTURER
SYSTEM MODEL Hewlett Packard Folio 9470m Ultrabook
HDD 500GB
The specifications of the primary personal computer‘s operating system are as shown
in the table 3.3.
DEVELOPER/MANUFACTURER Microsoft
OS BUILD 17134.165
11
DEVELOPER/MANUFACTURER Microsoft
EDITION
Windows 10 Home
VERSION 1803
OS BUILD 17134.165
PROCESSOR
800MHz Intel Pentium III or Equivalent
MEMORY 512 MB
3.2.1.4 PyCharm
The table 3.6 lists all the recommenced system requirements for using PyCharm
2017.1.3.
OPERATING SYSTEM: Windows 8 or higher
PROCESSOR
800MHz Intel Pentium III or Equivalent
DISK SPACE 2 GB
SSD Recommended
12
requirements for using PyCharm 2017.1.3 are listed below in table 3.7.
OPERATING SYSTEM: Windows 8 or higher
PROCESSOR
Any Intel or AMD x86-64 processor
MEMORY 4 GB
MEMORY 1, 2 GB
13
Table 3.10 lists the entire recommended system requirement for Python 2.7 or 3.6.
OPERATING SYSTEM: Windows 8 or higher
PROCESSOR Any Intel or AMD x86-64 processor
3.3 Summary
In this chapter detailed minimum and recommended system requirement of tools are
mentioned. Along the comparison of hardware used with recommended hardware
settings of different software tools. The specification of the system on which this
project was developed is also discussed in detail, along with specification of system
on which this project is tested is also discussed. All the tables shown above have
displayed the brief comparison between the minimum and recommended
requirements of tools and systems.
14
Chapter 4
METHODOLOGIES
In this chapter the very that will be discussed is the designs of the project. Since it is
software project so the design of the project will be demonstrated by the diagrams like
system sequence diagram, use case diagram, etc. Right after that the thing which is
going to be discussed is the algorithms used in this project, hardware, Analysis
procedures and the Implementation procedure.
15
going to default or not next month.
FINANCIAL ANALYST
Obtain Results
The second feature that will be provided to the end user is the financial credibility
check in which the credibility of the borrower is checked. In this process the previous
data of the client will be observed by the system developed and then will be show the
output that whether the client is going to default or not.
4.1.2 Algorithms
List of the algorithms used in the development of the project under discussion is as
follows:
Logistic Regression
16
Decision Tree
K Nearest Neighbor
Random Forest
4.3.2.1 Algorithms
17
Logistic regression hypothesis is defined as:
18
From the figure given above on can observe that here the gradient is quite similar to
linear regression gradient because the in this graph the logistic linear equation is been
demonstrated. The difference between linear regression and logistic regression is that
it has different formula of H of X than linear regression so by using logistic regression
one can demonstrate non linear equation and complex equations too. This is done by
using high order polynomials. It helps in determining or estimating the co-efficient of
the logistic functions by using following two descents:
Gradient Descent
Stochastic Descent
This is done by using one of the many algorithms in the machine learning
19
A : And
R : Regression
T : Tree
The tree is the flow chart like structure where each internal node is called the test,
whereas each branch is represents the outcome of a test each leaf or the terminal node
holds the class name label and the top most node is called the root node as shown in
the figure given below.
This is used to explicitly and visually represent the decision and decision making. The
reason of introducing this algorithm was due to these features. The features of the tree
decision are that it is easy to interpret, visualize and understand. One of the awesome
features of this algorithm is its characteristic of performing the implicitly variable
screaming and feature selection. Since our data sets also have categorical data too so
this algorithm is also good because it can handle categorical data and the non linear
relationship between the parameters.
The disadvantages of the decision tree are as follows:
Over-fitting
The decision tree learners can create over complex tree structure that cannot
generalize the data very well.
Variance
20
Sometimes the tree becomes unstable and mostly it happens when a little change
occurs in the data which results into a completely new tree.
Biased Trees
The algorithm sometimes creates the biased trees, this happens when some of classes
dominate that is why it is recommended that whenever you process the data by this
algorithm the balancing should be done prior to that. The decision tree is drawn up
side down where its roots are at top and the leaf at the bottom of the tree.
The types of decision trees are of two types one is the regression tree and the other is
the classification tree.
Classification Tree :
This type of decision tree is used when the dependent variable is continuous. In the
classification trees the value obtained by the terminal nodes or the class is mostly the
mode of the values falling in that region. The splitting process does not stop until the
stopping criteria are reached and this results into a fully grown tree.
Regression Tree :
This type of tree is used when and only when the dependent variables have the
categorical values. So the values obtained by the terminal nodes of the tree in the
training data is the mostly the mean or the average response falling in that region. The
splitting process does not stop until the stopping criteria are reached and this results
into a fully grown tree.
Classification Tree & Regression Tree :
The fully grown trees are mostly ends up over fitting the data which leads to poor
accuracy on the unseen data. The scenario is tackled by using the technique called
PRUNING. Following figure demonstrate the example of the titanic and possibility of
a person‘s survival.
21
Growing a tree consists of the following steps:
Which features are to choose from the dataset?
Condition for splitting
Knowing when to stop splitting
Pruning
4.3.2.1.3 K-Nearest Neighbor
K-Nearest neighbor is used in the pattern recognition. In KNN objects are classified
based on the closest training examples in the feature space. KNN is also considered as
the type of Instance Based Learning or the Lazy Learning. In this type of learning all
the computation is delayed until the classification is done. KNN is one of the
fundamental and basic techniques where the machine has no prior or very little
knowledge about the distribution of the data. In KNN the K is the number of nearest
neighbors, this number of nearest neighbors (K) is used to predict the output by the
classifier. Let‘s take an example of a well-known season Game of Thrones.
In this example we are trying to determine that whether the unknown person is
Dothraki or Westerosian as shown in the figure given bellow. Since the people of
Dothraki clan have muscular mass and whereas Westerosian clan has the wealth,
treasures and riches so here the muscular mass, wealth, treasure, and riches are the
variables.
22
Figure 4.8(KNN Game Of Thrones examples)
since here 4 neighbors is of Dothraki clan so the prediction will be that the unknown
person is from the Dothraki clan as shown in the figure given bellow.
The crux of KNN is that to ask who my neighbors are and which class they belong to
and to avoid the draw in votes k should be an odd. In the KNN Proximity Metrics
following distance techniques can be used:
EUCLIDEAN DISTANCE
HAMMING DISTANCE
MANHATAN DISTANCE
MINKOWSKY
CHEBY CHEV DISTANCE
Details about these types of distances are demonstrated in the figure given bellow:
23
Figure 4.10(KNN Camberra And Euclindcan Distance Formula )
There is a very basic need in KNN to determine value of K which becomes hard in
the high dimensions and following figure demonstrates the issues evolves in high
dimensions data.
24
. Figure 4.12 (KNN disadvantages)
It is hard to decide in KNN that which distance techniques and attribute should be
used to get best results.
Computation cast is high.
4.3.2.1.4 Random Forest
Random forest is one the best, most powerful and most frequently used algorithms in
the supervised machine learning. Random forest is capable solving both the regression
and the classification problem. As the name suggests the algorithm creates a forest of
number of decision trees. As the number of trees increases so is the will the
robustness of the prediction which will in return increase the accuracy to the multiple
decision trees. The building of the multiple decision trees can be done using the
algorithms such as:
GINI Approach
Information Gain
Other decision tree algorithms
The working of the random forest can be explained as we grow multiple trees against
a single tree in the court model. The working of the random forest in the case of
classification problem to classify a new object based on attributes each tree gives a
classification and we say that the classification (X) votes for that class and the tree
25
votes for that class. The forest chooses the tree having the most votes over all the trees
as shown in the figure given bellow.
26
4.3.2 Details about control etc.
4.4.1 Login
The first function that is user is going to perform is the login function. This function
provides the security to the bank‘s data in other it will prevent the un-authorized
access. The system displays the following window.
rows that are appearing twice in the data set are also dropped. These are the few key
points about the data cleaning. This step is not elaborated because it was not involved
in the project under discussion. The data set was taken from kaggle which was already
clean. The data set is shown in the figure given bellow.
27
Figure 4.15 (Cleaned Data Sets)
Feature extraction is the part of the dimensionality reduction the other part of the
dimensionality reduction is feature selection. Feature selection consists of following
things:
Wrappers
Filters
Embedded
A + B + C= AD
In the eqution given above lets say C = 0 so the equation can be written as:
A + B = AD
So one can say that feature selection can be defines as the selection of the relevent
data and droping the data that is irrelevent. Ten columns were selected for the training
model. Statuses of the Loan borrowers tells us about the current state of the loan
payment or repayment.
28
Figure 4.16 (Feature Selection)
The pandas fuction was used for finding the correlation of all 25 columns with the
output column i.e. Default. It is shown in the figure given bellow.
After droping the irrelevent columns we end up with eleven columns of features as
shown in the figure given bellow.
29
After the feature extraction the extracted features values will be inserted by the user to
the interface designed as shown in the figure given bellow.
In the machine learning while training the system in the supervised learning we pass
the labeled input to the model and the model in return gives the predicted output. For
example in the image classifier we me be passing it the labeled images as input as
show in the figure.
30
Figure 4.18 (categorical Data)
Most of the algorithms do not support this type of the input so it is converted in to the
numeric data. The data that is not in the numeric form is called categorical data and
the conversion of the categorical data into the numeric data is called on hot encoding.
The categorical in the project under discussion is shown in the figure given bellow.
31
There are four columns which had the categorical data as shown in the figure given
above. Conversion of the education‘s categorical data is shown in the figure given
bellow.
Conversion of the gender‘s categorical data is shown in the figure given bellow.
Conversion of the marriage status‘s categorical data is shown in the figure given
bellow.
32
Conversion of the default column‘s categorical data is shown in the figure given.
After the on hot encoding we end up with 13 columns as shown in the figure given
bellow.
The data set that is fed to the system while training the output column‘s value should
me balance otherwise it affects the training of the system. In the case of system under
development there was imbalance which had the shape shown in the figure given
bellow.
Figure 4.24
(Imbalanced Classes Graphical Presentation)
33
The numeric description is shown in the figure given bellow.
To overcome this problem SMOTE was used. After the balancing is been done the
numeric shape is shown in the figure given bellow.
Balancing is done by using python flask using the IDE Pycharm as shown in the
figure given bellow.
34
4.4.6 Scaling Data
Before feeding the data to the system one of the most important that should be done is
the data scaling. In the data scaling the rage of the features are equalized. Let‘s take
and examples
In the figure given above if we consider first column the range of the features is 56.5
to 14.4 whereas in the case of second column the rage is 4.5 to 8.41. This type of rage
difference can cause bad training of the system. Therefore one has to scale the data
before feeding it to the system. In the case of the system developed scaling is also
done using Robust Scaler as shown in the figure given bellow.
And the output of the code shown above is demonstrated in the figure shown bellow.
35
Figure 4.29 (scaled data)
Train/test split makes the data split into two parts training data and test data, the
training set contains known output which the model learns from and then from the
testing dataset the output was removed. The code regarding this process in the project
under discussion is as follows:
The reason of this process is to avoid the over fitting and under fitting, the graphical
demonstration is shown in the figure given bellow.
36
The process has the drawbacks if it is not randomized and the randomization is
described in the figure given bellow.
The figure shown above can be described as follows. Let‘s say the feature‘s numbers
are 20 and they are divided in 5 sets 4 features where these 5 sets are called K folds
and the process is called Cross Validation. Than these 4 features are first placed at the
right most side as shown in the first slot of the figure shown above after that the
changes in the allocation of the spitted feature is done according to the figure shown
above while training.
4.4.8 Models
Following are the models that were used to predict that whether the user is going to
default the loan in the next month, which has been sanctioned to him/her:
• Logistic Regression
• Decision Tree
• K Nearest Neighbor
• Random Forest
37
efficient so if one says they that both are bit similar it will not be false. Whereas the
prediction of output of the logistic regression is transformed by using a nonlinear
function which is unlike the linear regression and this function is called logistic
function, sigmoid function or Logit function. Unlike linear regression it provides the
output in between 0 to 1.Following is the input of the logistic regression:
After applying the logistic regression Model to the input, the accuracy of the output is
shown in the figure given bellow.
The ROC curve is another output of the model application to the input.
38
Figure 4.36(Logistic Regression graphical Representation)
Decision Tree
In the classification problem another algorithm that is frequently used is the Decision
Tree in the supervised learning. The tree of the decision tree is also called as CART.
C : Classification
A : And
R : Regression
T : Tree
The tree is the flow chart like structure where each internal node is called the test,
whereas each branch is represents the outcome of a test each leaf or the terminal node
holds the class name label and the top most node is called the root node. Following is
the input of the Decision Tree:
39
Figure 4.37(Decision Tree Input)
After applying the Decision Tree Model to the input, the accuracy of the output is
shown in the figure given bellow.
The ROC curve is another output of the model application to the input.
40
Figure 4.39(Decision Tree ROC)
As explained earlier random forest is one the best, most powerful and most frequently
used algorithms in the supervised machine learning. Random forest is capable solving
both the regression and the classification problem. As the name suggests the
algorithm creates a forest of number of decision trees. As the number of trees
increases so is the will the robustness of the prediction which will in return increase
the accuracy to the multiple decision trees. The building of the multiple decision trees
can be done using the algorithms such as:
GINI Approach
Information Gain
Other decision tree algorithms
41
Following is the input of the random forest input:
After applying the random forest Model to the input, the accuracy of the output is
shown in the figure given bellow.
The ROC curve is another output of the model application to the input.
42
Figure 4.42(Random Forest ROC)
K Nearest neighbor is used in the pattern recognition. In KNN objects are classified
based on the closest training examples in the feature space. KNN is also considered as
the type of Instance Based Learning or the Lazy Learning. In this type of learning all
the computation is delayed until the classification is done. KNN is one of the
fundamental and basic techniques where the machine has no prior or very little
knowledge about the distribution of the data. In KNN the K is the number of nearest
neighbors, this number of nearest neighbors (K) is used to predict the output by the
classifier. Following is the input of the logistic regression:
After applying the logistic regression Model to the input, the accuracy of the output is
shown in the figure given bellow.
43
Figure 4.44(KNN Accuracy)
The ROC curve is another output of the model application to the input.
Mathematical equation is been used but no modeling was required during whole
procedure of the development.
4.6 Summary
In this chapter the very first thing that is explained is the diagrams regarding the
project like data flow diagram and the use case diagram. After that algorithms are
44
discusses which are being used in this project. These algorithms include Logistic
Regression, Decision tree, KNN and the Random Forest. Thirdly the implementation
process is discussed in which it is explained that how development of the project
progressed. Fourthly the verification of the functionality is done with that is
implemented in this project.
45
Chapter 5
SYSTEM TESTING
Testing of the software ensures that either the required functionality is developed or
not? Testing has been completed in different phases at completion of every unit
before launching the next phase. Therefore, all of the functionality of the system is
tested so there is no chance of errors remaining in the system.
46
available on any OS e.g. Google Chrome, Safari etc. As our project is desktop
application so the only platform it can run is the windows. The reason for making the
system to support the windows platform is that other platforms are not easy to use.
Test Case #: 1
47
Test Case #: 2
Test Case #: 3
48
Test Case #: 4
Test Case #: 5
49
Test Case #: 6
Test Case #: 7
50
Test Case #: 8
Test Case #: 9
51
Test Case #: 10
Test Case #: 11
52
Test Case #:12
Test Case #: 13
53
Test Case #: 14
Test Case #: 15
54
Test Case #: 16
55
Chapter 6
In this chapter, you will explain all the results you achieved after completing all what
you explained in previous chapter. Try to find a balance while explaining your results.
Neither makes your project/work look worthless in case you were unable to achieve
the goals identified. Nor should you claim to have solved all the problems in the
world by the results you have achieved. Take a step by step approach as identified in
the section headings below.
As explained earlier that hardware is used but the purpose of them is to produce
software results not the hardware ones, so one can say that there has been no results
regarding the hardware all the results were software results which are explained in the
section bellow.
System passes through 8 steps before completion. Findings in this system can be
discussed in the figure given bellow.
6.1.2.1 Data Cleaning
Since the data sets were taken from kaggel and they were already clean so one can say
that the output of this functionality is the data set we started working on as shown in
the figure given bellow.
56
6.1.2.2 Feature Extraction
The output of the feature extraction is shown the figure ().
57
Figure 6.4 (Data Balancing output)
6.1.2.5 Scaling Data
Once scaling is done the result that generates is shown in the figure given bellow.
58
Accuracy:
59
Figure 6.7 (Decision Tree ROC Output)
6.1.2.9 Random Forest
ROC:
Accuracy:
60
6.2 Discussion of the findings
6.2.1 Comparison with initial GOAL
The comparison between the initial goals and the goals evolved at the end of the
project is explained bellow.
Initial Goal:
The initial goal was to get a cleaned data because in the machine learning cleaning of
data is extremely important without that one cannot proceed in machine learning.
End Goal:
Initial Goal:
The initial goal was to get features extracted and to drop the irrelevant data.
End Goal:
Initial Goal:
The initial goal was to convert the categorical data to numerical data
End Goal:
Initial Goal:
End Goal:
Initial Goal:
61
The initial goal was to get a well scaled data.
End Goal:
Initial Goal:
The functionality describes above has the great importance in perfect training.
End Goal:
Since there was no such functionality that we could not achieved in this project so one
can say that there are no short coming.
6.3 Limitations
The project under discussion had no limitation in the case the model random forest is
used.
6.4 Recommendations
The recommendation is that one should not use any other way to develop this project,
although the extension can be done to this project and that is one should integrate this
project to bank management system.
6.5 Summary
In this chapter the very first thing that was discussed is the software results. In this
point every result of each step is discussed, after that the finding of the project are
discussed along the diagrams. Thirdly the comparison with the initial goal is done in
and the short coming right after it. Then the limitations are explained and in the end
the recommendation are elaborated.
62
Chapter 7
FUTURE WORK
The future work that can be done in this project is that to integrate this project with
bank management system. Other work that can be done is that time series analysis can
be done using the loan data of the several years and this is done for the sake of
prediction about the client that when is he going to default. Future analysis can be
done on predicting the approximate Interest rates that the loan applicant is expected to
get as per his profile if his loan is approved. This can be useful for loan applicants,
since some banks approve loans, but give very high interest rates to the customer. It
would give the customers a rough insight regarding the interest rates that they should
be getting for their profile and it will make sure they don‘t end up paying much more
amount in interest to the bank. An application can be built, which will take various
inputs from the user like, Employment Length, Salary, Age, marital status, SSN,
address, loan amount, loan duration etc. and give a prediction of whether their loan
application can be approved by the banks or not based on their inputs along with an
approximate interest rate
63
REFERENCES
64
APPENDICES
Appendix – A
65