Report Fyp

LOAN DEFAULT PREDICTION (LDP)
by
Muhammad Manqaad Faheem

SEU-XF15-110
Muhammad Jazib Hussain
SEU-XF15-112
A Project Report submitted to the

DEPARTMENT OF COMPUTER SCIENCE & INFORMATION TECHNOLOGY
in partial fulfillment of the requirements for the degree of
BACHELORS OF SCIENCE IN COMPUTER SCIENCE
Faculty of Computer Science & Information Technology

University of Lahore
Islamabad
March, 2019
Copyright  2017 by UOL Student
All rights reserved. Reproduction in whole or in part in any form requires the prior
written permission of Muhammad Manqaad Faheem and Muhammad Jazib Hussain,
or designated representative.
ii
DECLARATION
It is declared that this is an original piece of my own work, except where

otherwise acknowledged in text and references. This work has not been submitted in
any form for another degree or diploma at any university or other institution for
tertiary education and shall not be submitted by me in future for obtaining any degree
from this or any other University or Institution.
Muhammd Manqaad Faheem

SEU-XF15-110
Muhammad Jazib Hussian
SEU-XF15-112
March 2019
iii
CE RT IFI C AT E OF AP P RO V AL
It is certified that the project titled ―Loan Default Prediction‖ carried out by
Muhammad Manqaad Faheem, Reg. No. SEU-XF15-110 and Muhammad Jazib
Hussain Reg. No. SEU-XF15-112, under the supervision of Dr. Usama Khalid.
University of Lahore, Islamabad, is fully adequate, in scope and in quality, as a final
year project for the degree of BS of Softwares Engineering.
Supervisor: -------------------------
Dr. Syed Usama Khalid
Assistant Professor
Dept. of CS & IT
The University of Lahore, Islamabad
Project Coordinator: -------------------------

Mr. Arshad Ali Khan
Lecturer
Dept. of CS & IT
HOD: -------------------------
Dr. Syed M. Jawad Hussain
Head of Department
Dept. of CS & IT
iv
ACKNOWLEDGMENT
This page is intended to thank your supervisor, co-supervisor and all those (students,
teachers, TA/SA or any third party) who directly helped you out in the completion of
the project/thesis. [Font: Times New Roman, Size: 12]
v
ABSTRACT
The abstract is the most important part of a Project report. Any abstract will be read by
ten or twenty times more than any other words in the report. So, to make a positive
impression, or just convey information, here's where to really pay attention to writing.
The purpose of abstract in not just to tell the reader about what was done: it is to tell
him/her what was done in the simplest, most informative way possible. Making an
abstract understandable for a non-technical person should be the first priority.
Discussed below are the basic components of an abstract in any discipline and should
be handled in separate paragraphs.
First paragraph should be about Motivation/problem statement: Why do you care about
the problem? What practical, scientific, theoretical gap is your research/project filling?
Methods/procedure/approach: What did you actually do to get your results? (e.g.

Designed something, developed your own algorithms/software/techniques, did some
survey, worked with some organization etc.)
Results/findings/product: As a result of completing the above procedure, what did you

learn/invent/create?
Conclusion/implications: What are the larger implications of your findings, especially

for the problem/gap identified in Motivation/problem statement paragraph?
vi
TABLE OF CONTENTS
DECLARATION .............................................................................. iii

ACKNOWLEDGMENT .....................................................................v
ABSTRACT ......................................................................................vi
TABLE OF CONTENTS ................................................................. vii
LIST OF FIGURES ...........................................................................ix
LIST OF TABLES ..............................................................................x
Chapter 1 ............................................................................................1
INTRODUCTION ............................................................................................... 1
1.1 Overview ........................................................................................... 1
1.2 Statement of Problem ......................................................................... 1
1.3 Purpose of the research/project .................................................. 2
1.4 Applications of the research ....................................................... 2
1.5 Theoretical bases and Organization............................................ 2
1.6 Summary............................................................................................ 3
Chapter 2 ............................................................................................4
LITERATURE REVIEW .................................................................................... 4
2.1 Related Technologies ................................................................ 4
2.1.1 Related Technology 1……………………………………………4
2.2 Related Projects ......................................................................... 6
2.3 Related Studies .......................................................................... 6
2.4 Their Limitations and Bottlenecks ............................................. 8
2.5 Summary............................................................................................ 9
Chapter 3 .......................................................................................... 10
TOOLS AND TECHNIQUES ............................................................................10
3.1 Hardware used with technical specifications .............................10
3.2 Software(s), simulation tool(s) used ..........................................11
3.3 Summary ..................................................................................11
Chapter 4 .......................................................................................... 15
METHODOLOGIES ..........................................................................................15
4.1 Design of the investigation/Algorithms/ Hardware ....................15
4.2 Analysis procedures..................................................................15
vii
4.3 Implementation procedure……………………………… ……..17
4.3.1 Details about hardwar ………………………………………….17
4.3.2 Details about software/ algorithms……………………………..17
4.3.2 Details about control et …… …………………………………..18
4.4 Verification of functionalities ………………………………… 27
4.5 Details about simulation / mathematical modeling ....................33
4.6 Summary ..................................................................................44
Summary of all your methodologies. Chapter 5 ................................ 44
SYSTEM TESTING ...........................................................................................46
5.1 Objective testing ...............................................................................46
5.2 Usability Testing ...............................................................................46
5.3 Software Performance Testing ...........................................................46
5.4 Compatibility Testing ........................................................................46
5.5 Load Testing .....................................................................................47
5.6 Security Testing ................................................................................47
5.7 Installation Testing ............................................................................47
5.8 Test Cases .........................................................................................47
Chapter 6 .......................................................................................... 56
RESULTS AND CONCLUSION .......................................................................56
6.1 Presentation of the findings.......................................................56
6.1.1 Hardware results .......................................................................56
6.1.2 Software results ........................................................................56
6.2 Discussion of the findings .........................................................60
6.2.1 Comparison with initial GOAL .................................................61
6.2.2 Reasoning for short comings .....................................................61
6.3 Limitations ...............................................................................62
6.4 Recommendations ....................................................................62
6.5 Summary ..................................................................................62
Chapter 7 .......................................................................................... 63
FUTURE WORK ...............................................................................................63
viii
LIST OF FIGURES
Figure-1.1 Figure Caption .................................................................................... 2
ix
LIST OF TABLES
Table-3.1 Caption of table ................................................................................... 2

x
LIST OF ACRONYMS
UOL University of Lahore

FYP Final Year Project
MS Master of Science
MBA Masters in Business Administration
HOD Head of Department
xi
Chapter 1
INTRODUCTION
1.1 Overview
Every business in the present era cannot prosper without the help of the banks and to
do so banks provides loans to the businesses which put the banks in the zone of credit
risk. In short banks have the confusion that weather the borrower is going to pay back
the borrowed amount in the fixed days or not. To resolve this issue data mining
classification algorithm is taken into consideration. Through this algorithm system
can set up classification model by using the relevant personal information and
consumption data of the loan applicants in the past, and find out the characteristics of
risk customers. The techniques is called supervised learning through which previous
six months data is observers by the system and system gives the output that whether
the user is going to default or not in next month.
1.2 Statement of Problem

In today's information and digital age, bank credit default is still frequent, how to
establish an effective model for the prediction whether bank customers will default on
the loan for recognition of the risk in bank from a mass of loan applicants is of great
significance. As the technologists say that data has a great importance in the recent
era because through data one can get enormous valuable things. In this project bank‘s
data will be mined by the system under discussion to find out whether the borrower is
going to default or not. Since banks pay loans to the borrowers and borrowers mostly
invest to their businesses and they prosper in their business. The system will process
the previous six months data of the borrower and will let know the banks that whether
the borrower is going to default or not and this type processing of data is called
supervised learning. To process the clients data system will use Random Forest,
Logistic Regression, K Nearest Neighbor and Decision Tree Algorithms to study and
analyze the bank credit data set, and compared these models on five model effect
evaluation statistics of Accuracy, Recall, precision, ROC and AUC area to identify
the risk customers from a large number of customers and provide effective approaches
for the bank's loan approval
1
1.3 Purpose of the research/project
Bank plays a vital role in market economy. The success or failure of organization
largely depends on the industry‘s ability to evaluate credit risk. Before giving the
credit loan to borrowers, bank decides whether the borrower is bad (defaulter) or good
(non-defaulter). The prediction of borrower status i.e. in future borrower will be
defaulter or non-defaulter is a challenging task for any organization or bank. Basically
the loan defaulter prediction is a binary classification problem. Loan amount,
costumer‘s history governs his creditability for receiving loan. The problem is to
classify borrower as defaulter or non-defaulter. However, developing such a model is
a very challenging task due to increasing in demands for loans. A prototype of the
model is described in the paper which can be used by the organizations for making the
correct or right decision for approve or reject the request for loan of the customers.
Loan Prediction is very helpful for employee of banks as well as for the applicant
also. The Loan Prediction System can automatically calculate the weight of each
features taking part in loan processing and on new test data same features are
processed with respect to their associated weight. A time limit can be set for the
applicant to check whether his/her loan can be sanctioned or not.
1.4 Applications of the research

The application of the project is going to be in banks. The system under discussion
will help the banks minimize their credit risks by mining the data of borrowers using
supervised learning. In other words the demographic and payment behavior of the
borrower is going to be observed by the system to asses that whether the customer is
going to default or not in the next month. This system comprises of two major actors
i.e. Financial Analyst and Bank Employee. The Financial Analyst will be responsible
for analyzing financial data with already trained system. Bank Employee will add
customer‘s financial data to be further over sighted by the Financial Analyst. . This
prediction project is applicable to, but not limited to, the following areas banks,
insurance companies, multinational companies, government sector, NGO‘s.
1.5 Theoretical bases and Organization

Chapter 1 INTRODUCTION: In this chapter, the main points are discussed
about overview, the Background of project, statement of problem of this project and
application of this project.
2
Chapter 2 LITRATURE REVIEW: In this chapter of document details literature
review is done. In this chapter related technology, related project and related studies
have been discussed.
Chapter 3 TOOLS AND TECHNIQUES: In this chapter, hardware and

software tools are discussed that have been used in software.
Chapter 4 METHODOLOGIES: In this chapter analysis procedure of project is

done. More over Implement procedures and verification o functionalities is also done
in this chapter.
Chapter 5 SYSTEM TESTING: This chapter is all about testing. Various tests
were performed on the system including objective testing, usability testing, software
performance testing, compatibility testing, load testing and security testing.
Chapter 6 RESULTS AND CONCLUSIONS: In this chapter presentation of

findings and discussion of findings are discussed with limits of the system. Moreover,
few recommendations were made to improve this system for future releases.
Chapter 7 FUTURE WORK: In this chapter we have discussed that if the point
we ended the project of ours, from that point what can be done more.
1.6 Summary
Banks play a vital role in the boosting up of the economy of any country by providing
loans to the businesses. In doing so banks put themselves in the credit risk problem
because it is hard to figure out that whether the borrower is going to default or not. To
resolve this issue a system is developed called loan predictor. This loan predictor is
going to use a data mining technique called supervised learning in which system is
going to examine the customer demographic and the payment behavior over the
previous six months to determine which customer will default their loan next month.
We use Machine learning to train the Dataset and predict that if the customer will
default the loan or not. In this system there are going to be two major actors one is
Financial Analyst and the other is Bank Employee. The Financial Analyst will be
responsible for analyzing financial data with already trained system. Bank Employee
will add customer‘s financial data to be further over sighted by the Financial Analyst.
3
Chapter 2
LITERATURE REVIEW
This chapter will include all of you work before starting the core of your report. What
you studied and why you studied that particular article/paper or book.
2.1 Related Technologies

Since main work in our project is done by the algorithms so following are the
algorithms that are related to the algorithms used in this project.
2.1.1 Linear Regression
Linear regression is the algorithm that helps the user to model the relationship
between two variables and this is done by fitting a linear equation to an observed data.
One of the variables mostly of the x- axis is considered to be the explanatory variable
and the other (y-axis) is considered to be the dependent variable as shown in the
diagram.
Figure 2.1(Linear Regression Model)
The dependent variable is the variable whose value is mostly the concerned personnel
are in need to forecast whereas the explanatory variable is the variable that explains
the other variable and it is also the called independent variable which is denoted by
the Y. There are basically two application of the linear regression one is that whether
4
there is a statistically significant relationship between two variables or not and the
other application is that to establish a relationship between the two variables. The
application can be explained as it is used to forecast unobserved values like what will
be the price. The second application as explained earlier that it used to show the
statistic relationship between the two variables the example of it is like the increase of
the sin tax by the use of the cigarette packs. The linear regressing can be explained as
the line of best fit.
Y=A+BX
OR
Y=B-0+B-1*X
OR
Y=MX+B ------------------- (A)
In the equation (A) the B is the intercept which means that if B increases so will the
value of the Y whereas M is the slope or the gradient and in the increase the line will
rotate with the increment. As the project is basically based on the classification
problem so some of this type of the problems cannot be tackled by the use of
algorithm like linear regression algorithm. In the linear regression data set processed
output is divided into two sets and this is done by setting the midpoint value as shown
in the figure given bellow.
5
In the graph show above E(Y) is the midpoint or in other words is called the threshold
classifier. In this case the threshold is about 12.5 and in this case any value below this
value is considered as negative or corresponding to No whereas in case of the values
above the threshold are considered as the positive values. But this type regression
model is not applicable for all the scenarios so that was the reason the linear
regression algorithm are not used in the project under development. One of the reason
linear regression is not used for the development of the system under discussion is
that it gives values larger than one (i.e. <1) and smaller than zero (i.e. > 0) here we
need the values in the form of 0 and 1 so that they could be assigned to default and no
default which is quite hard but the values should be in between 1 and 0. The second
reason of not using this algorithm is that there exists another scenario which leads to
the further classification that is muskrat classification. The scenario is demonstrated in
the graph given bellow.
2.2 Related Projects

Related projects can be classified based on the algorithms used in this project.
2.2.1 Logistic Regression Algorithm
Following are the related projects:
 Image Segmentation and Categorization Projects
Figure 2.4(Image segmentation and categorization)
 Geographic Recognition
6
Figure 2.5(Geographical Recognition)
 Hand Writing Recognition
Figure 2.6(Hand Writing Recognition)
2.2.2 Decision Tree

Following are the projects developed using decision tree algorithm.
 Insurance Renewal Default Prediction
 Titanic Survival Prediction
 Image Gender Determination
 Building Price Determination Using Image
2.2.1 Random Forest
Following are the projects developed using Random Forest algorithm.
 Finding Bank‘s Loyal Customers
 Stock Market Loss Predictor
 Image Classifier
 Voice Classifier
2.2.1 K Nearest Neighbor
Following are the projects developed using Random Forest algorithm.
 Text classifier
7
 Climate Forecaster
 Stock Market Forecaster
 Financial applications of the KNN are shown in the figure as follows:
Figure 2.7(Applications of KNN)
2.3 Related Studies

In the field data science two new types of Models are being introduces. These are the
models that have been worked on:
 Association Rule Mining

 K-Means Clustering
2.4 Their Limitations and Bottlenecks

Following are the limitations of the algorithms we used in this project:
 Decision Tree:
 Over-fitting
The decision tree learners can create over complex tree structure that cannot
generalize the data very well.
 Variance
Sometimes the tree becomes unstable and mostly it happens when a little
change occurs in the data which results into a completely new tree.
 Biased Trees
The algorithm sometimes creates the biased trees, this happens when some of
classes dominate that is why it is recommended that whenever you process the
data by this algorithm the balancing should be done prior to that.
 K Nearest Neighbor
 DATA SPARSITY
 False Intuition
8
 Large Data Storage Requirement
 Low Computation Efficiency
 Random Forest
 The random forest is not as good for the regression problems as it is good for
the classification problems and the reason is it sometimes over fits the
regression data usually when it is noisy.
 It can be considered as the black model approach for the statistical data
because in this scenario most of the cases one has no control over the data.
 Logistic Regression:
 Limited Outcome variables
 Independent Variables Required
 Over fitting the Variable
2.5 Summary
In this project the literature review is given of the project. In that the very first thing
that is been discussed is the related technology. In related technology linear regression
was discussed like how it works its advantages and disadvantages. After that the thing
that is been discussed in this chapter is the related projects. In the related projects few
of the projects are discussed according to the algorithms used in this project. The third
thing is the studies that are being done on regarding the technologies used in the
project and the last thing is the limitations of the models used in this project.
9
Chapter 3
TOOLS AND TECHNIQUES
In this chapter, you will be discussing in detail all the tools used in your work. This
includes hardware, software and simulation tools or any other thing which aided in
your project. If multiple hardware/software tools are used, use subheadings and go in
detail of each one of them.
3.1 Hardware used with technical specifications

There is going to be one type of hardware is used in this system that is computer
system.
3.1.1 HARDWARE TECHNICAL SPECIFICATION # 1:

The primary specifications of the system that are required to run this software are
written in the table given bellow.
SYSTEM Hewlett-Packard
MANUFACTURER
SYSTEM MODEL Hewlett Packard Folio 9470m Ultrabook
COMPUTER NAME John
PROCESSOR Intel Core i5-3427U (~2.3GHz)
MEMORY 4000 MB RAM
DISPLAY MEMORY Intel HD 4000
HDD 256GB
SYSTEM TYPE 64-Bit Operating System, x64 – based processor
Table 3.1(Hardware Technical Specs # 1)
3.1.2 HARDWARE TECHNICAL SPECIFICATION # 2:

Table 3.2 lists all the specifications of secondary system. The secondary system is the
10
personal computer on which this system is tested for the sake of testing purpose.
SYSTEM Hewlett-Packard
MANUFACTURER
SYSTEM MODEL Hewlett Packard Folio 9470m Ultrabook
COMPUTER NAME LAPTOP-M9E96SGA
PROCESSOR Intel Core i5-3427U (~2.3GHz)
MEMORY 4000 MB RAM
DISPLAY MEMORY Intel HD 4000
HDD 500GB
SYSTEM TYPE 64-Bit Operating System, x64 – based processor
Table 3.2(Hardware Technical Specs # 2)
3.2 Software(s), simulation tool(s) used

3.2.1 Software
Following are the software used in the development of the system.
3.2.1.1 Primary Operating System
The specifications of the primary personal computer‘s operating system are as shown
in the table 3.3.
DEVELOPER/MANUFACTURER Microsoft
EDITION Windows 10 Pro

VERSION 1803
OS BUILD 17134.165
Table 3.3(Primary Operating System‘s Specs)
3.2.1.2 Secondary Operating System

The specifications of the secondary system‘s operating system are given in table 3.4.
11
DEVELOPER/MANUFACTURER Microsoft
EDITION
Windows 10 Home
VERSION 1803
OS BUILD 17134.165
Table 3.4(Primary Operating System‘s Specs)
3.2.1.3 NetBeans 7.1:

The NetBeans was the IDE used for some of code writing purpose and the
development on Stego Console and Main Console. Minimum requirements of system
for using NetBeans are listed below in table 3.5.
OPERATING SYSTEM: Microsoft Windows 7 Pro
PROCESSOR
800MHz Intel Pentium III or Equivalent
MEMORY 512 MB
DISK SPACE 750 MB
Table 3.5(Minimum NetBeans Requirement)
3.2.1.4 PyCharm
The table 3.6 lists all the recommenced system requirements for using PyCharm
2017.1.3.
OPERATING SYSTEM: Windows 8 or higher
PROCESSOR
800MHz Intel Pentium III or Equivalent
MEMORY 4 GB (32-bit) , 8GB( 64-bit)
DISK SPACE 2 GB
SSD Recommended
Table 3.6(Recommended System Requirement for PyCharm)

PyCharm 2017.1.3 was the IDE used to develop testing systems. Minimum system
12
requirements for using PyCharm 2017.1.3 are listed below in table 3.7.
PROCESSOR
Any Intel or AMD x86-64 processor
MEMORY 4 GB
DISK SPACE 2 GB for PyCharm and

4-6 GB for Typical Installation
GRAPHIC No specific graphics card is required
Standard Graphic Card would work fine
Table 3.7(Minimum System Requirement for PyCharm)

The recommended system requirements for MATLAB R2018a are listed below in the
table 3.8.
OPERATING SYSTEM: Windows 10
PROCESSOR Any Intel or AMD x86-64 processor with four

logical cores
DISK SPACE 2 GB for PyCharm and 22 GB for Full
Installation, SSD recommended
GRAPHIC Hardware accelerated graphics card
supporting OpenGL 3.3 with 1GB GPU
memory is recommended
Table 3.6(Recommended System Requirement for MATLAB R2018a)
3.2.1.5 Python 2.7:

Minimum system requirement for Python are listed in table 3.9. Python 2.7 was used
to support the Intel IDE.
OPERATING SYSTEM: Windows 7
PROCESSOR Any Intel or AMD x86-64 processor
MEMORY 1, 2 GB
DISK SPACE 128 MB for Installation

SSD Recommended
Table 3.7 (Minimum system requirement for Python)
13
Table 3.10 lists the entire recommended system requirement for Python 2.7 or 3.6.
PROCESSOR Any Intel or AMD x86-64 processor
MEMORY 4 GB for 64-Bit and 2GB for 32 – bit
DISK SPACE 128 MB for Installation

SSD Recommended
Table 3.7 (Recommended system requirement for Python)
3.2.2 Simulation Tools

There is no simulation tool that has been used in this project.
3.3 Summary
In this chapter detailed minimum and recommended system requirement of tools are
mentioned. Along the comparison of hardware used with recommended hardware
settings of different software tools. The specification of the system on which this
project was developed is also discussed in detail, along with specification of system
on which this project is tested is also discussed. All the tables shown above have
displayed the brief comparison between the minimum and recommended
requirements of tools and systems.
14
Chapter 4
METHODOLOGIES
In this chapter the very that will be discussed is the designs of the project. Since it is
software project so the design of the project will be demonstrated by the diagrams like
system sequence diagram, use case diagram, etc. Right after that the thing which is
going to be discussed is the algorithms used in this project, hardware, Analysis
procedures and the Implementation procedure.
4.1 Design of the investigation/Algorithms/ Hardware

4.1.1 Design of the investigation
The very first diagram to explain the process of the usage of the system developed is
the data flow diagram.
4.1.1.1 Data Flow Diagram

The diagram is show in the figure given bellow.
Figure 4.1(Data Flow Diagram)

In this diagram the data flow is explained like the very first thing that will initiate the
process is the client‘s application for the loan that will be accepted by the employee of
the bank sitting on the respected counter. The application will then be forwarded to
the financial analyst of the bank who will have the access to the system developed.
The financial analyst will load the client data to the system. System will apply some
algorithms to the data of the client and will show the output that whether the client is
15
going to default or not next month.
4.1.1.3 Use Case Diagram

In this diagram it is explained that there is going to be two actors in the whole process
but the end user of our system will be only one and that is the financial analyst.
Logging in Into System
Finincial Creadibility Check
FINANCIAL ANALYST
Obtain Results
Figure 4.2(Use case Diagram)

The task that will be performed by the financial analyst are logging in to the system.
This feature will protect the system from the un-authorized access so that the valuable
data do not get breached.
The second feature that will be provided to the end user is the financial credibility
check in which the credibility of the borrower is checked. In this process the previous
data of the client will be observed by the system developed and then will be show the
output that whether the client is going to default or not.
4.1.2 Algorithms
List of the algorithms used in the development of the project under discussion is as
follows:
 Logistic Regression
16
 Decision Tree
 K Nearest Neighbor
 Random Forest
4.2 Analysis procedures
4.3 Implementation procedure

4.3.1 Details about hardware
4.3.2 Details about software/ algorithms
4.3.2.1 Algorithms
4.3.2.1.1 Logistic Regression

Logistic regression is the technique borrowed by the machine learning. The field it is
taken from is the field of statistics. Since the project under discussion deals with the
classification problem so the logistic regression algorithm provides solution for the
classification problems. Since both the logistic and linear regression have the same
goal and that is to estimate the values of the parameter‘s co-efficient so if one says
they that both are bit similar it will not be false. Whereas the prediction of output of
the logistic regression is transformed by using a non linear function which is unlike
the linear regression and this function is called logistic function, sigmoid function or
Logit function. This function was developed by the experts of the stats for describing
the properties of the rapidly growing population. Since the linear regression provides
the values of the output greater than 1 and less than 0 the logistic function makes the
S shaped curve through which one can take any real valued number and map it
between 1 and zero. The graph of this function is shown in the figure given below.
Figure 4.3 (Logistic Regression Model)
17
Logistic regression hypothesis is defined as:
Figure 4.4 (Logistic Regression Formula)

As shown in the figure given above the logistic regression hypothesis is defined as H
theta X is equal to G theta transpose X whereas G is the sigmoid function. As
explained earlier logistic regression function gives the output between zero to one but
never at these limits. One of the outputs of the project logistic regression algorithm is
given below.
Figure 4.5(Logistic Regression accuracy)

Using the logistic regression one can predict or estimate the two different types of
values with high certainty let say O represent lose and X is for wins as shown in the
figure given below.
Figure 4.6(Logistic Regression gradient)
18
From the figure given above on can observe that here the gradient is quite similar to
linear regression gradient because the in this graph the logistic linear equation is been
demonstrated. The difference between linear regression and logistic regression is that
it has different formula of H of X than linear regression so by using logistic regression
one can demonstrate non linear equation and complex equations too. This is done by
using high order polynomials. It helps in determining or estimating the co-efficient of
the logistic functions by using following two descents:
 Gradient Descent
 Stochastic Descent
This is done by using one of the many algorithms in the machine learning
Figure 4.7(Logistic Regression accuracy generation)

It works by using the model to calculate a prediction for each instance in the training
set and calculate the error for each instance. One can apply the stochastic descent to
find the logistic regression model‘s co-efficient. The way it is done for each training
instance is as follows:
 Calculate prediction using the current values of the co-efficient.
 Calculate new prediction values based on the errors in the prediction
Process is done repeated until the model is accurate enough or cannot be made
accurate enough. To randomize the data sets each time is also considered good and
this updating is called online learning whereas collection all the changes and making a
one large file is called batch learning.
4.3.2.1.2 Decision Tree
In the classification problem another algorithm that is frequently used is the Decision
Tree in the supervised learning. The tree of the decision tree is also called as CART.
 C : Classification
19
 A : And
 R : Regression
 T : Tree
The tree is the flow chart like structure where each internal node is called the test,
whereas each branch is represents the outcome of a test each leaf or the terminal node
holds the class name label and the top most node is called the root node as shown in
the figure given below.
Figure 4.8 (Decision Tree Model)
This is used to explicitly and visually represent the decision and decision making. The
reason of introducing this algorithm was due to these features. The features of the tree
decision are that it is easy to interpret, visualize and understand. One of the awesome
features of this algorithm is its characteristic of performing the implicitly variable
screaming and feature selection. Since our data sets also have categorical data too so
this algorithm is also good because it can handle categorical data and the non linear
relationship between the parameters.
The disadvantages of the decision tree are as follows:
 Over-fitting
The decision tree learners can create over complex tree structure that cannot
generalize the data very well.
 Variance
20
Sometimes the tree becomes unstable and mostly it happens when a little change
occurs in the data which results into a completely new tree.
 Biased Trees
The algorithm sometimes creates the biased trees, this happens when some of classes
dominate that is why it is recommended that whenever you process the data by this
algorithm the balancing should be done prior to that. The decision tree is drawn up
side down where its roots are at top and the leaf at the bottom of the tree.
The types of decision trees are of two types one is the regression tree and the other is
the classification tree.
 Classification Tree :
This type of decision tree is used when the dependent variable is continuous. In the
classification trees the value obtained by the terminal nodes or the class is mostly the
mode of the values falling in that region. The splitting process does not stop until the
stopping criteria are reached and this results into a fully grown tree.
 Regression Tree :
This type of tree is used when and only when the dependent variables have the
categorical values. So the values obtained by the terminal nodes of the tree in the
training data is the mostly the mean or the average response falling in that region. The
splitting process does not stop until the stopping criteria are reached and this results
into a fully grown tree.
 Classification Tree & Regression Tree :
The fully grown trees are mostly ends up over fitting the data which leads to poor
accuracy on the unseen data. The scenario is tackled by using the technique called
PRUNING. Following figure demonstrate the example of the titanic and possibility of
a person‘s survival.
Figure 4.6(Decision Tree Titanic example)
21
Growing a tree consists of the following steps:
 Which features are to choose from the dataset?
 Condition for splitting
 Knowing when to stop splitting
 Pruning
4.3.2.1.3 K-Nearest Neighbor
K-Nearest neighbor is used in the pattern recognition. In KNN objects are classified
based on the closest training examples in the feature space. KNN is also considered as
the type of Instance Based Learning or the Lazy Learning. In this type of learning all
the computation is delayed until the classification is done. KNN is one of the
fundamental and basic techniques where the machine has no prior or very little
knowledge about the distribution of the data. In KNN the K is the number of nearest
neighbors, this number of nearest neighbors (K) is used to predict the output by the
classifier. Let‘s take an example of a well-known season Game of Thrones.
Figure 4.7(KNN Game Of Thrones examples)
In this example we are trying to determine that whether the unknown person is
Dothraki or Westerosian as shown in the figure given bellow. Since the people of
Dothraki clan have muscular mass and whereas Westerosian clan has the wealth,
treasures and riches so here the muscular mass, wealth, treasure, and riches are the
variables.
22
since here 4 neighbors is of Dothraki clan so the prediction will be that the unknown
person is from the Dothraki clan as shown in the figure given bellow.
The crux of KNN is that to ask who my neighbors are and which class they belong to
and to avoid the draw in votes k should be an odd. In the KNN Proximity Metrics
following distance techniques can be used:
 EUCLIDEAN DISTANCE
 HAMMING DISTANCE
 MANHATAN DISTANCE
 MINKOWSKY
 CHEBY CHEV DISTANCE
Details about these types of distances are demonstrated in the figure given bellow:
23
Figure 4.10(KNN Camberra And Euclindcan Distance Formula )
Figure 4.11(KNN Distance Formulas)
Following are the advantages of the KNN:
 It is robust to noisy training data.

 It is effective in training large data.
 It has no training phase.
 It can learn complex models easily.
Following are the disadvantages of the KNN:
 There is a very basic need in KNN to determine value of K which becomes hard in
the high dimensions and following figure demonstrates the issues evolves in high
dimensions data.
24
. Figure 4.12 (KNN disadvantages)
 It is hard to decide in KNN that which distance techniques and attribute should be
used to get best results.
 Computation cast is high.
4.3.2.1.4 Random Forest
Random forest is one the best, most powerful and most frequently used algorithms in
the supervised machine learning. Random forest is capable solving both the regression
and the classification problem. As the name suggests the algorithm creates a forest of
number of decision trees. As the number of trees increases so is the will the
robustness of the prediction which will in return increase the accuracy to the multiple
decision trees. The building of the multiple decision trees can be done using the
algorithms such as:
 GINI Approach
 Information Gain
 Other decision tree algorithms
The working of the random forest can be explained as we grow multiple trees against
a single tree in the court model. The working of the random forest in the case of
classification problem to classify a new object based on attributes each tree gives a
classification and we say that the classification (X) votes for that class and the tree
25
votes for that class. The forest chooses the tree having the most votes over all the trees
as shown in the figure given bellow.
Figure 4.13 (Random Forest Demonstration)

In the case regression the random forest in takes the average of the output of different
trees. Following are the advantages of the random forest:
 One of the well-known advantage of the random forest is that it can be used for
both regression and classification problems.
 This algorithm can handle the missing values in the data and maintain the
accuracy despite of the missing values.
 When we have more trees in the forest the random classifier does nit over fits the
data.
 It has this amazing power to handle large data sets also with the high dimension of
the data.
Where it has some most amazing advantages it has the disadvantages which are as
follows:
 The random forest is not as good for the regression problems as it is good for the
classification problems and the reason is it sometimes over fits the regression data
usually when it is noisy.
 It can be considered as the black model approach for the statistical data because in
this scenario most of the cases one has no control over the data.
26
4.3.2 Details about control etc.
There is no such detail regarding the control.

4.4 Verification of functionalities
The data given bellow has the whole explanation of the functionality of the system
developed.
4.4.1 Login
The first function that is user is going to perform is the login function. This function
provides the security to the bank‘s data in other it will prevent the un-authorized
access. The system displays the following window.
Figure 4.14 (login Form)
4.4.2 Data Cleaning

First step towards in Data science or Machine learning is called data cleaning. It is a
huge process and usually it is said that data cleaning mostly takes the 50% of the
machine learning. In machine learning in order to algorithms work properly one
definitely needs a clean data. Now the question rises that what it means to have a
clean data and how it is done. It is done by deleting the columns which have the value
is NaN. In this process the columns with the values greater than 90% are also
dropped. The second thing that is done during the process of data cleaning is that the
rows that are appearing twice in the data set are also dropped. These are the few key
points about the data cleaning. This step is not elaborated because it was not involved
in the project under discussion. The data set was taken from kaggle which was already
clean. The data set is shown in the figure given bellow.
27
Figure 4.15 (Cleaned Data Sets)
4.4.3 Feature Selection
Feature extraction is the part of the dimensionality reduction the other part of the
dimensionality reduction is feature selection. Feature selection consists of following
things:
 Wrappers
 Filters
 Embedded
Whereas feature selection includes:
 Principle Component Analysis
Feature selection can be demonstrated as:
A + B + C= AD
In the eqution given above lets say C = 0 so the equation can be written as:
A + B = AD
So one can say that feature selection can be defines as the selection of the relevent
data and droping the data that is irrelevent. Ten columns were selected for the training
model. Statuses of the Loan borrowers tells us about the current state of the loan
payment or repayment.
28
Figure 4.16 (Feature Selection)
The pandas fuction was used for finding the correlation of all 25 columns with the
output column i.e. Default. It is shown in the figure given bellow.
Figure 4.16 (Feature‘s Output)
The correlation features with respose e variable.
Figure 4.17 (data correlation code)
After droping the irrelevent columns we end up with eleven columns of features as
shown in the figure given bellow.
Figure 4.17 (Extracted Feature)
29
After the feature extraction the extracted features values will be inserted by the user to
the interface designed as shown in the figure given bellow.
Figure 4.17a (Default Predictor GUI )
4.4.4 One Hot Encoding
In the machine learning while training the system in the supervised learning we pass
the labeled input to the model and the model in return gives the predicted output. For
example in the image classifier we me be passing it the labeled images as input as
show in the figure.
30
Figure 4.18 (categorical Data)
Most of the algorithms do not support this type of the input so it is converted in to the
numeric data. The data that is not in the numeric form is called categorical data and
the conversion of the categorical data into the numeric data is called on hot encoding.
The categorical in the project under discussion is shown in the figure given bellow.
Figure 4.19 (Categorical Data)
31
There are four columns which had the categorical data as shown in the figure given
above. Conversion of the education‘s categorical data is shown in the figure given
bellow.
Figure 4.20 (Education Column to Numeric code)
Conversion of the gender‘s categorical data is shown in the figure given bellow.
Figure 4.21 (Gender Column to Numeric code)
Conversion of the marriage status‘s categorical data is shown in the figure given
bellow.
Figure 4.22 (Marriage Status Column to Numeric code)
32
Conversion of the default column‘s categorical data is shown in the figure given.
Figure 4.23 (Default Status Column to Numeric code)
After the on hot encoding we end up with 13 columns as shown in the figure given
bellow.
Figure 4.24 (Numeric Data)
4.4.5 Balancing Datasets
The data set that is fed to the system while training the output column‘s value should
me balance otherwise it affects the training of the system. In the case of system under
development there was imbalance which had the shape shown in the figure given
bellow.
Figure 4.24
(Imbalanced Classes Graphical Presentation)
33
The numeric description is shown in the figure given bellow.
Figure 4.25 (Imbalanced Classes Numeric Presentation)
To overcome this problem SMOTE was used. After the balancing is been done the
numeric shape is shown in the figure given bellow.
Figure 4.26 (balanced Classes Numeric Presentation)
Balancing is done by using python flask using the IDE Pycharm as shown in the
figure given bellow.
Figure 4.26 (balancing Classes code)
Graphic representation of the balanced classed can be demonstrated as follows.
Figure 4.26 (balanced Classes Graphical Presentation)
34
4.4.6 Scaling Data
Before feeding the data to the system one of the most important that should be done is
the data scaling. In the data scaling the rage of the features are equalized. Let‘s take
and examples
Figure 4.27 (un-scaled data example)
In the figure given above if we consider first column the range of the features is 56.5
to 14.4 whereas in the case of second column the rage is 4.5 to 8.41. This type of rage
difference can cause bad training of the system. Therefore one has to scale the data
before feeding it to the system. In the case of the system developed scaling is also
done using Robust Scaler as shown in the figure given bellow.
Figure 4.28 (scaling code)
And the output of the code shown above is demonstrated in the figure shown bellow.
35
Figure 4.29 (scaled data)
4.4.7 Train Test Split and Cross validation
Train/test split makes the data split into two parts training data and test data, the
training set contains known output which the model learns from and then from the
testing dataset the output was removed. The code regarding this process in the project
under discussion is as follows:
Figure 4.30 (Train Test Split code)
The process is demonstrated in the figure given bellow:
Figure 4.31 (Train Test Graphical Representation)
The reason of this process is to avoid the over fitting and under fitting, the graphical
demonstration is shown in the figure given bellow.
Figure 4.32 (Over and under fitting Graphical Representation)
36
The process has the drawbacks if it is not randomized and the randomization is
described in the figure given bellow.
Figure 4.33 (Train Test Split Graphical Representation)
The figure shown above can be described as follows. Let‘s say the feature‘s numbers
are 20 and they are divided in 5 sets 4 features where these 5 sets are called K folds
and the process is called Cross Validation. Than these 4 features are first placed at the
right most side as shown in the first slot of the figure shown above after that the
changes in the allocation of the spitted feature is done according to the figure shown
above while training.
4.4.8 Models
Following are the models that were used to predict that whether the user is going to
default the loan in the next month, which has been sanctioned to him/her:
• Logistic Regression
• Decision Tree
• K Nearest Neighbor
• Random Forest
4.4.8.1 Logistic Regression
As explained earlier Logistic regression is the technique borrowed by the machine

learning. The field it is taken from is the field of statistics. Since the project under
discussion deals with the classification problem so the logistic regression algorithm
provides solution for the classification problems. Since both the logistic and linear
regression have the same goal and that is to estimate the values of the parameter‘s co-
37
efficient so if one says they that both are bit similar it will not be false. Whereas the
prediction of output of the logistic regression is transformed by using a nonlinear
function which is unlike the linear regression and this function is called logistic
function, sigmoid function or Logit function. Unlike linear regression it provides the
output in between 0 to 1.Following is the input of the logistic regression:
Figure 4.34(Logistic Regression Input)
After applying the logistic regression Model to the input, the accuracy of the output is
Figure 4.35(Logistic Regression Accuracy)
The ROC curve is another output of the model application to the input.
38
Figure 4.36(Logistic Regression graphical Representation)
Decision Tree
In the classification problem another algorithm that is frequently used is the Decision
Tree in the supervised learning. The tree of the decision tree is also called as CART.
 C : Classification
 A : And
 R : Regression
 T : Tree
The tree is the flow chart like structure where each internal node is called the test,
whereas each branch is represents the outcome of a test each leaf or the terminal node
holds the class name label and the top most node is called the root node. Following is
the input of the Decision Tree:
39
Figure 4.37(Decision Tree Input)
After applying the Decision Tree Model to the input, the accuracy of the output is
Figure 4.38(Decision Tree Accuracy)
40
Figure 4.39(Decision Tree ROC)
4.4.8.2 Random Forest
As explained earlier random forest is one the best, most powerful and most frequently
used algorithms in the supervised machine learning. Random forest is capable solving
both the regression and the classification problem. As the name suggests the
algorithm creates a forest of number of decision trees. As the number of trees
increases so is the will the robustness of the prediction which will in return increase
the accuracy to the multiple decision trees. The building of the multiple decision trees
can be done using the algorithms such as:
 GINI Approach
 Information Gain
 Other decision tree algorithms
41
Following is the input of the random forest input:
Figure 4.40(Random Forest Input)
After applying the random forest Model to the input, the accuracy of the output is
Figure 4.41(Random Forest Accuracy)
42
Figure 4.42(Random Forest ROC)
4.4.8.3 K Nearest Neighbor
K Nearest neighbor is used in the pattern recognition. In KNN objects are classified
based on the closest training examples in the feature space. KNN is also considered as
the type of Instance Based Learning or the Lazy Learning. In this type of learning all
the computation is delayed until the classification is done. KNN is one of the
fundamental and basic techniques where the machine has no prior or very little
knowledge about the distribution of the data. In KNN the K is the number of nearest
neighbors, this number of nearest neighbors (K) is used to predict the output by the
classifier. Following is the input of the logistic regression:
Figure 4.43(KNN Input)
After applying the logistic regression Model to the input, the accuracy of the output is
43
Figure 4.44(KNN Accuracy)
Figure 4.45(KNN ROC)
4.5 Details about simulation / mathematical modeling

4.5.1 Details about simulation
There was no simulation required during whole procedure of the development.
4.5.2 Mathematical Modeling
Mathematical equation is been used but no modeling was required during whole
procedure of the development.
4.6 Summary
In this chapter the very first thing that is explained is the diagrams regarding the
project like data flow diagram and the use case diagram. After that algorithms are
44
discusses which are being used in this project. These algorithms include Logistic
Regression, Decision tree, KNN and the Random Forest. Thirdly the implementation
process is discussed in which it is explained that how development of the project
progressed. Fourthly the verification of the functionality is done with that is
implemented in this project.
45
Chapter 5
SYSTEM TESTING
Testing of the software ensures that either the required functionality is developed or
not? Testing has been completed in different phases at completion of every unit
before launching the next phase. Therefore, all of the functionality of the system is
tested so there is no chance of errors remaining in the system.
5.1 Objective testing

The objective testing which is basically is for the sake of the ensuring of the quality of
the system deployed to the end user. Since the major objective of the objective testing
is to make sure that all the components of the system are working properly. So by
doing the objective testing we found out that both of the components of the system are
working properly. The two components are appliance automation system and other is
the inventory management system.
5.2 Usability Testing

Since this type of the testing is carried out to measure how much the system is user
friendly and easy to use. Is the GUI (Graphical User Interface) is easily understood by
its users? The findings of the testing were that the system‘s GUI is easy to use and
user friendly and the user can easily understand the system whereas one time training
is required to make the user fully understand the system.
5.3 Software Performance Testing

Since software performance testing is performance to check the efficiency &
reliability of the system. The finding of this testing are very positive like system is
pretty much reliable and the efficiency is good.
5.4 Compatibility Testing

Since compatibility testing is carried out to check the compatibility of the software on
different platform. Example: The software can run on any latest up to date browser
46
available on any OS e.g. Google Chrome, Safari etc. As our project is desktop
application so the only platform it can run is the windows. The reason for making the
system to support the windows platform is that other platforms are not easy to use.
5.5 Load Testing

Since Load testing is carried out to know the behavior of the system under the specific
expected load. Example: The software is tested in load of 20 users accessing resources
at the same time; the system response time was good. In our system the load testing
cannot be performed because the system is one user based.
5.6 Security Testing

Since the security testing is carried out to disclose the weakness in the software.
There are few weaknesses are there in the system. One of the weaknesses is that
system if window goes corrupted then the system then there is a chance of losing that
database files.
5.7 Installation Testing

System can be installed in any system having the windows of minimum of the version
of the windows 7 or any above of it.
5.8 Test Cases
Test Case #: 1
Software: Loan Default Predictor

Test Description: This user case will demonstrate the system‘s functionality of login.
`Test ID: LoginUserCase

Preconditions System should be in the running state
Step#1: Run the Software System
Step#2: Software System displays the login form.
Step#3: User enters the user name.
Step#4: User enters the password.
Actions Step#5: Software System displays the message box of successful login.
Expected Results
System should display the message box of successful login.
Result System displays the message box of successful login.
47
Test Case #: 2

Testing Environment:
Test ID: LoginUserCaseFail

Step#3: User enters the wrong user name.
Step#4: User enters the password.
Actions Step#5: Software System displays the message box of un-successful login.
Expected Results
System should display the message box of un-successful login.

Result System displays the message box of un-successful login.
Test Case #: 3

Test ID: DataLoadingUserCase

Step#3: User does not enter the filename.
Step#4: Software System displays the message loading successful.
Actions
Expected Results
System should display the message process successful.

Result System displays the message process successful.
48
Test Case #: 4

Test ID: DataLoadingUserCaseFail

Step#3: User does not enter the filename.
Step#4: Software System displays the message loading un-successful.
Actions
Expected Results
System should display the message process un-successful.

Result System displays the message process un-successful.
Test Case #: 5

Test Description: This user case will demonstrate the system‘s functionality of
feature extraction.
Test ID: featureExtractionUserCase

System should be in the running state
Preconditions System should have the clean data.
Step#1: System displays the cleaned data
Step#2: System displays the option of extract feature and cancel.
Step#3: selects the option of extract feature.
Step#4: Software System displays the message extraction successful.
Actions
Expected Results
System should display the message extraction successful.

Result System displays the message extraction successful.
49
Test Case #: 6

feature extraction.
Test ID: featureExtractionUserCaseFail

Step#3: selects the option of cancel.
Actions
Expected Results
System should display the message extraction un-successful.

Result System displays the message extraction un-successful.
Test Case #: 7

feature extraction.
Test ID: featureExtractionUserCaseFail

Step#3: selects the option of cancel.
Actions
Expected Results
System should display the message extraction un-successful.

Result System displays the message extraction un-successful.
50
Test Case #: 8

balancing data.
Test ID: balanceDataUserCase

Preconditions System should have the extracted feature.
Step#2: System displays the option of balance feature and cancel.
Step#3: selects the option balance feature.
Step#4: Software System displays the message SMOTE balancer applying.
Actions Step#4: Software System displays the message balancing successful.
Expected Results
System should display the message balancing successful.

Result System displays the message extraction balancing successful.
Test Case #: 9

balancing data.
Test ID: balanceDataUserCaseFail

Preconditions System should have the extracted feature.
Step#1: System displays the extracted feature.
Step#2: System displays the option of balance feature and cancel.
Step#3: selects the option cancel.
Step#4: Software System displays the message balancing un-successful.
Actions
Expected Results
System should display the message balancing un-successful.

Result System displays the message extraction balancing un-successful.
51
Test Case #: 10

scaling data.
Test ID: scaleDataUserCase

Preconditions System should have the balanced feature.
Step#1: System displays the balanced feature.
Step#2: System displays the option of scale feature and cancel.
Step#3: User selects the option scale feature.
Step#4: System displays the message Robust Scaler Applying.
Actions Step#5: Software System displays the message scaling successful.
Expected Results
System should display the message scaling successful.
Result System displays the message extraction scaling successful.
Test Case #: 11

scaling data.
Test ID: scaleDataUserCaseFail

Preconditions System should have the balanced feature.
Step#2: System displays the option of scale feature and cancel.
Step#3: selects the option cancel.
Step#4: Software System displays the message scaling un-successful.
Actions
Expected Results
System should display the message scaling un-successful.

Result System displays the message extraction scaling un-successful.
52
Test Case #:12

Testing Environment:
Test ID: DataLoadingUserCaseFail

Step#3: User does not upload the data.
Step#4: Software System displays the message loading successful.
Actions
Expected Results
System should display the message loading successful.

Result System displays the message process successful.
Test Case #: 13

scaling feature.
Test ID: TrainTestSplitAndCrossValidateUseCase

Preconditions System should have the scaled feature.
Step#2: System displays the option of Train Test Split And Cross Validate
and cancel.
Step#3: User selects the option Train Test Split And Cross Validate.
Actions Step#4: Software System displays the message process successful.
Expected Results
System should display the message process successful.
Result System displays the message extraction process successful.
53
Test Case #: 14

scaling feature.
Test ID: TrainTestSplitAndCrossValidateUseCaseFail

Preconditions System should have the scaled feature.
Step#2: System displays the option of Train Test Split And Cross Validate
and cancel.
Step#3: User selects the option cancel.
Actions Step#4: Software System displays the message process un-successful.
Expected Results
System should display the message process un-successful.
Result System displays the message extraction process un-successful.
Test Case #: 15

scaling feature.
Test ID: applyRandomForestUseCase

Preconditions System should have the Train Test Split And Cross Validated Features.
Step#1: System displays the Train Test Split And Cross Validated Features.
Step#2: System displays the option of Apply Random Forest Model
and cancel.
Step#3: User selects the option Apply Random Forest Model.
Step#4: Software System displays the message process successful.
Step#5: System displays the accuracy.
Step#6: System displays the ROC.
Actions
Expected Results System should display the message process successful.
System should display the ROC.
System should display the accuracy.
Result System displays the message extraction process successful.
System displays the ROC.
System displays the accuracy.
54
Test Case #: 16

scaling feature.
Test ID: applyRandomForestUseCaseFail

Preconditions System should have the Train Test Split And Cross Validated Features.
Step#1: System displays the Train Test Split And Cross Validated Features.
Step#2: System displays the option of Apply Random Forest Model
and cancel.
Step#3: User selects the option cancel.
Step#4: Software System displays the message process un-successful.
Actions
Expected Results System should display the message process un-successful.
Result .System displays the message extraction process un-successful.
55
Chapter 6
RESULTS AND CONCLUSION
In this chapter, you will explain all the results you achieved after completing all what
you explained in previous chapter. Try to find a balance while explaining your results.
Neither makes your project/work look worthless in case you were unable to achieve
the goals identified. Nor should you claim to have solved all the problems in the
world by the results you have achieved. Take a step by step approach as identified in
the section headings below.
6.1 Presentation of the findings

6.1.1 Hardware results
As explained earlier that hardware is used but the purpose of them is to produce
software results not the hardware ones, so one can say that there has been no results
regarding the hardware all the results were software results which are explained in the
section bellow.
6.1.2 Software results
System passes through 8 steps before completion. Findings in this system can be
discussed in the figure given bellow.
6.1.2.1 Data Cleaning
Since the data sets were taken from kaggel and they were already clean so one can say
that the output of this functionality is the data set we started working on as shown in
the figure given bellow.
Figure 6.1 (data cleaning output)
56
6.1.2.2 Feature Extraction
The output of the feature extraction is shown the figure ().
Figure 6.2 (feature extraction output)

6.1.2.3 One Hot Encoding
One hot encoding
Since one hot encoding converts the categorical data into numeric and we had 4
columns of categorical data so he output of the one hot encoding is shown in the
figure given bellow.
Figure 6.3 (one hot encoding output)

6.1.2.4 Balancing Data
The result of balanced data is shown in the figure given bellow.
57
Figure 6.4 (Data Balancing output)
6.1.2.5 Scaling Data
Once scaling is done the result that generates is shown in the figure given bellow.
Figure 6.5 (Data Scaling output)

6.1.2.6 Logistic Regression
The output result of application of this model is two, one is the ROC curve and the
other is the accuracy, both are shown in the figure given bellow. Same is the case with
other models given bellow.
Accuracy:
Figure 6.6 (Logistic Regression Accuracy Output)

ROC Curves:
Figure 6.7 (Logistic Regression ROC Output)

6.1.2.8 K Nearest Neighbor
58
Accuracy:
Figure 6.7 (K Nearest Neighbor Accuracy Output)

ROC:
Figure 6.7 (KNN ROC Output)

6.1.2.9 Dession Tree
Accuracy:
Figure 6.7 (Decision Tree Accuracy Output)

ROC:
59
Figure 6.7 (Decision Tree ROC Output)
6.1.2.9 Random Forest
ROC:
Figure 6.7 (Random Forest ROC Output)
Accuracy:
Figure 6.7 (Random Forest Accuracy Output)
60
6.2 Discussion of the findings
6.2.1 Comparison with initial GOAL
The comparison between the initial goals and the goals evolved at the end of the
project is explained bellow.
6.2.1.1 Data Cleaning.
Initial Goal:
The initial goal was to get a cleaned data because in the machine learning cleaning of
data is extremely important without that one cannot proceed in machine learning.
End Goal:
The cleaned data was retrieved during the project completion.

6.2.1.2. Feature Extraction
Initial Goal:
The initial goal was to get features extracted and to drop the irrelevant data.
End Goal:
The initial goal was achieved in the project.

6.2.1.3. One Hot Encoding
Initial Goal:
The initial goal was to convert the categorical data to numerical data
End Goal:

6.2.1.4. Balanced Data
Initial Goal:
The initial goal was to balance the classes
End Goal:

6.2.1.5. Data Scaling
Initial Goal:
61
The initial goal was to get a well scaled data.
End Goal:

6.2.1.5. Test Train Spit And Cross Validation
Initial Goal:
The functionality describes above has the great importance in perfect training.
End Goal:
The functionality describes above was achieved in the project.
6.2.2 Reasoning for short comings
Since there was no such functionality that we could not achieved in this project so one
can say that there are no short coming.
6.3 Limitations
The project under discussion had no limitation in the case the model random forest is
used.
6.4 Recommendations
The recommendation is that one should not use any other way to develop this project,
although the extension can be done to this project and that is one should integrate this
project to bank management system.
6.5 Summary
In this chapter the very first thing that was discussed is the software results. In this
point every result of each step is discussed, after that the finding of the project are
discussed along the diagrams. Thirdly the comparison with the initial goal is done in
and the short coming right after it. Then the limitations are explained and in the end
the recommendation are elaborated.
62
Chapter 7
FUTURE WORK
The future work that can be done in this project is that to integrate this project with
bank management system. Other work that can be done is that time series analysis can
be done using the loan data of the several years and this is done for the sake of
prediction about the client that when is he going to default. Future analysis can be
done on predicting the approximate Interest rates that the loan applicant is expected to
get as per his profile if his loan is approved. This can be useful for loan applicants,
since some banks approve loans, but give very high interest rates to the customer. It
would give the customers a rough insight regarding the interest rates that they should
be getting for their profile and it will make sure they don‘t end up paying much more
amount in interest to the bank. An application can be built, which will take various
inputs from the user like, Employment Length, Salary, Age, marital status, SSN,
address, loan amount, loan duration etc. and give a prediction of whether their loan
application can be approved by the banks or not based on their inputs along with an
approximate interest rate
63
REFERENCES
[1]. Anwar, O. 2007, RF Controlled Audio/Video Vehicle, BS Project Report,

Mohammad Ali Jinnah University, Islamabad, Pakistan
[2]. Brandli G. and Dick M., ―Alternating current fed power supply,‖ U.S. Patent 4
084 217, Nov. 4, 1978.
[3]. Chen W. K., Linear Networks and Systems. Belmont, CA: Wadsworth, 1993, pp.
123–135.
[4]. Duncombe J. U., ―Infrared navigation—Part I: An assessment of feasibility,‖
IEEE Trans. Electron Devices, vol. ED-11, pp. 34–39, Jan. 1959.
[5]. Ebehard D. and Voges E., ―Digital single sideband detection for interferometric
sensors,‖ presented at the 2nd Int. Conf. Optical Fiber Sensors, Stuttgart, Germany,
1984.
[6]. Miller E. H., ―A note on reflector arrays,‖ IEEE Trans. Antennas Propagat., to be
published.
[7]. Payne D. B. and Stern J. R., ―Wavelength-switched passively coupled single-
mode optical network,‖ in Proc. IOOC-ECOC, 1985, pp. 585–590.
[8]. Reber E. E., Mitchell R. L., and Carter C. J., ―Oxygen absorption in the Earth‘s
atmosphere,‖ Aerospace Corp., Los Angeles, CA, Tech. Rep. TR-0200 (4230-46)-3,
Nov. 1968.
[9]. Smith, J., Masud, F. & Linda, A. 2005, Some good work in a good conference,
IEEE International Conference, USA, Vol. vv(issue), pp. 11-22.
[10]. Vidmar. R. J. 1992, Aug. On the use of atmospheric plasmas as electromagnetic
reflectors. IEEE Trans. Plasma Sci.. 21(3), pp. 876–880.
[11]. Wigner E. P., ―Theory of traveling-wave optical laser,‖ Phys. Rev., vol. 134, pp.
A635–A646, Dec. 1965.
[12]. ―Orthogonal frequency-division multiplexing‖, available online at
‗http://en.wikipedia.org/wiki/Orthogonal_frequency-division_multiplexing‘ on
02/10/2012
64
APPENDICES
Appendix – A
65

Report Fyp

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Report Fyp

Caricato da

Copyright:

Formati disponibili

LOAN DEFAULT PREDICTION (LDP)

Muhammad Manqaad Faheem

A Project Report submitted to the

Faculty of Computer Science & Information Technology

It is declared that this is an original piece of my own work, except where

Muhammd Manqaad Faheem

Project Coordinator: -------------------------

Methods/procedure/approach: What did you actually do to get your results? (e.g.

Results/findings/product: As a result of completing the above procedure, what did you

Conclusion/implications: What are the larger implications of your findings, especially

DECLARATION .............................................................................. iii

Table-3.1 Caption of table ................................................................................... 2

UOL University of Lahore

1.2 Statement of Problem

1.4 Applications of the research

1.5 Theoretical bases and Organization

Chapter 3 TOOLS AND TECHNIQUES: In this chapter, hardware and

Chapter 4 METHODOLOGIES: In this chapter analysis procedure of project is

Chapter 6 RESULTS AND CONCLUSIONS: In this chapter presentation of

2.1 Related Technologies

2.1.1 Linear Regression

Figure 2.1(Linear Regression Model)

Y=MX+B ------------------- (A)

Figure 2.2(Linear Regression Model)

Figure 2.3(Linear Regression Model)

2.2 Related Projects

 Image Segmentation and Categorization Projects

Figure 2.4(Image segmentation and categorization)

 Hand Writing Recognition

Figure 2.6(Hand Writing Recognition)

2.2.2 Decision Tree

Figure 2.7(Applications of KNN)

2.3 Related Studies

 Association Rule Mining

2.4 Their Limitations and Bottlenecks

TOOLS AND TECHNIQUES

3.1 Hardware used with technical specifications

3.1.1 HARDWARE TECHNICAL SPECIFICATION # 1:

COMPUTER NAME John

PROCESSOR Intel Core i5-3427U (~2.3GHz)

MEMORY 4000 MB RAM

DISPLAY MEMORY Intel HD 4000

SYSTEM TYPE 64-Bit Operating System, x64 – based processor

Table 3.1(Hardware Technical Specs # 1)

3.1.2 HARDWARE TECHNICAL SPECIFICATION # 2:

COMPUTER NAME LAPTOP-M9E96SGA

PROCESSOR Intel Core i5-3427U (~2.3GHz)

MEMORY 4000 MB RAM

DISPLAY MEMORY Intel HD 4000

SYSTEM TYPE 64-Bit Operating System, x64 – based processor

Table 3.2(Hardware Technical Specs # 2)

3.2 Software(s), simulation tool(s) used

Following are the software used in the development of the system.

3.2.1.1 Primary Operating System

EDITION Windows 10 Pro

Table 3.3(Primary Operating System‘s Specs)

3.2.1.2 Secondary Operating System

Table 3.4(Primary Operating System‘s Specs)

3.2.1.3 NetBeans 7.1:

DISK SPACE 750 MB

Table 3.5(Minimum NetBeans Requirement)

MEMORY 4 GB (32-bit) , 8GB( 64-bit)

Table 3.6(Recommended System Requirement for PyCharm)