Sei sulla pagina 1di 16

1

Journal of Marmara for Pure and Applied Sciences, 18 (2002) 159-174


Marmara University, Printed in Turkey




CLASSIFICATION AND PREDICTION IN A DATA
MINING APPLICATION



SERHAT ZEKES
1
and A.YILMAZ AMURCU
2


1
Istanbul Commerce University, Ragp Gmpala Cad. No: 84 Eminn 34378,
Istanbul - Turkey
2
Marmara University, Technical Education Faculty, Electronics and Computer
Education Department, Gztepe, 34722 Kadky, Istanbul - Turkey


Summary. Data mining is one of the hottest current technologies of the information age. As computer
systems getting cheaper and computer power increases, the amount of data available to be collected
and processed increases. Data mining is a process, which is used in these cases for discovering patterns
and trends in large datasets. In this study a data mining application is explained which is constructed
using the classification model and the decision trees technique. The data mining application explained
in this study, constructs a decision tree and extracts classification rules by examining the granted loans
whose contracts are already finished. Then by using the classification rules, it makes predictions on the
granted loans whose contracts are still running whether they will be repaid or not by the time the
contracts are finished.

Keywords : Data mining, classification, decision trees.



INTRODUCTION

Data mining takes its name and popularity from the fact that data is stored in the
form of a mountain. The valuable information is similar to gems in this mountain.
The main problem is to eliminate the worthless rubble and rocks in order to reach

2
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174


the valuable gems. The models used in data mining are grouped under two
categories; predictive and descriptive [1,2].
The goal of the predictive models is to construct a model by using the
results of the known data and is to predict the results of unknown data sets by using
the constructed model. For instance a bank might have the necessary data about the
loans given in the previous terms. In this data, independent variables are the
characteristics of the loan granted clients and the dependent variable is whether the
loan is paid back or not [3]. The model constructed by this data is used in the
prediction of whether the loan will be paid back by client in the next loan
applications.
However in the descriptive models, patterns in the current data which will
guide the decision making process are defined [1]. The determination of the
similarities between the families with children which have two or more cars and
whose income is in the interval X-Y, and the families with no children whose
income is less than X-Y interval is an example of these descriptive models [3].
Data mining models can be divided into three groups according to their
functions [3,4]:
1- Classification and Regression;
2- Clustering;
3- Association Rules.

Classification and regression models are predictive, and clustering and
association rules models are descriptive [5].
Classification and regression are two data analyzing methods which
determine important data classes or may construct models which can predict future
data trends. While the classification predicts the categorical values, the regression is
used in the prediction of values showing continuity. For instance while the
classification model is constructed to categorize whether the bank loan applications
are safe or risky, the regression model may be constructed to predict the spendings
of clients buying computer products whose income and occupation are given [4,5].
In the classification and regression models the following techniques are
mainly used [3]:
1- Decision Trees;
2- Artificial Neural Networks;
3- Genetics Algorithm;
4- K-Nearest Neighbor
3
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174


5- Memory Based Reasoning;
6- Navie-bayes.

Decision trees technique is commonly used in data mining because their
construction is cheap, their interpretation is easy, their easy integration with
database systems and their good reliability [3].
In this study we explain the classification model by using the decision tree
technique with a developed application. The realized application examines the loans
and their payback granted by a bank to its client. The application puts forward the
classification rules which are obtained by the characteristics and bank transactions
of clients whose loan contracts are over. Than it uses these rules to predict the loan
status of clients whose loan contract continue. The classification rules discovered by
the program also may give ideas to the bank administrators whether offering loans
to clients are proper. The data set which is used by the program is downloaded from
the Internet [6].



THE DATA SET USED IN THE APPLICATION

The data set used in the application contains all the transactions in a
Czechoslovakian bank between the dates Jan.01,1993 - Dec.31,1998, and the
number of these transaction is 1,056,320. The data set is made up of eight tables,
and these tables covers the name of the clients, their accounts, transactions, pay
orders, granted loans and credit cards. These tables are in the text document format
in the Internet. The tables and their explanations are as follows:

1. Account table: each record describes static characteristics of an account
(4500 objects in the file ACCOUNT.ASC).
2. Client table: each record describes characteristics of a client (5369 objects
in the file CLIENT.ASC).
3. Disposition table: each record relates together a client with an account i.e.
this relation describes the rights of clients to operate accounts (5369 objects
in the file DISP.ASC).


4
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174


4. Permanent order table: each record describes characteristics of a payment
order (6471 objects in the file ORDER.ASC).
5. Transaction table: each record describes one transaction on an account
(1056320 objects in the file TRANS.ASC).
6. Loan table: each record describes a loan granted for a given account (682
objects in the file LOAN.ASC).
7. Credit card table: each record describes a credit card issued to an account
(892 objects in the file CARD.ASC).
8. Demographic data table: each record describes demographic characteristics
of a district (77 objects in the file DISTRICT.ASC).



DESIGN OF THE SOFTWARE

In the loan table the paybacks of loan are evaluated in 4 states in the status field.
These states are called A, B, C, and D. The states A and B explain the paybacks of
clients whose loan contracts are over. In state A, the loan contract is over and the
client pays back on time. In state B, the loan contract again is over but client is in
debt and couldnt pay back. The states C and D represent the clients whose contract
continues. In the state C, the loan contract continue and the client pays back
without any difficulty. The state D represents the clients whose contract continue
and having payback problems. In this application by examining the states A and B,
the prediction will be made whether the clients pay back on time in states C and D.
By examining the data set, we see that 682 clients of 5369 are granted loans.
76 of these fall into the state B or D, that is, either they couldnt payback or still can
not payback. This shows that 11,14% of the loan granted clients have payback
problems. 234 of the loan granted 682 clients fall into the state A or B. The
remaining 448 clients fall into the state C or D.
While designing the software we follow the following steps:
1- The data set is preprocessed.
2- Decision tree is constructed.
3- Classification rules are formed.
4- Classification rules are verified by training data.


5
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174


5- By using the test data, error rate of decision tree and classification rules
are determined
6- By using the classification rules, the future of the accounts with loan
states C and D are predicted.

First data is preprocessed. Since the original data set is in the text format it
is transformed to a database. In this application since we use the classification
model which is used to predict the future trends of the current data and which is
usually used in data mining, first we decide the input variables to construct the
decision tree. The number of determined input variables is 12. By adding the Loan
State class to be predicted to these 12 input variables, we form training data and
test data. The training data and the test data are used to construct the decision tree.
These 12 input variables are the following:
1. Sex: Sex of the client.
2. Age: Age of the client.
3. Amount of loan: The amount of the loan granted to the client.
4. Loan duration: Duration of the payback of the loan.
5. Type of the credit card: Type of the credit card owned by the client
6. District: District where the client lives in.
7. Minimum amount: The minimum amount of transaction performed by
the client in the duration (*)
8. Maximum amount: The maximum amount of transaction performed
by the client in the duration (*)
9. Average amount: The average amount of transaction performed by the
client in the duration (*)
10. Minimum account: The minimum amount in the account in the
duration (*)
11. Maximum account: The maximum amount in the account in the
duration (*)
12. Average account: The average amount in the account in the duration
(*)

* indicates that the duration is 1 year since last payment for the clients
whose loan states are A and C. For clients whose loan state are B and D this
duration is 1 year back before the loan was created.

6
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174


Minimum amount, maximum amount, average amount, minimum account,
maximum account and average account are found by using SQL. The age and the
sex input variables are computed using the birth number form the Client table.
The most important property of the classification model is that the input
variables used in the construction of the decision tree and the data to be predicted
must have categorical values. In this study age, amount of loan, minimum amount,
maximum amount, average amount, minimum account, maximum account and the
average amount are not categorical. For this reason these variables are divided into
categories with determined value intervals.
After this categorization training data and test data are determined. Because
the goal in this application is to predict the future of the loans in the states C and D
by examining the loan states A and B, training and testing data groups are formed
by using 234 loan states with A and B. 60% of these 234 case is for training data
and the remaining 40% is for test data. Training data and test data are chosen
randomly. Thus 140 cases are for training data and 94 cases are for test data.
Decision tree, as it is understood from its name, is in the form of a tree and
it is a prediction technique. With its structure as a tree, it is the most popular
classification technique which is easily integrable with information technology, and
its rules are easily understood [5].
The decision tree is made up of decision nodes, branches and leaves.
Decision node determines the test to be realized. The result of this test causes the
branching without loosing data. In every node, testing and branching are realized
consecutively and this branching is dependent on upper level. Every branch of the
tree is a candidate to complete the classification. If the classification cannot be
realized, a decision node is formed at the end of the branch. But if a certain class is
formed at the end of the branch, there is a leaf [4]. This leaf is one of the classes to
be determined from the data. The operation of the decision tree starts form the root
nodes and follows the consecutive nodes from top to bottom until reaching the leaf.
Data classification using decision tree technique is made up of two steps.
The first step is training. In this step a predetermined training data set is analyzed by
a classification algorithm to construct the model. This model is shown as
classification rules or the decision tree. The second step is the classification. In this
step test data is used to verify the classification rules or the decision tree. If
verification is an admissible rate then the rules are used to classify the new data. The
accuracy of the model applied to the test data is the ratio of its accurate
classification to the all classes in the test data. Each known class of records in the
7
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174


test data is compared to the class predicted by the model. If the accuracy of the
model is admissible then the model is used to classify the new data with unknown
classes [5,7].
Decision tree classification depends on the fact that constructing a decision
tree by using the training set is chosen from the data set. Also the quality of the
decision tree depends on the size of the tree and the accuracy of its classification [8-
11]. At this stage the determination of nodes in the decision tree is very important. It
should designated that which fields of the data set will be used in which order to
construct the tree [11,12]. For this purpose the most commonly used measure is
Entropy. The Entropy is also used in information technologies. The higher the
entropy of an attribute, the more uncertainty there is with respect to its outcomes.
Thus we would wish to select attributes in order of increasing entropy, where the
root node of our tree would correspond to the attribute,
k
A , is given as [8]:
( ) ( ) ( ) ( )

=

=
=
N
i
j k
a
i
c p
j k
a
i
c p
k
M
j
j k
a p
k
A C E
1
, 2
log
,
1
,
( 1 )
where
( ) =
k
A C E entropy of the classification property of attribute
k
A
( ) =
j k
a p
,
probability of attribute k being at value j
( )
j k
a
i
c p
,
=probability that the class value is
i
c when attribute k is at its jth value
k
M = total number of values for attribute
k
a ; j = 1,2,...,
k
M
N = total number of different classes; i = 1,2,..., N
K = total number of attributes; k = 1,2,..., K

The term in the brackets is called the information. Thus as Equation 1
implies, entropy is the expected information that is the sum of the information in the
several possible outcomes multiplied by their probability. Logarithms are generally
taken to base 2, so that the information is measured in bits.
If a set S of records is partitioned into classes C
1
, C
2
, C
3
, . . . , C
i
on the
basis of the categorical attribute, then the information needed to identify the class of
an element of S is denoted by:

( ) ( ) ( ) ( ) ( )
i
p
i
p p p p p S I
2
log ...
2 2
log
2 1 2
log
1
+ + + = ( 2 )

8
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174


where p
i
is the probability distribution of the partition C
i
[8]. The term in the
brackets in equation 1 is similar to equation 2. Thus Entropy in equation 1 can be
written like this [8]:

( ) ( )
i
S
n
i
I
S
i
S
A E
=
=
1
( 3 )

Thus the information gain in performing a branching with attribute A can be
calculated with this equation [5]:

( ) ( ) ( ) A E S I A Gain = ( 4 )

The information gain computed for each attribute is used to choose the test
attribute in each node of the decision tree. The attribute with highest information
gain is chosen as the test attribute for the current node. This attribute minimizes the
information necessary for the classification of the data and the problems which may
occur during branching.
After the preprocessing of data, it comes computing the information gain
necessary to construct the decision tree. For each field in the training data, the
information gain is computed from the equations 1,2,3 and 4 by descending as
follows:
1. Gain (min_Account) = 0.342
2. Gain (Loan_Amount) = 0.192
3. Gain (min_Amount) = 0.162
4. Gain (max_Amount) = 0.069
5. Gain (avg_Amount) = 0.051
6. Gain (Card_Type) = 0.044
7. Gain (avg_Account) = 0.039
8. Gain (Age) = 0.032
9. Gain (District) = 0.03
10. Gain (Loan_Duration) = 0.025
11. Gain (max_Account) = 0.029
12. Gain (Sex) = 0
After these information gains are computed, the decision tree constructed by
the software is like this:
9
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174



10
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174


After the decision tree in Figure 1 is constructed, the following
classification rules are constituted by the software.

If min_Account= '<=0' Then CLASS='B'
If min_Account= '>40K' Then CLASS='A'
If min_Account= '0<...<=10K' Then
If Loan_Amount= '>325K' Then CLASS='B'
If Loan_Amount= '100K<...<=125K' Then CLASS='A'
If Loan_Amount= '125K<...<=150K' Then
If min_Miktar= '10<...<=20' Then CLASS='A'
If min_Miktar= '20<...<=30' Then CLASS='B'
End If
If Loan_Amount= '150K<...<=175K' Then
If min_Amount= '10<...<=20' Then
If max_Amount= '10K<...<=20K' Then CLASS='A'
If max_Amount= '40K<...<=50K' Then CLASS='A'
If max_Amount= '50K<...<=60K' Then CLASS='B'
End If
If min_Amount= '20<...<=30' Then CLASS='A'
End If
If Loan_Amount= '175K<...<=200K' Then CLASS='B'
If Loan_Amount= '200K<...<=250K' Then CLASS='B'
If Loan_Amount= '20K<...<=30K' Then CLASS='A'
If Loan_Amount= '250K<...<=275K' Then CLASS='B'
If Loan_Amount= '275K<...<=300K' Then CLASS='B'
If Loan_Amount= '300K<...<=325K' Then CLASS='A'
If Loan_Amount= '30K<...<=40K' Then
If min_Amount= '10<...<=20' Then
If max_Amount= '0<...<=10K' Then CLASS= 'A'
If max_Amount= '20K<...<=30K' Then CLASS='B'
If max_Amount= '50K<...<=60K' Then CLASS='A'
If max_Amount= '60K<...<=70K' Then CLASS='A'
End If
End If
If Loan_Amount= '40K<...<=50K' Then CLASS='A'
If Loan_Amount= '50K<...<=60K' Then CLASS='A'
11
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174


If Loan_Amount= '60K<...<=70K' Then
If min_Amount= '10<...<=20' Then CLASS='B'
If min_Amount= '20<...<=30' Then CLASS='A'
If min_Amount= '30<...<=40' Then CLASS='B'
If min_Amount= '60<...<=70' Then CLASS='B'
End If
If Loan_Amount= '70K<...<=80K' Then CLASS='A'
If Loan_Amount= '80K<...<=90K' Then CLASS='B'
If Loan_Amount= '90K<...<=100K' Then
If min_Amount= '<=1' Then CLASS='A'
If min_Amount= '>1K' Then CLASS='B'
If min_Amount= '10<...<=20' Then
If max_Amount= '10K<...<=20K' Then CLASS='A'
If max_Amount= '30K<...<=40K' Then CLASS='A'
If max_Amount= '40K<...<=50K' Then CLASS='A'
If max_Amount= '50K<...<=60K' Then CLASS='A'
If max_Amount= '70K<...<=80K' Then CLASS='B'
End If
End If
End If
If min_Account= '10K<...<=20K' Then CLASS= 'A'
If min_Account= '20K<...<=30K' Then CLASS= 'A'
If min_Account= '30K<...<=40K' Then CLASS= 'A'


The verification of this classification rules as seen in Figure 2 is done by
using test data. The software executes the classification rules in the fields of
min_Account, loan_amount, min_account and max_amount for each account and
writes the results in the results column. Loan_state data from the test data is
compared with values in the result column and an error rate is obtained. This error
rate shown in the message box is 12.76%. Since this error rate is admissible, the
classification rules may be used to predict the loan states of C and D at the end of
the contract.



12
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174




Figure 2. The specification of the classification rules error rate.

The prediction on states C and D is shown in Figure 3. In the loan state
column current states of accounts are shown. And in the result column, predicted
loan states at the end of the contract are shown. If the status of clients with loan state
C becomes the state B at the end of the contract and if the status of clients with loan
state D becomes the state A at the end of the contract, these two cases are called
unexpected cases.






13
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174




Figure 3. The prediction of the C and D loan states at the end of the contract.

As the results of the predictions on C and D states whose number is 448,
generally loan states C will be A and loan states with D will be B at the end of the
contract. This shows that generally accounts with no problems are expected to repay
the loan, and accounts with problems so far are expected not to repay the loan.
But in some accounts it is expected that although the current loan state is C, it will
be B, and although the current loan state is D, it will be A at the end of the contract.
The number of these states is 45.
Therefore for the clients having problems in repaying the loan, it seems
possible that they solve their problems and repay the loans. Similarly some clients
although still repaying their loans without any problem, will have problems and can
not repay their loans.


14
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174


The used decision tree and classification rules, may not sufficient to predict
some new data. These states which can not be predicted are described as unknown
states. These states are shown in Figure 4.



Figure 4. The C and D states that can not be predicted.



RESULTS AND CONCLUSIONS

When a data mining application will be realized, first of all the data in hand
and the business problem to be solved must be analyzed and understood very well.
These two key points are the basic facts that effect the success of the data mining
application. After that the choosing the right data mining technique is another
important fact to overcome the business problem.
15
Serhat zekes and A.Ylmaz amurcu/ J Marm Pure Appl Sci 18 (2002) 159-174


The data you wish to analyze by data mining techniques may be incomplete,
noisy and inconsistent. Thus when starting the application, first the data must be
preprocessed. This preprocessing includes data cleaning, data integration, data
transformation and data reduction. The data used in this application is also
preprocessed and arranged for the decision tree technique. One of these
arrangements is the determination of the input variables used in the construction of
the decision tree. Determination of the input variables according to the goal is one
of the key points of the decision tree technique.
The evaluation and interpretation of the patterns which are obtained by
applying the right technique on the preprocessed data is another important point of
the data mining applications. By interpretation of the patterns by the experts, the
gold valued information is obtained. The last step of a data mining application is the
representation of the information for the users.
In this study as a results of the predictions on C and D states whose number
is 448, 10 cases are labeled as unknown states. It is seen that the decision tree and
the classification rules are insufficient in these states and cant make any
predictions. This is because the data set used in application can not provide an
adequate training data set and thus classification rules with adequate capacity to
predict C and D states can not be obtained.
The software designed for this application, uses the Entropy measure as a
branching criterion. The contributions in the solving of the above problem of the
Gini and Twoing [14] criterions, which are the other alternatives that can be used as
a branching criterion, must be studied.



REFERENCES

[1] Zhong, N.; Zhou, L.: Methodologies for Knowledge Discovery and Data Mining,
The Third Pacific-Asia Conference, Pakdd-99, Beijing, China, April 26-28, 1999 ;
Proceedings, Springer Verlag, (1999).
[2] Fayyad, U.: Mining Databases: Towards Algorithms for Knowledge Discovery,
IEEE Bulletin of the Technical Committee on Data Engineering, 21 (1) (1998) 41-48.
[3] Akpnar, H.: Veri Tabanlarnda Bilgi Kefi ve Veri Madencilii, stanbul niv.
letme Fakltesi Dergisi, 29 (2000) 1.
16
Classification and prediction in a data mining ., / J Marm Pure Appl Sci 18 (2002) 159-174


[4] Berson, A.; Smith, S.; Thearling, K.: Building Data Mining Applications for CRM,
McGraw-Hill Professional Publishing, New York, USA, (2000).
[5] Chaudhuri, S.: Data Mining and Database Systems : Where is the Intersection?, IEEE
Bulletin of the Technical Committee on Data Engineering, 21 (1) (1998) 4 - 8.
[6] 3rd European Conference on Principles and Practice of Knowledge Discovery in
Databases, http://lisp.vse.cz/pkdd99/DATA/data_berka.zip, Access Date: 18 Dec.2001
[7] Seidman, C.: Data Mining with Microsoft SQL Server 2000, Microsoft Press, 1st Ed.,
Washington, USA, (2001).
[8] Gven, E.: Student Performance Assessment In Higher Education Using Data
Mining, MSc Thesis, Boazii Univ., Inst. for Graduate Studies in Science and
Engineering, Istanbul, Turkey, (2001).
[9] Ula, M.A.: Market Basket Analysis For Data Mining, MSc Thesis, Boazii Univ.,
Inst. for Graduate Studies in Science and Engineering, Istanbul, Turkey, (2001).
[10] Berry, M.J.A.; Linoff, G.S.: Mastering Data Mining: The Art and Science of
Customer Relationship Management, 1st Ed.; John Wiley & Sons, (1999).
[11] Han, J.; Kamber, M.; Data Mining Concepts and Techniques, 1st Ed.; Morgan
Kaufmann Publishers, San Francisco, USA, (2000).
[12] Agrawal, R.; Imielinski, T.; Swami, A.: Database Mining:A Performance
Perspective, IEEE Transactions on Knowledge and Data Engineering, (1993) 914 -
925.
[13] Chen, M.; Han, J.; Yu, P.S.: Data Mining: An Overview from Database Perspective,
IEEE Transactions on Knowledge and Data Engineering, 8 (6) (1996).
[14] CART for Windows Users Guide, http://www.salford-systems.com, Access Date: 17
March 2002.











Received December 2002

Potrebbero piacerti anche