Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
A R T I C L E I N F O A B S T R A C T
Article history: It is very important for nancial institutions to develop credit rating systems to help them to decide
Received 30 May 2008 whether to grant credit to consumers before issuing loans. In literature, statistical and machine learning
Received in revised form 15 April 2009 techniques for credit rating have been extensively studied. Recent studies focusing on hybrid models by
Accepted 2 August 2009
combining different machine learning techniques have shown promising results. However, there are
Available online 8 August 2009
various types of combination methods to develop hybrid models. It is unknown that which hybrid
machine learning model can perform the best in credit rating. In this paper, four different types of hybrid
Keywords:
models are compared by Classication + Classication, Classication + Clustering, Clustering + Clas-
Credit rating
sication, and Clustering + Clustering techniques, respectively. A real world dataset from a bank in
Consumer loans
Machine learning Taiwan is considered for the experiment. The experimental results show that the Classication + -
Hybrid models Classication hybrid model based on the combination of logistic regression and neural networks can
Maximum prots provide the highest prediction accuracy and maximize the prot.
2009 Elsevier B.V. All rights reserved.
1568-4946/$ see front matter 2009 Elsevier B.V. All rights reserved.
doi:10.1016/j.asoc.2009.08.003
C.-F. Tsai, M.-L. Chen / Applied Soft Computing 10 (2010) 374380 375
credit rating in addition to the single classication and clustering probability of each class given the feature vectors of a new
techniques as the baseline models. Moreover, the prot made by instance.
these models is also compared based on a chosen bank as the case.
Four well-known classication techniques, which are decision 2.1.4. Logistic regression
trees, Bayes classication, logistic regression, and neural networks, Logistic regression is a simply parametric statistical approach. It
and two clustering techniques, which are K-means and expectation is similar to traditional regression analysis. Therefore, the use of
maximization are used to develop the hybrid models. Therefore, logistic regression should also conform to some hypothesis of
the contribution of this paper is to nd out which combination traditional regression analysis, such as to avoid the autocorrelation
method and techniques for the hybrid learning model can perform in residuals, to avoid multi-collinearity in independent variables,
the best as well as provide maximum prots for credit rating. and the collected data must conform to a normal distribution [7].
The organization of this paper is as follows. Section 2 briey Logistic regression uses a series of numerical computations to
describes related machine learning techniques used in this paper. establish a model by known classication parameters to nd out
In addition, related work is compared and their limitations are which parameters have the more discriminated ability for each
discussed. Section 3 presents the research methodology including group and the classication rules for each group. Logistic
the data used, the development of the credit rating models, regression is the same as discriminant analysis, which is used to
evaluation strategies considered, etc. Section 4 shows the deal with the relationship between independent variables and
experimental results and the conclusion is provided in Section 5. dependent variables when the dependent variable is a list of
categories to which objects can be classied. Therefore, the
2. Machine learning techniques difference between logistic regression and discriminant analysis is
that discriminant analysis must satisfy the assumption of normal
2.1. Classication techniques distribution and the equal covariance matrixes to nd out the
optimal value. However, logistic regression does not need these
Classication (or supervised learning) techniques are based on assumptions, even if these assumptions are satised, logistic
learning by examples that map input vectors into one of several regression can still provide relatively high prediction accuracy.
desired output classes. That is, a pattern classier can be created
through the training or learning process. The learning process of 2.2. Clustering techniques
creating a classier is to calculate the approximate distance
between inputoutput examples and make correct output labels of Clustering (or unsupervised learning) techniques can be
the training set. This process is called the model generation phase. regarded as the process of grouping similar objects into a cluster.
When the model is generated, it can classify an unknown instance In particular, labeled examples are not available. The purpose of
into one of the learned classes in the training set [20]. clustering techniques is to improve the similarity of the members
in a group and make the data in each cluster have the highest
2.1.1. Decision trees similarity, but the highest dissimilarity between different clusters
A decision tree is a classication approach which is based on the [20].
tree structure to analyze data. One major advantage is that some Clustering algorithms can be classied into two categories,
decision rules can be produced that are easy to understand by which are hierarchical and partitional clustering algorithms [12].
humans. A decision tree classies an instance by sorting it through Hierarchical clustering creates a hierarchy of clusters by using the
the tree to the appropriate leaf node, i.e. each leaf node represents a agglomeration algorithm. Then, a distinct singleton cluster will be
classication. Each node represents some attribute of the instance, combined one by one until satisfying some rules. The result will
and each branch corresponds to one of the possible values for this produce a series of arborescence partitions. On the other hand,
attribute. C4.5 is the mostly used decision tree approach, and it is a partitional clustering is much more popular, which have been
later version of the ID3 algorithm [23]. extensively used in many business problems [21]. Two well-
known partitional clustering algorithms are K-means and expecta-
2.1.2. Articial neural networks tion maximization (EM) algorithms described below.
An articial neural network, also called neural network, is
composed of a group of neural nodes that link with the weighted 2.2.1. K-means
nodes. Every node can simulate a neuron of creatures, and the The K-means clustering algorithm is a simple and efcient
connection among these nodes is equal to the synaptic that clustering method. K-means clustering is performed based on the
connects among the neurons. The most common type of neural following steps [5]:
networks consists of three layers of units: input layers, hidden
layers, and output layers. It is called multilayer perceptron (MLP). A Given a group of feature vectors (or data points) as the dataset to
layer of input units is connected to a layer of hidden units, be clustered.
which is connected to a layer of output units [6]. Randomly select the amount of seed by k to be the cluster center.
Assign the nearest data points to the clusters.
2.1.3. Nave Bayes classication Average the position of every data point in the clusters in order to
Nave Bayesian classication [4] is based on Bayes theorem, nd out the new cluster, and then every data point will be
which uses all kinds of beforehand probabilities and probabilities assigned to their nearest cluster center.
that are observed in the population to predict afterward Repeat the two previous steps until some convergence criterion
probabilities. It is an effective tool to predict the relation of class is met or the assignment cannot be changed.
members in the unknown situation.
The nave Bayes classier requires all assumptions be explicitly 2.2.2. Expectation maximization
built into models which are then used to derive optimal decision/ The EM algorithm performs as the following steps [3]:
classication rules. It can be used to represent the dependence
between random variables (features) and to give a concise and Step 1: Estimation. First, assume the average of cluster parameter
tractable specication of the joint probability distribution for a mi; standard deviation si. Then, we can gure out the probability
domain. It is constructed by using the training data to estimate the of pi that every points to cluster qi, and qi = 1 where i represents
376 C.-F. Tsai, M.-L. Chen / Applied Soft Computing 10 (2010) 374380
the ith cluster. Consequently, the initial values of mi, si, and pi are where D0 contains m examples (m < n and D0 2 D). Then, D0 is used
used to calculate the cluster probability of every instance x. Next, to train the second classier. Given a new testing set S, the second
the probability and responsibility to repeatedly predict the value classier could provide better classication results than single
of mi, si, and pi are applied. classiers trained by the original dataset D.
The responsibility is dened by the degree of every data point Similarly, for the combination of two clustering techniques, the
of a dataset associated with cluster qi. The responsibility for any x st cluster is also used for data reduction. The correctly clustered
can be obtained by: data D00 by the rst cluster are used to train the second cluster,
2 2 where D00 contains s examples (s < n and D00 2 D). Finally, a new
pi =s di e1=2s i jjxn mi jj testing set S can be used to test the second cluster and other single
hin P
1=2s 2j jjxn m j jj2
p j =s dj e clusters trained by the original dataset D for comparisons.
In short, the rst component (no matter based on supervised or
The nave Bayes is: unsupervised learning technique) of the four hybrid approaches
can simply perform the task of outlier detection [15]. That is, the D0
Pqi 1Pxn jqin 1
hin Pqin 1jxn P n i and D00 datasets are much cleaner than the original dataset D.
Pqn 1Pxn jqin 1
Note that some other sophisticated hybrid techniques, such as
Step 2: Maximization. According to the responsibility of every classier ensembles [14] and stacking (or stacked generalization)
data point, the average m and standard deviation s of the clusters [25] are not considered in this paper. This is because the
are recalculated. Then, the new average m is the cluster center. computational efforts of constructing these models are relative
The average m is based on the responsibility of every data higher than the above mentioned four hybrid approaches by
point in their clusters by: combining two single techniques in a series connection. In
P i particular, classier ensembles need to combine a number of
h x
mi Pn n i n classiers in parallel where the ensemble size is problem
n hn dependent and the outputs of these classiers need to be further
combined (or post-processed) by such as majority voting, or
As a result, the likelihood of cluster qi can be obtained based
weighted voting, etc., for the nal classication decision. On the
on mi, si, and pi dened by:
other hand, the structure of stacking is much more complex than
a
n
classier ensembles, which is composed of n-level of classication
PX n jqi 1 PX k qi 1
k1
where the rst level classication is based on combining multiple
classiers parallelly and the second level classication is to collect
the outputs of the rst level classiers as the training set to
construct the second level classier for the nal classication.
2.3. Hybrid machine learning techniques However, there is no a truly answer to the question about how
many levels of the stacking scheme can perform the best [25].
In general, the hybrid model is based on combining the
clustering and classication techniques. In Lenard et al. [18], two 2.4. Related work
combination methods are proposed as shown in Figs. 1 and 2,
respectively. Regarding Kumar and Ravi [16], which provide a detailed review
For the rst hybrid model, as clustering is the unsupervised of machine learning techniques for nancial related problems, one
learning technique, it cannot distinguish data accurately like important trend is to build a soft computing architecture (or hybrid
supervised one. Therefore, a classier can be trained at rst, and its intelligent systems). However, there are very few studies focusing on
output is subsequently used as the input for the cluster to improve developing hybrid models for credit rating.
the clustering result [10]. Table 1 compares related work of hybrid learning models in
On the other hand, the second hybrid model uses the clustering terms of their research problems, hybrid techniques used, and
technique rst in order to lter out unrepresentative data. That is, evaluation methods.
the data which cannot be clustered accurately can be regarded as In related work, the developed hybrid models are generally
noisy data. Then, the representative data, which are not ltered out compared with some chosen baseline models, which are based on
by the clustering techniques, are used to train the classier in order single machine leaning techniques to make the nal conclusion.
to improve the classication result [8]. Although they conclude that hybrid models outperform single
Besides, there are two other strategies to combine two machine classication models, it is unknown that what kind of the hybrid
learning techniques. The rst one is based on combining two models can perform the best in the credit rating domain.
different classication techniques, and the second for combining Moreover, much related work evaluates the models perfor-
two different clustering techniques. mance by assessing their accuracy and error rates. However, no
For combining two classication techniques, given an original study examines the ability of making maximum prots of the
dataset D which contains n training and testing examples, the aim credit rating models.
of the rst classier is to pre-process D for data reduction. That is, Therefore, this paper compares four different types of hybrid
the correctly classied data D0 by the rst classier are collected, credit rating models in terms of their prediction accuracy, error
rates, and the maximum prot they can make (c.f. Section 3.4).
3. Research methodology
Fig. 1. A classier combined with a cluster.
3.1. The dataset
Table 1
Comparisons of related work.
Hsieh [8] Credit rating Clustering + Classication Accuracy and error rates
Huysmans et al. [10] Credit rating Classication + Clustering Accuracy rates
Jain and Kumar [11] Time series forecasting Classication + Classication Error rates
Kim and Shin [13] Stock markets Classication + Classication Error rates
Lee et al. [17] Credit rating Classication + Classication Accuracy and error rates
Malhotra and Malhotra [19] Credit rating Classication + Classication Accuracy rates
Table 2 Table 4
The three datasets. The variables in the dataset.
Dataset ID Description Case unit Delay of revolving The quantity of cash cards
credit
Dataset 1 Contains the data whose proportion of the normal accounts
and overdue accounts is 1:1. The data are selected randomly House ownership The guarantor involved Query times form the joint
from the original dataset. (yes/no) credit information center
Dataset 2 Contains the original data without Dataset 1. Living class Connubiality Liabilities
Dataset 3 Contains the data whose proportion of the normal accounts and Case source The quantity of Job code
overdue accounts is the same as the original dataset. The data credit card
are selected from Dataset 2. Age
Table 3 Table 5
The dataset information. Parameter settings of the baseline models.
Dataset The ratio of normal No. of overdue No. of normal Model Parameters
and overdue accounts accounts (cases) accounts (cases)
C4.5 decision tree SCORE_METHOD = 4
Original dataset 1:2.729 3467 9462 SPLIT_METHOD = 3
Dataset 1 1:1 1500 1500 MINIMUM_SUPPORT = 10
Dataset 2 1967 7962 COMPLEXITY_PENALTY = default
Dataset 3 1:2.729 1967 5368 FORCED_REGRESSOR = default
overdue accounts are certainly different. In order to avoid Logistic regression HOLDOUT_PERCENTAGE = 30
ineffective model training, which may make the model inclines HOLDOUT_SEED = 0
to the class containing the large quantity of data, we randomly SAMPLE_SIZE = 1000
MAXIMUM_STATES = 100
select the data whose proportion of the normal accounts and
overdue accounts is 1:1 from the original dataset as the dataset 1. Neural network HIDDEN_NODE_RATIO = 4
HOLDOUT_PERCENTAGE = 30
In particular, 10-fold cross-validation is used during the
HOLDOUT_SEED = 0
training and testing stages. That is, dataset 1 is divided into 10 MAXIMUM_STATES = 100
un-duplicated subsets, in which any 9 of the 10 subsets are used for SAMPLE_SIZE = 1000
training and the remaining one for testing. As a result, each model
will be trained and tested by 10 times.
To further validate the trained models, rst of all we collect the 3.3. Model development
original data without dataset 1, i.e. dataset 2, which are the
unknown data for the trained models. Then, in order to make the There are six different types of credit rating models developed
validation set to be similar as the original dataset in terms of in this paper.2 They are the four different types of hybrid models
normal distribution, dataset 3 is produced from dataset 2 based on (c.f. Section 2.3) and two single baseline models based on the
the same proportion of the normal accounts and overdue accounts classication and clustering techniques individually.
as the original dataset.
3.3.1. Single models by classication techniques
3.2. Variables The single baseline models using classication techniques are
based on C4.5 decision trees, nave Bayes, logistic regression, and
The dataset contains a number of different input variables1 neural networks, respectively. The parameter settings to construct
shown in Table 4. the four baseline prediction models are shown in Table 5.
The most common variables are connubiality, query times from After running 10-fold cross-validation, the best baseline model
the joint credit information center, the quantity of cash cards, the can be obtained. Then, dataset 3 is used to evaluate the
quantity of credit card, liabilities, the job code, and age. The un- performance of these models.
common variables, but are selected as signicant variables in this
dataset by the entropy algorithm are the case unit, the house 3.3.2. Single models by clustering techniques
ownership, the living class, the case source, delay of revolving In this paper, there two clustering techniques used to develop
credit, and the guarantor involved. the single baseline models. They are the K-means and expectation
For the output variables, as credit rating can be regarded as a maximization (EM) algorithms. The number of clusters (i.e. the k
two-class classication problem, the model can classify or predict value) is set from 2 to 50 to examine their clustering results based
the new test case for either a good credit or bad credit. on 10-fold cross-validation. In particular, the percentages of the
1 2
The representative variables are selected by the Microsoft SQL Server 2005 In this paper, the Microsoft SQL 2005 Data Mining software is used to develop
entropy algorithm. the credit rating models.
378 C.-F. Tsai, M.-L. Chen / Applied Soft Computing 10 (2010) 374380
As a result, two hybrid models are developed. They are the best
baseline model + K-means and the best baseline model + EM.
Finally, dataset 3 is used to evaluate the prediction performance of Then, prediction accuracy can be obtained by:
the two Classication + Clustering hybrid models individually. ad
Prediction accuracy
abcd
3.3.4. Hybrid Models by Classication + Classication techniques
Similar to the rst stage of Classication + Clustering techni- The Type I error shows the rate of prediction errors of a model,
ques, the correctly predicted data from the best baseline which incorrectly classies the good credit group into the bad
classication model are used to train the four classication models credit group. Opposed to the Type I error, the Type II error presents
individually, which are C4.5 decision trees, nave Bayes, logistic the rate of prediction errors of a model to incorrectly classify the
regression, and neural networks. bad credit group into the good credit group. Therefore, the Type II
Consequently, there are four hybrid models developed. They are error is more critical which contain higher risks for nancial
(1) the best baseline classication model + C4.5 decision trees, (2) institutions.
the best baseline classication model + nave Bayes, (3) the best In addition, we will examine the breakeven rate which is based
baseline classication model + logistic regression, and (4) the best on the values in (a) and (b) of Table 6 to nd out the optimal
baseline classication model + neural networks. predication accuracy of the model which can provide maximum
Again, dataset 3 is nally used to evaluate the prediction prots.3 That is, the proportion of breakeven in this case bank is
performance of the four Classication + Classication hybrid 5.351351:1, which means that the prot made from 5.351351
models individually. normal accounts can be counterbalanced by 1 overdue account.
Table 8 Table 13
Prediction accuracy of baseline clustering models. Prediction performance of the four hybrid models.
Model Testing dataset Validation dataset No. of clusters Model Prediction accuracy Ranking
Table 10
Prediction performance of LR + EM and LR + K-means.
Model Prediction accuracy No. of clusters 4.1.5. Clustering + Classication hybrid models
LR + EM 76.07% 41 Regarding the result of single baseline clustering models, EM
LR + K-means 70.37% 29 with 41 clusters performs the best. Table 12 shows the prediction
performance of EM by a confusion matrix.
Therefore, 1187 data which are accurately predicted by EM are
Table 11 used to train DT, NB, LR, and NN, respectively. Table 13 shows the
Prediction performance of the four hybrid models.
prediction performance of the four hybrid models using dataset 3.
Model Prediction accuracy Ranking In this type of hybrid models, EM + LR provides the best prediction
LR + DT 72.50% 4 result.
LR + NB 78.68% 3
LR + LR 82.94% 2 4.1.6. Clustering + Clustering hybrid models
LR + NN 83.44% 1 In this type of hybrid models, the 1187 data which are
accurately predicted by EM are used to train K-means (with 29
clusters) and EM (with 41 clusters), respectively. Table 14 shows
Table 12 their prediction performances by dataset 3. The result indicates
Prediction performance of EM.
that the EM + EM hybrid model performs the best.
Actual Predicted
Bad credit Good credit 4.2. Comparisons of the six best credit rating models
Table 15
Performance of the six best models.