Sei sulla pagina 1di 9

EE559 Project

German Credit Score Project


Jorge Gomez1∗
1∗
e-mail: gomezpon@usc.edu

This document presents the results of pattern classification using the German
Credit Score Dataset. A compact (9 features) and the full version (20 features)
of the dataset were used in this work for comparing the performance of 3 dif-
ferent classifiers (Support Vector Machine, Naive Bayes and K-Nearest Neigh-
bors). A preprocessing and dimensionality reduction procedure were needed
in order to work with different types of features (categorical and numerical)
that the dataset possess. As result of this procedure, the best classifier (full
dataset) is KNN with a training accuracy of 71.71%, test accuracy of 67.95%
and F-Score of 73.79%.

Introduction
The German Credit Dataset is a mixed dataset with 20 features and 1000 samples. 70% of
the observations are classified as ”Good” and the rest as ”Bad”. Due to its proportionally, this
dataset can be considered as unbalanced and makes it difficult to classify. Other characteristics
of this dataset is the mixed nature of its data. In this dataset, categorical and numerical features
are found. Also, it is stated that it is more penalized to misclassify a person who is ”Bad” as
”Good” rather than the opposite.
A set of 3 classifiers were selected in order to perform classification in this dataset. SVM, KNN
and Naive Bayes were chosen for performing classification because of their good characteristics.
For developing the algorithms, the Statistics and Machine Learning Toolbox from MATLAB
was used.
For each classifier, an optimization procedure for finding the best parameters is performed and
after that a performance evaluation is developed in order to compare and decide which classifier
is the best. In this paper we discussed, briefly, the steps for developing step by step using both,
the compact dataset and the complete version. In Section 2, the framework of the system is
established and explained, also the preprocessing step is detailed and analyzed. In Chapter 3,
the classifier algorithm code is explained. Finally, the results of the classification using the 3
strategies are shown and discussed.

1
Framework
As mentioned in the previous section, in order to work with this dataset is it necessary to per-
formed a procedure for analyzing the categorical and numerical data from the dataset. The
procedure is implemented in 3 steps: Preprocessing & Dimension Reduction, Classification
and Performance Evaluation. As can be seen in Figure 1, the first step consist in the extraction
of the data.

Data
Database
Extraction

Preprocessing

Classification

Performance
Shuffle sets
Test

Is i=N
(Number
no of itera-
tions)

yes

stop

Figure 1: Project Framework

The next step is the preprocessing from which the data is converted into numerical values
and expanded vector if is categorical. As result of this step the number of columns is increased,
this new expanded matrix is splitted using a ratio of 80% - 20%.
The training data is normalized (only numerical features) and used for performing PCA to
obtain the principal component, and the same transformation matrix from the training used is

2
used in the test data for projecting the original into a lower dimension approximation. The
next step is the classification form which the classifiers are configure for finding the best set of
parameters that maximizes the train set accuracy. This process is develop in a loop for obtaining
average results and its uncertainty (variance), using different combinations of training and test
sets. Finally, for testing the performance of the classifier, a set of testing parameters are used
for the 3 classifiers.
The features used in this project are:

Table 1: Compact Dataset (9 features)


Age Sex Job
Housing Credit Saving Account
Checking Accounts Credit Amount Purpose

Table 2: Complete Dataset (20 features)


Checking Account Duration Credit History Purpose Credit Amount
Savings Account Present Employment Installment rate Personal Status Debtors
Present Residence Property Age Installment Plans Housing
Existing Credits Job People liable Telephone Foreign Worker

Both sets have a mixed of categorical and numerical. For instance, in case of the com-
pact dataset 3 features (Age, Duaration and credit Amount) and the rest is categorical. In the
extended version 7 are numerical and 13 are categorical.

Preprocessing
The first step of the processing is the feature extraction, from which the information was ob-
tained from the file ”Proj dataset 1.csv” or ”Proj dataset 2.csv”. After that, the labels and fea-
tures were splitted in 2 different variables. The categorical data was trasnformed into numbers
for processing. Moreover, in case of the missing data, the value was replaced with a constant
value different form the current category that the feature possesses, this step is called Feature
expansion, which convertes the categorical data into a vector on which only 1 element is dif-
ferent from ’0’. This expanded matrix of numerical and binary data is splitted into 2 parts,
training and test set, following a ratio of 80%-20%, for which a random selection is performed
for picking both sets. The dataset conatins different tyoes of data with different scale, for which
is necessary to normalized them for eliminating any effect form the scale of the data features.
For example, Credit Amount has values in order of thousands, meanwhile age is in order of tens,
therefore both features have different impact in the analysis, even can bias the result and reduce
the performance of the classifier. The next step is PCA, the normalized data is transformed
and projected into a lower dimension for classification, the selection criteria was to select the
number of columns that maintain at least 95% of the variance of the data. This number differs
from the compact version to the full dataset, but in both cases this new matrix is the input for
the next step, and also the matrix transformation is stored in order to be used with the test data,
which had no involvement in the PCA analysis.

3
Data
Database
Extraction

Feature
Expansion

Data
Splitting

PCA

stop

Figure 2: Preprocessing Procedure

Classification
After processing the data, the output is a reduced matrix of 800 samples and the number of fea-
tures depends on the original dataset. As mentioned in previous sections the selected classifiers
to be used with data are:
Table 3: Commands used in the project
Classifier MATLAB Command
Naive Bayes Classifier fitcnb
Support Vector Machine fitcsvm
K-nearest Neighbors fitcknn
The training set is put as an input of the classifier algorithm. For finding the best set of pa-
rameters in each classifer, it is used an Optimization algorithm (Bayes Optimization) which
computes combinations of the classifer’s parameters an compare its accuracy. i.e. In SVM, the
optimization tecnigque will check the best Slack Variable, Gamma and the best kernel function
between RBF and Polynomial (Sigmoid Kernel is not defined in MATLAB). In case of KNN,
the parameters are: the number of neigbors, the distance definition and a weighting function.
Finally, in case of Naive Bayes, the parameters are the Kernel for pdf estimation, the width of
the window function and prior distribution. In all of the classifiers, a 10 Fold crossvalidation
procedure was configure for obtaining the performance of the classifier. Moreover, due to the
nature of the Optimization procedure, this approach is repeated 20 times for finding teh average

4
accuracy and its standard deviation.

Initial
Preprocessing
parameters

Classifier
Config-
uration

Parameter
Opti-
mization

Performance
Evaluation

is the
number of
Shuffle sets no iterations
reach?

yes

stop

Figure 3: Classification Procedure


After the training of the classifier, the optimized classifier is tested using the 20% of the
remained data (Test set).And as a final step, confusion matrix is calculated for obtaining the
measurements of precision, recall and F-score. This procedure is repeated 20 with different
splitting of the data for obtaining average values for training and test accuracy and F-Score.

Classification Performance
Results
In case of dimensionality reduction, approximately 50% of the expanded data is reduced, 95%
of the variance of the database is contained in the PCA output matrix (see Figure 4 & 5).

5
Figure 4: PCA for Compact Dataset

Figure 5: PCA for Complete Dataset

6
For instance, in case of the compact dataset ([1000x29]) the output of PCA is a [1000x15].
On the other hand, for the full dataset ([1000x62]), the result is a [1000x30] matrix. After PCA,
the classification procedure was performed for obtaining the best parameters. The resulting
configuration for each classifier is shown in the following table:

Table 4: Naive Bayes Parameters


Kernel Epanechnikov
Width 0.0309

Table 5: SVM Parameters


Kernel RBF
Gamma 22.6574
Slack Variable 1.5873

Table 4: KNN Parameters


Distance N
squared inverse 153

The results shown above are result of the complete run of the algorithm however, beacuse
of the randomness of the data splitting, this result may change depending on the set of training
and data sets. However, in order to compare the 3 classifiers, the average classification accuracy
and F-Score are shown in the following tables:

Table 6: Compact Dataset Uniform Prior


Bayes SVM KNN
Accuracy Train 68.4% ± 0.75% 70.24% ± 0.92% 70.66% ± 0.88%
Accuracy Test 62.28% ± 0.88% 61.73% ± 1.39% 64.8% ± 1.04%
F Score 71.46% ± 0.84% 69.52% ± 1.93% 72.56% ± 1.03%

Table 7: Compact Dataset Empirical Prior


Bayes SVM KNN
Accuracy Train 72.61% ± 0.91% 74.65% ± 1.29% 72.8% ± 1.45%
Accuracy Test 72.15% ± 4.27% 73.80% ± 4.31% 71.65% ± 2.63%
F Score 82.01% ± 3.54% 82.90% ± 2.86% 82.11% ± 1.71%

Table 8: Complete Dataset Uniform Prior


Bayes SVM KNN
Accuracy Train 69.46% ± 1.27% 71.52% ± 0.94% 71.71% ± 1.51%
Accuracy Test 70.12% ± 3.96% 68.26% ± 2.81% 68.12% ± 3.14%
F Score 77.29% ± 3.79% 74.54% ± 2.62% 73.94% ± 4.44%

7
Table 9: Complete Dataset Empirical Prior
Bayes SVM KNN
Accuracy Train 73.93% ± 0.88% 75.53% ± 1.03% 74.77% ± 1.11%
Accuracy Test 73.14% ± 2.86% 74.55% ± 1.96% 73.68% ± 1.96%
F Score 81.94% ± 2.31% 82.87% ± 1.68% 83.31% ± 2.31%

Asumming uniform priors and the complete dataset, the best classifier is KNN with 71.71%,
however SVM has a performance, close to KNN in terms of training accuracy (71.52%) and
even better performance on F-Score (74.54%), which indicates that the precision and recall on
the SVM is better than KNN. The fact of assuming prior uniforms works with the unbalanced-
ness of the data, improving the classification and reducing Type I errors (False positive), which
in this case are highly penalized.

Table 10: Confusion Matrix


Complete Dataset, Uniform Prior
Naive Bayes SVM KNN
114 31 96 49 109 36
19 36 10 45 15 40

Table 11: Confusion Matrix


Complete Dataset, Empirical Prior
Naive Bayes SVM KNN
125 20 109 36 125 20
29 26 44 11 50 5

Moreover, in both datasets the performance of the classifiers is slightly different becuase of
the information used in the classification, however its behavior remains, i.e. KNN gives the best
accuracy and SVM has a very close performance to KNN. In both cases Naive Bayes does not
provide a good training accuracy. Even if the F-Score and Test accuracy are higher comparing
with SVM and KNN, this result can mislead to a wrong decision. Using the confusion matrix
to analize this effect, it is clearly observed that Naive Bayes has better F-Score and test accu-
racy beacuse it detects more positive cases, but also produces a high number of false positives
comparing to SVM and KNN. So, in this case choosing the Naive Bayes might not be a good
option.

Analysis and Discussion


The best classifier using the complete dataset in terms of training accuracy is KNN, which is
0.5% better than SVM and almost 3% better than Naive Bayes. Also its F-score is 72.56%. This
results depends on the assumptions that are made before doing the classification. For instance,
it can be assumed that the data is unbalanced and doing so use different weights for the classi-
fication; and, as shown in previous tables, how it affects directly the accuracy and F-Score. For
instance, in case of KNN classifier, assuming the empirical Prior, gives good performance in

8
terms of accuracy, however using the confusion matrix it is clearely seen that it fails in classi-
fying bad samples, it classifies them as false positives (see Table 11).
Indeed, all three classifiers improve their performances in 3% (average) when using the empir-
ical prior probability, however their is an impact in the detection of bad cases. Comparing the
information on tables 10 and 11, the succesfull detection of bad cases is greater using uniform
prior rather than the empirical assumption, even in some cases the effect is greater (e.g. KNN)
which means that the classifiers are sensitive to the prior probability assumption. This effect
is due to the prior probabilities that bias the classification performance in order to detect more
”Good” than ”Bad”, and doing so, it may classify bad samples as good (i.e. false positive).
So, analyzing only the accuracy (training and test) or F-Score can mislead in order to check
the performance of the classifiers. Because of that, the confusion matrix is a powerful tool
for checking good and bad classification (False positives, misdetections), in some applications
where may be extremely important to avoid those types of errors.

References
1. J. Han, M. Kamber Data Mining: Theory and Concepts, Second Edition, El-
seiver, San Fransisco, 2006.
2. D. Powers, Evaluation: From Precision, Recall and F-Factor to ROC, In-
formedness, Markedness & Correlation School of Informatics and Engineer-
ing, Flinders University of South Australia , December 2007.
3. J. Scheffer, Dealing with Missing Data Res. Lett. Inf. Math. Sci. (2002), pp.
153-160
4. V. Popova, Missing Values in Monotone Data Sets ProSanos Corp. (2002),
pp. 153-160

Potrebbero piacerti anche