Sei sulla pagina 1di 6

2012 International Conference on Information Engineering Lecture Notes in Information Technology, Vol.

25

Research on the Application of Decision Tree to the Analysis of Individual Credit Risk Yu Yanping1, a, Qian Zhengming1,b, Yang Min1,c, Guan Rui1,d, Fang Liting1,e, Guo Penghui1,f
1

The School of Economics, Xiamen University, Siming District, Xiamen, 361005, China
b

zmqianxm@126.com

Keywords: Data Mining; Decision Tree; Analysis of Individual Credit Risk

Abstract. Currently most financial institutions utilize data mining methods to analyze individual credit risk, but in China no systematic research has been done on the advantages and disadvantages of various data mining methods in the analysis of credit so that no reference views can be provided for analysts to choose an appropriate data mining method in practical application. Therefore, the data mining method used in specific practical application may not be the most appropriate so that the results are not optimal and the application quality of the data mining method is greatly lowered. This paper studies the superiority and characteristics of the decision tree method in the analysis of individual credit risk by empirical analysis, so as to provide a reference and recommendations for better application of this method to the analysis of individual credit risk. 1. Introduction Business intelligence is a kind of tool which assists users to make proper decisions on their own business. The technical system of business intelligence is composed of three parts, data warehouse (DW), online analytical processing (OLAP) and data mining (DM). Data mining (DM) is actually a decision support process, based on AI, machine learning, statistics and so on. It can highly automatically analyze the original data of the enterprises, make inductive reasoning, mine the potential mode, and predict customers behavior, so as to help business decision-makers to adjust marketing strategies, reduce risk, and make the right decisions. As one of the core competitiveness of banks and credit card companies and other credit institutions, data mining is widely used for managing credit. In our country, the application of data mining in financial institutions has just begun, and there is no systematic research on the advantages and characteristics of various data mining methods in specific applications so that it is difficult to provide reference views on how to choose an appropriate data mining method, which inhibits better and fuller exploitation of data mining methods. This paper studies the advantages and characteristics of the decision tree model used for analyzing personal credit risk, in order to provide a reference and recommendations for better application of this method to the analysis of individual credit risk. We also hope that our research can be a valuable beginning of further study of how to make better use of data mining methods in the analysis of individual credit risk. 2. Range of the Applications of Decision Tree Model to the Analysis of Individual Credit Risk Constructing a decision tree model is to establish a decision tree based on a number of training data. The tree can be binary or multi-branch. In generally speaking, each internal node of a binary tree is a logical judgment and one leaf node of the tree is a category tag. Then, we can use the constructed
978-1-61275-024-8/10/$25.00 2012 IERI ICIE2012

209

decision tree to predict data. Since a process of constructing a decision tree can be considered as a regular process of generating data, so decision tree model visualizes the data search rules and the results are easy to understand. Its range of applications is as follows: 1. Decision tree model can be used to classify and predict individuals so as to find the homogenous individual group; 2. Decision tree model can be used to develop management strategies of marketing or risk management; 3. Samples can be divided into several homogeneous groups by means of developing shallow decision trees. That grouping method can be used for further development of a scoring model. 3. Empirical Research on the Application of Decision Tree Sampling. This article chooses classification sampling method and random sampling method. First, we classify the 100,000 data from a bank into good credit group (called as Category 1) and bad credit group (known as Category 2). Second, we randomly take 30,000 data from Category 1 and 600 data from Category 2. Determining the Analytical Indicators. At first, we determine the credit standards. This paper focuses on analyzing the credit risk, so we choose credit status as the dependent variable. However, there are a number of credit standards to distinguish between and evaluate the credit situations, so the evaluation of indicators should be concise, clear and consistent with the overall objective. We develop such following indicators to evaluate credit status: The credit status is judged as poor credit risk if both of the following situations appear. The study period is 12 months (2) Breach contracts more than 2 times Second, we assume a system of indicators. We set the following system of indicators on the basis of the factors affecting credit risk.
Table 1 A System of Indicators

Arguments Types Nature

Arguments Age Gender Place of Birth Marital Status Education

Data Types numerical value character numerical value character character

Occupation

Housing Situation Time for Living at Present Address Occupation Time Position Titles Monthly Income Accounts Breach of Contract

character numerical value character numerical value character character numerical value character character

Range 18-65 male, female south, north, west, east married, unmarried Doctor degree, Master degree, bachelor degree, college, secondary, high school, Junior high school, Primary school, Illiteracy have ones own real estate, rental, other natural number natural number administrative institutions, the state-owned enterprises, private enterprises, SOHO primary, Intermediate, Advanced natural number loan account, Savings account, No account yes, no

Relationship with the Bank Credit Status

Empirical Analysis of the Model. Now we do empirical research on the individual credit risk. Parameters. This paper selects "credit status" as predictor variable, with other indicators as input variables. And all variables are discrete. We construct the model by writing programs, using DMX programming language of SQL Server 2005. The provisions of the model are as following:
210

Table 2 Table of Parameters of the Decision Tree Model

Parameter Complexity_penalty Minimum_support Score_method Split_method

Instruction Inhibit the growth of the decision tree. The smaller the value is, the greater the possibility of split is; the higher the value is, the smaller the possibility of split is. Specify the minimum number of cases that a leaf node must contain. Specify the method used to calculate the split score. Specify the method used to split the node. Available methods: Binary (1), Complete (2) or Both (3)

Value 0.7

30 Take the system default Take the system default --(3)

Analysis. Put the sample data into the model and the results are as follows:

Fig. 1 Five-level Decision Tree Model

Fig. 2 Eight-level Decision Tree Model

Analysis of the results: 1. Description of color: the darker (In Figure 1 and Figure 2, the color is dark green) the color of the node is, the more possible its for the credit status to be the second category. 2. Further analysis of each node: we can obtain the classification rules of each node, the number of the facts supporting the rules, and the probability distribution to predict the predictor variables.

Fig. 3 Analysis of Nodes

Instructions: (1) Value stands for the predictive value of the dependent variable;
211

(2) Case stands for the number of samples which comply with the rules after classification; (3) Probability means the probability of different values of the dependent variable under the rules. The classification rule of this node can be obtained from Figure 3 (the housing situation is 2, the marital status is 2, the cultural status is 3, and the income status is 2). The probability of credit status 2 for the samples in conformity with the rule is 44.8%, and the number of samples is 151. In the original samples there are 600 samples whose credit status is 2. So by means of that rule we can obtain about one-fourth of the samples with credit status 2. It can be found that the probability of credit status 2 is very high for the samples in conformity with the rule, that is to say, they are high-risk. 3. Investigate the relevance between independent variables and the dependent variable. We can draw the strength of the relationship between variables by the association degree tool of SQL Server 2005 data mining analysis tool. In this paper the following diagram of relationship can be drawn from our example.

Fig. 4 Relevance between the Predictor Variable and Independent Variables

It can be found from Figure 4, housing situation, culture status, occupational status, title, account relationship with the bank, marital status, income and credit status (predictor variable) are linearly dependent, and those indicators can predict the credit status. Now we analyze the size of the degree of the correlation of the seven indicators with the predictor variables. First we draw the indicator with the strongest relevance to the predictor variable.

Fig. 5 Relevance between the Predictor Variable and Independent Variables

It can be seen from Figure 5, among the above seven indicators the housing situation has the strongest relevance to the predictor variable (credit status). Similarly, we can find the indicator with the sub-strong relevance to the predictor variable.

212

Fig. 6 Relevance between the Predictor Variable and Independent Variables

From Figure 6, it can be found that marital status, income, and culture status are the second strongest indicators affecting the predictor variable. Whats more, the classification of decision trees is based on the strength of the association between indicators and the predictor variable. 4. Now we verify the accuracy of the model. This article puts another batch of data from the original data into the model so as to predict the predictor variable. We compare the predictive values of the predictor variable (credit status) with its real values, using the accuracy of the prediction to determine the models quality.

Fig. 7 Comparison of the Predictive Values and the Actual Values

Instruction: (1) Predict stands for the predictive values; (2) 1Actualmeans the actual value of the samples is 1. Similarly, 2Actual means the actual value of the samples is 2.According to Figure 7, when the predictive value is 2, the number of samples whose real value is 2 is 274, and the number of samples whose real value is 1 is 64, with the accuracy of 81.06%. Similarly, we can find that when the predictive value is 1, the accuracy is 97.7%, which shows that the model's prediction accuracy is quite high, in full compliance with the requirements of the model predictions. 4. Conclusion We can use the decision tree method to judge and classify samples. The decision tree model can only distinguish discrete target variables. Based on the above empirical results, we draw the following conclusions: (1) The prediction accuracy is 81.06% when we predict the credit status of the sample with poor credit. (2) Decision tree model can accurately dig out of 274 samples that are more probable to breach contracts, which can greatly reduce the risk of financial institutions. (3) The prediction error rate of neural network is 2.3%. The risk which financial institutions face when they mistake a client with poor credit for one with good credit is much higher than what they face when they mistake a client with good credit for one with poor credit. In general, financial institutions require that the prediction error rate be as low as possible in order to better control credit risk.

213

References [1] Chen M S, Han J W,Yu P S, Data Mining: An Overview from a Database Perspective. IEEE Transaction on Knowledge and Data Engineering, 1996, 18(6):1-41. [2] G. Piatetsky Shapiro, Lisa Lewinson, Advances in Knowledge Discovery and Data Mining. America: AAA/MIT Press, 1996:1-35. [3] A. Berson, S. J. Smith, K. Thearling, Building Data Mining Applications for CRM. New York: McGraw-Hall, l999:1-43. [4] K. Wang, S. Zhoit, S. C. Liew, Building Hierarchical Classifiers Using Class Proximity. In Proc 1999 Int. Conf. Very Large Data Bases. Edinburgh, UK, Sept, 1999:363-374. [5] M. J. A. Berry, G Linoff: Mastering Data Mining, The Art and Science of Customer Relation Management. New York: John Wiley& Sons, 1999:25-89. [6] Jiawei Han, Micheline Kamber: Data Mining, Concepts and Techniques. Morgan Kaufmann Publishers, Inc 2001:1-260. [7] M. Mehta, R. Agrawal, and J. Rissanen, SLIQ: A Fast Scalable Classifier for Data Mining. In Proc.1996 Int. Conf. Extending Database Technology (EDBT'96), Avignon, France, Mar1996. [8] J. R. Quinlan, Induction of Decision Trees. Machine Leaning.1986:81-106.

214

Potrebbero piacerti anche