Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
750
We started by tweaking the Cost matrix to identify all users identified. The ROC Area in this case is 0.697 which is the
who will leave the service. This procedure reduced the largest measured value in the evaluation.
percentage of recognition of those who will remain to only All evaluations presented above were performed on the
7.46%. To achieve more meaningful classification, we reduced data set generated as a result of clustering, due to
evaluated the impact of a broad range of cost-matrix values. need of higher speed performances, but also the inability of
Figure 1 shows the plot of the number of identified users who the average PC to evaluate the complex algorithms.
will leave the service against the percentage of correctly Final testing was performed on the whole data set
classified instances, achieved by varying the cost matrix. containing 50000 instances. For this purpose cost matrix has
been slightly changed because ratio of instances of the classes
400
A are changed too. It is fine adjusted again in order to optimize
350 the AUC value and the results of classification of the
Identified minority class instances
300 complete data set using this cost matrix are shown in Table
250 IV:
200
150 TABLE IV
100 CONFUSION MATRIX FOR COST-SENSITIVE DECISIONSTUMP BOOSTING
50 BASED ON THE COMPLETE DATASET
0 B
7.46 12.96 23.78 41.2 62.9 82.8 85.28 88.84 92.32 93.1
Classification accuracy(%)
a B classified as
43773 1898 a = -1
Fig. 1. Leaving users in dependence of correctly classified instances
2978 649 b=1
The results obtained by J48 classifier are weaker than the
results obtained by algorithms previously described. Including The results and the corresponding ROC value of 0.72
the "cost sensitive" classification, a typical result is shown in indicate the algorithms perform even better than when using
Table II: reduced dataset. The percentage of correctly classified
instances was also slightly higher, and it, in this case, also
TABLE II
confirms the result obtained by centroid classification.
CONFUSION MATRIX FOR COST-SENSITIVE J48 This algorithm is therefore identified as the best model for
classification of unbalanced data sets that describe the
fluctuation of users of telecommunications operators.
a B classified as
The result achieved in this manner is comparable to IBM
2789 1866 a = -1 Research Laboratorys results where ROC value was 0.7651.
209 136 b=1
V. CONCLUSION
Maximum values of ROC could not go over 0.5. J48 has
poor results with unbalanced data sets. It is also poorly We studied the applicability of different machine learning
sensitive to cost matrix changes. algorithms, available within the open-source data mining tool
It is interesting to note that only IB1 identified a number of Weka, to the problem of predicting the churn of
users who will leave the provider without the cost matrix telecommunication users. A number of algorithms have been
settings. That algorithm is one of the slowest algorithms evaluated using the data set provided to researchers within the
together with Kstar which did not generally react to cost KDD Cup 2009 competition. We showed that meaningful
matrix changes. classification can be achieved using a personal computer and
Naive Bayes has achieved an average score, but had the best open-source tools.
classification speed. Working with unbalanced data sets, which are common
Boosting of Decision Stumps, using Ada Boost, achieved place in the problem considered, is difficult. This was a major
the best results, slightly better than ADTree algorithm. challenge in the work presented.
The result with the best cost matrix balance achieved with None of the applied algorithms is able to do a meaningful
DecisionStump algorithm is shown in Table III: classification of very unbalanced input dataset without
employing "cost sensitive" classification.
TABLE III The best results were achieved using Ada Boost to boost
CONFUSION MATRIX FOR COST-SENSITIVE DECISIONSTUMP BOOSTING the Decision Stumps classifier.
A B classified as ACKNOWLEDGEMENT
4408 247 a = -1
292 53 b=1 This work has been supported in part by the Ministry of
Science and Technological Development of Serbia grant
In total, 89% instances were classified correctly, and as III43002.
many as 53 users who will leave the service have been
751
REFERENCES [6] Ian H. Witten & Eibe Frank, Data Mining - Practical Machine
Learning Tools, The Morgan Kaufmann, 2005
[7] Bianca Zadrozny, John Langford, Naoki Abe, Cost-Sensitive
[1] Srdjan Sladojevic, Previdjanje fluktuacije korisnika
Learning by Cost-Proportionate Example Weighting,
telekomunikacionih usluga primenom savremenih sistema za
Mathematical Sciences Department IBM T. J. Watson Research
istrazivanje podataka Magistarska teza, FTN, Novi Sad,
Center, Yorktown Heights, NY 10598
2010
[8] Kai Ming Ting, Cost-Sensitive Classification using Decision
[2] Ian H. Witten, Eibe Frank, Len Trigg, Mark Hall, Geoffrey
Trees, Boosting and MetaCost, School of Computing and
Holmes, and Sally Jo Cunningham, Weka: Practical Machine
Information Technology, Monash University, Churchill,
Learning Tools and Techniques with Java Implementations,
Victoria 3842, Australia
Department of Computer Science, University of Waikato, New
[9] Andrew P. Bradley, The use of the area under the ROC curve
Zealand
in the evaluation of machine learning algorithms, Cooperative
[3] John G. Cleary, Leonard E. Trigg: K*: An Instance-based
Research Centre for Sensor Signal and Information Processing,
Learner Using an Entropic Distance Measure. In: 12th
Department of Electrical and Computer Engineering, The
International Conference on Machine Learning, 108-114, 1995.
University of Queensland, Australia, 1996
[4] Gideon Dror, Marc Boulle, Isabelle Guyon, Vincent Lemaire i
[10] Yoav Freund, Robert E. Schapire: Experiments with a new
David Vogel, Winning the KDD Cup Orange Challenge with
boosting algorithm. In: Thirteenth International Conference on
Ensemble Selection, JMLR:Workshop and Conference
Machine Learning, San Francisco, 148-156, 1996.
Proceedings, KDD Cup, 2009.
[11] D. Aha, D. Kibler (1991). Instance-based learning algorithms.
[5] Zdravko Markov, Ingrid Russel, An introduction to Weka data
Machine Learning. 6:37-66.
mining system, Central Connecticut State University,
University of Hartford.
752