Parametric Comparison Based On Split Criterion On Classification Algorithm

International Journal of Computer Engineering and Technology ENGINEERING (IJCET), ISSN 0976INTERNATIONAL JOURNAL OF COMPUTER 6367(Print), ISSN 0976
6375(Online) Volume 4, Issue 2, March April (2013), IAEME & TECHNOLOGY (IJCET)
ISSN 0976 6367(Print) ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), pp. 459-470 IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com
IJCET
IAEME
PARAMETRIC COMPARISON BASED ON SPLIT CRITERION ON CLASSIFICATION ALGORITHM IN STREAM DATA MINING
Ms. Madhu S. Shukla*, Dr.K.H.Wandra**, Mr. Kirit R. Rathod*** *(PG-CE Student, Department of Computer Engineering), (C.U.Shah College of Engineering and Technology, Gujarat, India) ** (Principal, Department of Computer Engineering), (C.U.Shah College of Engineering and Technology, Gujarat, India) *** (Assistant Professor, Department of Computer Engineering)
ABSTRACT Stream Data Mining is a new emerging topic in the field of research. Today, there are number of application that generate Massive amount of stream data. Examples of such kind of systems are Sensor networks, Real time surveillance systems, telecommunication systems. Hence there is requirement of intelligent processing of such type of data that would help in proper analysis and use of this data in other task even. Mining stream data is concerned with extracting knowledge structures represented in models and patterns in non stopping streams of information. Classification process based on generating decision tree in stream data mining that makes decision process easy. As per the characteristic of stream data, it becomes essential to handle large amount of continuous and changing data with accuracy. In classification process attribute selection at the non leaf decision node thus become a critical analytic point. Various performance parameters like Speed of Classification, Accuracy, and CPU Utilization time can be improved if split criterion is implemented precisely. This paper presents implementation of different attribute selection criteria and their comparison with alternative method. Keywords: Stream, Stream Data Mining, Performance Parameter processing, MOA (Massive Online Analysis), Split Criterion.
459
International Journal of Computer Engineering and Technology (IJCET), ISSN 09766367(Print), ISSN 0976 6375(Online) Volume 4, Issue 2, March April (2013), IAEME
1. INTRODUCTION Characteristic of stream data also act as challenges for the same. Due its huge size, continuous nature, speed with which it changes, it requires a real time response which is done after analysis of this type of data. As the data is huge in size algorithm which would access the data is restricted for single scan of the data. Data mining makes use of different types of algorithm for various types of mining task like Classification, Clustering, and Pattern Recognition. Same way, Stream Data mining also makes use of different types of algorithm for various types of mining task. Some of the algorithm for Classification of Stream Data is Hoeffding Tree, VFDT (Very Fast decision Tree, CVFDT (Concept adaptation Very Fast Decision Tree).These classification algorithm is based on Hoeffding Bound for decision tree generation. It makes use of Hoeffding Bound to gather optimum amount of data so that classification can be done accurately. CVFDT is the algorithm which is able to detect concept drift which again is a challenge in stream data mining. As the size of stream data is extremely large, a method is required for improving the split criterion at the node of decision tree, so that the speed in tree generation is achieved accuracy is improved and CPU utilization time is reduced. Two different types of split criterion are checked for Stream data Classification in this paper. And thus improvement in the algorithm based on it is done as a part of research work. As said earlier, Stream Data is huge in size, so in order to perform certain analysis; we need to take some sample of that data so that processing of stream data could be done with ease. These samples taken should be such that whatever data comes in the portion of sample is worth analyzing or processing, which means maximum knowledge is extracted from that sampled data. In this paper sampling technique used is adaptive sliding window in Hoeffding-Bound based tree algorithm. 2. RELATED WORK Implementing algorithm for Stream Data Classification demands improvement in resource utilization as well as improvisation in accuracy with ongoing classification process. Here, we would see improvement done on algorithm that is based on Concept Drift Detection while doing the classification of the data. Drift Detection here is done using Windowing Technique. Sliding Window: It is an advance technique. It deals with detailed analysis over most recent data items and over summarized versions of older ones. The inspiration behind sliding window is that the user is more concerned with the analysis of most recent data streams. Thus the detailed analysis is done over the most recent data items and summarized versions of the old ones. This idea has been adopted in many techniques in the undergoing comprehensive data stream mining system. 3. CLASSIFICATION PROCESS. There are many data mining algorithms that exist in practice. Data mining algorithms can be categorized in three types: 1. Classification 2. Clustering 3. Association
460
A standard classification system has normally three different phases: 1. The training phase, during which the model is built using labeled data. 2. The testing phase, during which the model is tested by measuring its classification accuracy on withheld labeled data. 3. The deployment phase during which the model is used to predict the class of unlabelled data. The three phases are carried out in sequence. See Figure 2.1 for the standard classification phases.
Fig 3.1: Phases of standard classification systems 3.1. STREAM DATA MINING Ordinary classification is usually considered in three phases. In the first phase, a model is built using data, called the training data, for which the property of interest (the class) is already known (labeled data). In the second phase, the model is used to predict the class of data (test data), for which the property of interest is known, but which the model has not previously seen. In the third phase, the model is deployed and used to predict the property of interest for (unlabelled data). In stream classification, there is only a single stream of data, having labeled and unlabelled records occurring together in the stream. The training/test and deployment phases, therefore, interleave. Stream classification of unlabelled records could be required from the beginning of the stream, after some sufficiently long initial sequence of labeled records, or at specific moments in time or for a specific block of records selected by an external analyst. 4. ATTRIBUTE SELECTION CRITERION IN DECISION TREE: Selection of appropriate splitting criterion helps in improving performance measurement dimensions. In data stream mining main three performance measurement dimensions: - Accuracy - Amount of space necessary or computer memory (Model cost or RAM hours) - The time required to learn from training examples and to predict (Evaluation time) These properties may be interdependent: adjusting the time and space used by an algorithm can influence accuracy. By storing more pre-computed information, such as look up tables, an algorithm can run faster at the expense of space. An algorithm can also run faster by processing less information, either by stopping early or storing less, thus having less data to process. The more time an algorithm has, the more likely it is that accuracy can be increased.
461
There are major two types of attribute selection criterion and they are Information Gain and Gini Index. Later one is also known as binary split criterion. During late 1970s and 1980s . J.Ross Quinlan, a researcher in machine learning has developed a decision tree algorithm known as ID3 [1] (Iterative Dichotomiser). ID3 uses information gain for attribute selection. Information gain Gain (A) is given as Gain (A) = Info (D) InfoA (D).We have developed a new algorithm to calculate information gain. Methodology wise this algorithm is promising. We have divided the algorithm into two parts. The first part calculates Info (D) and the second part calculates the Gain (A). 4.1. Information Gain Calculation: (information before split) (information after split) Entropy: A common way to measure impurity is entropy Entropy = pi log 2 pi Where pi is the probability iof class i. Compute it as the proportion of class i in the set. Entropy comes from information theory. The higher the entropy the more the information content. For Continuous data value is computed as (ai+ai+1+1)/2
Calculating Information Gain
Information Gain = entropy(parent) [average entropy(children)]
13 4 4 13 child log 2 log 2 = 0.787 17 17 17 entropy 17
Entire population (30 instances) 17 instances
child 1 12 12 1 entropy 13 log 2 13 13 log 2 13 = 0.391
14 14 16 16 parent log 2 log 2 = 0.996 30 30 30 entropy 30
13 instances
17 13 (Weighted) Average Entropy of Children = 0.787 + 0.391 = 0.615 30 30
Information Gain= 0.996 - 0.615 = 0.38
Figure 4.1: Phases of standard classification systems 4.2. Calculating Gini Index If a data set T contains examples from n classes, Gini index, Gini (T) is defined as
gini (T ) = 1
j =1
2 j
Where pj is the relative frequency of class j in T. Gini (T) is minimized if the classes in T are skewed. After splitting T into two subsets T1 and T2 with sizes N1 and N2, the Gini index of the split data is defined as
gini
split
(T) = N1 gini (T1) + N2 gini (T2) N N
The attribute providing smallest gin split(T) is chosen to split the node.
462
5. METHODOLOGY AND PROPOSED ALGORITHM CVFDT (Concept Adaptation Very fast Decision Tree) is an extended version of VFDT which provides same speed and accuracy advantages but if any changes occur in example generating process provide the ability to detect and respond. Various systems with this CVFDT uses sliding window of various dataset to keep its model consistent. In Most of systems, it needs to learn a new model from scratch after arrival of new data. Instead, CVFDT continuous monitors the quality of new data and adjusts those that are no longer correct. Whenever new data arrives, CVFDT incrementing counts for new data and decrements counts for oldest data in the window. The concept is stationary than there is no statically effect. If the concept is changing, however, some splits examples that will no longer appear best because new data provides more gain than previous one. Whenever this thing occurs, CVFDT create alternative sub-tree to find best attribute at root. Each time new best tree replaces old sub tree and it is more accurate on new data. 5.1 CVFDT ALGORITHM (Based on HoeffdingTree) 1. Alternate trees for each node in HT start as empty. 2. Process Examples from the stream indefinitely 3. For Each Example (x, y) 4. Pass (x, y) down to a set of leaves using HT And all alternate trees of the nodes (x, y) pass Through. 5. Add(x, y) To the sliding window of examples. 6. Remove and forget the effect of the oldest Examples, if the sliding window overflows. 7. CVFDT Grow 8. Check Split Validity if f examples seen since Last checking of alternate trees. 9. Return HT.
Fig: 5.1 Flow of CVFDT algorithm

463
6. EXPERIMENTAL ANALYSIS WITH OBSERVATION Different types of dataset were taken and the algorithm of CVFDT was implemented after Importing those data set to in MOA. Performance analysis of various split criterion used in decision tree approach are also tested for improving the accuracy of the algorithm. Datasets used here are in ARFF format. Some of the data are taken from Repository of California University, some from projects of Spain which are working on Stream Data. Data Sets taken were as follows: 1) Sensor 2) Sea 3) Random Tree generator. The Readings taken here are for Sensor data. It contains information (temperature, humidity, light, and sensor voltage) collected from 54 sensors deployed in Intel Berkeley Research Lab. The whole stream contains consecutive information recorded over a 2 months period (1 reading per 1-3 minutes). I used the sensor ID as the class label, so the learning task of the stream is to correctly identify the sensor ID (1 out of 54 sensors) purely based on the sensor data and the corresponding recording time. While the data stream flow over time, so does the concepts underlying the stream. For example, the lighting during the working hours is generally stronger than the night, and the temperature of specific sensors (conference room) may regularly rise during the meetings.
Fig: 6.1 MIT Computer Science and Artificial Intelligence Lab data repository
As discussed above an attribute selection measure is a heuristic for selecting the splitting criterion that best separates a given Data. Two common methods used for it are: 1) Entropy based method (i.e. Information Gain) 2) Gini Index
6.1 RANDOM TREE GENERATOR DATA SET RESULTS

464
Instance
Information Gain(Accuracy) 92.6 93 94.7 96.3 94.8 96.9 96.9 96.7 98.7 97.4
Gini Index(Accuracy) 81.7 83 80.1 82.2 80.9 81.9 82.6 82.1 84 77.9
100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000
Table-I: Comparison for accuracy in random tree generator
6.2 SEA DATA SET RESULTS Instance Information Gain(Accuracy) 89.8 92.1 89.6 89.1 88.5 88.8 90.6 89.5 89.1 89.9 Gini Index(Accuracy) 89.3 91.6 89.3 88.9 88.5 88.1 90.6 89.3 89 89.9
100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000
Table-II: Comparison for accuracy for SEA Data
465
6.3 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (CPU UTILIZATION) Learning evaluation instances 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000 1700000 1800000 1900000 Evaluation time (Cpu seconds) Info gain 6.676843 13.46289 20.23333 26.97257 33.68062 40.40426 47.0499 53.74234 59.93558 66.79963 73.27367 79.27971 85.53535 91.99379 98.40543 104.3803 110.3083 116.4859 121.9928 Evaluation time (Cpu seconds)Gini index 8.704856 18.67332 29.40619 39.87386 49.63952 59.06198 67.70443 78.0941 88.14057 98.48343 107.1727 116.9851 127.016 136.6257 145.2993 152.9278 160.0102 168.1223 174.8459
Table-III: Comparison of CPU Utilization time for SENSOR Data
466
6.4 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (ACCURACY) Learning evaluation instances 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000 1700000 1800000 1900000 Classifications correct (percent)Info Gain 96.3 68.3 18 43.2 62.8 92 97.9 97.4 96.8 80.6 53.6 71 84.1 78.5 96.3 50.9 24 74.3 98 Classifications correct (percent)Gini Index 98.4 69.7 64.4 67.4 72.9 71 72.5 73.9 73.7 68.5 71.2 90.3 73.1 83.9 84.9 84.9 79 87.6 97.8
Table-IV: Comparison of ACCURACY for SENSOR Data
467
6.5 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (TREE SIZE)
Learning evaluation instances 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000 1700000 1800000 1900000
Tree size (nodes) Info Gain 14 30 44 60 76 88 102 122 136 150 172 196 216 226 240 262 282 292 312
Tree size (nodes) Gini Index 126 270 396 530 666 800 938 1076 1214 1346 1466 1602 1742 1868 1998 2122 2238 2352 2474
Table-V: Comparison of TREE SIZE for SENSOR Data)
468
6.6 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (LEAVES) Learning evaluation instances 100000 200000 300000 400000 500000 600000 700000 800000 900000 1000000 1100000 1200000 1300000 1400000 1500000 1600000 1700000 1800000 1900000 Tree size (leaves) Info Gain 7 15 22 30 38 44 51 61 68 75 86 98 108 113 120 131 141 146 156
Tree size (leaves) Gini Index 63 135 198 265 333 400 469 538 607 673 733 801 871 934 999 1061 1119 1176 1237
Table-IV: Comparison of LEAVES for SENSOR Data) 6.7 COMPARISION OF ALL DIMENSION OF PERFORMANCE TOGETHER FOR SENSOR DATA
Fig 6.2: Comparison of Performance for Sensor Data for every dimension together
469
7. CONCLUSION In this paper, we discussed about theoretical aspects and practical results of Stream Data Mining Classification algorithms with different split criterion. The comparison based on different dataset shows the result analysis. Hoeffding trees with windowing technique spend least amount of time for learning and results in higher accuracy than Gini Index. Memory utilization, Accuracy and CPU Utilization which are crucial factor in Stream Data are practically discussed here in this paper with observation. Classification generates decision tree and tree generated with Split Criterion as Information gain shows that size of tree is also decreased as shown in table along with dramatic change in accuracy and CPU Utilization. REFERENCES [1] Elena ikonomovska,Suzana Loskovska,Dejan Gjorgjevik, A Survey Of Stream Data Mining Eight National Conference with International Participation-ETAI2007 [2] S.Muthukrishnan, Data streams: Algorithms and Applications.Proceeding of the fourteenth annual ACM-SIAM symposium on discrete algorithms,2003 [3] Mohamed Medhat Gaber, Arkady Zaslavsky and Shonali Krishnaswamy. ]Mining Data Streams: A Review, Centre for Distributed Systems and Software Engineering, Monash University900 Dandenong Rd, Caulfield East, VIC3145, Australia [4] P. Domingos and G. Hulten, A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering, Proceedings of the Eighteenth International Conference on Machine Learning, 2001, Williamstown, MA, Morgan Kaufmann [5] H. Kargupta, R. Bhargava, K. Liu, M. Powers, P.Blair, S. Bushra, J. Dull, K. Sarkar, M. Klein, M. Vasa, and D. Handy, VEDAS: A Mobile and Distributed Data Stream Mining System for Real-Time Vehicle Monitoring, Proceedings of SIAM International Conference on Data Mining, 2004. [6]Adaptive Parameter-free Learning from Evolving Data Streams, Albert Bifet and Ricard Gavald`a, Universitat Polit`ecnica de Catalunya, Barcelona, Spain. [7] Mining Stream with Concept Drift, Dariusz Brzezinski, Masters thesis, Poznan University of Technology [8] R. Manickam, D. Boominath and V. Bhuvaneswari, An Analysis of Data Mining: Past, Present and Future, International journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 1 - 9, ISSN Print: 0976 6367, ISSN Online: 0976 6375 [9] Mr. M. Karthikeyan, Mr. M. Suriya Kumar and Dr. S. Karthikeyan, A Literature Review on the Data Mining And Information Security, International journal of Computer Engineering & Technology (IJCET), Volume 3, Issue 1, 2012, pp. 141 - 146, ISSN Print: 0976 6367, ISSN Online: 0976 6375
470

Parametric Comparison Based On Split Criterion On Classification Algorithm

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Parametric Comparison Based On Split Criterion On Classification Algorithm

Caricato da

Copyright:

Formati disponibili

International Journal of Computer Engineering and Technology ENGINEERING (IJCET), ISSN 0976INTERNATIONAL JOURNAL OF COMPUTER 6367(Print), ISSN 0976

child 1 12 12 1 entropy 13 log 2 13 13 log 2 13 = 0.391

14 14 16 16 parent log 2 log 2 = 0.996 30 30 30 entropy 30

17 13 (Weighted) Average Entropy of Children = 0.787 + 0.391 = 0.615 30 30

Information Gain= 0.996 - 0.615 = 0.38

(T) = N1 gini (T1) + N2 gini (T2) N N

Fig: 5.1 Flow of CVFDT algorithm

6.1 RANDOM TREE GENERATOR DATA SET RESULTS

Table-I: Comparison for accuracy in random tree generator

Table-II: Comparison for accuracy for SEA Data

Table-III: Comparison of CPU Utilization time for SENSOR Data

Table-IV: Comparison of ACCURACY for SENSOR Data

6.5 PERFORMANCE ANALYSIS BASED ON SENSOR DATA SET (TREE SIZE)

Table-V: Comparison of TREE SIZE for SENSOR Data)

Potrebbero piacerti anche