Sei sulla pagina 1di 9

Volume 3, Issue 10, October 2013

ISSN: 2277 128X

International Journal of Advanced Research in Computer Science and Software Engineering


Research Paper Available online at: www.ijarcsse.com

Experimental and Comparative Analysis of Machine Learning Classifiers


Mr. Hitesh H. Parmar P.G. Student, C.E. Department Marwadi Education Foundations Group of Institutions Rajkot, Gujarat, India Prof. Glory H. Shah Asst. Professor, C.E. Department Marwadi Education Foundations Group of Institutions Rajkot, Gujarat, India

Abstract Classification methods have been rapidly used for variety of the different fields including medical, banking and finance, social science, political and economic science and variety of other different fields to classify the available data, which may be containing many different attributes and it will be difficult to classify them manually. As people are generating more data everyday so there is a need for such a classifier which can classify those newly generated data accurately and efficiently. This paper mainly focuses on the supervised learning technique called the Random forests for classification of data by changing the values of different hyper parameters in Random Forests Classifier to get accurate classification results. This paper also focuses on experimental comparison of Random Forests classifier with some of the state of art supervised learning technique like NB (Nave Bayes), C4.5 and ID3 (Iterative Dichotomiser 3) with respect to their accuracy of correctly classified instances, incorrectly classified instances and very important ROC Area which helps in understanding the classification model and their results, which can also help other researchers in making decision for the selection in classification model based on their data and number of attributes. Keywords Data mining, Machine Learning, Classifiers, Random Forests. I. INTRODUCTION In Data Mining there are mainly two techniques are available for the data analysis and those techniques are known as the Data Classification and the Data Prediction [2]. Where classification techniques are mainly used to predict the discrete class labels for the new observation or new data on the basis of training data set provided to the classifier algorithm, and prediction techniques generally works with the continuous valued functions. Classification techniques have been used in many of the different fields like Computer Vision [3], Text Classification [4], Fraud Detection [2], Sentiment Analysis [5] and many other. This paper focuses on use of supervised classification techniques, which would be working with two things one is known as the training set, which is the collection of the data which are already classified, and second is known as the testing set which are the collection of the data whose class labels are required to be determined based on the training data set. This paper will focus on four different classification algorithm which are 1. Nave Bayes, 2. Decision Tree Learning ID3 (Iterative Dichotomiser 3), 3. Decision Tree Learning C4.5 (extension of ID3), 4. Random Forests. This paper is organized in to six sections, section one discusses introduction and the usage of Machine Learning classification techniques, section two discusses approaches to four classifier techniques, section three is on literature survey, section four demonstrates experiments and results followed by the experimental evaluations in section five and conclusion is in section six. II. UNDERSTANDING SUPERVISED MACHINE LEARNING APPROACH This section deals with the basic understanding of the above mentioned four algorithm with their advantages and the disadvantages. Supervised machine learning approach provides a really great results in terms of accuracy, if we are having more data, which can be used for the training purpose for the classifier algorithm to test out the new input data. It is always said that if there is more data for training more accurate the results would be [2]. Supervised machine learning approaches have their own advantages and disadvantages [7], which are described below. Advantages: Often provide much accurate results compared to humans driven data analysis. Can analyze very large amount of data which is certainly impossible for any human. Disadvantages: Need high amount of training data for accurate results. Impossible to get results which are perfect accurate. All the classifier that are mentioned above are described here with their introduction, working and ended with the strength, weaknesses and research issues of a classifier are also mentioned with them. 2013, IJARCSSE All Rights Reserved

Page | 955

Hitesh et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 955-963 A. Nave Bayesian Classifier : Nave Bayes Classifier is very widely used classifier and it is also known as the state of the art technique for many of different applications which makes this classifier useful and accurate in providing results (Zhang et al. 2004). It is also known as the probabilistic classifier because it uses the concept of Bayes theorem which is named after its founder Thomas Bayes, for classifying the data with strong independence assumptions. Working of the Bayesian classifier depends on the presence or the absence of a particular feature of the class, and does not depend on the presence or the absence of the other features (Amiri et al. 2013). Nave Bayes algorithm can be used for binary classification as well the multi label classification. Results provided by the Bayesian classifiers are very comparable to the approaches- like Decision Tree [10]. Strength and weakness of the Nave Bayes classifier is shown down below. B. Decision Tree Classifier ID3: Decision Tree classifier ID3 is also known as the (Iterative Dichotomiser 3) which was developed by (Quinlan et al. 1986). Classifier uses the concept of tree structure to classify the given data in to the different number of classes based on the training data, Structure is mainly divided into two parts nodes and branches, there are mainly two things in tree which plays very important role in classifying the data, one is Root node from which all the instances are going to be classified and goes to the leaf node based on their feature values, Leaf node contains the actual class label which is required to be determined. Every single node in decision tree represents a feature which will help in classification of an instance, and each branch in decision tree represents a value of a node [10]. ID3 Algorithm is good at dealing with categorical attributes [2]. When dealing with the multiple attributes in the decision tree, then the split point for the decision tree is going to be computed using the measure from information theory called Information Gain (Hunt et al., 1966), which is known as an attribute selection method for the ID3 algorithm. ID3 does not guarantee an optimal solution to a problem, it is greedy in nature and look forward to the local optimum. There are few advantages, disadvantages and research issues which are mentioned here in the figure of ID3 in terms of Strength, Weakness and Research Issues.

Decision Tree Classifier: ID3 Strength: Relatively Fast. Easy to run and easy to understand. Automatic variable selection. Good at missing value handling. Weakness: Sharp decision Boundaries Model tends to evolve around the strongest effect. Doesnt support pruning. Research Issue: Larger the Decision Tree grows, poorest the accuracy results it will return, researchers can work on algorithm which can produce decision trees that are small in size and depth as well with trees providing good accuracy results (Kothari et al. 2001). Nave Bayes Classifier Strength: Relatively Fast. Easy to run and easy to understand. Automatic variable selection. Good at missing value handling. Weakness: Sharp decision Boundaries Model tends to evolve around the strongest effect. Doesnt support pruning. Research Issue: Nave Bayes algorithm is good at dealing with features which are completely independent, and sometimes surprisingly well on those features which are dependent as well, so there is a need for the deep study related to data characteristics which really affect or which can affect the performance of Nave Bayes algorithm (Rish et al. 2001).

2013, IJARCSSE All Rights Reserved

Page | 956

Hitesh et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 955-963 C. Decision Tree Classifier C4.5: Decision Tree classifier C4.5 was developed by (Quinlan et al. 1993), which is an improved version of the previously developed classifier algorithm ID3, as there were few problems with the ID3 algorithm, (Tom et al. 1997) shows that for an ID3 generated tree, as the number of nodes grows, the accuracy increases on the training data, but decreases on unseen test cases; this is called overfitting the data. C4.5 classifier algorithm overcomes this problem. Uses the Gain Ratio for the attribute selection method. C4.5 Algorithm also provides the pruning of generated tree which was not possible in ID3, in pruning operation all the nodes which are irrelevant are going to be eliminated and that will results in to the reduction of tree size. Strength, weakness and research issues for C4.5 algorithm are mentioned below. D. Random Forest: Random Forest classifiers are the ensemble classifier which are combination of many Decision Tree classifiers, and it was developed by (Breiman et al. 2001), which works on the concept of generating multiple Random Trees, with the help of bootstrap of training dataset and generating trees, and there are other things as well, which are required to be set to random forest before execution of an algorithm like, Number of Trees in the forests, Depth of the tree, Number of samples for bagging[13], Number of features for splitting node. The main advantage of using Random Forest is because of its randomness, so that it doesnt depend of the data, so it is good with dealing with the outliers. Random Forest is good at dealing with high dimensional data, and performance and the accuracy results of random forests are very much promising and comparable with some of the state of the art techniques [13]. One of the best thing that random forests adds is the randomness on to the data while it classifies, and due to randomness each tree would be highly uncorrelated with the other random tree. If trees would be correlated then it would not give the proper results of accuracy, but due to highly uncorrelated tree, when you combine those tree using bagging approach, results would be much improved, compared to those which are highly correlated and generating the most similar results all the time, that is shows a classifier Random Forests can provide good accuracy, and can efficiently handle outliers. Strength, Weakness and Research Issues of Random Forests are mentioned in figure below, Random Forests was founded in 2001, and many of the researchers have started using it for different number of application field which includes medical research, image processing and there have been many of the competitions won by making use of Random Forests [36]. Random Forests Strength: Good at dealing with the outliers due to randomness. Provides good accuracy. Weakness: Difficult in selecting the hyperparameters. Time consuming. Research Issue: There multiple number of parameters are required to be set while dealing with Random Forests, to get good accuracy results those parameters are required to be set manually, so researchers can focus on optimizing the algorithm which will automate the tuning of setting those parameters. Decision Tree Classifier: C4.5 Strength: Handling Training data with missing values. Pruning trees after creation of trees. Weakness: Not good while dealing with the continuous data values [12]. Trees created from the numeric data sets can be complex. Research Issue: One of the problem with C4.5 is with the high memory usage while generating rule sets for the given data [35]. III. LITERATURE SURVEY There have been many experiments performed on this data mining classification technique and those experiments usually changes from domain to domain, as (Wolpert et al. 1997) has published that there is no algorithm which works best for all the domains, which is also known as no free lunch theorems [22] . This literature survey focuses on the some of the previous work carried out by the researchers in measuring the performance and comparing different data 2013, IJARCSSE All Rights Reserved

Page | 957

Hitesh et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 955-963 mining classification algorithm, and provide the valid information and remarks on the performance and output and working behavior of the algorithm on the various cases and various parameters. (Sharma et al. 2011) performed the experimental evaluation of the various classification algorithm on Weka [38] which are C4.5, ID3, CART[23] to determine the classification accuracy of an algorithm on the emails to classify that whether those emails are spam or not. All those classification algorithm are then compared based on the correctly classified instances. As the part of the conclusion they mentioned that the classification accuracy of the C4.5 algorithm outperforms the all other three classifier, and the accuracy of the CART algorithm is also very promising and comparable to the c4.5 whereas the other two algorithm reported less accuracy compared to C4.5 and CART [14] (Yadav et al. 2012) performed the experiment to find out the best classifier which can work on students data and predict the students performance based on those data, this experiment was conducted on the sample of 50 students who were pursuing MCA from VBS Purvanchal University, Jaunpur, Uttar Pradesh, India. Data includes students previous semester marks, class test grade, seminar performance, attendance and assignment work, and with the help of these data they tried to predict the end semester results of these 50 students, for that they used the same three classifiers to predict the results which are ID3, CART and C4.5. In the conclusion of an experiment they found that the classification accuracy of algorithm C4.5 is much better compared to the other two classifier [15]. (Lakshmi et al. 2013) They have also worked in the field of educational data mining and they have also used the same three classifiers which are used by (Yadav et al. 2012), their main intention was very similar to that of (Yadav et al. 2012) but the only difference was that with the type of data that they used, (Lakshmi et al. 2013) used the classifier to predict the performance of the students, they collected the sample of 120 Under graduate students and the data includes qualification details of the students, students location details, their financial support and the family support and relatio ns etc. They used all these samples and tested in Weka with three classifiers which were mentioned before, and from their experimental evaluations and conclusion they achieved the highest accuracy with the CART algorithm which was also comparable to the accuracy that they got with the use of C4.5 [16]. (Khoonsari et al. 2012) performed many different experiments on two classification approaches which are ID3 and C4.5 to check efficiency and the robustness of the classifier techniques, their experiments were carried out on nine different data sets from the UCI Repository, which contains many of the Gold standard data set for the machine learning, and most of the researchers in the data mining and machine learning tend to use only those data sets to test out the performance results of their experiments over some of the previously published research. The data set they used were of type categorical only, and the number of instances varies from 40 to 67557, and number of different attributes varies from 4 to 4, and none of the data set contains any of the missing value. At the end of their experiments they concluded that robustness and the accuracy of the classification technique C4.5 is outperforming the accuracy results returned by the ID3 classifier [17]. (Patel et al. 2012) demonstrated the use of Machine Learning classification techniques for Network Intrusion Detection System, They carried out their experiments with some of the state of the art classification techniques which also include Nave Bayes and C4.5. Data set that they used was DARPA KDD99, which is also known as the gold standard data set for the researchers working in Intrusion Detection and evaluation. From their experiments and evaluations C4.5 performed better than Nave Bayes classifier, and they also concluded that instead of using only single classifier for the evaluation, results can be improved by combining more than one classifiers to remove the disadvantages of one another [18]. (Khan et al. 2008) used the machine learning classification techniques such as Decision Tree for mining of Oral Medicine. They are using Decision Tree classifiers to mine the certain large Electronic Medical Records which contain lot of information, and these information can be useful to teach to the students of the Oral Medicine. They worked with the data set which contains examination records of more than 20000 patients. Their data set contains more than 180 different attributes, and data set also suffers from the missing values as well. They concluded that C4.5 performs better than ID3 classifier when there is a data set which contains missing values, C4.5 avoids the overfitting of data values and can also better handle data with missing values compared to ID3, which will not be able to perform well when there are missing values in the data [19]. (Kotsiantis et al. 2007) Performed and represented a very explanatory and detailed survey of many different classification technique, which includes both supervised and unsupervised, out of which it includes both Decision Trees and Nave Bayes, he focuses on many of the issues of the classifiers which includes algorithm selection, issue regarding supervised learning and their accuracy, implementations, and also provided information on what are the advantages and disadvantages of one classifier over the other classifier approach. He has concluded the paper with suggestion that the researchers should not select the classifier based on it is better than the other one or not but it has to be selected on the basis of characteristics under which that classifier can perform really well, characteristics includes Type of attributes like some of the classifiers are good with numbers whereas some of the classifiers are good at both numbers and categorical, number of instances also plays a very important role as some of the classifier provides very good result when provided small data set like Nave Bayes, whereas some of the classifiers provide very good accuracy results on given high dimensional data like SVM. More than one methods can also be combined but requires amount of study on both the methods so after that limitation of one method can be handled by the other method. But integrating multiple methods may also lead to increase in storage amount, this will also increase overall computation time. As there are many classification technique available in the machine learning, and to compare these techniques there are number of parameters available with the help of which classification techniques can be compared, few of those 2013, IJARCSSE All Rights Reserved

Page | 958

Hitesh et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 955-963 parameters are described here which plays very important role in the classification technique to make decision in selection of classification techniques, those parameters includes classification scheme, based on data that we want to work, data specification, computation time, ability of a classifier to deal with noise or outliers, classification accuracy, number of different model parameters to set to get an efficient classification output. All these points are briefly described and compared below with respect to four classification approach. This will also help the researchers in data mining and machine learning to get to know the classifiers from the different parameter perspective. Classifier Scheme: Classifier scheme generally refers to the way of classifying data by the classifiers, as this paper focuses on only two type of classification scheme, one is the Hierarchical which includes ID3, C4.5, Random Forests and the second one is Probabilistic which includes Nave Bayes. Data Specifications: Some of the algorithm can very well handle the large dimensional data or more columns but very less number of rows eg. Random Forests [13]. Classifiers like Random Forests, which is an ensemble of decision trees are good at dealing with large dimensional data and can perform very well with those type of data, and it can also handle both category of data which includes Categorical as well as Numerical but large sample size is also one of the key point to get the higher prediction accuracy. Where there are other algorithms like Nave Bayes which can certainly perform well when they are dealing with the small data sets. And when dealing with the logic based classifiers like Decision Tree eg. ID3, C4.5, they tends to perform better if they are going to be used with data with features of type categorical or discrete [20]. Computational Time: It is really important that the classifier should return very promising accuracy predictions within the finite amount of time, faster the classifier with the good prediction more better it is. In that case Nave Bayes is much better approach with its faster and short computational time for training. Computation time of ID3, C4.5 is also very comparable with Nave Bayes[20], but with the Random Forest the computation time would be more compared to all the other three classifiers, there are few reasons for that and which are mentioned here, Random Forests is an ensemble decision tree classifier, so first it is required to produce number of random trees and after parsing all those different random trees, the bagging operation will be performed which will provide the accuracy results which are always quite comparable to state of the art classification technique[13]. Outliers: While dealing with large amount of data it is not sure that all the data would be accurate, there may be some of the data which are usually considered as noise or outliers, which are required to be handled by the classifiers accurately otherwise the prediction of the classifier will be biased to that outlier. Decision tree classifiers like ID3, C4.5 are robust with respect to outliers in training data [26], Nave Bayes classifier provides very high tolerance to outliers [25], whereas Random forests is using the concept of bagging which makes it very less sensitive to noise or outliers compared to other classifier algorithms. Accuracy: Accuracy of an algorithm depends upon what kind of data that the algorithm can handle and what kind of data that is been given as the input, and also the size of the data also matters. ID3 often faces the problem of overfitting when dealing with large amount of data, which affects the accuracy predictions [26]. Whereas accuracy of Random Forests is certainly good as there is no requirement for pruning trees and overfitting is also not a problem for it which is main problem for the classifier algorithm like ID3, and Random Forests can also handle both Categorical as well as Continuous and Binary attributes very well [26]. Naive Bayes works on the assumptions of the independence of child nodes and that certainly not always correct due that reason it is sometimes less accurate compared to other supervised algorithms [20]. Domingos & Pazzani (1997) provided great results by comparing Nave Bayes Classifier with some of state of the art classification technique which also includes Decision Tree Induction ID3 classifier, and found that Nave Bayes is sometimes performed really well to the other learning techniques [27]. Parameters: Classification accuracy can also be improved by setting up the different parameters available to classifier, Nave Bayes Classifier just works fine as there is nothing like the parameter settings, While dealing with the decision trees there is a need to take care of parameters, as ID3 works just fine on the basis of training datasets, where as there is a need for the setting up parameters in C4.5 for CF (confidence factor) which is set to 25%, and MS (minimum numbers of split-off cases) which is set to 2. Parameter values are the one which were suggested as the default values for the classifier by Quinlan, the inventor of C4.5 [30]. If those parameters are set to values other than the default values, the classifier may perform surprisingly well which leads to better result in accuracy predictions [29], but is difficult to set exact parameters for a particular algorithm. So compared to Nave Bayes and ID3 algorithm in C4.5 it is required to deal other parameters than just running an algorithm, same is the case with the Random Forests which also deals with many hyperparameters compared to C4.5 which are required to set to deal with number of trees to create in forests, depth of each tree etc. [13]. There has not been any of the survey done which carries out the comparison of the supervised classifiers ID3, C4.5, Nave Bayes, Random Forests, this paper focuses only on supervised classification techniques, next section shows comparison of all four above mentioned classifier in terms of their accuracy with correctly classified instances, incorrectly classified instances and ROC Area which are really important in deciding the performance of any classifier model [37]. 2013, IJARCSSE All Rights Reserved

Page | 959

Hitesh et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 955-963 IV. EXPERIMENTS AND RESULTS To check the performance of the classifier algorithm experiments were performed on five different data sets from UCI repository, and those data sets belong to both the types numeric as well as nominal. Data set Information is shown below. Name of The Data Set Arrhythmia [A] Musk [B] Splice [C] Spect_test [D] Nursery [E] Number of Instances in the Data Set 452 6598 3190 187 12960 Number of Attributes in the Data Set 280 169 62 23 8

Machine learning tool called Weka was used with above mentioned data sets in order to perform analysis on three things 1) Correctly classified instances, 2) Incorrectly classified instances and 3) ROC Area for all four algorithms which are Nave Bayes, ID3, C4.5 and Random Forests. Those data sets ranges from low to high dimension, out of which first two data sets [A], [B] contains the numeric values and missing values as well, but ID3 algorithm cant deal with missing values so those missing values are replaced by applying the replacemissingvalues filter in Weka, and those numeric values were converted to nominal using numerictonominal filter available in Weka. The experiments and results are shown in terms of three different chart which are results for correctly classified instances, second chart shows results of incorrectly classified instances, and the last chart shows the ROC Area results which collectively best defines the classification algorithm [37]. All these experiments are performed by splitting data sets into 80% for training and 20% for testing which is generally considered as good amount of split for supervised classifiers. All these experiments are performed under Weka version 3.6.10 on 32 bit Microsoft windows 7 platform with 3GB of RAM. Following Fig. 1. Chart represents result analysis of correctly classified instances. Here vertical axis shows the accuracy of correctly classified instances in percentage, and horizontal axis represents the data sets used for the experiment.

[Fig. 1: Correctly Classified Instances of Data sets] Following Fig. 2. Represents percentage results of incorrectly classified instances for all five data sets. From this chart we can say that size of the data set also plays a very important role in classification approach. Here vertical axis shows the accuracy of incorrectly classified instances in percentage, and horizontal axis represents the data sets used for the experiment.

[Fig. 2: Incorrectly Classified Instances of Data sets] 2013, IJARCSSE All Rights Reserved

Page | 960

Hitesh et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 955-963 Following Fig. 3. Represents percentage results of ROC (Receiver Operating Characteristics) Area for all five data sets. ROC is a plot of True Positive Rate against False Positive Rate. ROC Area results provides great information on how accurate the classifier is. Here vertical axis represents the probability value for ROC, where value 1 means the perfect classification, and horizontal axis represents the data sets used for the experiment.

[Fig. 3: ROC Area results for Data Sets] By performing the classification task on those five above mentioned data sets we got some of the really interesting information with respect to performance results which are mentioned in the Experimental Evaluation section. V. EXPERIMENT EVALUATIONS After applying classification algorithms on data sets, evaluation is as follows, sometimes ID3 classifier may also lead to data which may be unclassified, as from the Fig. 1 for the data sets [A] and [C] the accuracy results received were very less because most of the instances were unclassified which resulted in poor accuracy results. Whereas out of all data sets, for data set [B] accuracy achieved was 100% and for data set [E] accuracy results were very nearer to 100% for all three tree based classifiers. Performance of Nave Bayes was also very better compared to ID3 for correctly classified instances for all five data sets, in a same way results of Random Forests and C4.5 are also very comparable but overall Random Forests outperforms all the three classifiers for correctly classified instances for all five out of four data sets. A Random Forests is the classification algorithm which deals with different types of parameters like number of random trees and number of features to be selected, and to get better accuracy results those parameters are required to be set accurately. All of these performance analysis is carried out using Weka, in which for Random Forests those parameter values of random trees are set to 10 and features are selected randomly at the runtime. Those default parameter values of Random Forests in Weka does not guarantee that the classification results will always be better, sometimes changing those parameter values manually may lead to drastic change in result accuracy, as using those default values provided by Weka did not help in getting good accuracy result in the first execution. While performing these experiments with Random Forest was tuned with some different parameter values set manually, instead of those default values provided in Weka for number of trees in forest and number of features for Random Forests, as these parameter plays very important role in its classification technique. These different parameter values of trees and features that helped to get good accuracy results are 10 trees, 9 random features for data set [A]. 10 trees, 130 random features for data set [B]. 50 trees, 10 random features for data set [C]. 10 trees, 10 random features for data set [D] and 100 trees, 9 random features for data set [E] and because of these parameter values Random Forests provided very promising and accurate results, sometimes 100% as for data set B. Setting those parameters is really a search problem which requires great efforts to get good accuracy results. For incorrectly classified instances ID3 did not performed well, because for two data sets [A], [C] it resulted into 10% and 1% of correctly classified instances and incorrectly classified instances respectively, and rest of the instances were unclassified. For this results again Random Forests outperformed all three classifier in four out of five data sets. For ROC Area ID3 did not perform well compared to other three classifiers, Nave Bayes surprisingly performed well and results were very comparable to Random Forests. And compared to C4.5, Random Forests performed better in all five data sets. So overall form all these data and result analysis Random Forests performed well compared to other three algorithms. VI. CONCLUSION AND FUTURE WORK Classification techniques are being used in many different application areas, and there is no single classifier which can perform best all the time for variety of data. This paper focuses on experimental comparison of four different classifier which are ID3, Nave Bayes, C4.5 and Random Forests on five standard data sets form UCI and result analysis clearly shows that the accuracy results returned by Random Forests outperforms all the three classifiers in terms of correctly and incorrectly classified instances and ROC Area. Future work focuses on optimizing hyperparameters of Random Forests, as to work with Random Forests there is a need to set those hyperparameters manually to improve the accuracy results. Setting those parameters manually will be time consuming and may not lead to better solution always, so we will look forward to implement optimization technique on top of Random Forests for automatic tuning of those hyperparameters which may lead to better accuracy results. 2013, IJARCSSE All Rights Reserved

Page | 961

Hitesh et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 955-963 REFERENCES [1] B. Liu, Statistical Approaches to Concept-Level Sentiment Analysis, IEEE, Vol. 28, Issue 3, 2013, pp. 6-9 [2] J. Han, M. Kamber, Data mining: concepts and techniques, Edition 2, 2006, pp. 258 [3] R. Szeliski, Computer Vision: Algorithms and Applications. Springer-Verlag. 2002 [4] B Baharudin, A Review of Machine Learning Algorithms for Text -Documents Classification, JOURNAL OF ADVANCES IN INFORMATION TECHNOLOGY, VOL. 1, NO. 1, pp. 4-20, FEBRUARY 2010 [5] B. Pang, L. Lee, and S. Vaithyanathan, Thumbs up? : sentiment classification using machine learning techniques,Proceedings of the ACL-02 conference on Empirical methods in natural language processing, vol. 10, 2002, pp. 79-86. [6] Ridgeway G, Madigan D, Richardson T (1998) Interpretable boosted naive Bayes classification. In: Agrawal R, StolorzP, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining.. AAAI Press, Menlo Park pp 101 104. [7] Machine Learning Algorithm for Classification - http://www.cs.princeton.edu/~schapire/talks/picassominicourse.pdf [8] Zhang, H., The optimality of naive bayes, Proceedings of the 17th International FLAIRS Conference 2004. [9] M. Amiri, Using Nave Bayes Classifier to Accelerate Constructing Fuzzy Intrusion Detection Systems, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-6, January 2013 [10] A. S. Galathiya, Improved Decision Tree Induction Algorithm with Feature Selection, Cross Validation, Model Complexity and Reduced Error Pruning, ) International Journal of C omputer Science and Information Technologies, Vol. 3 (2) , pp. 3427-3431 [11] Tom M. Mitchell, lecture slides for textbook Machine Learning, McGraw Hill, 1997 [12] J. R. Quinlan, Improved Use of Continuous Attributes in C4.5, Journal of Artificial Intelligence Res earch 4 (1996) 77-90 [13] L. Breiman, Random forests. Machine Learning, vol. 45. Issue 1, 2001, pp. 5-32. [14] Aman Kumar Sharma, Suruchi Sahni A Comparative Study of Classification Algorithms for Spam Email Data Analysis, International Journal on Computer Science and Engineering, May 2011 ,Vol. 3 No. 5 ,pp 1890-1895. [15] Surjeet Kumar Yadav, Brijesh Bharadwaj, Saurabh Pal, A Data Mining Application: A Comparative study for predicting students Performance, International Journal of Innovative Technology and Creative Engineering, Vol.1 No.12 (2012) 13-19 [16] TM Lakshmi, An Analysis on Performance of Decision Tree Algorithms using Students Qualitative Data, I.J. Modern Education and Computer Science, 2013, 5, 18-27 [17] Payam Emami Khoonsari and AhmadReza Motie, A Comparison of Efficiency and obustness of ID3 and C4.5 Algorithms Using Dynamic Test and Training Data Sets International Journal of Machine Learning and Computing, Vol. 2, No. 5, October 2012, pp. 540-543 [18] A. Ganatra, R. Patel, A. Thakkar, A Survey and Comparative Analysis of Data Mining Techniques for Network Intrusion Detection Systems, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-1, March 2012, pp. 265-271 [19] KHAN, F. S., ANWER, R. M., TORGERSSON, O. & FALKMAN, G. Data mining in oral medicine using decision trees. World Academy of Science, Engineering and Technology, 37, 225-230. 2008 [20] S. B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica 31 (2007) 249-268 249 [21] archive.ics.uci.edu/ml/datasets.html: UCI Machine Learning Repository: Data Sets. [22] Wolpert, D.H., Macready, W.G. (1997), "No Free Lunch Theorems for Optimization", IEEE Transactions on Evolutionary Computation 1, 67. [23] L. Breiman, Classification and Regression Trees. CRC Press, New York, 1999. [24] A. Ganatra, H. Bhavsar, A Comparative Study of Training Algorithms for Supervised Machine Learning, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-4, September 2012 [25] A. S. Galathiya, A. P. Ganatra, C. K. Bhensdadia, Improved Decision Tree Induction Algorithm with Feature Selection, Cross Validation, Model Complexity and Reduced Error Pruning, International Journal of Computer Science and Information Technologies, Vol. 3 (2) , 2012,3427-3431 [26] Ned Horning, Introduction to decision trees and random forests, American Museum of Natural History's Center for Biodiversity and Conservation, http://www.whrc.org/education/indonesia/pdf/DecisionTrees_RandomForest_v2.pdf [27] Domingos, P. & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning 29: 103-130. [28] JR Beck, A Backward Adjusting Strategy and Optimization of the C4.5 Parameters to Improve C4.5s Performance, Proceedings of the Twenty-First International FLAIRS Conference (2008), Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. [29] Banfield, R.E.; Hall, L.O.; Bowyer, K.W.; Kegelmeyer, W.P., "A Comparison of Decision Tree Ensemble Creation Techniques," Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.29, no.1, pp.173,180, Jan. 2007. 2013, IJARCSSE All Rights Reserved

Page | 962

Hitesh et al., International Journal of Advanced Research in Computer Science and Software Engineering 3(10), October - 2013, pp. 955-963 [30] Jason R. Beck, Maria Garcia, Mingyu Zhong, Michael Georgiopoulos, and Georgios C. Anagnostopoulos, A Backward Adjusting Strategy and Optimization of the C4.5 Parameters to Improve C4.5's Performance. FLAIRS Conference, page 35-40. AAAI Press, (2008) [31] Quinlan, J. R. 1986. Induction of Decision Trees. Mach. Learn. 1, 1 (Mar. 1986), 81-106. [32] Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. [33] Irina Rish. An empirical study of the naive bayes classifier. In IJCAI-01, workshop on Empirical Methods in AI. [34] R. Kothari and M. Dong, "Decision trees for classification: A review and some new results," in Lecture Notes in Pattern Recognition, S. K. Pal and A. Pal, Eds., Singapore, 2000, World Scientific Publishing Company. [35] Experiments on C4.5 http://rulequest.com/see5-comparison.html [36] https://www.kaggle.com/wiki/RandomForests [37] M. Abernethy,Data mining with WEKA, Part 2: Classification and clustering, IBM developer Works,2010 [38] http://www.cs.waikato.ac.nz/ml/weka/

2013, IJARCSSE All Rights Reserved

Page | 963

Potrebbero piacerti anche