Sei sulla pagina 1di 6

International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)

_______________________________________________________________________________________________

Improved the Classification Ratio of ID3 Algorithm Using Attribute


Correlation and Genetic Algorithm
1
Priti Bhagwatkar, 2Parmalik Kumar
Department of Computer science & Engineering
PIT,BHOPAL, INDIA, PCST,BHOPAL,INDIA
Email: bhagwatkarpriti@gmail.com, Parmalik.kumar@patelcollege.com

features. There are generally two types of multiple


Abstract— Increasing the size of data classification
performance of ID3 algorithm is decrease. Improvement of classifiers combination: multiple classifiers selection and
ID3 algorithm various techniques are used such as attribute multiple classifiers fusion[3,4]. Multiple classifiers
selection method, neural network, and fuzzy based ANT selection assumes that each classifier has expertise in
colony optimization. All this technique faced a problem of some local regions of the feature space and attempts to
attribute correlation and decreases the performance of ID3 find which classifier has the highest local accuracy in the
algorithm. In this paper proposed a GA based ID3 vicinity of an unknown test sample. Then, this classifier is
algorithm for data classification. In the proposed algorithm nominated to make the final decision of the system[8].
generate attribute correlation using genetic algorithm. the Attribute correlation technique is new method to find the
proposed algorithm implemented in MATLAB software and
relation of attribute using correlation coefficient factor.
used some standard data set from UCI machine learning
repository. our experimental result shows better The correlation coefficient factor estimates the
classification ratio instead of ID3 and fuzzyID3. correlation value and passes through genetic algorithm.
Genetic algorithm is population based searching
Index Terms—- ID3, Attribute correlation, data mining and technique and finds to know best possible set of value for
GA the classification process. Decision tree is one of widely
used classification method in Data Mining field whose
I. INTRODUCTION core problem is the choice of splitting attributes. In ID3
algorithm, information theory is applied to choose the
The diversity and applicability of data mining are increase attribute that has the biggest information gain as the
day to day in the field of engineering and science for the splitting attribute in each step[10,11,12]. And a recursive
predication of product market analysis. The data mining way is used to generating decision tree until certain
provide lots of technique for mine data in several field, the condition is reached. ID3 algorithm has been extensively
technique of mining as association rule mining, clustering applied in many fields already, but some inherent defects
technique, classification technique and emerging still exist. The most obvious one is its inclining to
technique such as called ensemble classification attributes with many values. About improving this
technique. The process of ensemble classifier increases inclination, related researchers proposed lots of improved
the classification rate and improved the majority voting of methods, such as, modify the information gain of any
classification technique for individual classification attributes by weighting add the number of attributes
algorithm such as KNN, Decision tree and support vector values, the users’ interestingness attribute similarity to
machine. The new paradigms of ID3 are GA technique for information gain as weight. However, there are specific
classification of data[1,2]. This paper apply classification conditions and restriction in above methods[14,15].
proceed based on GA selection to data and propose an Therefore, on the basis of many research achievements,
ID3 classifier selection method. In the method, many an improved ID3 based on weighted modified information
features are selected for a hybrid process. Then, the gain using genetic algorithm called GA_ID3 is proposed
standard presentation of each feature on selected ID3 is in this paper.
calculated and the classifier with the best average
The above section discuss introduction of ID3 algorithm
performance is chosen to classify the given data. In the
and feature correlation attribute. In section II we describe
computation of normal act, weighted average is technique
related work in ID3 algorithm. In section III discuss
is used. Weight values are calculated according to the
distances between the given data and each selectedattribute correlation and genetic algorithm. In section IV
discuss proposed methodology for classification. In
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
21
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________

section V discuss Experimental result and finally [5] Author proposed a new attribute based method for
conclude in section VI. multiclass data classification described as a method
Graph-based representation has been successfully used to
II. RELATED WORK support various machine learning and data mining
algorithms. The learning algorithms strongly rely on the
In this section discuss the related work of feature algorithm employed for constructing the graph from input
correlation and attribute selection for ID3 algorithm. Now data, given as a set of vector-based patterns. A popular
a day’s ID3 algorithm are used for prediction and way to build such graphs is to treat each data pattern as a
classification process in data science. The diversity of vertex; vertices are then connected according to some
data decreases the performance of ID3 algorithm. now similarity measure, resulting in an structure known as data
improvement of performance of ID3 algorithm used graph.
various technique such as weighted technique ,ANT
colony optimization technique and fuzzy logic. All [6] Author proposed an improved decision tree ID3
technique discuss here as contribution for improvement of algorithm described as a Decision tree is an important
ID3 algorithm. method for both induction research and data mining,
which is mainly used for model classification and
[1] Author proposed a new method described as Many prediction. ID3 algorithm is the most widely used
Qualitative Bankruptcy Prediction models are available. algorithm in the decision tree so far. Through illustrating
These models use non-financial information as on the basic ideas of decision tree in data mining, in this
Qualitative factors to predict Bankruptcy. However this paper, the shortcoming of ID3’s inclining to choose
Model uses only very less number of Qualitative factors attributes with many values is discussed, and then a new
and the generated rules has redundancy and overlapping. decision tree algorithm combining ID3 and Association
To improve the Prediction accuracy we have proposed a Function (AF) is presented.
model which applies more number of Qualitative factors
which can be categorized using Fuzzy ID3 Algorithm and [7] Author proposed a new approach of Detecting
Prediction Rules are generated using Ant Colony Network Anomalies using improved ID3 with horizontal
Optimization Algorithm (ACO). In Fuzzy ID3 the portioning based decision tree. During the last decades,
concept of Entropy and Information Gain helps to rank different approaches to intrusion detection have been
the qualitative parameters and this can be used to generate explored. The two most common approaches are misuse
prediction rules in qualitative Bankruptcy prediction. detection and anomaly detection. In misuse detection,
attacks are detected by matching the current traffic pattern
[2] In this paper author describes an ID3 algorithm is a with the signature of known attacks. Anomaly detection
mining one based on decision tree, which selects property keeps a profile of normal system behavior and interprets
value with the highest gains as the test attribute of its any significant deviation from this normal profile as
sample sets, establishes decision-making node, and malicious activity. One of the strengths of anomaly
divides them in turn. ID3 algorithm involves repeated detection is the ability to detect new attacks. Anomaly
logarithm operation, and it will affect the efficiency of detection’s most serious weakness is that it generates too
generating decision tree when there are a large number of many false alarms.
data, so one must change the selection criteria of data set
attributes, using the Taylor formula to transform the [8] In this paper author solving the problem a decision
algorithm to reduce the amount of data calculation and the tree algorithm based on attribute importance is proposed.
generation time of decision trees and thus improve the The improved algorithm uses attribute-importance to
efficiency of the decision tree classifier. It is shown that increase information gain of attribution which has fewer
the use of improved ID3 algorithm to deal with the attributions and compares ID3 with improved ID3 by an
customer base data samples can reduce the computational example. The experimental analysis of the data shows that
cost, and improve the efficiency of the decision tree the improved ID3 algorithm can get more reasonable and
generation. more effective rules. The improved algorithm through
introducing attribute importance emphasizes the
[3] Author proposed a Fuzzy Decision tree for Stock attributes with fewer values and higher importance, dilute
market analysis has traditionally been proven to be the attributes with more values and lower importance, and
difficult due to the large amount of noise present in the solve the classification defect of inclining to choose
data. Decision trees based on the ID3 algorithm are used attributions with more values.
to derive short-term trading decisions from candlesticks.
To handle the large amount of uncertainty in the data, both [9] This paper attempts to summarize the advances in
inputs and output classifications are fuzzified using well RST, its extensions, and their applications. It also
defined membership functions. Testing results of the identifies important areas which require further
derived decision trees show significant gains compared to investigation. Typical example application domains are
ideal mid and long-term trading simulations both in examined which demonstrate the success of the
frictionless and realistic markets. application of RST to a wide variety of areas and
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
22
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________

disciplines, and which also exhibit the strengths and Equation can only be used when the true values for the
limitations of the respective underlying approaches” covariance and variances are known. When these values
.Formally, a rough set is the approximation of a vague are unknown, an estimate of the correlation can be made
concept (set) by a pair of precise concepts, called lower using Pearson's product-moment correlation coefficient
and upper approximations, which are a classification of over a sample of the population (xi, y). This formula only
the domain of interest into disjoint categories. requires finding the mean of each feature and the target to
calculate:
[10] Author proposed a method for anti spam filtering
described as The task of anti spam filter is to rule out
unsolicited bulk e-mail (junk) automatically from a user's
mail stream. The two approaches that are used for ………(2)
classification here are based on the fuzzy and decision Where m is the number of data points. Correlation
trees to build an automatic anti-spam filter to classify coefficients can be used for both repressors and
emails as spam or legitimate. Fuzzy similarity and ID3 classifiers. When the machine is a repressor, the range of
approach based systems derives the classification from values of the target may be any ratio scale. When the
training data using learning techniques. The fuzzy based learning machine is a classifier, we restrict the range of
method uses fuzzy sets and the decision tree method uses values for the target to ±1. We then use the coefficient of
a set of heuristic rules to classify e-mail messages. determination, or R(i)2 , to enforce a ranking of the
features according to the goodness of linear fit between
[11] Here Author study on various data mining algorithm individual features and the target [25]. When using the
based on decision tree described as a Decision tree correlation coefficient as a feature selection metric, we
algorithm is a kind of data mining model to make must remember that the correlation only finds linear
induction learning algorithm based on examples. It is easy relationships between a feature and the target. Thus, a
to extract display rule, has smaller computation amount, feature and the target may be perfectly related in a
and could display important decision property and own non-linear manner, but the correlation could be equal to 0.
higher classification precision. For the study of data We may lift this restriction by using simple non-linear
mining algorithm based on decision tree, this article put pre-processing techniques on the feature before
forward specific solution for the problems of property calculating the correlation coefficients to establish a
value vacancy, multiple-valued property selection, goodness of non-linear relationship fit between a feature
property selection criteria, propose to introduce weighted and the target [12].
and simplified entropy into decision tree algorithm so as
to achieve the improvement of ID3 algorithm. Genetic algorithm is a population based heuristic function
used for optimization process. They combine survival of
III. ATTRIBUTE CORRELATION & GA the fittest among string structures with a structured yet
ALGORITHM randomized information exchange to form a search
algorithm with some innovative flair of human search.
The correlation coefficient is a statistical test that These algorithms are started with a set of random solution
measures the strength and quality of the relationship called initial population. Each member of this population
between two variables. Correlation coefficients can range is called a chromosome. Each chromosome of this
from -1 to 1. The absolute value of the coefficient gives problem which consists of the string genes. The number of
the strength of the relationship; absolute values closer to 1 genes and their values in each chromosome depends on
indicate a stronger relationship. The sign of the the population specification. In the algorithm of this
coefficient gives the direction of the relationship: a paper, the number of genes of each chromosome is equal
positive sign indicates then the two variables increase or to the number of the nodes in the tree and the gene values
decrease with each other and a negative sign shows that demonstrate the selection priority of the classification to
one variable increases as the other decreases. In machine the node, where the higher priority means that task must
learning problems, the correlation coefficient is used to executed early[16]. Set of chromosomes in each
evaluate how accurately a feature predicts the target
independent of the context of other features. The features iteration of GA is called a generation, which are evaluated
are then ranked based on the correlation score [11]. For by their fitness functions. The new evaluated by their
problems where the covariance cov( Xi , Y) between a fitness functions. The new generation i.e., the offspring’s
feature ( Xi ) and the target (Y) and the variances of the are created by applying some operators on the current
feature (var( Xi )) and target (var(Y)) are known, the generation. These are called crossover which selects two
correlation can be directly calculated: chromosomes of the current population, combines them
and generates a new child (offspring), and mutation which
changes randomly some gene values of chromosomes and
creates a new offspring. Then, the best offspring’s are
………………1) selected by evolutionary select operator according to their
fitness values. The GA has four steps as shown below in
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
23
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________

figure: increase among the trees, decide an associated value


of feature.
7. Resulting classifier set is classified
 Finally to estimate the entire model,
misclassification
Conversion of binary attribute in actual value
V. EXPERIMENTAL RESULT ANALYSIS

For the process of experimental result analysis of


proposed algorithm we collected 3 datasets from the UCI
Machine Learning Repository. The datasets have item
sizes vary from 150 to 1000 and feature sizes from 4 to 10.
A few datasets have missing values and we replaced them
with negative values. The nominal data types are changed
to integers and are numbered starting from 1 based on the
order of the appearance. For those dataset with multiple
classes, we use class 1 as the positive class and all other
classes as the negative class. We used a 10-fold cross
validation for each experiment. For the total of 10 rounds
of cross validation for each dataset in each experiment, we
recorded the mean of the average accuracy of individual
classifiers. Our all process performs in matlab 7.8.0 and
show the result in table form.

Table 1 shows that comparative result of wine dataset

Dataset Algorithm Accuracy Time


Fig. 1:- Working process of genetic algorithm

IV. PROPOSED METHODOLOGY


ID3 84.82 34.23
In this section discuss the proposed algorithm for data Wine
classification. the proposed algorithm is a combination of FUZZY_ID 87.46 32.82
Dataset
ID3 and genetic algorithm. Feature correlation is very
important function for the process of genetic algorithm ID3-GA 93.17 17.89
and ID3 algorithm. GA algorithm is creating for data
training for minority and majority class data sample for Table 2 shows that comparative result of iris dataset
processing of tree classification. The input processing of
training phase is data sampling technique for classifier. Dataset Algorithm Accuracy Time
While fitness function select the initial input of ID3
algorithm, GA function optimized with single value might
find relationships more quickly.
1. Sampling of data of sampling technique Iris Dataset ID3 84.52 35.11
2. Split data into two parts training and testing part
3. Apply GA function for training a sample value FUZZY_ID 86.76 31.33
4. Using 2/3 of the sample, fit a tree the split at each
node ID3-GA 94.67 16.86
For each tree. .
 Calculate classification of the available 1/3 using the Table 3 shows that comparative result of cancer dataset
tree, and calculate the misclassification rate = out of
GA. Dataset Algorithm Accuracy Time
5. For each variable in the tree
6. Compute Error Rate: Calculate the overall
percentage of misclassification
 Variable selection: Average increase in GA error Cancer ID3 83.34 37.43
over all trees and assuming a normal division of the Dataset

_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
24
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________

FUZZY_ID 88.48 34.26 VI. CONCLUSION

ID3-GA 95.23 15.46 In this paper we proposed an optimized ID3 method based
on genetic algorithm. Our method combined feature
correlation factor for estimating for attributes selection.
These three are combined together and form GA-ID3
model , these GA-ID3 model passes through data and
reduces the unclassified data improve the majority voting
of classifier. Our experimental result shows better in
compression of old and traditional ID3 classifier. our
experimental task perform in UCI data set such as, wine,
iris and cancer etc. The model is stable under different
machine learning algorithms, dataset sizes, or feature
sizes

REFERENCES

[1] A.Martin, Research, Aswathy.V & Balaji.S, T.


Miranda Lakshmi, V.Prasanna Venkatesan” An
Analysis on Qualitative Bankruptcy Prediction
Using Fuzzy ID3 and Ant Colony Optimization
Fig.2:- Comparative result analysis of classification
Algorithm” IEEE 2012, Pp 56-67.
accuracy and execution time of three algorithm for wine
dataset
[2] Feng Yang,Hemin Jin, Huirnin Qi “Study on the
Application of Data Mining for Customer Groups
Based on the Modified ID3 Algorithm in the
E-commerce” IEEE 2012, Pp 78-87.

[3] Carlo Noel Ochotorena, Cecille Adrianne Yap,


Elmer Dadios, and Edwin Sybingco” Robust Stock
Trading Using Fuzzy Decision Trees” IEEE 2012,
Pp 24-33.

[4] Joao Roberto Bertini Junior, Maria do Carmo


Nicoletti, Liang Zhao”Attribute-based Decision
Graphs for Multiclass Data Classification” IEEE
2013, Pp 97-106.
Fig. 3:- Comparative result analysis of classification [6] Chen Jin, Luo De lin, Mu Fen xiang “An
accuracy and execution time of three algorithm for iris Improved ID3 Decision Tree Algorithm” ,IEEE
dataset 2009, Pp 76-87.

[7] Sonika Tiwari, Roopali Soni “Horizontal


partitioning ID3 algorithm A new approach of
detecting network anomalies using decision tree”
(IJERT) ISSN: 2278-0181,Vol 1 Issue 7,
September 2012.

[8] Liu Yuxun, Xie Niuniu “Improved ID3


Algorithm” IEEE 2010, Pp 34-42.

[9] N.MAC PARTHALA´ IN and Q. SHEN “On


rough sets, their recent extensions and
Applications” The Knowledge Engineering
Review, Vol. 25:4, 365–395. & Cambridge
Fig. 4:- Comparative result analysis of classification University Press, 2010.
accuracy and execution time of three algorithm for Cancer
dataset

_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
25
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________

[10] Binsy Thomas, Dr. J.W.Bakal “Fuzzy Similarity on Machine Learning (ICML 2011). ACM, 2011,
and ID3 algorithm for anti spam Filtering” IJEA Pp 17–24.
ISSN: 2320-0804 Vol. 2 Issue7 2013.
[14] Narasimha Prasad, Prudhvi Kumar Reddy, Naidu
[11] Linna Li, Xuemin Zhang “Study of Data Mining MM “An Approach to Prediction of Precipitation
Algorithm Based on Decision Tree”, ICCDA Using Gini Index in SLIQ Decision Tree” 4th
IEEE 2010, Pp 78-88. International Conference on Intelligent Systems,
2013. Pp 56-60.
[12] C. H. L. Lee, Y. C. Liaw, L. Hsu "Investment
Decision Making by Using Fuzzy Candlestick [15] B. Chandra, P. Paul Varghese "Fuzzy SLIQ
Pattern and Genetic Algorithm" in IEEE Decision Tree Algorithm" IEEE Transactions on
International Conference on Fuzzy Systems 2011, Systems, Man and Cybernetics Part B:
Pp 2696-2701. Cybernetics. Vol.38, 2008.

[13] W. Bi and J. Kwok “Multi-label classification on [16] Sung-Hwan Min, Jumin Lee, Ingoo Han “Hybrid
tree and DAG structured hierarchies” in genetic algorithms and support vector machines
Proceedings of the 28th International Conference for bankruptcy prediction” Elsevier Ltd. Expert
Systems with Applications, 2010 Pp 5689-5697.


_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
26

Potrebbero piacerti anche