Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
_______________________________________________________________________________________________
section V discuss Experimental result and finally [5] Author proposed a new attribute based method for
conclude in section VI. multiclass data classification described as a method
Graph-based representation has been successfully used to
II. RELATED WORK support various machine learning and data mining
algorithms. The learning algorithms strongly rely on the
In this section discuss the related work of feature algorithm employed for constructing the graph from input
correlation and attribute selection for ID3 algorithm. Now data, given as a set of vector-based patterns. A popular
a day’s ID3 algorithm are used for prediction and way to build such graphs is to treat each data pattern as a
classification process in data science. The diversity of vertex; vertices are then connected according to some
data decreases the performance of ID3 algorithm. now similarity measure, resulting in an structure known as data
improvement of performance of ID3 algorithm used graph.
various technique such as weighted technique ,ANT
colony optimization technique and fuzzy logic. All [6] Author proposed an improved decision tree ID3
technique discuss here as contribution for improvement of algorithm described as a Decision tree is an important
ID3 algorithm. method for both induction research and data mining,
which is mainly used for model classification and
[1] Author proposed a new method described as Many prediction. ID3 algorithm is the most widely used
Qualitative Bankruptcy Prediction models are available. algorithm in the decision tree so far. Through illustrating
These models use non-financial information as on the basic ideas of decision tree in data mining, in this
Qualitative factors to predict Bankruptcy. However this paper, the shortcoming of ID3’s inclining to choose
Model uses only very less number of Qualitative factors attributes with many values is discussed, and then a new
and the generated rules has redundancy and overlapping. decision tree algorithm combining ID3 and Association
To improve the Prediction accuracy we have proposed a Function (AF) is presented.
model which applies more number of Qualitative factors
which can be categorized using Fuzzy ID3 Algorithm and [7] Author proposed a new approach of Detecting
Prediction Rules are generated using Ant Colony Network Anomalies using improved ID3 with horizontal
Optimization Algorithm (ACO). In Fuzzy ID3 the portioning based decision tree. During the last decades,
concept of Entropy and Information Gain helps to rank different approaches to intrusion detection have been
the qualitative parameters and this can be used to generate explored. The two most common approaches are misuse
prediction rules in qualitative Bankruptcy prediction. detection and anomaly detection. In misuse detection,
attacks are detected by matching the current traffic pattern
[2] In this paper author describes an ID3 algorithm is a with the signature of known attacks. Anomaly detection
mining one based on decision tree, which selects property keeps a profile of normal system behavior and interprets
value with the highest gains as the test attribute of its any significant deviation from this normal profile as
sample sets, establishes decision-making node, and malicious activity. One of the strengths of anomaly
divides them in turn. ID3 algorithm involves repeated detection is the ability to detect new attacks. Anomaly
logarithm operation, and it will affect the efficiency of detection’s most serious weakness is that it generates too
generating decision tree when there are a large number of many false alarms.
data, so one must change the selection criteria of data set
attributes, using the Taylor formula to transform the [8] In this paper author solving the problem a decision
algorithm to reduce the amount of data calculation and the tree algorithm based on attribute importance is proposed.
generation time of decision trees and thus improve the The improved algorithm uses attribute-importance to
efficiency of the decision tree classifier. It is shown that increase information gain of attribution which has fewer
the use of improved ID3 algorithm to deal with the attributions and compares ID3 with improved ID3 by an
customer base data samples can reduce the computational example. The experimental analysis of the data shows that
cost, and improve the efficiency of the decision tree the improved ID3 algorithm can get more reasonable and
generation. more effective rules. The improved algorithm through
introducing attribute importance emphasizes the
[3] Author proposed a Fuzzy Decision tree for Stock attributes with fewer values and higher importance, dilute
market analysis has traditionally been proven to be the attributes with more values and lower importance, and
difficult due to the large amount of noise present in the solve the classification defect of inclining to choose
data. Decision trees based on the ID3 algorithm are used attributions with more values.
to derive short-term trading decisions from candlesticks.
To handle the large amount of uncertainty in the data, both [9] This paper attempts to summarize the advances in
inputs and output classifications are fuzzified using well RST, its extensions, and their applications. It also
defined membership functions. Testing results of the identifies important areas which require further
derived decision trees show significant gains compared to investigation. Typical example application domains are
ideal mid and long-term trading simulations both in examined which demonstrate the success of the
frictionless and realistic markets. application of RST to a wide variety of areas and
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
22
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
disciplines, and which also exhibit the strengths and Equation can only be used when the true values for the
limitations of the respective underlying approaches” covariance and variances are known. When these values
.Formally, a rough set is the approximation of a vague are unknown, an estimate of the correlation can be made
concept (set) by a pair of precise concepts, called lower using Pearson's product-moment correlation coefficient
and upper approximations, which are a classification of over a sample of the population (xi, y). This formula only
the domain of interest into disjoint categories. requires finding the mean of each feature and the target to
calculate:
[10] Author proposed a method for anti spam filtering
described as The task of anti spam filter is to rule out
unsolicited bulk e-mail (junk) automatically from a user's
mail stream. The two approaches that are used for ………(2)
classification here are based on the fuzzy and decision Where m is the number of data points. Correlation
trees to build an automatic anti-spam filter to classify coefficients can be used for both repressors and
emails as spam or legitimate. Fuzzy similarity and ID3 classifiers. When the machine is a repressor, the range of
approach based systems derives the classification from values of the target may be any ratio scale. When the
training data using learning techniques. The fuzzy based learning machine is a classifier, we restrict the range of
method uses fuzzy sets and the decision tree method uses values for the target to ±1. We then use the coefficient of
a set of heuristic rules to classify e-mail messages. determination, or R(i)2 , to enforce a ranking of the
features according to the goodness of linear fit between
[11] Here Author study on various data mining algorithm individual features and the target [25]. When using the
based on decision tree described as a Decision tree correlation coefficient as a feature selection metric, we
algorithm is a kind of data mining model to make must remember that the correlation only finds linear
induction learning algorithm based on examples. It is easy relationships between a feature and the target. Thus, a
to extract display rule, has smaller computation amount, feature and the target may be perfectly related in a
and could display important decision property and own non-linear manner, but the correlation could be equal to 0.
higher classification precision. For the study of data We may lift this restriction by using simple non-linear
mining algorithm based on decision tree, this article put pre-processing techniques on the feature before
forward specific solution for the problems of property calculating the correlation coefficients to establish a
value vacancy, multiple-valued property selection, goodness of non-linear relationship fit between a feature
property selection criteria, propose to introduce weighted and the target [12].
and simplified entropy into decision tree algorithm so as
to achieve the improvement of ID3 algorithm. Genetic algorithm is a population based heuristic function
used for optimization process. They combine survival of
III. ATTRIBUTE CORRELATION & GA the fittest among string structures with a structured yet
ALGORITHM randomized information exchange to form a search
algorithm with some innovative flair of human search.
The correlation coefficient is a statistical test that These algorithms are started with a set of random solution
measures the strength and quality of the relationship called initial population. Each member of this population
between two variables. Correlation coefficients can range is called a chromosome. Each chromosome of this
from -1 to 1. The absolute value of the coefficient gives problem which consists of the string genes. The number of
the strength of the relationship; absolute values closer to 1 genes and their values in each chromosome depends on
indicate a stronger relationship. The sign of the the population specification. In the algorithm of this
coefficient gives the direction of the relationship: a paper, the number of genes of each chromosome is equal
positive sign indicates then the two variables increase or to the number of the nodes in the tree and the gene values
decrease with each other and a negative sign shows that demonstrate the selection priority of the classification to
one variable increases as the other decreases. In machine the node, where the higher priority means that task must
learning problems, the correlation coefficient is used to executed early[16]. Set of chromosomes in each
evaluate how accurately a feature predicts the target
independent of the context of other features. The features iteration of GA is called a generation, which are evaluated
are then ranked based on the correlation score [11]. For by their fitness functions. The new evaluated by their
problems where the covariance cov( Xi , Y) between a fitness functions. The new generation i.e., the offspring’s
feature ( Xi ) and the target (Y) and the variances of the are created by applying some operators on the current
feature (var( Xi )) and target (var(Y)) are known, the generation. These are called crossover which selects two
correlation can be directly calculated: chromosomes of the current population, combines them
and generates a new child (offspring), and mutation which
changes randomly some gene values of chromosomes and
creates a new offspring. Then, the best offspring’s are
………………1) selected by evolutionary select operator according to their
fitness values. The GA has four steps as shown below in
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
23
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
24
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
ID3-GA 95.23 15.46 In this paper we proposed an optimized ID3 method based
on genetic algorithm. Our method combined feature
correlation factor for estimating for attributes selection.
These three are combined together and form GA-ID3
model , these GA-ID3 model passes through data and
reduces the unclassified data improve the majority voting
of classifier. Our experimental result shows better in
compression of old and traditional ID3 classifier. our
experimental task perform in UCI data set such as, wine,
iris and cancer etc. The model is stable under different
machine learning algorithms, dataset sizes, or feature
sizes
REFERENCES
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
25
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
_______________________________________________________________________________________________
[10] Binsy Thomas, Dr. J.W.Bakal “Fuzzy Similarity on Machine Learning (ICML 2011). ACM, 2011,
and ID3 algorithm for anti spam Filtering” IJEA Pp 17–24.
ISSN: 2320-0804 Vol. 2 Issue7 2013.
[14] Narasimha Prasad, Prudhvi Kumar Reddy, Naidu
[11] Linna Li, Xuemin Zhang “Study of Data Mining MM “An Approach to Prediction of Precipitation
Algorithm Based on Decision Tree”, ICCDA Using Gini Index in SLIQ Decision Tree” 4th
IEEE 2010, Pp 78-88. International Conference on Intelligent Systems,
2013. Pp 56-60.
[12] C. H. L. Lee, Y. C. Liaw, L. Hsu "Investment
Decision Making by Using Fuzzy Candlestick [15] B. Chandra, P. Paul Varghese "Fuzzy SLIQ
Pattern and Genetic Algorithm" in IEEE Decision Tree Algorithm" IEEE Transactions on
International Conference on Fuzzy Systems 2011, Systems, Man and Cybernetics Part B:
Pp 2696-2701. Cybernetics. Vol.38, 2008.
[13] W. Bi and J. Kwok “Multi-label classification on [16] Sung-Hwan Min, Jumin Lee, Ingoo Han “Hybrid
tree and DAG structured hierarchies” in genetic algorithms and support vector machines
Proceedings of the 28th International Conference for bankruptcy prediction” Elsevier Ltd. Expert
Systems with Applications, 2010 Pp 5689-5697.
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume-3, Issue-2, 2014
26