Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Keywords
Concept Learning System (CLS), Decision tree (DT), Data Mining (DM), Entropy
and Tree Prune.
1. INTRODUCTION
Data Mining is the process of data exploration to extract consistent pattern and
relations among variables that can be used to make valid predictions. DM has been
increasingly popular in wide number of fields due to large volume of data collected
and dramatic need for analysis of these data. For example, in the pharmaceutical
industry; DM is used to analysis massive amount of data for selecting chemical
components for producing new drugs. Successful Data Mining involves a series of
decisions that need to be made before DM process starts. It involves setting up clearly
defined goals, selecting the type of prediction example (sample) classification,
appropriate for data. The type of model to be used for classification one may choose a
DT model.
In this paper, we explain the concept of how to construct DT using
simplest and earliest CLS Algorithm for classification, which is the one of the technique used
in DM. Further, we explain the ways to construct rules from the constructed DT. These rules
can be used to mine the data or we also use in SQL to mine the data from the database. Also
we highlight optimal ordering of attributes and finally the concept of pruning is discussed in
a nutshell.
Age?
<=30
Student?
no
>40
3040
yes
Credit rating?
yes
no
fair
excellent
yes
no
yes
Attribute X
X2
Parent node
X3
Child nodes
Temperature
Humidity
Outlook
Cool
High
Overcast
Hot
Normal
Sunny
Hot
Normal
Windy
Mild
Normal
Sunny
Mild
Low
Rainy
Mild
Normal
Rainy
Mild
Low
Windy
Table-1, Sample weather data
Concept satisfied
YES
NO
YES
NO
YES
NO
YES
In this data set, there are three categorical attributes Temperature, Humidity and Outlook.
We are interested in building the system which will enable us to decide whether or not to play
the game on the basis of weather conditions, i.e., we wish to predict the value of play using
temperature, humidity and outlook. We can think of the attribute we wish to predict concept
satisfied, as output attribute and other attributes as input attributes.
We can construct the tree from sample database given in the Table-1 by using the
CLS algorithm. We can get the Tree at first in the following structure.
Ex. NO {1, 2, 3, 4, 5, 6, 7}
Temperature
Ex. NO {1, 2, 3, 4, 5, 6, 7}
Temperature
{1}
{2, 3}
{4, 5, 6, 7}
Humidity
Outlook
Normal
{2, 3}
Outlook
YES
Sunny Rainy
Windy
{4}
{5 ,6}
{7}
Humidity
Sunny
{2}
windy NO
YES
{3}
Low
Normal
{5}
{6}
NO
YES
YES
NO
Fig-3 Decision Tress constructed from table-1
IF (temp =cool) OR (temp = hot AND outlook = windy) OR (temp = mild AND outlook
=windy) OR (temp=mild AND outlook = windy) OR (temp = mild AND outlook = rainy AND
humidity =low then concept = satisfied (YES)
ELSE concept = not satisfied (NO).
Attributes are chosen in any order for the CLS Algorithm. This can result in large
Decision Trees if the ordering is not optimal. Optimal ordering would result in smallest
decision tree. So we use the method i.e., CLS + efficient ordering of attributes.
4.2 Attribute Selection
The estimation criterion in the decision tree algorithm is the selection of the attribute
to test at each decision node in the tree. The goal is to select the attribute that is most useful
for classifying examples. A good quantitative measure of the worth of the attribute is a
statistical property called information gain that measure a how well a given attribute
separates the training examples according to their target classification. This measure is used
to select among the candidate attributes at each step while growing the tree.
4.2.1 Entropy
In order to define information gain precisely, we need to define a measure commonly
used in information theory, called entropy that characterizes the purity of an arbitrary
collection of examples. Given a set S, containing only positive and negative examples of
some target concept, the entropy of set S relative to this simple, binary classification is
defined as:
Entropy (S) =-Pp log2 Pp - Pn log2 Pn
Where PP is the proportion of positive examples in S and P n is the proportion of negative
examples in S. In all calculations involving entropy we define 0 log0 to be 0.
To illustrate, suppose S is a collection pf 25 examples, including 15 positive and 10
negative examples [15+, 10-]. Then the entropy of S relative to this classification is
Entropy(S) = (15/25) log2 (15/25) (10/25) log2 (10/25) =0.970
Notice that the entropy is 0 if all members of S belong to the same class. For
example, if all members are positive (Pp =1), then Pn=0. Note the entropy is 1 (at its
maximum) when the collection contains an equal number of positive and negative examples,
the entropy is between 0 and 1. Fig 4 shows the form of the entropy function relative to
binary classification, as p+ varies between 0 and 1.
1
0.8
0.6
0.4
0.2
0
0.2
0.4
0.6
One interpretation of entropy from information theory is that it specifies the minimum
number of bits of information needed to encode the classification of an arbitrary member of S
(i.e.., a member of S drawn at random with uniform probability).
For example, if Pp is 1, the receiver knows that the drawn example will be positive,
so no message need be sent, and entropy is 0. On the other hand, if Pp is 0.5, one bit is
required to indicate whether the drawn example is positive of negative. If Pp is 0.8, then a
collection of messages can be encoded using on average less than 1 bit per message by
assigning shorter codes to collections of positive examples and longer codes to less likely
negative examples.
Thus far we have discussed entropy in the special case where the target classification
is binary. If the target attribute takes on c different values, then the entropy of S relative to
this c-wise classification is defined as
c
Entropy (S) = -Pi log2 Pi
i=1
Where Pi is the proportion of S belonging to class i. Note the logarithm is still base 2 because
entropy is a measure of the expected encoding length measured in bits.
4.2.2 Information gain
Given entropy as a measure of the impurity in a collection a training examples we can
now define a measure of the effectiveness of an attribute in classifying the training data. The
measure we will use, called information gain, is simply the expected reduction in entropy
caused by partitioning the examples according to this attribute. More precisely, the
for Gain is just the entropy of the original collection S and the second term is the expected
value of the entropy after S is partitioned using attribute A. The expected entropy described
by this second term is simply the sum of the entropies of each subset Sv, weighted by the
fraction of examples |Sv|/|S| that belong to Sv. Gain (S,A) is therefore the expected reduction
in entropy caused by knowing the value of attribute A. Put another way, Gain (S,A) is the
information provided about the target attribute value, given the value of some other attribute
A. The value of Gain(S, A) is the number of bits saved when encoding the target value of an
arbitrary member of S, by knowing the value of attribute A.
The process of selecting a new attribute and partitioning the training examples is now
repeated for each non-terminal descendant node which should have maximum Information
Gain.
4.3 Tree Expansion using Information Gain
We have explained the concept of entropy in this previous section now we are going
to apply the concept of entropy to choose the best attribute for tree expansion. First calculate
the expected entropy for tree is
Concept satisfied
YES
NO
So the expected entropy for tree is
Pr
4/7
3/7
Temperature
Cool
Hot
Hot
Mild
Mild
Concept Satisfied
YES
NO
YES
NO
YES
Probability
1/7
1/7
1/7
2/7
2/7
Pr
Pr
Concept
Satisfied
NO
YES
1/2
1/2
S(Hot)=1
Pr
S(Mild)=1
S(Cool)=0
Fig-5 Information Gain for temperature attribute
S(temperature)= (2 / 7) 1 + (1 / 7) 0 + (4 / 7) = 6 / 7 =0.86
Information Gain for temperature= 0.99-0.86=0.13
Similarly we can calculate the information Gain corresponding to attribute Humidity and
Outlook.
First Expansion
Attribute
Information Gain
TEMPERATURE
0.13
HUMIDITY
0.52
OUTLOOK
0.7
Choose
Max
Examples: {1, 2, 3, 4, 5, 6, 7}
Outlook
Sunny
{2,4}
{1}
NO
YES
{5,6}
{3,7}
Expand
YES
After expanding all the attributes according to the concept of information Gain then,
we get the complete decision tree as shown in Fig-7.
Examples: {1, 2, 3, 4, 5, 6, 7}
Outlook
Sunny
Overcast
Rainy
{2,4}
{1}
No
Yes
Windy
{5,6}
Humidity
Low
Normal
{6}
{5}
No
Yes
Now we can construct the following rules from tree shown in Fig-7
IF
Outlook is sunny
OR
Outlook is rainy AND
Humidity is Normal
10
{3,7}
Yes
THEN NO
ELSE YES
5. TREE PRUNE
When a decision tree is built, many of the branches will reflect anomalies in the
training data due to noise or outliers. Tree pruning methods address this problem of over
fitting the data. Such method typically uses statistical measures to remove the least reliable
branches, generally resulting in faster classification and an improvement in the ability of the
tree to correctly classify independent test data.
Decision trees provide a clear indication of which fields are most important for
prediction or classification.
Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.
Decision trees are prone to errors in classification problems with many class and
relatively small number of training examples.
7. CONCLUSION
Decision Tree provides an efficient method of decision making because they clearly
lay out the problem so that all options can be challenged. It allows us to analyze fully the
11
12