Sei sulla pagina 1di 12

ABSTRACT

Starting large spreadsheet is not good way to analyze any data. It is


necessary to find effective way to combine the computers power to process data with the
human eyes ability to detect patterns. The techniques of data mining are designed for large
set of data sets.
The goal of classification is to assign a new object to a class from a given set of
classes based on the attribute values of the object. Different methods have been proposed for
the task of classification such as Decision tree Classifier, Neural Network, Bayesian
Classifier and K-nearest neighbor algorithms.
In this paper, we explain the concept of how to construct Decision Tree using simplest
andearliest Concept Learning System Algorithm for classification, which is the one of
thetechnique used in Data Mining. Further, we explain the ways to construct rules from the
constructed Decision Tree. These rules can be used to mine the data from the database. Also
we highlight optimal ordering of attributes and finally the concept of pruning is discussed in
a nutshell. A simple Weather Data set is taken as example and concern the condition under
which some hypothetical outer game may be played.

Keywords
Concept Learning System (CLS), Decision tree (DT), Data Mining (DM), Entropy
and Tree Prune.

1. INTRODUCTION
Data Mining is the process of data exploration to extract consistent pattern and
relations among variables that can be used to make valid predictions. DM has been
increasingly popular in wide number of fields due to large volume of data collected
and dramatic need for analysis of these data. For example, in the pharmaceutical
industry; DM is used to analysis massive amount of data for selecting chemical
components for producing new drugs. Successful Data Mining involves a series of
decisions that need to be made before DM process starts. It involves setting up clearly
defined goals, selecting the type of prediction example (sample) classification,
appropriate for data. The type of model to be used for classification one may choose a
DT model.
In this paper, we explain the concept of how to construct DT using
simplest and earliest CLS Algorithm for classification, which is the one of the technique used
in DM. Further, we explain the ways to construct rules from the constructed DT. These rules
can be used to mine the data or we also use in SQL to mine the data from the database. Also
we highlight optimal ordering of attributes and finally the concept of pruning is discussed in
a nutshell.

2. DECISION TREE CONCEPT


Decision Tree is a powerful tool for classification and prediction. A Decision tree is
flow chart-like tree structure where each internal node denotes a test on an attribute, each
branch represents an outcome of the test and leaf nodes represent classes or class
distributions. A Typical decision tree is shown in Fig 1. It represents the concept buys
computer, that is, it predicts whether or not a customer is likely to purchase a computer.
Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals.
In order to classify an unknown sample, the attribute values of the sample are tested
against the Decision tree. A path is traced from the root to a leaf node that holds a class
prediction for the sample. Decision tree can easily be converted to classification rules.

Age?
<=30
Student?
no

>40

3040
yes

Credit rating?

yes

no

fair

excellent

yes

no

yes

Fig1: Example for Decision Tree


3. CONCEPT LEARNING SYSTEM (CLS)
Concept learning is for making the machine learn about the domain. This can be done
using Decision Tree. To construct a Decision Tree an attribute is considered as the root,
which may have all possible values from the sample. All the possibilities of the particular
node are to be found out.
Node with mixture of

Attribute X

+ve and ve examples


X1

X2

Parent node
X3
Child nodes

Fig2: Decision Tree Sample

3.1. CLS Algorithm


1. Initialize the tree T by setting it to convert of one node containing all the
examples, both +ve and ve in the training set.
2. If all the examples (samples) in T are +ve, create a YES node and HALT.
3. If all the examples (samples) in T are ve, create a NO node and HALT.
4. Otherwise, select an attribute F with values V1.Vn. Partition T into subsets
T1.Tn according to the values on F. Create branches with F as parent and
T1.Tn as child nodes.
5. Apply the procedure recursively to each child node.

4. DECISION TREE FROM DATABASE


Consider simple Weather Data set and concern the condition under which some
hypothetical outer game may be played. The data is shown in the Table-1.
Ex. NO
1
2
3
4
5
6
7

Temperature
Humidity
Outlook
Cool
High
Overcast
Hot
Normal
Sunny
Hot
Normal
Windy
Mild
Normal
Sunny
Mild
Low
Rainy
Mild
Normal
Rainy
Mild
Low
Windy
Table-1, Sample weather data

Concept satisfied
YES
NO
YES
NO
YES
NO
YES

In this data set, there are three categorical attributes Temperature, Humidity and Outlook.
We are interested in building the system which will enable us to decide whether or not to play
the game on the basis of weather conditions, i.e., we wish to predict the value of play using
temperature, humidity and outlook. We can think of the attribute we wish to predict concept
satisfied, as output attribute and other attributes as input attributes.
We can construct the tree from sample database given in the Table-1 by using the
CLS algorithm. We can get the Tree at first in the following structure.
Ex. NO {1, 2, 3, 4, 5, 6, 7}
Temperature

{1} {2, 3} {4, 5, 6, 7}


Then, we can further expand the tree as shown in Fig-3

Ex. NO {1, 2, 3, 4, 5, 6, 7}
Temperature

{1}

{2, 3}

{4, 5, 6, 7}

Humidity

Outlook

Normal
{2, 3}
Outlook

YES

Sunny Rainy
Windy
{4}
{5 ,6}
{7}
Humidity

Sunny
{2}

windy NO
YES
{3}
Low
Normal
{5}
{6}
NO
YES
YES
NO
Fig-3 Decision Tress constructed from table-1

4.1 Extracting Classification Rules from Decision Tree


The Knowledge represented in decision trees can be extracted and represented in the
form of classification IF-THEN rules. One rule is created for each path from the root to a leaf
node. Each attribute-value pair along a given path forms a conjunction in the rule antecedent
(IF part). The leaf node holds the class prediction, forming the rule consequent (THEN
part). The IF-THEN rules may be easier for humans to understand, particularly if the given
tree is very large.
Generate classification rules from a decision tree. The decision tree of Fig-3 can be
converted to classification IF-THEN rules by tracking path from the root node to each leaf
node in the tree. After constructing the Decision Tree using CLS Algorithms, We can extract
the following rules from the DT shown in Fig-3.
IF (temp = mild AND ((outlook = sunny) OR (outlook = rainy AND humidity = normal)))
OR (temp = hot AND outlook = sunny) THEN
NO
IF (temp = mild AND ((outlook = rainy) AND humidity = low) OR outlook = windy) OR
(temp = hot AND outlook =windy) OR (temp = cool) THEN
YES
By using Disjunctive Normal form (DNF) we can optimize the rules as shown below

IF (temp =cool) OR (temp = hot AND outlook = windy) OR (temp = mild AND outlook
=windy) OR (temp=mild AND outlook = windy) OR (temp = mild AND outlook = rainy AND
humidity =low then concept = satisfied (YES)
ELSE concept = not satisfied (NO).
Attributes are chosen in any order for the CLS Algorithm. This can result in large
Decision Trees if the ordering is not optimal. Optimal ordering would result in smallest
decision tree. So we use the method i.e., CLS + efficient ordering of attributes.
4.2 Attribute Selection
The estimation criterion in the decision tree algorithm is the selection of the attribute
to test at each decision node in the tree. The goal is to select the attribute that is most useful
for classifying examples. A good quantitative measure of the worth of the attribute is a
statistical property called information gain that measure a how well a given attribute
separates the training examples according to their target classification. This measure is used
to select among the candidate attributes at each step while growing the tree.
4.2.1 Entropy
In order to define information gain precisely, we need to define a measure commonly
used in information theory, called entropy that characterizes the purity of an arbitrary
collection of examples. Given a set S, containing only positive and negative examples of
some target concept, the entropy of set S relative to this simple, binary classification is
defined as:
Entropy (S) =-Pp log2 Pp - Pn log2 Pn
Where PP is the proportion of positive examples in S and P n is the proportion of negative
examples in S. In all calculations involving entropy we define 0 log0 to be 0.
To illustrate, suppose S is a collection pf 25 examples, including 15 positive and 10
negative examples [15+, 10-]. Then the entropy of S relative to this classification is
Entropy(S) = (15/25) log2 (15/25) (10/25) log2 (10/25) =0.970
Notice that the entropy is 0 if all members of S belong to the same class. For
example, if all members are positive (Pp =1), then Pn=0. Note the entropy is 1 (at its

maximum) when the collection contains an equal number of positive and negative examples,
the entropy is between 0 and 1. Fig 4 shows the form of the entropy function relative to
binary classification, as p+ varies between 0 and 1.
1
0.8
0.6
0.4
0.2
0

Fig-4 the entropy function relative to a binary


classification, as the proportion of positive
Examples (Samples) PP varies between 0 and 1
0

0.2

0.4

0.6

One interpretation of entropy from information theory is that it specifies the minimum
number of bits of information needed to encode the classification of an arbitrary member of S
(i.e.., a member of S drawn at random with uniform probability).
For example, if Pp is 1, the receiver knows that the drawn example will be positive,
so no message need be sent, and entropy is 0. On the other hand, if Pp is 0.5, one bit is
required to indicate whether the drawn example is positive of negative. If Pp is 0.8, then a
collection of messages can be encoded using on average less than 1 bit per message by
assigning shorter codes to collections of positive examples and longer codes to less likely
negative examples.
Thus far we have discussed entropy in the special case where the target classification
is binary. If the target attribute takes on c different values, then the entropy of S relative to
this c-wise classification is defined as
c
Entropy (S) = -Pi log2 Pi
i=1
Where Pi is the proportion of S belonging to class i. Note the logarithm is still base 2 because
entropy is a measure of the expected encoding length measured in bits.
4.2.2 Information gain
Given entropy as a measure of the impurity in a collection a training examples we can
now define a measure of the effectiveness of an attribute in classifying the training data. The
measure we will use, called information gain, is simply the expected reduction in entropy
caused by partitioning the examples according to this attribute. More precisely, the

information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is


defined as
Gain (S, A) = Entropy(S)

(|Sv| / |S|) Entropy (Sv)


(v Values(A)).
where Value(A) is the set of all possible values for attribute A, and S v is the subset of S for
which attribute A has value v(i.e., S v={s

S | A(s)=v}). Note the first term in the equation

for Gain is just the entropy of the original collection S and the second term is the expected
value of the entropy after S is partitioned using attribute A. The expected entropy described
by this second term is simply the sum of the entropies of each subset Sv, weighted by the
fraction of examples |Sv|/|S| that belong to Sv. Gain (S,A) is therefore the expected reduction
in entropy caused by knowing the value of attribute A. Put another way, Gain (S,A) is the
information provided about the target attribute value, given the value of some other attribute
A. The value of Gain(S, A) is the number of bits saved when encoding the target value of an
arbitrary member of S, by knowing the value of attribute A.
The process of selecting a new attribute and partitioning the training examples is now
repeated for each non-terminal descendant node which should have maximum Information
Gain.
4.3 Tree Expansion using Information Gain
We have explained the concept of entropy in this previous section now we are going
to apply the concept of entropy to choose the best attribute for tree expansion. First calculate
the expected entropy for tree is
Concept satisfied
YES
NO
So the expected entropy for tree is

Pr
4/7
3/7

S = (4/7) log(4/7) + (3/7) log (3/7) = 0.99


Next, we choose any one of the attributes from temperature, humidity, outlook by
calculating the information Gain corresponding to that. Now let us see how to calculate
expected entropy and Information Gain corresponding to temperature attribute.

Temperature
Cool
Hot
Hot
Mild
Mild

Concept Satisfied
YES
NO
YES
NO
YES

Probability
1/7
1/7
1/7
2/7
2/7

Hot Pr (Hot) =2/7


Mild Pr(Mild) = 4/7
Concept
Satisfied
NO
YES
Cool Pr(Cool)=1/7
Concept
Satisfied
YES

Pr

Pr
Concept
Satisfied
NO
YES

1/2
1/2
S(Hot)=1

Pr

S(Mild)=1

S(Cool)=0
Fig-5 Information Gain for temperature attribute
S(temperature)= (2 / 7) 1 + (1 / 7) 0 + (4 / 7) = 6 / 7 =0.86
Information Gain for temperature= 0.99-0.86=0.13
Similarly we can calculate the information Gain corresponding to attribute Humidity and
Outlook.
First Expansion

Attribute
Information Gain
TEMPERATURE
0.13
HUMIDITY
0.52
OUTLOOK
0.7
Choose

Max

Table 2 Attributes and its information Gain.


According to the concept explained in entropy, choose the attribute which has
maximum Information Gain. From the fable-2 it is obvious that the attribute outlook has
maximum information Gain. So we choose the attribute outlook as the root attribute for tree
expansion. So the tree first expands like in fig 6.

Examples: {1, 2, 3, 4, 5, 6, 7}
Outlook

Sunny

{2,4}

{1}

NO

YES

Overcast Rainy Windy

{5,6}

{3,7}

Expand

YES

After expanding all the attributes according to the concept of information Gain then,
we get the complete decision tree as shown in Fig-7.

Examples: {1, 2, 3, 4, 5, 6, 7}
Outlook
Sunny
Overcast
Rainy
{2,4}

{1}

No

Yes

Windy

{5,6}
Humidity
Low
Normal
{6}
{5}
No

Yes

Fig 7 Complete Tree

Now we can construct the following rules from tree shown in Fig-7
IF
Outlook is sunny
OR
Outlook is rainy AND
Humidity is Normal
10

{3,7}
Yes

THEN NO
ELSE YES
5. TREE PRUNE
When a decision tree is built, many of the branches will reflect anomalies in the
training data due to noise or outliers. Tree pruning methods address this problem of over
fitting the data. Such method typically uses statistical measures to remove the least reliable
branches, generally resulting in faster classification and an improvement in the ability of the
tree to correctly classify independent test data.

There are two common approaches to tree pruning. They are


1. Pre Pruning.
2. Post Pruning.
Prune will overcome noise to some extent but not completely.
6. STRENGTHS AND WEAKNESSES OF DECISION TREE METHODS
The strengths of decision tree methods are

Decision trees are able to generate understandable rules.

Decision trees perform classification without requiring much computation.

Decision trees provide a clear indication of which fields are most important for
prediction or classification.

The weaknesses of decision tree methods are

Decision trees are less appropriate for estimation tasks where the goal is to predict the
value of a continuous attribute.

Decision trees are prone to errors in classification problems with many class and
relatively small number of training examples.

7. CONCLUSION
Decision Tree provides an efficient method of decision making because they clearly
lay out the problem so that all options can be challenged. It allows us to analyze fully the

11

possible consequences of a decision. It provides a framework to quantify the values of


outcomes and the probabilities of achieving them. It helps us to make the best decision on the
basis of existing information and best guess (guesses). As with all decision-making methods,
decision tree analysis should be used in conjunction with common sense-decision tree is just
one important part of your decision-making tool kit.
8. REFERENCES
1. Data Mining Concepts & Techniques (2001) by Jiawei Han & Micheline Kamber,
Elsevier Publication, New Delhi
2. www.he.cornell.edu,www.csdl.computer.org,dbmsmag.com,www.cs.umn.edu

12

Potrebbero piacerti anche