Sei sulla pagina 1di 25

IDS 572 – Data Mining for Business

Fall 2015
Homework #2

Submitted By-
Group 11
Ankit Bhardwaj (abhard5@uic.edu)
Arpit Gulati (agulat4@uic.edu)
Nitish Puri (npuri5@uic.edu)
Problem 1
a) Input the data set. Set the role of INCOME to target. Use a partition node to divide the
data into 60% train, 40% test.

Ans: Salary-class.csv file was the input to the var file which was then connected to the
partition node to divide the data into 60% train, 40% test.
b) Create the default C&R decision tree. How many leaves are in the tree?

Ans :

There are a total of 7 leaves which can be seen when we see the decision tree using the
viewer option in the C & R Model.

c) What are the major predictors of INCOME?


Ans:
The major predictor of Income are MSTATUS, C-GAIN, DEGREE, JOBTYPE.
d) Give three rules that describe who is likely to have an INCOME > 50K and who is likely
to have an income <= 50K. These rules should be relevant (support at least 5% in the
training sample) and strong (either confidence more than 75% “> 50K" or 90% “<=
50K"). If there are no three rules that meet these criteria, give the three best rules you
can.

Rule 1 for INCOME <=50K

Support = 6835/13559 = 0.50409 or 50.409%


Confidence = 6835/7170 = 0.95328 or 95.328%

if MSTATUS in [ " Divorced" " Married-spouse-absent" " Never-married" " Separated" "
Widowed" ] and C-GAIN <= 7139.500 then INCOME <=50K

Rule 2 for INCOME >50K

Support = 942/13559 = 0.06947 or 6.947%


Confidence = 942/1307 = 0.72073 or 72.073%
If MSTATUS in [ " Married-AF-spouse" " Married-civ-spouse" ] and DEGREE in [ " Bachelors"
" Doctorate" " Masters" " Prof-school" ] and C-GAIN <= 5095.500 and JOBTYPE in [ " Armed-
Forces" " Exec-managerial" " Handlers-cleaners" " Prof-specialty" " Protective-serv" " Sales"
" Tech-support" ] then INCOME >50K

Rule 3 for INCOME <=50K

Support = 2932/13559 = 0.21624 or 21.624%


Confidence = 2932/4141 = 0.70804 or 70.804%

If MSTATUS in [ " Married-AF-spouse" " Married-civ-spouse" ] and DEGREE in [ " 10th" " 11th" "
12th" " 1st-4th" " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm" " Assoc-voc" " HS-grad" " Preschool"
" Some-college" ] and C-GAIN <= 5095.500 then INCOME <=50K.
e) Create two more C&R trees. The first is just like the default tree except you do not
“prune tree to avoid overfitting" (on the basic tab). The other does prune, but you
require 500 records in a parent branch and 100 records in a child branch. How do the
three trees differ (briefly). Which seems most accurate on the training data? Which
seems most accurate on the test data?
Ans:

Tree which does not prune


Tree that does prune, but require 500 records in a parent branch and 100 records in a child
branch.

Tree that does not prune has a maximum Tree Depth :7 and has 17 leaves.
Tree that does prune but require 500 records in a parent branch and 100 records in a child
branch has a Tree Depth : 4 and 7 leaves.

The default tree has Tree Depth : 4 and 7 leaves.

As per the analysis Default tree and Prune 500,100 are pretty similar as they have same depth
and number of leaves. However, tree that does not prune has a different tree depth : 7 and has
17 leaves as there is a difference in Predictor Importance of the attribute for the tree that does
not Prune.
When we connect all the three trees to the analysis node the results are given below.

The result clearly states that all the three trees are predicting 84.96% results as correct for
training data and 84.13% results as correct for testing data. Hence, as all the tree models are in
100% agreement with each other. However, Default model and Model that prunes has Tree
depth as 4 so, that would be more efficient and would avoid overfitting.
Problem 2
a) Input the zoo1.csv to train the decision tree classifier (C5.0) and come up with a decision
tree to classify a new record into one of the categories (pick the “favor accuracy" option
in the C5.0 node). Make sure you examine the data first and think about what field(s) to
use for the classification scheme.
Ans :
After examining the data it is pretty eminent that the Attribute Animal has a distinct value
every time and can be avoided while making the tree.
b) Rename the generated node as “fulltree" and fully unfold it while browsing it. Use this
to draw the full tree - how many leaves does it have? What is the classification accuracy
on the training dataset? You can check this through an analysis node or through a table.

Ans:

Total number of leaves in fulltree is 7.

Classification accuracy on Training Data is 100%.


c) Next, reset the option in C5.0 to choose “ruleset" as opposed to “decision tree" and
generate a new node - rename this “fullrules." Once again fully unfold the ruleset and
write out the rules for each type.
Ans : Selected Rule Set as opposed to Decision Tree.

Rules for each type:


Rules for amphibian - contains 1 rule(s)
Rule 1 for amphibian
if feathers = FALSE
and milk = FALSE
and aquatic = TRUE
and breathes = TRUE
then amphibian
Rules for bird - contains 1 rule(s)
Rule 1 for bird
if feathers = TRUE
then bird
Rules for fish - contains 1 rule(s)
Rule 1 for fish
if backbone = TRUE
and breathes = FALSE
then fish
Rules for insect - contains 1 rule(s)
Rule 1 for insect
if feathers = FALSE
and milk = FALSE
and airborne = TRUE
then insect
Rules for invertebrate - contains 1 rule(s)
Rule 1 for invertebrate
if airborne = FALSE
and backbone = FALSE
then invertebrate
Rules for mammal - contains 1 rule(s)
Rule 1 for mammal
if milk = TRUE
then mammal
Rules for reptile - contains 1 rule(s)
Rule 1 for reptile
if feathers = FALSE
and milk = FALSE
and aquatic = FALSE
and backbone = TRUE
then reptile
Default: mammal
d) Compare your results from parts (b) and (c) and comment on them.
Ans:

Both the models give 100% on the training dataset. However, one difference which can be
seen from the analysis is that both models use different values for Predictor Importance of
Attributes.

Predictor Importance for Fulltree. It gives importance to mainly 3 attributes milk, backbone
and feathers.

Predictor Importance for Fulltree. It gives importance attributes milk, feathers, backbone,
airbourne, breathes, aquatic.
e) Next, use the ”fulltree" node and an analysis node to classify the records in the testing
dataset, zoo2.csv (to do this just disconnect the zoo1.csv data source node and instead
connect a new data source node at the beginning of the data stream with zoo2.csv as
the variable). Compare the classification accuracy here with what you saw in part (b)
and comment. What are the misclassified animals?
Ans:

Using the analysis node we can see that tree predicts 90% of the results correct and 10%
wrong that is 3 records are predicted wrong. However, in part (b) for training data accuracy
was 100%.
The three Misclassified animals are
1) Classified Flea as invertebrate should be insect.
2) Classified Seasnake as fish should be reptile.
3) Classified Termite as invertebrate should be insect.
f) Suppose you wished to use a single level tree (i.e., 1R - just one attribute to classify) and
you use the full data set (zoo.csv) to determine this. Which of the three attributes “milk",
“feathers" and “aquatic" yields the best results? Why do you think the results are so
skewed in each case?
Ans:
Case 1 taking “Milk” as the only attribute-

Accuracy is 60.4% when taking only “Milk” as the attribute for predicting “Type” i.e 61 out of
101 results are predicted correct.
Case 2 taking “Feathers” as the only attribute-

Accuracy is 60.4% when taking only “Feathers” as the attribute for predicting “Type” i.e 61 out
of 101 results are predicted correct.
Case 3 taking “Aquatic” as the only attribute-

Accuracy is 47.52% when taking only “Aquatic” as the attribute for predicting “Type” i.e 48 out
of 101 results are predicted correct.

Clearly, the attributes “Milk” and “Feathers” yield the best result i.e 60.4% in comparison to
“Aquatic” which gives a result of 47.52%.

The results are skewed in each case as we are taking only one attribute at a time and do not
have sufficient information to classify all the animals into correct types. Also, as we can see
from the results is that in all the tree cases mammal is the type which each model is predicted
for most of the observation it is so as for Mammal when Milk is True in most of the cases type is
mammal, when feathers are false type is mammal and when aquatic is false type is mammal.
So, the distribution is favoring Mammal to a great extent.
Problem 3
(a) What is the gini impurity value for this data set?

Solution :- Here , buys-computer is the target variable .


Of all the 14 records we have 7 YES and & 7 No.
pyes = 7/14 = 0.5

pno = 7/14 = 0.5


The Gini measure of a node is one minus the sum of the squares of the proportions of the
classes.
Gini impurity value for the data set: 1 – (0.52 + 0.52) = 0.5
(b) Suppose a decision tree algorithm did the initial split on income. For each of the
children describe the number of records, number of yeses, number of nos, and gini
impurity value.
Ans:-

Income

HIGH MEDIUM LOW

0 Yes, 3 No 4 Yes, 2 No 3 Yes, 2 No

PURE

Records description from the decision tree drawn 

 3 records having high income don’t buy computer


 Out of 6 records having medium income, 4 buy computer and 2 don’t buy.
 Out of 5 records having low income, 3 buy computer and 2 don’t buy.
Gini Index of leaf nodes 
Gini(High income node) = 0 (Pure )
2 2 4 2
Gini(Medium leaf node) = 1 − ((6) + (6) ) = 4/9
2 2 3 2
Gini (Low leaf node) = 1 − ((5) + (5) ) = 12/25

Gini (Income) = calculated using weighted avg = Gini (High)+ Gini(Medium) + Gini (Low)
= 0 *(3/14) + 4/9 * (6/14) + 12/25 *( 5/14)
= 4/21 + 6/35 = 0.3619
(c) What is the gain in gini impurity of income obtained by splitting on income as in part
(b)?

Ans :- Gain = Gini index of entire data set – Gini index using ‘Income’ as split variable.
= 0.5 (part (a) calculated) – 0.3619 (part (b) calculated)
= 0.1381.

(d) Continue building the tree. If you do not split on a node, explain why. Draw the final tree
that you obtained.
Ans :-
We will classify the decision tree further, keeping ‘Income’ as the initial split variable and try
out different combinations of ‘Student’ and ‘Credit Rating’ to see what we get on the target
variable.
Take on splitting a node :->
1. Follow HIT & TRIAL approach i.e choosing a splitting variable.
2. The moment we get a pure subset, we will not split further.

Income

HIGH MEDIUM LOW

0 Yes, 3 No PURE
4 Yes, 2 No 3 Yes, 2 No
Step 1 :- Choosing ‘Credit rating ‘ as the next split variable , we will see the values of target variable on
the combination <Medium income , Fair credit rating> and <Medium Income and Excellent credit rating>

Income
LOW
HIGH MEDIUM

3 Yes, 2 No
0 Yes, 3 No 4 Yes, 2 No Impurity

PURE Credit rating

FAIR EXCELLENT

1 Yes, 2 No’s
0 Yes, 3 No’s
PURE Impurity

Conclusion :- We rule out the option of choosing ‘Credit rating ‘ as the next splitting variable for
‘MEDIUM ‘ income , since we are not getting a pure subset.

Step 2 :- Choosing ‘student ‘ as the next split variable for ‘Medium ‘ income , we will see the values
of target variable on the combination <Medium income , Yes is a student> and <Medium Income
and No is not a student>.

Income

HIGH MEDIUM LOW


3 Yes, 2 No
0 Yes, 3 No 4 Yes, 2 No
PURE STUDENT

YES NO

4 Yes, 0 No’s 0 Yes, 2 No’s


PURE PURE
Step 3:- Looking at the next splitting variable ‘Credit Rating ‘for the LOW income group. We will check
the values of target variable we get for the combination <Low income, Fair credit rating> and <Low
income, Fair credit rating>

FINAL DECISION TREE

Income

MEDIUM LOW
HIGH
3 Yes, 2 No
0 Yes, 3 No 4 Yes, 2 No
Credit Rating

PURE STUDENT

YES NO FAIR Excellent

4 Yes, 0 No 0 Yes, 2 No

PURE PURE

0 Yes, 2 No 3 Yes, 0 No

PURE PURE

Based on the decision tree, we obtained, we can infer the following:


1) If income is high, customer will not buy a computer.
2) If income is medium, and customer is a student, then customer will buy a computer.
3) If income is medium, and customer is a not a student, then customer will not buy a
computer.
4) If income is low, and credit rating is fair, then customer will not buy a computer.
5) If income is low, and credit rating is excellent, then customer will buy a computer.
Problem 4:
a) Describe the purpose of separating the data into training and testing data.

The purpose of separating the data into training and testing data is to determine the
accuracy of predictions. We measure the performance of a model in terms of its error
rate: percentage of incorrectly classified instances in the data set. To measure the
performance of a data set we divide the data set into

Training data : The training set (seen data) to build the model (determine its
parameters)
Test data: The test set (unseen data) to measure its performance (holding the
parameters constant).Test data is something that we get in future. We don't know their
Y/dependent variable value and we predict it using our model.

Sometimes it is useful to temporarily "split" a dataset in order to compare analytic output


across different subsets of data. This can be useful when you want to compare frequency
distributions or descriptive statistics with respect to the categories of some variable (e.g.,
Gender), or want to filter the results so that you can pull out only the information relevant to
the group of interest.

Outcomes of predictive model on dividing the data set 

Success: Instances of data set is predicted correctly


Error: Instances of data set is predicted incorrectly
Error rate: proportion of errors made over the whole set of instances

b) Which problem do we try to address when using pruning? Please explain.


Overfitting is addressed using pruning. Overfitting happens when we include branches in the
decision tree that try to fit data too specifically. As a result, there is a reduced error in the
training data set at the cost of increased test data set error. To avoid overfitting we implement
pruning mechanism. Pruning helps us to 

Remove the branches of decision tree that are of low statistical significance.
To avoid overfitting while making a decision tree for a particular data set.

Things to be considered while pruning 


Never remove branches (attributes) that is predictive in nature.
The aim of pruning is to discard parts of a classification model that describe random variation in
the training sample rather than true features of the underlying domain.
Two pruning strategies

• Pre-pruning: the process is done during the construction of the tree. There is
some criteria to stop expanding the nodes (allowing a certain level of "impurity"
in each node).

• Post-pruning: the process is done after the construction of the tree. Branches are
removed from the bottom up to a certain limit. It uses similar criteria to pre-
pruning.