1 Voti positivi0 Voti negativi

78 visualizzazioni25 pagineda

Sep 21, 2015

© © All Rights Reserved

PDF, TXT o leggi online da Scribd

da

© All Rights Reserved

78 visualizzazioni

da

© All Rights Reserved

- Berkeley Dataproduct Talk
- Predicting food demand in food courts by decision tree approaches
- paper
- _Induction of Decisoin Trees
- Quality of Cluster Index Based on Study of Decision Tree
- CourseSyllabus_GradingPolicy
- Brief Introduction of Data Mining and Data Warehousing
- A Survey on Classification and Prediction Techniques in Data Mining for Diabetes Mellitus
- 06 Data Mining-Data Preprocessing- Discrisation
- Parametric Comparison Based on Split Criterion on Classification Algorithm
- Blue Eyes Technology Report
- 1806.07470.pdf&sa=D&ust=1537878080177000
- Intrusion Detection using Multi-Stage Neural Network
- Introduction
- TGS_BESAR_ML_8488_8684_8861_9010_9027
- Subject Distribution Using Data Mining
- Comparative Study on Classification Algorithms in Machine Learning
- Research Paper
- Lecture 6
- d912f50ff096024084

Sei sulla pagina 1di 25

Fall 2015

Homework #2

Submitted By-

Group 11

Ankit Bhardwaj (abhard5@uic.edu)

Arpit Gulati (agulat4@uic.edu)

Nitish Puri (npuri5@uic.edu)

Problem 1

a) Input the data set. Set the role of INCOME to target. Use a partition node to divide the

data into 60% train, 40% test.

Ans: Salary-class.csv file was the input to the var file which was then connected to the

partition node to divide the data into 60% train, 40% test.

b) Create the default C&R decision tree. How many leaves are in the tree?

Ans :

There are a total of 7 leaves which can be seen when we see the decision tree using the

viewer option in the C & R Model.

Ans:

The major predictor of Income are MSTATUS, C-GAIN, DEGREE, JOBTYPE.

d) Give three rules that describe who is likely to have an INCOME > 50K and who is likely

to have an income <= 50K. These rules should be relevant (support at least 5% in the

training sample) and strong (either confidence more than 75% “> 50K" or 90% “<=

50K"). If there are no three rules that meet these criteria, give the three best rules you

can.

Confidence = 6835/7170 = 0.95328 or 95.328%

if MSTATUS in [ " Divorced" " Married-spouse-absent" " Never-married" " Separated" "

Widowed" ] and C-GAIN <= 7139.500 then INCOME <=50K

Confidence = 942/1307 = 0.72073 or 72.073%

If MSTATUS in [ " Married-AF-spouse" " Married-civ-spouse" ] and DEGREE in [ " Bachelors"

" Doctorate" " Masters" " Prof-school" ] and C-GAIN <= 5095.500 and JOBTYPE in [ " Armed-

Forces" " Exec-managerial" " Handlers-cleaners" " Prof-specialty" " Protective-serv" " Sales"

" Tech-support" ] then INCOME >50K

Confidence = 2932/4141 = 0.70804 or 70.804%

If MSTATUS in [ " Married-AF-spouse" " Married-civ-spouse" ] and DEGREE in [ " 10th" " 11th" "

12th" " 1st-4th" " 5th-6th" " 7th-8th" " 9th" " Assoc-acdm" " Assoc-voc" " HS-grad" " Preschool"

" Some-college" ] and C-GAIN <= 5095.500 then INCOME <=50K.

e) Create two more C&R trees. The first is just like the default tree except you do not

“prune tree to avoid overfitting" (on the basic tab). The other does prune, but you

require 500 records in a parent branch and 100 records in a child branch. How do the

three trees differ (briefly). Which seems most accurate on the training data? Which

seems most accurate on the test data?

Ans:

Tree that does prune, but require 500 records in a parent branch and 100 records in a child

branch.

Tree that does not prune has a maximum Tree Depth :7 and has 17 leaves.

Tree that does prune but require 500 records in a parent branch and 100 records in a child

branch has a Tree Depth : 4 and 7 leaves.

As per the analysis Default tree and Prune 500,100 are pretty similar as they have same depth

and number of leaves. However, tree that does not prune has a different tree depth : 7 and has

17 leaves as there is a difference in Predictor Importance of the attribute for the tree that does

not Prune.

When we connect all the three trees to the analysis node the results are given below.

The result clearly states that all the three trees are predicting 84.96% results as correct for

training data and 84.13% results as correct for testing data. Hence, as all the tree models are in

100% agreement with each other. However, Default model and Model that prunes has Tree

depth as 4 so, that would be more efficient and would avoid overfitting.

Problem 2

a) Input the zoo1.csv to train the decision tree classifier (C5.0) and come up with a decision

tree to classify a new record into one of the categories (pick the “favor accuracy" option

in the C5.0 node). Make sure you examine the data first and think about what field(s) to

use for the classification scheme.

Ans :

After examining the data it is pretty eminent that the Attribute Animal has a distinct value

every time and can be avoided while making the tree.

b) Rename the generated node as “fulltree" and fully unfold it while browsing it. Use this

to draw the full tree - how many leaves does it have? What is the classification accuracy

on the training dataset? You can check this through an analysis node or through a table.

Ans:

c) Next, reset the option in C5.0 to choose “ruleset" as opposed to “decision tree" and

generate a new node - rename this “fullrules." Once again fully unfold the ruleset and

write out the rules for each type.

Ans : Selected Rule Set as opposed to Decision Tree.

Rules for amphibian - contains 1 rule(s)

Rule 1 for amphibian

if feathers = FALSE

and milk = FALSE

and aquatic = TRUE

and breathes = TRUE

then amphibian

Rules for bird - contains 1 rule(s)

Rule 1 for bird

if feathers = TRUE

then bird

Rules for fish - contains 1 rule(s)

Rule 1 for fish

if backbone = TRUE

and breathes = FALSE

then fish

Rules for insect - contains 1 rule(s)

Rule 1 for insect

if feathers = FALSE

and milk = FALSE

and airborne = TRUE

then insect

Rules for invertebrate - contains 1 rule(s)

Rule 1 for invertebrate

if airborne = FALSE

and backbone = FALSE

then invertebrate

Rules for mammal - contains 1 rule(s)

Rule 1 for mammal

if milk = TRUE

then mammal

Rules for reptile - contains 1 rule(s)

Rule 1 for reptile

if feathers = FALSE

and milk = FALSE

and aquatic = FALSE

and backbone = TRUE

then reptile

Default: mammal

d) Compare your results from parts (b) and (c) and comment on them.

Ans:

Both the models give 100% on the training dataset. However, one difference which can be

seen from the analysis is that both models use different values for Predictor Importance of

Attributes.

Predictor Importance for Fulltree. It gives importance to mainly 3 attributes milk, backbone

and feathers.

Predictor Importance for Fulltree. It gives importance attributes milk, feathers, backbone,

airbourne, breathes, aquatic.

e) Next, use the ”fulltree" node and an analysis node to classify the records in the testing

dataset, zoo2.csv (to do this just disconnect the zoo1.csv data source node and instead

connect a new data source node at the beginning of the data stream with zoo2.csv as

the variable). Compare the classification accuracy here with what you saw in part (b)

and comment. What are the misclassified animals?

Ans:

Using the analysis node we can see that tree predicts 90% of the results correct and 10%

wrong that is 3 records are predicted wrong. However, in part (b) for training data accuracy

was 100%.

The three Misclassified animals are

1) Classified Flea as invertebrate should be insect.

2) Classified Seasnake as fish should be reptile.

3) Classified Termite as invertebrate should be insect.

f) Suppose you wished to use a single level tree (i.e., 1R - just one attribute to classify) and

you use the full data set (zoo.csv) to determine this. Which of the three attributes “milk",

“feathers" and “aquatic" yields the best results? Why do you think the results are so

skewed in each case?

Ans:

Case 1 taking “Milk” as the only attribute-

Accuracy is 60.4% when taking only “Milk” as the attribute for predicting “Type” i.e 61 out of

101 results are predicted correct.

Case 2 taking “Feathers” as the only attribute-

Accuracy is 60.4% when taking only “Feathers” as the attribute for predicting “Type” i.e 61 out

of 101 results are predicted correct.

Case 3 taking “Aquatic” as the only attribute-

Accuracy is 47.52% when taking only “Aquatic” as the attribute for predicting “Type” i.e 48 out

of 101 results are predicted correct.

Clearly, the attributes “Milk” and “Feathers” yield the best result i.e 60.4% in comparison to

“Aquatic” which gives a result of 47.52%.

The results are skewed in each case as we are taking only one attribute at a time and do not

have sufficient information to classify all the animals into correct types. Also, as we can see

from the results is that in all the tree cases mammal is the type which each model is predicted

for most of the observation it is so as for Mammal when Milk is True in most of the cases type is

mammal, when feathers are false type is mammal and when aquatic is false type is mammal.

So, the distribution is favoring Mammal to a great extent.

Problem 3

(a) What is the gini impurity value for this data set?

Of all the 14 records we have 7 YES and & 7 No.

pyes = 7/14 = 0.5

The Gini measure of a node is one minus the sum of the squares of the proportions of the

classes.

Gini impurity value for the data set: 1 – (0.52 + 0.52) = 0.5

(b) Suppose a decision tree algorithm did the initial split on income. For each of the

children describe the number of records, number of yeses, number of nos, and gini

impurity value.

Ans:-

Income

PURE

Out of 6 records having medium income, 4 buy computer and 2 don’t buy.

Out of 5 records having low income, 3 buy computer and 2 don’t buy.

Gini Index of leaf nodes

Gini(High income node) = 0 (Pure )

2 2 4 2

Gini(Medium leaf node) = 1 − ((6) + (6) ) = 4/9

2 2 3 2

Gini (Low leaf node) = 1 − ((5) + (5) ) = 12/25

Gini (Income) = calculated using weighted avg = Gini (High)+ Gini(Medium) + Gini (Low)

= 0 *(3/14) + 4/9 * (6/14) + 12/25 *( 5/14)

= 4/21 + 6/35 = 0.3619

(c) What is the gain in gini impurity of income obtained by splitting on income as in part

(b)?

Ans :- Gain = Gini index of entire data set – Gini index using ‘Income’ as split variable.

= 0.5 (part (a) calculated) – 0.3619 (part (b) calculated)

= 0.1381.

(d) Continue building the tree. If you do not split on a node, explain why. Draw the final tree

that you obtained.

Ans :-

We will classify the decision tree further, keeping ‘Income’ as the initial split variable and try

out different combinations of ‘Student’ and ‘Credit Rating’ to see what we get on the target

variable.

Take on splitting a node :->

1. Follow HIT & TRIAL approach i.e choosing a splitting variable.

2. The moment we get a pure subset, we will not split further.

Income

0 Yes, 3 No PURE

4 Yes, 2 No 3 Yes, 2 No

Step 1 :- Choosing ‘Credit rating ‘ as the next split variable , we will see the values of target variable on

the combination <Medium income , Fair credit rating> and <Medium Income and Excellent credit rating>

Income

LOW

HIGH MEDIUM

3 Yes, 2 No

0 Yes, 3 No 4 Yes, 2 No Impurity

FAIR EXCELLENT

1 Yes, 2 No’s

0 Yes, 3 No’s

PURE Impurity

Conclusion :- We rule out the option of choosing ‘Credit rating ‘ as the next splitting variable for

‘MEDIUM ‘ income , since we are not getting a pure subset.

Step 2 :- Choosing ‘student ‘ as the next split variable for ‘Medium ‘ income , we will see the values

of target variable on the combination <Medium income , Yes is a student> and <Medium Income

and No is not a student>.

Income

3 Yes, 2 No

0 Yes, 3 No 4 Yes, 2 No

PURE STUDENT

YES NO

PURE PURE

Step 3:- Looking at the next splitting variable ‘Credit Rating ‘for the LOW income group. We will check

the values of target variable we get for the combination <Low income, Fair credit rating> and <Low

income, Fair credit rating>

Income

MEDIUM LOW

HIGH

3 Yes, 2 No

0 Yes, 3 No 4 Yes, 2 No

Credit Rating

PURE STUDENT

4 Yes, 0 No 0 Yes, 2 No

PURE PURE

0 Yes, 2 No 3 Yes, 0 No

PURE PURE

1) If income is high, customer will not buy a computer.

2) If income is medium, and customer is a student, then customer will buy a computer.

3) If income is medium, and customer is a not a student, then customer will not buy a

computer.

4) If income is low, and credit rating is fair, then customer will not buy a computer.

5) If income is low, and credit rating is excellent, then customer will buy a computer.

Problem 4:

a) Describe the purpose of separating the data into training and testing data.

The purpose of separating the data into training and testing data is to determine the

accuracy of predictions. We measure the performance of a model in terms of its error

rate: percentage of incorrectly classified instances in the data set. To measure the

performance of a data set we divide the data set into

Training data : The training set (seen data) to build the model (determine its

parameters)

Test data: The test set (unseen data) to measure its performance (holding the

parameters constant).Test data is something that we get in future. We don't know their

Y/dependent variable value and we predict it using our model.

across different subsets of data. This can be useful when you want to compare frequency

distributions or descriptive statistics with respect to the categories of some variable (e.g.,

Gender), or want to filter the results so that you can pull out only the information relevant to

the group of interest.

Error: Instances of data set is predicted incorrectly

Error rate: proportion of errors made over the whole set of instances

Overfitting is addressed using pruning. Overfitting happens when we include branches in the

decision tree that try to fit data too specifically. As a result, there is a reduced error in the

training data set at the cost of increased test data set error. To avoid overfitting we implement

pruning mechanism. Pruning helps us to

Remove the branches of decision tree that are of low statistical significance.

To avoid overfitting while making a decision tree for a particular data set.

Never remove branches (attributes) that is predictive in nature.

The aim of pruning is to discard parts of a classification model that describe random variation in

the training sample rather than true features of the underlying domain.

Two pruning strategies

• Pre-pruning: the process is done during the construction of the tree. There is

some criteria to stop expanding the nodes (allowing a certain level of "impurity"

in each node).

• Post-pruning: the process is done after the construction of the tree. Branches are

removed from the bottom up to a certain limit. It uses similar criteria to pre-

pruning.

- Berkeley Dataproduct TalkCaricato daLordger Liu
- Predicting food demand in food courts by decision tree approachesCaricato daSelman Bozkır
- paperCaricato daNurul Renaningtias
- _Induction of Decisoin TreesCaricato damuhammadriz
- Quality of Cluster Index Based on Study of Decision TreeCaricato daWhite Globe Publications (IJORCS)
- CourseSyllabus_GradingPolicyCaricato daVicky Ze SeVen
- Brief Introduction of Data Mining and Data WarehousingCaricato daRahul Sharma
- A Survey on Classification and Prediction Techniques in Data Mining for Diabetes MellitusCaricato daEditor IJTSRD
- 06 Data Mining-Data Preprocessing- DiscrisationCaricato daRaj Endran
- Parametric Comparison Based on Split Criterion on Classification AlgorithmCaricato daIAEME Publication
- Blue Eyes Technology ReportCaricato daAnto Mathew
- 1806.07470.pdf&sa=D&ust=1537878080177000Caricato daDarlyn Reanel
- Intrusion Detection using Multi-Stage Neural NetworkCaricato daijcsis
- IntroductionCaricato dathulasiramireddy
- TGS_BESAR_ML_8488_8684_8861_9010_9027Caricato daSisdig
- Subject Distribution Using Data MiningCaricato daesatjournals
- Comparative Study on Classification Algorithms in Machine LearningCaricato daabhay sharma
- Research PaperCaricato daAnish
- Lecture 6Caricato dagustand_cd
- d912f50ff096024084Caricato daShubham Sharma
- PacktCaricato daYuri
- DEVELOPING A FRAMEWORK FOR PREDICTION OF HUMAN PERFORMANCE CAPABILITY USING ENSEMBLE TECHNIQUESCaricato daCS & IT
- Psg2012a Standard ClusterCaricato daNadya Damayanti
- QNT 565 Final Exam Solutions on StudentwhizCaricato dastudentwhiz
- A1_Day1_10HS60015Caricato daRuchi Dixit
- Research Paper Grading Rubric Spring 2011Caricato daJames Trinklein
- ANN.docxCaricato daAm K
- APES Lab and Field ReportCaricato daEddieWang
- eric fluckiger labCaricato daapi-302420928
- rubricCaricato daapi-451604349

- American Well: The Doctor Will E-See You Now - CaseCaricato dabinzidd007
- Instructions for Installing the BatchInfo CSV FileCaricato daArpit Gulati
- Regression in R SoftwareCaricato daJoão Heitor De Avila
- Book1Caricato daArpit Gulati
- Barnes & Noble AnalysisCaricato dalethuy_tram3910
- American WellCaricato daPreetam Joga
- IDS 507 - Fall 2016 - Case Study 1 AssignmentCaricato daArpit Gulati
- 2011-er-relational-exercise-pharma.pdfCaricato daArpit Gulati
- The Database Development ProcessCaricato daLidya Indriani
- 61-90.pdfCaricato daArpit Gulati
- Beer and PretzelsCaricato daArpit Gulati
- Ch4 AlgebraCaricato dajamesyu
- Barnes&Noble vs Amazon - Talking PointsCaricato dachetonal
- Beer and PretzelsCaricato daArpit Gulati
- BankofAmericaCorporation_10K_20090227Caricato daArpit Gulati
- chap08Caricato daArpit Gulati
- sdadCaricato daArpit Gulati
- 840N Robinson SP08Caricato daArpit Gulati
- Ch1 Dbms MdmCaricato daGerald Alminar-uslt
- Problems 2.7.4-2.7.5 (2)Caricato daArpit Gulati
- Problem 2.4.5(1)Caricato daArpit Gulati
- dtreeoutCaricato daArpit Gulati
- Analysis ofad [RESPONSE] #2Caricato daArpit Gulati
- HW3Caricato daArpit Gulati
- Problem 1.4.2Caricato daArpit Gulati
- 27Caricato daArpit Gulati
- Your Order Has Been Successfully PlacedCaricato daArpit Gulati
- Work Energy & Power(n)Caricato daArpit Gulati
- cxcxCaricato daArpit Gulati

- Six Sigma Green Belt Exam Study NotesCaricato daKumaran Vel
- LMS Week 3 Fundamental of Metal CuttingCaricato daabdulaziz
- 1907 pocketbookofaero00moedialaCaricato daUlf Dunell
- Boeing 767 300 antihielosCaricato daKoldo Navarro
- OTRSMasterSlaveCaricato daing_llenki
- The Difference Between Saliva PH and Blood Glucose Levels Before and After Consuming White RiceCaricato dadeyfahmidey
- Design of Hydraulic Structures 711Caricato daAkshay Patil
- Slides Ppt Ch02Caricato daGeoffrey Alleyne
- USWL_PP_BBP_V1.1Caricato daBiswajit Kar
- GAS Dispersion of Zawia Power PlantCaricato damostafaessuri
- Eurocode-3-worked-examples-pdf.pdfCaricato daRebecca
- prof edCaricato daArmie Jane Laverinto
- Bi Material BeamCaricato daPraveen Kumaar
- 15865Caricato damech_sahil
- BSD-CANCSA-S6-14Caricato dadongheep811
- Javascript Es6 PresentationCaricato daAvinash Kundaliya
- Gds 1000 u ManualYCaricato daLCalcinaCmc
- 4927.pdfCaricato daVianney Ochoa
- Water MeterCaricato daChinchu Sasi
- Joy to the WorldCaricato daFavio Rodriguez Insaurralde
- 8086 Lecture Notes 3Caricato dasai420
- How is Nitrogen Used in Oil and Gas FieldsCaricato darajasekharbo
- Diffie–Hellman Key ExchangeCaricato daMilica Lukic
- Design of Jigs Qp Upto 2010Caricato daNaresh Dharma
- As 2582.16-2006 Complete Filled Transport Packages - Methods of Test Stability Testing of Unit LoadsCaricato daSAI Global - APAC
- Finite Element Simulation of Multi Stage Deep Drawing Processes and Comparison With Experimental ResultsCaricato daMarzouki Sofiene
- SumpCaricato daAnand.5
- CAD SlidesCaricato daHarish Kumar
- Vertical Crest Curve Design and Sight Distance RequirementsCaricato daKlaus Ranola
- KUMARCaricato daanon-886175

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.