Sei sulla pagina 1di 8

MACHINE LEARNING

TUTORIAL-1

PROBLEM 1:
Objective: To construct a decision Tree for the abalone dataset .Draw the decision tree to
classify a new record and show the accuracy of the tree.
Total classes in dataset:
There are 29 classes in given dataset based on Rings.
Training Data Set: 75 % of the data set (contains 34781 instances)
Test Data: 25 % data
Methodology:
i.

ii.

Tools used: Weka is a collection of machine learning algorithms for data mining tasks.
Weka contains tools for data pre-processing, classification, regression, clustering,
association rules, and visualization. Here Weka version 3.6.13 is used.
Features(if any)/Preprocess:
The classification algorithm works on nominal data so
Length, Diameter, Height, Whole weight, Shucked weight, Viscera weight, Shell
weight have continuous values which are reduced up to 2 precision value to
improve efficiency
RINGS Attribute Preprocessing: Rings attribute is divided into 3 classes
0-9 is classified as Young
10-14 is classified as Adult
>14 is classified as Old

Results:
Decision Tree for the Training data Set :
Number of Leaves :
Size of the tree : 51

28

Accuracy
Test Option

Classifier Accuracy %

Cross Validation 10 Fold

75.6705 %

Supplied Test Set (25 % of the Test data generated from 80.9706 %
the training data for validation)
Percentage Split -66%
74.0845 %

PROBLEM 2: Perform Clustering using K means on the following Data set.


Objective: To cluster the plant dataset using K Means clustering algorithm
Total classes in dataset:
The data is in the transactional form. It contains the Latin names (species or genus) and state
abbreviations.
Total classes in the data set will depend on the value of K provided by user .
Number of classes = K.
Training data set: 100 % (As it is a clustering task we do not need to create test data set)
Methodology:
Tools used: Weka is a collection of machine learning algorithms for data mining tasks. Weka
contains tools for data pre-processing, classification, regression, clustering, association rules, and
visualization. Here Weka version 3.6.13 is used.
Building Cluster:
Simple K means algorithm is used to cluster whole data set. The clustering process clusters the
whole data set into K clusters.

Results of Clustering:
Value of K
2
3
4
5
=== Run information ===
K=2
Clustered Instances

33395 ( 96%)

1386 ( 4%)

Cluster
96%,4%
39%,4%,57%
37%,4%,57%,2%
35%,4%,55%,2%,3%

K=3
Clustered Instances

13410 ( 39%)

1549 ( 4%)

19822 ( 57%)

K=5
Clustered Instances

12346 ( 35%)

1549 ( 4%)

19252 ( 55%)

606 ( 2%)

1028 ( 3%)

PROBLEM:3
1.Objective: To create a classifier of the balance-scale dataset using K Nearest Neighbor
algorithm.
2. Total classes in dataset:
Number of Instances: 625 (49 balanced, 288 left, 288 right)
Attribute Information:
1. Class Name: 3 (L, B, R)
2. Left-Weight: 5 (1, 2, 3, 4, 5)
3. Left-Distance: 5 (1, 2, 3, 4, 5)
4. Right-Weight: 5 (1, 2, 3, 4, 5)
5. Right-Distance: 5 (1, 2, 3, 4, 5)
So, there are total 3 classes (49 balanced, 288 left, 288 right) in given dataset
2.1 Training Data Set- 70 % of total instances (438)

2.2 Test Data Set- 30% of the removed data set.


Methodology:
Tools used: Weka is a collection of machine learning algorithms for data mining tasks. Weka
contains tools for data pre-processing, classification, regression, clustering, association rules, and
visualization. Here Weka version 3.6.13 is used.
Building and Testing Classifier
Firstly the model is trained with complete data set containing 625 instances and tested with cross
fold 10 validations (K=1). Cross fold -10 divides the database into 10 equal parts ,train the
model with 9 parts and test it with 1 part.
Then the model is trained with only training data containing only 70% of the data and tested with
30% of remaining data as the test set. Accuracy for various values of K is given in table under
Heading Accuracy.
Results:
Run Information for case :Supplied Test Set and K=5
=== Run information ===

=== Summary ===

Correctly Classified Instances


Incorrectly Classified Instances
Kappa statistic

150

80.2139 %

37

19.7861 %

0.593

Mean absolute error

0.1563

Root mean squared error

0.2936

Relative absolute error

37.4503 %

Root relative squared error

60.5805 %

Total Number of Instances

187

=== Detailed Accuracy By Class ===


TP
Rat
e
0
1
0.83
Weighted Avg.

0.802

FP Rate

Precision

Recall

F-Measure

ROC Area

Class

0
0.248
0

0
0.507
1

0
1
0.83

0
0.673
0.907

0.609
0.97
0.982

B
R
L

0.05

=== Confusion Matrix ===

a b c <-- classified as
0 14 0 | a = B
0 38 0 | b = R
0 23 112 | c = L

0.825

0.802

0.791

0.951

Accuracy Chart:
The Accuracy Chart for various values of K is given in table below

Test Option

Value of K

Classifier Accuracy %

Cross Validation 10 Fold

84.8%

Supplied Test Set (30 % of the split data)

79.1444 %

Supplied Test Set (30 % of the split data)

81.2834 %

Supplied Test Set (30 % of the split data)

80.2139 %

Potrebbero piacerti anche