Sei sulla pagina 1di 9

Experiment No.

7 – To perform data pre-processing task and demonstrate


Classification, Clustering, Association algorithm on data sets using data mining
tool WEKA.

1. PREPROCESS:
The data that is collected from the field contains many unwanted things that leads to wrong
analysis. For example, the data may contain null fields, it may contain columns that are irrelevant
to the current analysis, and so on. Thus, the data must be preprocessed to meet the requirements
of the type of analysis you are seeking. This is the done in the preprocessing module.
At the very top of the window, just below the title bar there is a row of tabs. Only the first tab,
‘Preprocess’, is active at the moment because there is no dataset open. The first three 4 buttons at
the top of the preprocess section enable you to load data into WEKA. Data can be imported from
a file in various formats: ARFF, CSV, C4.5, binary, it can also be read from a URL or from an
SQL database (using JDBC). The easiest and the most common way of getting the data into
WEKA is to store it as Attribute-Relation File Format (ARFF) file.
2. CLASSIFICATION (NAÏVE BAYES ALGORITHM):

In the Classify tab, you can create a model by using Choose to select a model. Naive Bayes is a
classification algorithm. Traditionally it assumes that the input values are nominal, although it
numerical inputs are supported by assuming a distribution.

Naive Bayes uses a simple implementation of Bayes Theorem (hence naive) where the prior
probability for each class is calculated from the training data and assumed to be independent of
each other (technically called conditionally independent).

This is an unrealistic assumption because we expect the variables to interact and be dependent,
although this assumption makes the probabilities fast and easy to calculate. Even under this
unrealistic assumption, Naive Bayes has been shown to be a very effective classification
algorithm.

Naive Bayes calculates the posterior probability for each class and makes a prediction for the
class with the highest probability. As such, it supports both binary classification and multi-class
classification problems

=== Run information ===

Scheme: weka.classifiers.bayes.NaiveBayes
Relation: golfgame
Instances: 14
Attributes: 3
condition
temperature
class
Test mode: 10-fold cross-validation
=== Classifier model (full training set) ===
Naive Bayes Classifier
Class
Attribute Yes No
(0.63) (0.38)
===============================
condition
Rainy 4.0 4.0
Sunny 5.0 2.0
Cloudy 3.0 2.0
[total] 12.0 8.0

temperature
mean 60.9744 61.8154
std. dev. 12.0783 9.0532
weight sum 9 5
precision 3.1538 3.1538

Time taken to build model: 0 seconds

=== Stratified cross-validation ===


=== Summary ===
Correctly Classified Instances 5 35.7143 %
Incorrectly Classified Instances 9 64.2857 %
Kappa statistic -0.4651
Mean absolute error 0.5546
Root mean squared error 0.5798
Relative absolute error 116.4716 %
Root relative squared error 117.5297 %
Total Number of Instances 14
=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area
Class
0.556 1.000 0.500 0.556 0.526 -0.471 0.244 0.542 Yes
0.000 0.444 0.000 0.000 0.000 -0.471 0.244 0.290 No
Weighted Avg. 0.357 0.802 0.321 0.357 0.338 -0.471 0.244 0.452

=== Confusion Matrix ===

a b <-- classified as
5 4 | a = Yes
5 0 | b = No
3. SIMPLE K-MEANS CLUSTERING:

A clustering algorithm finds groups of similar instances in the entire dataset. WEKA supports
several clustering algorithms such as EM, FilteredClusterer, HierarchicalClusterer,
SimpleKMeans and so on. You should understand these algorithms completely to fully exploit
the WEKA capabilities. As in the case of classification, WEKA allows you to visualize the
detected clusters graphically. To demonstrate the clustering, we will use the provided iris
database. The data set contains three classes of 50 instances each. Each class refers to a type of
iris plant.

=== Run information ===

Scheme: weka.clusterers.SimpleKMeans -init 0 -max-candidates 100 -periodic-pruning 10000 -min-


density 2.0 -t1 -1.25 -t2 -1.0 -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -num-slots 1 -S 10

Relation: golfgame

Instances: 14

Attributes: 3

condition

temperature

class

Test mode: evaluate on training data

=== Clustering model (full training set) ===

kMeans

======

Number of iterations: 3

Within cluster sum of squared errors: 8.755332710121525


Initial starting points (random):

Cluster 0: Sunny,76,Yes

Cluster 1: Sunny,73,Yes

Missing values globally replaced with mean/mode

Final cluster centroids:

Cluster#

Attribute Full Data 0 1

(14.0) (7.0) (7.0)

==============================================

condition Rainy Sunny Rainy

temperature 61.2857 66.5714 56

class Yes Yes Yes

Time taken to build model (full training data) : 0 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 7 ( 50%)

1 7 ( 50%)
4. APRIORI ASSOCIATION RULE:
The Apriori algorithm is one such algorithm in ML that finds out the probable associations
and creates association rules. WEKA provides the implementation of the Apriori algorithm.
You can define the minimum support and an acceptable confidence level while computing
these rules. You will apply the Apriori algorithm to the supermarket data provided in the
WEKA installation.

=== Run information ===

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1

Relation: golfgame

Instances: 14

Attributes: 4

condition
con

cond

class

=== Associator model (full training set) ===

Apriori

=======

Minimum support: 0.15 (2 instances)

Minimum metric <confidence>: 0.9

Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 10

Size of set of large itemsets L(2): 28

Size of set of large itemsets L(3): 16

Size of set of large itemsets L(4): 1

Best rules found:

1. condition=Overcast 4 ==> class=Yes 4 <conf:(1)> lift:(1.56) lev:(0.1) [1] conv:(1.43)

2. con=Cool 4 ==> cond=Normal 4 <conf:(1)> lift:(2) lev:(0.14) [2] conv:(2)

3. condition=Rainy class=No 3 ==> cond=High 3 <conf:(1)> lift:(2) lev:(0.11) [1] conv:(1.5)

4. condition=Rainy cond=High 3 ==> class=No 3 <conf:(1)> lift:(2.8) lev:(0.14) [1] conv:(1.93)


5. con=Cool class=Yes 3 ==> cond=Normal 3 <conf:(1)> lift:(2) lev:(0.11) [1] conv:(1.5)

6. condition=Rainy con=Hot 2 ==> cond=High 2 <conf:(1)> lift:(2) lev:(0.07) [1] conv:(1)

7. con=Hot class=No 2 ==> condition=Rainy 2 <conf:(1)> lift:(2.8) lev:(0.09) [1] conv:(1.29)

8. condition=Rainy con=Hot 2 ==> class=No 2 <conf:(1)> lift:(2.8) lev:(0.09) [1] conv:(1.29)

9. condition=Rainy class=Yes 2 ==> cond=Normal 2 <conf:(1)> lift:(2) lev:(0.07) [1] conv:(1)

10. condition=Rainy cond=Normal 2 ==> class=Yes 2 <conf:(1)> lift:(1.56) lev:(0.05) [0] conv:(0.71)

Potrebbero piacerti anche