Sei sulla pagina 1di 55

1

Exp No: 1 Date: _ _/_ _/_ _

Name of the Experiment: .. ..

WHAT IS WEKA? Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. Main features: Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods Graphical user interfaces (incl. data visualization) Environment for comparing learning algorithms THE GUI CHOOSER: The GUI chooser is used to start different interfaces of the Weka Environment. These interfaces can be considered different programs and they vary in form, function and purpose. Depending on the specific need whether it be simple data exploration, detailed experimentation or tackling very large problems, these different interfaces of Weka will be more appropriate. This project will mainly focus on the Explorer, Experimenter and Knowledge Flow interfaces of the Weka Environment.

The CLI is a text based interface to the Weka Environment. It is the most memory efficient interface available in Weka. WEKA DATA MINER Weka is a comprehensive set of advanced data mining and analysis tools. The strength of Weka lies in the area of classification where it covers many of the most current machine learning (ML) approaches. The version of Weka used in this project is version 3-4-4.

Papered By Kareem, Asst. Prof.

JBREC

2
At its simplest, it provides a quick and easy way to explore and analyze data. Weka is also suitable for dealing with large data where the resources of many computers and or multi-processor computers can be used in parallel. We will be examining different aspects of the software with a focus on its decision tree classification features. DATA HANDLING: Weka currently supports 3 external file formats namely CSV, Binary and C45. Weka also allows for data to be pulled directly from database servers as well as web servers. Its native data format is known as the ARFF format ATTRIBUTE RELATION FILE FORMAT (ARFF): An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed by the Machine Learning Project at the Department of Computer Science of The University of Waikato for use with the Weka machine learning software OVERVIEW ARFF files have two distinct sections. The first section is the Header information, which is followed the Data information. The ARFF Header Section The ARFF Header section of the file contains the relation declaration and attribute declarations. The @relation Declaration The relation name is defined as the first line in the ARFF file. The format is: @relation <relation-name> Where <relation-name> is a string. The string must be quoted if the name includes Spaces. The @attribute Declarations Attribute declarations take the form of an ordered sequence of @attribute statements. Each attribute in the data set has its own @attribute statement which uniquely defines the name of that attribute and it's data type. The order the attributes are declared indicates the column position in the data section of the file. For example, if an attribute is the third one declared then Weka expects that all that attributes values will be found in the third comma delimited column. The format for the @attribute statement is: @attribute <attribute-name> <datatype>

Papered By Kareem, Asst. Prof.

JBREC

3
Where the <attribute-name> must start with an alphabetic character. If spaces are to be included in the name then the entire name must be quoted. The <datatype> can be any of the four types currently (version 3.2.1) supported by Weka:

numeric <nominal-specification> string date [<date-format>]

Where <nominal-specification> and <date-format> are defined below. The keywords numeric, string and date are case insensitive. Numeric attributes Numeric attributes can be real or integer numbers. Nominal attributes Nominal values are defined by providing an <nominal-specification> listing the possible values: {<nominal-name1>, <nominal-name2>, <nominal-name3> ...} For example, the class value of the Iris dataset can be defined as follows: @ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}

Values that contain spaces must be quoted. String attributes String attributes allow us to create attributes containing arbitrary textual values. This is very useful in text-mining applications, as we can create datasets with string attributes, then write Weka Filters to manipulate strings (like StringToWordVectorFilter). String attributes are declared as follows: @ATTRIBUTE LCC string Date attributes Date attribute declarations take the form: @attribute <name> date [<date-format>] where <name> is the name for the attribute and <date-format> is an optional string specifying how date values should be parsed and printed (this is the same format used by SimpleDateFormat). The default format string accepts the ISO-8601 combined date and time format: "yyyy-MM-dd'T'HH:mm:ss". Dates must be specified in the data section as the corresponding string representations of the date/time (see example below).

Papered By Kareem, Asst. Prof.

JBREC

4
ARFF Data Section The ARFF Data section of the file contains the data declaration line and the actual instance lines. The @data Declaration The @data declaration is a single line denoting the start of the data segment in the file. The format is: @data The instance data Each instance is represented on a single line, with carriage returns denoting the end of the instance. Attribute values for each instance are delimited by commas. They must appear in the order that they were declared in the header section (i.e. the data corresponding to the nth @attribute declaration is always the nth field of the attribute). Missing values are represented by a single question mark, as in: @data 4.4,?,1.5,?,Iris-setosa Values of string and nominal attributes are case sensitive, and any that contain space must be quoted, as follows: @relation LCCvsLCSH @attribute LCC string @attribute LCSH string @data AG5, 'Encyclopedias and dictionaries.;Twentieth century.' AS262, 'Science -- Soviet Union -- History.' AE5, 'Encyclopedias and dictionaries.' AS281, 'Astronomy, Assyro-Babylonian.;Moon -- Phases.' AS281, 'Astronomy, Assyro-Babylonian.; Moon -- Tables.' Dates must be specified in the data section using the string representation specified in the attribute declaration. For example: @RELATION Timestamps @ATTRIBUTE timestamp DATE "yyyy-MM-dd HH:mm:ss" @DATA "2001-04-03 12:12:12" "2001-05-03 12:59:55" NOTE: All header commands start with @ and all comment lines start with %. blank lines are ignored. COMMA SEPARATED VALUE (CSV): Ex: Sno, Sname, Branch, Year 1, abc, MCA, First 2, def,MCA, First Attributes Comment and

Data
JBREC

Papered By Kareem, Asst. Prof.

5
Exp No: 2 Date: _ _/_ _/_ _

Name of the Experiment: Data Retrieval and Preparation Getting the data: There are three ways of loading data into the explorer, these are loading from a file, a database connection and finally getting a file from web server. We will be loading the data file from a locally stored file. Weka supports 4 different file formats namely, CSV, C4.5, flat binary files and the native ARFF format. To demonstrate the functionality of the explorer environment we will be loading a CSV file and then in the following section we will preprocess the data to prepare it for analysis. To open a local data file, click on the Open File button, and in the window that follows select the desired data file. Preprocess the Data: First Method:

Initially (in the Preprocess tab) click "open" and navigate to the directory containing the data file (.csv or .arff). In this case we will open the above data file.

Papered By Kareem, Asst. Prof.

JBREC

Since the data is not in ARFF format, a dialog box will prompt you to use the convertor, as in Figure. You can click on "Use Converter" button, and click OK in the next dialog box that appears.

Again you can click on choose button list of converters are listed below. Choose the which converter you want and click on OK button.

Papered By Kareem, Asst. Prof.

JBREC

. Once the data is loaded, WEKA will recognize the attributes and during the scan of the data will compute some basic statistics on each attribute. The left panel in above figure shows the list of recognized attributes, while the top panels indicate the names of the base relation (or table) and the current working relation (which are the same initially).

Statistical measures of selected attribute Attributes

Visualize

Clicking on any attribute in the left panel will show the basic statistics on that attribute. For categorical attributes, the frequency for each attribute value is shown, while for continuous attributes we can obtain min, max, mean, standard deviation, etc. Note that the visualization in the right bottom panel is a form of cross-tabulation across two attributes. For example, in above Figure, the default visualization panel cross-tabulates "married" with the "pep" attribute (by default the second attribute is the last column of the data file). You can select another attribute using the drop down list.

Papered By Kareem, Asst. Prof.

JBREC

8
Second Method: In this method you can load data from a web server. In preprocess tab click on Open URL button pop-up window appeared like as below.

You can give a name of web server and followed by file name then click on OK button. Third Method: In this method you can load from a Database. In preprocess tab click on OpenDB button then window is appears like as.

Filtering Algorithms: Filters transform the input dataset in some way. When a filter is selected using the Choose button, its name appears in the line beside that button. Click that line to get a generic object editor to specify its properties. What appears in the line is the command-line version of the filter, and the parameters are specified with minus signs. This is a good way of learning how to use the Weka commands directly. There are two kinds of filters unsupervised and supervised Filters. Filters are often applied to a training dataset and then also applied to the test file. If the filter is supervisedfor example, if it uses class values to derive good intervals for discretizationapplying it to the test data will bias the results. It is the discretization intervals derived from the training data that must

Papered By Kareem, Asst. Prof.

JBREC

9
be applied to the test data. When using supervised filters you must be careful to ensure that the results are evaluated fairly, an issue that does not arise with unsupervised filters. We treat Wekas unsupervised and supervised filtering methods separately. Within each type there is a further distinction between attribute filters, which work on the attributes in the datasets, and instance filters, which work on the instances. Sample Waether.arff Dataset @relation weather @attribute outlook {sunny,overcast,rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE,FALSE} @attribute play {yes,no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no Unsupervised Filters: Unsupervised Attribute Filters: Add: SCHEMA : Weka.filters.unsupervised.attribute.Add N unnamed-C 3

Papered By Kareem, Asst. Prof.

JBREC

10

Discretize: SCHEMA : Weka.filters.unsupervised.attribute.descretize-B11-M-1.0-R first-last

Papered By Kareem, Asst. Prof.

JBREC

11

NominalToBinary: SCHEMA : Weka.filters.unsupervised.attribute.NominalToBinary-R first-last

Papered By Kareem, Asst. Prof.

JBREC

12

Normalize: SCHEMA : Weka.filters.unsupervised.attribute.Normalize-S 1.0-T 0.0

Papered By Kareem, Asst. Prof.

JBREC

13

NumericToBinary: SCHEMA : Weka.filters.unsupervised.attribute.NumericToBinary

Papered By Kareem, Asst. Prof.

JBREC

14

Swap Values: SCHEMA : Weka.filters.unsupervised.attribute .Swap Values C last F first s last

Papered By Kareem, Asst. Prof.

JBREC

15

String to Nominal: Weka.filters.unsupervised.attribute.StringtoNominal

Papered By Kareem, Asst. Prof.

JBREC

16

Papered By Kareem, Asst. Prof.

JBREC

17

SparseToNonSparse: SCHEMA : Weka.filters.unsupervised.instance SparseToNonSparse Sparse Arff Structure: @relation <relation name> @attribute <attribute name> <Datatype> @data {< index> < value1>, <index> <value2>, <index> <value3>,.} . Sparse Weather Arff Dataset: @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data {0 sunny,1 85,2 85,3 FALSE,4 no} {0 sunny,1 80,2 90,3 TRUE,4 no} {0 overcast,1 83,2 86,3 FALSE,4 yes} {0 rainy,1 70,2 96,3 FALSE,4 yes} {0 rainy,1 68,2 80,3 FALSE,4 yes} {0 rainy,1 65,2 70,3 TRUE,4 no} {0 overcast,1 64,2 65,3 TRUE,4 yes} {0 sunny,1 72,2 95,3 FALSE,4 no} {0 sunny,1 69,2 70,3 FALSE,4 yes} {0 rainy,1 75,2 80,3 FALSE,4 yes} {0 sunny,1 75,2 70,3 TRUE,4 yes} {0 overcast,1 72,2 90,3 TRUE,4 yes} {0 overcast,1 81,2 75,3 FALSE,4 yes} {0 rainy,1 71,2 91,3 TRUE,4 no}

Papered By Kareem, Asst. Prof.

JBREC

18

Click on Apply Button and save the output as weather1.arff @relationweather-weka.filters.unsupervised.instance.SparseToNonSparse @attribute outlook {sunny,overcast,rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE,FALSE} @attribute play {yes,no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no NonSparseToSparse:SCHEMA : Weka.filters.unsupervised.instance.NonSparseToSparse

Papered By Kareem, Asst. Prof.

JBREC

19

Click on Apply Button and save the output as weather2.arff @relationweather-weka.filters.unsupervised.instance.NonSparseToSparse @attribute outlook {sunny,overcast,rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE,FALSE} @attribute play {yes,no} @data {1 85,2 85,3 FALSE,4 no} {1 80,2 90,4 no} {0 overcast,1 83,2 86,3 FALSE} {0 rainy,1 70,2 96,3 FALSE} {0 rainy,1 68,2 80,3 FALSE} {0 rainy,1 65,2 70,4 no} {0 overcast,1 64,2 65} {1 72,2 95,3 FALSE,4 no} {1 69,2 70,3 FALSE} {0 rainy,1 75,2 80,3 FALSE} {1 75,2 70} {0 overcast,1 72,2 90} {0 overcast,1 81,2 75,3 FALSE} {0 rainy,1 71,2 91,4 no}

Exp No: 3

Date: _ _/_ _/_ _

Papered By Kareem, Asst. Prof.

JBREC

20

Name of the Experiment: .. Classification the Data: The user has the option of applying many different algorithms to the data set that would in theory produce a representation of the information used to make observation easier. It is difficult to identify which of the options would provide the best output for the experiment. The best approach is to independently apply a mixture of the available choices and see what yields something close to the desired results. The Classify tab is where the user selects the classifier choices. Again there are several options to be selected inside of the classify tab. Test option gives the user the choice of using four different test mode scenarios on the data set: 1 1. Use training set 2 2. Supplied training set 3 3. Cross validation 4 4. Split percentage There is the option of applying any or all of the modes to produce results that can be compared by the user. Additionally inside the test options toolbox there is a dropdown menu so the user can select various items to apply that depending on the choice can provide output options such as saving the results to file or specifying the random seed value to be applied for the classification. The classifiers in WEKA have been developed to train the data set to produce output that has been classified based on the characteristics of the last attribute in the data set. For a specific attribute to be used the option must be selected by the user in the options menu before testing is performed. Finally the results have been calculated and they are shown in the text box on the lower right. They can be saved in a file and later retrieved for comparison at a later time or viewed within the window after changes and different results have been derived. Weather Dataset.arff @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes

Papered By Kareem, Asst. Prof.

JBREC

21
sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no Implement Weka.Classifiers.trees.j48 Use Training Set Testing Options:

Papered By Kareem, Asst. Prof.

JBREC

22

=== Run information === Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: evaluate on training data === Classifier model (full training set) === J48 pruned tree -----------------outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8 Time taken to build model: 0.02 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 14 Incorrectly Classified Instances 0

100 % 0 %

Papered By Kareem, Asst. Prof.

JBREC

23
Kappa statistic 1 Mean absolute error 0 Root mean squared error 0 Relative absolute error 0 % Root relative squared error 0 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 1 0 1 1 1 1 yes 1 0 1 1 1 1 no === Confusion Matrix === a b <-- classified as 9 0 | a = yes 0 5 | b = no Visualize Tree:

Papered By Kareem, Asst. Prof.

JBREC

24

Use Cross Validation Testing Option:

Classifier Output: === Run information === Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: 10-fold cross-validation === Classifier model (full training set) === J48 pruned tree -----------------outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5

Papered By Kareem, Asst. Prof.

JBREC

25
Size of the tree : 8

Time taken to build model: 0 seconds === Stratified cross-validation === === Summary === Correctly Classified Instances 9 64.2857 % Incorrectly Classified Instances 5 35.7143 % Kappa statistic 0.186 Mean absolute error 0.2857 Root mean squared error 0.4818 Relative absolute error 60 % Root relative squared error 97.6586 % Coverage of cases (0.95 level) 92.8571 % Mean rel. region size (0.95 level) 64.2857 % Total Number of Instances 14 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.778 0.6 0.7 0.778 0.737 0.789 0.4 0.222 0.5 0.4 0.444 0.789 === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no Visualize Tree:

yes no

Use Supplied Test Set Testing Options: We will now use our model to classify the new instances. A portion of the new instances ARFF file is depicted in Figure Note that the attribute section is identical to the training data (bank data we used for building our model). However, in the data section, the value of the "pep" attribute is "?" (or unknown).

Papered By Kareem, Asst. Prof.

JBREC

26
In the main panel, under "Test options" click the "Supplied test set" radio button, and then click the "Set..." button. This will pop up a window which allows you to open the file containing test instances, as in Figures

In this case, we open the file "bank-new.arff" and upon returning to the main window, we click the "start" button. This, once again generates the models from our training data, but this time it applies the model to the new unclassified instances in the "bank-new.arff" file in order to predict the value of "pep" attribute. The result is depicted in Figure 28. Note that the summary of the results in the right panel does not show any statistics. This is because in our test instances the value of the class attribute ("pep") was left as "?", thus WEKA has no actual values to which it can compare the predicted values of new instances.

Papered By Kareem, Asst. Prof.

JBREC

27
Of course, in this example we are interested in knowing how our model managed to classify the new instances. To do so we need to create a file containing all the new instances along with their predicted class value resulting from the application of the model. Doing this is much simpler using the command line version of WEKA classifier application. However, it is possible to do so in the GUI version using an "indirect" approach, as follows. First, right-click the most recent result set in the left "Result list" panel. In the resulting popup window select the menu item "Visualize classifier errors". This brings up a separate window containing a two-dimensional graph.

Papered By Kareem, Asst. Prof.

JBREC

28

We would like to "save" the classification results from which the graph is generated. In the new window, we click on the "Save" button and save the result as the file: "bank-predicted.arff". This file contains a copy of the new instances along with an additional column for the predicted value of "pep".

Note that two attributes have been added to the original new instances data: "Instance_number" and "predictedpep". These correspond to new columns in the data portion. The "predictedpep" value for each new instance is the last value before "?" which the actual "pep" class value. For example, the predicted value of the "pep" attribute for instance 0 is "YES" according to our model, while the predicted class value for instance 4 is "NO". Use Percentage Split Testing Options:

Classifiers Output: === Run information ===

Papered By Kareem, Asst. Prof.

JBREC

29
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2 Relation: weather Instances: 14 Attributes: 5 outlook temperature humidity windy play Test mode: split 66.0% train, remainder test === Classifier model (full training set) === J48 pruned tree -----------------outlook = sunny | humidity <= 75: yes (2.0) | humidity > 75: no (3.0) outlook = overcast: yes (4.0) outlook = rainy | windy = TRUE: no (2.0) | windy = FALSE: yes (3.0) Number of Leaves : 5 Size of the tree : 8 Time taken to build model: 0 seconds === Evaluation on test split === === Summary === Correctly Classified Instances 2 40 % Incorrectly Classified Instances 3 60 % Kappa statistic -0.3636 Mean absolute error 0.6 Root mean squared error 0.7746 Relative absolute error 126.9231 % Root relative squared error 157.6801 % Coverage of cases (0.95 level) 40 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 5 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.667 1 0.5 0.667 0.571 0.333 yes 0 0.333 0 0 0 0.333 no === Confusion Matrix === a b <-- classified as 2 1 | a = yes 2 0 | b = no

Papered By Kareem, Asst. Prof.

JBREC

30

Exp No: 4

Date: _ _/_ _/_ _

Name of the Experiment: ..


Analyzing the Output: Run Information: The first line of the run information section contains information about the learning scheme chosen and its parameters. The parameters chosen (both default and modified) are shown in short form. In our example, the learning scheme was weka.classifiers.trees.J48 or the J48 algorithm. The second line shows information about the relation. Relations in Weka are like data files. The name of the relation contains in it the name of data file used to build it, and the names of filters that have been applied on it. The next part shows the number of instances in the relation, followed by the number of attributes. This is followed by the list of attributes. The last part show the type of testing that was employed; in our example it was 10-fold cross-validation Classifier Model (full Training set):

Papered By Kareem, Asst. Prof.

JBREC

31

J48 pruned tree -----------------Outlook = sunny | Humidity <= 75: yes (2.0) | Humidity > 75: no (3.0) Outlook = overcast: yes (4.0) Outlook = rainy | Windy = TRUE: no (2.0) | Windy = FALSE: yes (3.0) Number of Leaves: 5 Size of the tree: 8 It displays information about the model generated using the full training set. It mentions full training set because we used cross-validation and what is being displayed here is the final model that was built used all of the dataset to be generated. When using tree models, a text display of the generated tree is shown. This is followed by the information about the number of leaves and overall tree size (above). Confusion Matrix: A confusion matrix is an easy way of describing the results of the experiment. The best way to describe it is by example === Confusion Matrix === a b <-- classified as 7 2 | a = yes 3 2 | b = no The confusion matrix is more commonly named contingency table. In our case we have two classes, and therefore a 2x2 confusion matrix, the matrix could be arbitrarily large. The number of correctly classified instances is the sum of diagonals in the matrix; all others are incorrectly classified (class "a" gets misclassified as "b" exactly twice, and class "b" gets misclassified as "a" three times). Detailed Accuracy of Class: The True Positive (TP) rate is the proportion of examples which were classified as class x, among all examples which truly have class x, i.e. how much part of the class was captured. It is equivalent to Recall. In the confusion matrix, this is the diagonal element divided by the sum over the relevant row, i.e. 7/(7+2)=0.778 for class yes and 2/(3+2)=0.4 for class no in our example The False Positive (FP) rate is the proportion of examples which were classified as class x, among all examples which are not of class x, In this matrix, this is the column sum of class x minus the diagonal element, divided by the rows sums of all other classes; i.e. 3/5=0.6 for class yes and 2/9=0.222 for class no The Precision is the proportion of the examples which truly have class x among all those which were classified as class x. In the matrix, this is the diagonal element divided by the sum over the relevant column, i.e. 7/(7+3)=0.7 for class yes and 2/(2+2)=0.5 for class no.

Papered By Kareem, Asst. Prof.

JBREC

32
precision=TP/(TP+FP) for yes precision=TN/(TN+FN) for No The F-Measure is simply 2*Precision*Recall/ (Precision+Recall), a combined measure for precision and recall. Evaluating numeric prediction: Same strategies: independent test set, cross-validation, significance tests, etc. Difference: error measures

Actual target values: a1 a2 an Predicted target values: p1 p2 pn Most popular measure: mean-squared error ( p1 a1 ) 2 + ... + ( pn an ) 2 n The root mean-squared error : ( p1 a1 ) 2 + ... + ( pn an ) 2 n The mean absolute error is less sensitive to outliers than the mean-squared error: | p1 a1 | +...+ | pn an | n Sometimes relative error values are more appropriate (e.g. 10% for an error of 50 when
predicting 500) How much does the scheme improve on simply predicting the average?

The relative squared error is ( p1 a1 ) 2 + ... + ( pn an ) 2 (a a1 ) 2 + ... + ( a an ) 2

The relative absolute error is: | p1 a1 | +...+ | pn an | | a a1 | +...+ | a an |


A Root mean-squared error Mean absolute error 67.8 41.3 B 91.7 38.5 C 63.3 33.4 D 57.4 29.2

Papered By Kareem, Asst. Prof.

JBREC

33
Root rel squared error Relative absolute error Correlation coefficient D best C second-best 42.2% 43.1% 0.88 57.2% 40.1% 0.88 39.4% 34.8% 0.89 35.8% 30.4% 0.91

A, B arguable

Exp No: 5

Date: _ _/_ _/_ _

Name of the Experiment: ..


Clustering: When clustering, Weka shows the number of clusters and how many instances each cluster contains. For some algorithms the number of clusters can be specified by setting a parameter in the object editor. For probabilistic clustering methods, Weka measures the log-likelihood of the clusters on the training data: the larger this quantity, the better the model fits the data. Increasing the number of clusters normally increases the likelihood, but may overfit. The controls on the Cluster panel are similar to those for Classify. You can specify some of the same evaluation methodsuse training set, supplied test set, and percentage split (the last two are used with the log-likelihood). K- Means Clustering in Weka: This example illustrates the use of k-means clustering with WEKA The sample data set used for this example is based on the "bank data" available in ARFF format (bank-data.arff). As an illustration of

Papered By Kareem, Asst. Prof.

JBREC

34
performing clustering in WEKA, we will use its implementation of the K-means algorithm to cluster the customers in this bank data set, and to characterize the resulting customer segments.Since the Data File is loaded in Weka Explorer. Bank Data Set: @relation bank @attribute Instance_number numeric @attribute age numeric @attribute sex {"FEMALE","MALE"} @attribute region {"INNER_CITY","TOWN","RURAL","SUBURBAN"} @attribute income numeric @attribute married {"NO","YES"} @attribute children {0,1,2,3} @attribute car {"NO","YES"} @attribute save_act {"NO","YES"} @attribute current_act {"NO","YES"} @attribute mortgage {"NO","YES"} @attribute pep {"YES","NO"} @data 0,48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES 1,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO 2,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO 3,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO 4,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO 5,57,FEMALE,TOWN,37869.6,YES,2,NO,YES,YES,NO,YES 6,22,MALE,RURAL,8877.07,NO,0,NO,NO,YES,NO,YES 7,58,MALE,TOWN,24946.6,YES,0,YES,YES,YES,NO,NO 8,37,FEMALE,SUBURBAN,25304.3,YES,2,YES,NO,NO,NO,NO 9,54,MALE,TOWN,24212.1,YES,2,YES,YES,YES,NO,NO 10,66,FEMALE,TOWN,59803.9,YES,0,NO,YES,YES,NO,NO 11,52,FEMALE,INNER_CITY,26658.8,NO,0,YES,YES,YES,YES,NO 12,44,FEMALE,TOWN,15735.8,YES,1,NO,YES,YES,YES,YES 13,66,FEMALE,TOWN,55204.7,YES,1,YES,YES,YES,YES,YES 14,36,MALE,RURAL,19474.6,YES,0,NO,YES,YES,YES,NO 15,38,FEMALE,INNER_CITY,22342.1,YES,0,YES,YES,YES,YES,NO 16,37,FEMALE,TOWN,17729.8,YES,2,NO,NO,NO,YES,NO 17,46,FEMALE,SUBURBAN,41016,YES,0,NO,YES,NO,YES,NO 18,62,FEMALE,INNER_CITY,26909.2,YES,0,NO,YES,NO,NO,YES 19,31,MALE,TOWN,22522.8,YES,0,YES,YES,YES,NO,NO 20,61,MALE,INNER_CITY,57880.7,YES,2,NO,YES,NO,NO,YES 21,50,MALE,TOWN,16497.3,YES,2,NO,YES,YES,NO,NO 22,54,MALE,INNER_CITY,38446.6,YES,0,NO,YES,YES,NO,NO 23,27,FEMALE,TOWN,15538.8,NO,0,YES,YES,YES,YES,NO 24,22,MALE,INNER_CITY,12640.3,NO,2,YES,YES,YES,NO,NO 25,56,MALE,INNER_CITY,41034,YES,0,YES,YES,YES,YES,NO 26,45,MALE,INNER_CITY,20809.7,YES,0,NO,YES,YES,YES,NO 27,39,FEMALE,TOWN,20114,YES,1,NO,NO,YES,NO,YES 28,39,FEMALE,INNER_CITY,29359.1,NO,3,YES,NO,YES,YES,NO 29,61,MALE,RURAL,24270.1,YES,1,NO,NO,YES,NO,YES 30,61,FEMALE,RURAL,22942.9,YES,2,NO,YES,YES,NO,NO

Papered By Kareem, Asst. Prof.

JBREC

35

In the pop-up window we enter 5 as the number of clusters (instead of the default values of 2) and we leave the value of "seed" as is. The seed value is used in generating a random number which is, in turn, used for making the initial assignment of instances to clusters. Note that, in general, K-means is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try different values and evaluate the results.

Papered By Kareem, Asst. Prof.

JBREC

36
Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use Percentage Split" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window.

Cluster Output: === Run information === Scheme: weka.clusterers.SimpleKMeans -N 5 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation: bank Instances: 600 Attributes: 12 Instance_number age sex region income married children car save_act current_act mortgage pep Test mode: split 66% train, remainder test === Clustering model (full training set) === kMeans ====== Number of iterations: 14

Papered By Kareem, Asst. Prof.

JBREC

37
Within cluster sum of squared errors: 1719.2889887418955 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Full Data 0 1 2 3 4 (600) (66) (112) (120) (137) (165) =============================================================== Instance_number 299.5 306.6364 265.9732 302.2333 320.292 300.1515 age 42.395 40.0606 32.7589 51.475 44.3504 41.6424 sex FEMALE FEMALE FEMALE FEMALE FEMALE MALE region INNER_CITY RURAL INNER_CITY INNER_CITY TOWN INNER_CITY income 27524.0312 26206.1992 18260.9218 34922.1563 27626.442 28873.3638 married YES NO YES YES YES YES children 0 3 2 1 0 0 car NO NO NO NO NO YES save_act YES YES YES YES YES YES current_act YES YES YES YES YES YES mortgage NO NO NO NO NO YES pep NO NO NO YES NO YES Attribute Time taken to build model (full training data) : 0.14 seconds === Model and evaluation on test split === kMeans ====== Number of iterations: 10 Within cluster sum of squared errors: 1115.231316606429 Missing values globally replaced with mean/mode Cluster centroids: Cluster# Full Data 0 1 2 3 4 (396) (131) (63) (80) (41) (81) =============================================================== Instance_number 299.6364 277.1374 347.1905 362.0625 231.5122 271.8642 age 43.1061 40.4733 49.1111 45.1 51.2439 36.6049 sex MALE FEMALE MALE MALE FEMALE MALE region INNER_CITY INNER_CITY INNER_CITY TOWN RURAL INNER_CITY income 27825.983 25733.2533 32891.0238 30817.2439 32090.4995 22158.1307 married YES YES YES NO NO YES children 0 0 1 0 0 0 car NO YES YES NO YES NO save_act YES YES YES YES YES NO current_act YES YES NO YES YES YES mortgage NO NO NO NO NO YES pep NO NO YES YES NO YES Attribute Time taken to build model (percentage split) : 0.04 seconds Clustered Instances 0 73 ( 36%) 1 37 ( 18%) 2 30 ( 15%) 3 28 ( 14%) 4 36 ( 18%)

Papered By Kareem, Asst. Prof.

JBREC

38
The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for each cluster (so, each dimension value in the centroid represents the mean value for that dimension in the cluster). Thus, centroids can be used to characterize the clusters. For example, the centroid for cluster 1 shows that this is a segment of cases representing middle aged to young (approx. 38) females living in inner city with an average income of approx. $28,500, who are married with one child, etc. Furthermore, this group have on average said YES to the PEP product. Another way of understanding the characteristics of each cluster in through visualization. We can do this by right-clicking the result set on the left "Result list" panel and selecting "Visualize cluster assignments".

You can choose the cluster number and any of the other attributes for each of the three different dimensions available (x-axis, y-axis, and color). Different combinations of choices will result in a visual rendering of different relationships within each cluster. In the above example, we have chosen the cluster number as the x-axis, the instance number (assigned by WEKA) as the y-axis, and the "sex" attribute as the color dimension. This will result in a visualization of the distribution of males and females in each cluster. For instance, you can note that clusters 2 and 3 are dominated by males, while clusters 4 and 5 are dominated by females. In this case, by changing the color dimension to other attributes, we can see their distribution within each of the clusters. Finally, we may be interested in saving the resulting data set which included each instance along with its assigned cluster. To do so, we click the "Save" button in the visualization window and save the result as the file "bank-kmeans.arff". @relation bank_clustered @attribute Instance_number numeric @attribute age numeric "@attribute sex {FEMALE,MALE}" "@attribute region {INNER_CITY,TOWN,RURAL,SUBURBAN}" @attribute income numeric

Papered By Kareem, Asst. Prof.

JBREC

39
"@attribute married {NO,YES}" "@attribute children {0,1,2,3}" "@attribute car {NO,YES}" "@attribute save_act {NO,YES}" "@attribute current_act {NO,YES}" "@attribute mortgage {NO,YES}" "@attribute pep {YES,NO}" "@attribute Cluster {cluster0,cluster1,cluster2,cluster3,cluster4,cluster5}" @data "0,48,FEMALE,INNER_CITY,17546,NO,1,NO,NO,NO,NO,YES,cluster1" "1,40,MALE,TOWN,30085.1,YES,3,YES,NO,YES,YES,NO,cluster3" "2,51,FEMALE,INNER_CITY,16575.4,YES,0,YES,YES,YES,NO,NO,cluster2" "3,23,FEMALE,TOWN,20375.4,YES,3,NO,NO,YES,NO,NO,cluster5" "4,57,FEMALE,RURAL,50576.3,YES,0,NO,YES,NO,NO,NO,cluster5" "5,57,FEMALE,TOWN,37869.6,YES,2,NO,YES,YES,NO,YES,cluster5" "6,22,MALE,RURAL,8877.07,NO,0,NO,NO,YES,NO,YES,cluster0" "7,58,MALE,TOWN,24946.6,YES,0,YES,YES,YES,NO,NO,cluster2" "8,37,FEMALE,SUBURBAN,25304.3,YES,2,YES,NO,NO,NO,NO,cluster5" "9,54,MALE,TOWN,24212.1,YES,2,YES,YES,YES,NO,NO,cluster2" "10,66,FEMALE,TOWN,59803.9,YES,0,NO,YES,YES,NO,NO,cluster5" "11,52,FEMALE,INNER_CITY,26658.8,NO,0,YES,YES,YES,YES,NO,cluster4" "12,44,FEMALE,TOWN,15735.8,YES,1,NO,YES,YES,YES,YES,cluster1" "13,66,FEMALE,TOWN,55204.7,YES,1,YES,YES,YES,YES,YES,cluster1" "14,36,MALE,RURAL,19474.6,YES,0,NO,YES,YES,YES,NO,cluster5" "15,38,FEMALE,INNER_CITY,22342.1,YES,0,YES,YES,YES,YES,NO,cluster2" "16,37,FEMALE,TOWN,17729.8,YES,2,NO,NO,NO,YES,NO,cluster5" "17,46,FEMALE,SUBURBAN,41016,YES,0,NO,YES,NO,YES,NO,cluster5" "18,62,FEMALE,INNER_CITY,26909.2,YES,0,NO,YES,NO,NO,YES,cluster4" "19,31,MALE,TOWN,22522.8,YES,0,YES,YES,YES,NO,NO,cluster2" "20,61,MALE,INNER_CITY,57880.7,YES,2,NO,YES,NO,NO,YES,cluster2" "21,50,MALE,TOWN,16497.3,YES,2,NO,YES,YES,NO,NO,cluster5" Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster" attribute to the original data set. In the data portion, each instance now has its assigned cluster as the last attribute value. By doing some simple manipulation to this data set, we can easily convert it to a more usable form for additional analysis or processing.

Papered By Kareem, Asst. Prof.

JBREC

40

Exp No: 6

Date: _ _/_ _/_ _

Name of the Experiment: ..


Association: The Associate panel is simpler than Classify or Cluster. Weka contains only three algorithms for determining association rules and no methods for evaluating such rules. It shows the output from the Apriori program for association rules on the nominal version of the weather data. Despite the simplicity of the data, several rules are found. The number before the arrow is the number of instances for which the antecedent is true; that after the arrow is the number of instances in which the consequent is true also; and the confidence (in parentheses) is the ratio between the two. Ten rules are found by default: you can ask for more by using the object editor to change numRules. Weather Dataset.arff @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature real @attribute humidity real @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes

Papered By Kareem, Asst. Prof.

JBREC

41
overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no

Clicking on the "Associate" tab will bring up the interface for the association rule algorithms. The Apriori algorithm which we will use is the deafult algorithm selected. However, in order to change the parameters for this run (e.g., support, confidence, etc.) we click on the text box immediately to the right of the "Choose" button. Note that this box, at any given time, shows the specific command line arguments that are to be used for the algorithm. The dialog box for changing the parameters is depicted in Figure a2. Here, you can specify various parameters associated with Apriori. Click on the "More" button to see the synopsis for the different parameters.

WEKA allows the resulting rules to be sorted according to different metrics such as confidence, leverage, and lift. In this example, we have selected lift as the criteria. Furthermore, we have entered 1.5 as the minimum value for lift (or improvement) is computed as the confidence of the rule divided by the support of the right-hand-side (RHS). In a simplified form, given a rule L => R, lift is the ratio of the probability that L and R occur together to the multiple of the two individual probabilities for L and R, i.e., lift = Pr(L,R) / Pr(L).Pr(R). If this value is 1, then L and R are independent. The higher this value, the more likely that the existence of L and R together in a transaction is not just a random occurrence, but because of some relationship between them. Here we also change the default value of rules (10) to be 100; this indicates that the program will report no more than the top 100 rules (in this case sorted according to their lift values). The upper bound for minimum support is set to 1.0 (100%) and the lower bound to 0.1 (10%). Apriori in WEKA starts

Papered By Kareem, Asst. Prof.

JBREC

42
with the upper bound support and incrementally decreases support (by delta increments which by default is set to 0.05 or 5%). The algorithm halts when either the specified number of rules are generated, or the lower bound for min. support is reached. The significance testing option is only applicable in the case of confidence and is by default not used (-1.0).

Once the parameters have been set, the command line text box will show the new command line. We now click on start to run the program.

Associator Output: === Run information === Scheme: weka.associations.Apriori -N 5 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1 Relation: weather.symbolic Instances: 14 Attributes: 5 outlook temperature humidity windy play === Associator model (full training set) === Apriori ======= Minimum support: 0.25 (3 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 15 Generated sets of large itemsets: Size of set of large itemsets L(1): 12 Size of set of large itemsets L(2): 26 Size of set of large itemsets L(3): 4 Best rules found: 1. outlook=overcast 4 ==> play=yes 4 <conf:(1)> lift:(1.56) lev:(0.1) [1] conv:(1.43) 2. temperature=cool 4 ==> humidity=normal 4 <conf:(1)> lift:(2) lev:(0.14) [2] conv:(2) 3. humidity=normal windy=FALSE 4 ==> play=yes 4 <conf:(1)> lift:(1.56) lev:(0.1) [1] conv:(1.43) 4. outlook=sunny play=no 3 ==> humidity=high 3 <conf:(1)> lift:(2) lev:(0.11) [1] conv:(1.5) 5. outlook=sunny humidity=high 3 ==> play=no 3 <conf:(1)> lift:(2.8) lev:(0.14) [1] conv:(1.93)

Papered By Kareem, Asst. Prof.

JBREC

43

The panel on the left ("Result list") now shows an item indicating the algorithm that was run and the time of the run. You can perform multiple runs in the same session each time with different parameters. Each run will appear as an item in the Result list panel. Clicking on one of the results in this list will bring up the details of the run, including the discovered rules in the right panel. In addition, rightclicking on the result set allows us to save the result buffer into a separate file. In this case, we save the output in the file bank-data-ar1.txt. Note that the rules were discovered based on the specified threshold values for support and lift. For each rule, the frequency counts for the LHS and RHS of each rule is given, as well as the values for confidence, lift, leverage, and conviction. Note that leverage and lift measure similar things, except that leverage measures the difference between the probability of co-occurrence of L and R (see above example) as the independent probabilities of each of L and R, i.e., Leverage = Pr(L,R) - Pr(L).Pr(R). In other words, leverage measures the proportion of additional cases covered by both L and R above those expected if L and R were independent of each other. Thus, for leverage, values above 0 are desirable, whereas for lift, we want to see values greater than 1. Finally, conviction is similar to lift, but it measures the effect of the right-hand-side not being true. It also inverts the ratio. So, convictions is measured as: Conviction = Pr (L).Pr(not R) / Pr(L,R). Thus, conviction, in contrast to lift is not symmetric (and also has no upper bound). In most cases, it is sufficient to focus on a combination of support, confidence, and either lift or leverage to quantitatively measure the "quality" of the rule. However, the real value of a rule, in terms of usefulness and action ability is subjective and depends heavily of the particular domain and business objectives.

Papered By Kareem, Asst. Prof.

JBREC

44

Exp No: 7

Date: _ _/_ _/_ _

Name of the Experiment: ..


Using The Experimenter The Experimenter interface to Weka is specialized for conducting experiments where the user is interested in comparing several learning schemes on one or several datasets. As its output, the Experimenter produces data that can be used to compare these learning schemes visually and numerically as well as conducting significance testing. To demonstrate how to use the Experimenter, we will conduct an experiment comparing two tree-learning algorithms on the birth dataset. There are 3 main areas in the Experimenter interface and they are accessed via tabs at the top left of the window. These 3 areas are the Setup area where the experiment parameters are set, the Run area where the experiment is started and its progress monitored and lastly the Analyze area where the results of the experiment are studied.

Click on Browse Button to Browse the Result arff (or) csv file

Papered By Kareem, Asst. Prof.

JBREC

45

Choose any Testing Option Click on New Button to Create New Experiment

Click on Add New Button to add a Dataset used to comparing algorithms

Click on Add New Button to add Classification algorithms for comparing

Setting up the Experiment The Setup window has 6 main areas that must each be configured in order for the experiment to be properly configured. Starting from the top these areas are Experiment Configuration Mode, Results Destination, Experiment Type, Iteration Control, Datasets, Algorithms and lastly the Notes area.

Experiment Configuration Mode: We will be using the simple experimental interface mode, as we do not require the extra features the advanced mode offers. We will start by creating a new experiment and then defining its parameters. A new experiment is created by pushing the on the New button at the top of the window and this will create a blank new experiment. After we have finished setting up the experiment, we save it using the Save button. Experiment settings are saved in either EXP or a more familiar XML format. These files can be opened later to recall all the experiment configuration settings. Choose Destination: The results of the experiment will be stored in a datafile. This area allows one to specify the name and format that this file will have. It is not necessary to do this if one does not intend on using this data outside of the Experimenter and if this data does not need to be examined at a later date. Results can be stored in the ARFF or CSV format and they can also be sent to an external database.

Set Experiment Type: There are 3 types of experiments available in the simple interface. These types vary in how the data is going to be split for the training/testing in the experiment. The options are cross-validation, random split and random split with order preserved (i.e. data is split randomly but the order of the instances is not randomized; so it will instance#1 followed by instance #2 and so on). We will use crossvalidation in our example

Papered By Kareem, Asst. Prof.

JBREC

46
Iteration Control: For the randomized experiment types, the user has the option of randomizing the data again and repeating the experiment. This value for Number of Repetitions controls how many times this will take place. Add data set(s): In this section, the user adds the datasets that will be used in the experiment. Only ARFF files can be used here and as mentioned before, the Experimenter is expecting a fully prepared and cleaned dataset. There is no option for choosing the classification variable here, and it will always pick the last attribute to be the class attribute. In our example, the birth-weight data set is the only one we will use. Add Algorithms: In this section, the user adds the classification algorithms to be employed in the experiment. The procedure here to select an algorithm and choose its options is exactly the same as in the Explorer. The difference here is that more than one algorithm can be specified. Algorithms are added by clicking on the add button in the Algorithm section of the window and this will pop-up a window that user will use to select the algorithm. This window will also display the available options for the selected algorithm. The first time the window displays, the ZeroR rule algorithm will be selected. This is shown on the picture above. The user can select a different algorithm by clicking on the Choose button. Clicking on the More button will display help about the selected Algorithm and a description of its available options. For our example, we will add the J48 algorithm with the option for binary splits turned on and the REPTree algorithm. Individual algorithms can be edited or deleted by clicking on the algorithm from the list of algorithms and then by clicking on the Edit or Delete buttons. Finally, any extra notes or comments about the experiment setup can be added by clicking on the Notes button at the bottom of the window and entering the information in the window provided.Saving the Experiment Setup:

At this point, we have entered all the necessary options to start our experiment. We will now save the experiment setup so that we do not have to re-enter all this information again. This is done by clicking on the Save Options button on the bottom of the window. These settings can be loaded at another time if one wishes to redo the experiment or modify it.

Papered By Kareem, Asst. Prof.

JBREC

47
Running the Experiment Running The next step is to run the experiment and this is done by clicking on the Run tab at the top of the window. There is not much involved in this step, all that is needed is to click on the Start button. The progress of the experiment is displayed the Status area at the bottom of the window and any errors reported will be displayed in the Log area. Once the experiment has been run, the next step is to analyze the results.

Analyzing the output:

Click on Experiment Button to analyze the Experiment

Papered By Kareem, Asst. Prof.

JBREC

48

Choose Comparison Field used to compare algorithm

Sort Dataset in ascending order

Choose Test Base any one in given list

Papered By Kareem, Asst. Prof.

JBREC

49

Test Output:
Tester: weka.experiment.PairedCorrectedTTester Analysing: Percent_correct Datasets: 1 Resultsets: 3 Confidence: 0.05 (two tailed) Sorted by: Date: 4/6/12 9:55 PM Dataset (1) trees.J4 | (2) trees (3) trees -----------------------------------------------------------weather.symbolic (525) 50.03 | 60.03 69.02 -----------------------------------------------------------(v/ /*) | (0/1/0) (0/1/0) Key: (1) trees.J48 '-C 0.25 -M 2' -217733168393644444 (2) trees.REPTree '-M 2 -V 0.0010 -N 3 -S 1 -L -1 -I 0.0' -9216785998198681299 (3) trees.RandomTree '-K 0 -M 1.0 -S 1' 8934314652175299374

Papered By Kareem, Asst. Prof.

JBREC

50

Exp No: 8

Date: _ _/_ _/_ _

Name of the Experiment: ..


Using the Knowledge Flow: In many respects, this interface is similar to the SAS interface. The interface is based on network flow model. Experiments are defined using objects and are laid out on a canvas called a KnowledgeFlow Layout. These objects are then connected by links that allow these objects to pass information to each other. Since every facet of Weka is defined as an object, this interface lends itself quite naturally to the internal workings of Weka. Interface of Knowledge Flow: In this model, an experiment consists of a set of objects with links that dictate how information is passed between them. Each object is responsible for performing a single action or a set of actions. Objects can produce information or produce whole datasets that other objects need to perform their actions. This information is passed via connections that exist to define which object is producing the information, which object is receiving and what type information is passed. The KnowledgeFlow interface is composed of three main areas. There are main action buttons on the left, the object selection area and finally the layout area. We shall look at each in turn. Left Action Buttons The picture show the action buttons available on the top left of the interface window. The top button sets the mouse into a select/modify mode. In this mode, objects can be moved around and manipulated with the mouse. The button below it allows the user to save the experiment setup to a file. The next button is used to load up previously saved experiment setups. Lastly, the stop button is used to stop a running experiment.

Papered By Kareem, Asst. Prof.

JBREC

51

Object Selection The Object Selection area is located on the top part of the interface window, next to the action buttons we just looked at. This area serves as a toolbox where the user will pick objects to add to the experiment layout. Objects are grouped into 7 categories and they are Evaluation, Visualization, Filters, Classifiers, Clusterers, DataSources and DataSinks. Each category is located on a single tab and a user can switch categories by clicking on the tab at the top and proceed to select the desired object to add. Objects are added to the experiment layout by first left clicking on the desired object and then left clicking on an empty area in the layout.

The Experiment Layout The layout area is the largest part of the KnowledgeFlow interface window. It is here that the experiment is laid out and controlled. It is useful to think of this area as a canvas where the experiment will be drawn out. As described before, a user will click on an object to select it and then clicks here to place it on the layout. Some objects have user definable configuration options. Once an object has been placed on the layout, its configuration can be accessed by right clicking on it and from the menu that pops up select Configure.

Different options require different kinds of input(s) and produce different types of outputs. These inputs and outputs are controlled via connections. A connection is displayed as an arrow going from one object to another. The picture on the left shows a connection from the datasource object CSV Loader to the filter object Discretize. It is an active connection indicated by the red color of the arrow, and it is of type dataset. A dataset connection passes a dataset from one object to another. In this case, the loaded CSV file is passed to the filter object. Certain objects can only accept certain types of connections as input and provide only certain types of connections as outputs. In the example above, a filter object will only accept a single dataset type connection as input. Output connections are also limited in the types of connections they can offer, but in general are not limited in the number of connections. Therefore, a datasource object can offer an unlimited number of dataset type connections as long as there is an object that can accept it.

Papered By Kareem, Asst. Prof.

JBREC

52

Right clicking on an object will bring up a menu and from here, its possible connections will be listed. Clicking on a connection from the menu will create an arrow and this arrow will follow the mouse until the user left clicks on the desired destination object. When this is done, the arrow will point to this object. Objects that are ready to receive a connection of the type and from the particular source object will have markings as shown on the picture right. The blue diamond shapes on the four corners indicate that this object is ready and willing to accept the connection. Lastly, deleting an object is also done by right clicking on an object and from the menu and selecting delete. This action will also delete all connections to and from this object. Status Pane This is found in the bottom section of the interface and it displays the status of a running experiment. Details of the experiment as it runs or has completed are accessed by clicking on the Log button on the bottom right. There is a bug in this version of Weka where the status does not indicate that an experiment has finished; also, the Log window does not display details of the experiment, as it should. Setting Up the Experiment To demonstrate using the KnowledgeFlow interface, we will redo the experiment done on the Experimenter interface using this interface. Unlike the experiment though, we will start with the raw CSV file and preprocess it as we did when we were using the Explorer interface. Basic Experiment Setup The first step of the basic experiment setup involves defining how the data will be brought in. The second step is to defined what (if any) pre-processing needs to take place. The next is to define how the model will be trained and tested and what learning scheme will be used to build the model. The final step is to define how the results of the experiment will be displayed and/or saved.

The

Birth-weight

Example

For our experiment, we will first require a Datasource object to bring in the data. Since we are getting the data from a CSV file, we will add a CSV Loader object on to the layout. We will configure it to load the birth.csv file, which is the birth weight datafile. When we examined this datafile using the Explorer we had to drop some unnecessary attributes and also convert some nominal attributes Weka did not recognize as such. We will first deal with the second problem by adding the unsupervised Discretize filter and configure it to discretize attributes 4 to 8 and 10

Papered By Kareem, Asst. Prof.

JBREC

53
(smoke, ptl, ht, ui, ftv and weight). To drop the unnecessary attributes we shall use the Remove filter and configure it to remove attributes 1 and 9(bwt and id). Both filters require a dataset as input and produce a dataset as output. We will first connect the CSV Loader object to the Discretize object and then connect this to the Remove object. We also need to ensure that weight is specified as the class variable. The way to do this is to the Class Assigner object and set it to use the last attribute (weight) as the class variable. This object is located in the Evaluator tab in the object selection area. This object requires a dataset as input and produces a dataset as output so we connect it accordingly. The experiment layout at this point should look something like the picture above. The next step is to add our classifiers and connect the training and testing datasets to them. For this experiment, we will be using the J48 (with binary splits) and the REPTree algorithm. They are found in the Classifiers tab in the object selection area. After adding them to the layout, we will connect them to the CrossValidation FoldMaker. This is shown in the picture on the right. We also need a way to measure the performance of these classifiers and to do this we will add a Classifier PerformanceEvaluator object to each classifier. Finally we need a way to visualize the performance of both models and to do this we will add a graphical and a text visualizer.

The Model PerformanceChart object is a graphical visualizer that takes in threshold data as input and produces ROC style curves. It can accept more than one classifier so it is possible to view both curves on the same graph. The graph displayed is identical to the threshold curve we looked at in the explorer interface, except it can display multiple curves. A similar functionality is planned for the cost and margin curve functions but as of yet not implemented. The TextViewer object is a general-purpose text viewer and takes any text as input. This object will display a summary of the performance of the two models. The experiment setup is now complete and we can start it by right clicking on the first object, the CSV loader, and from the menu that pops up select Start Loading. The running state of the experiment should be displayed in the status area as well as in the log window but both are not working in this version of Weka. To look at the results we can right click on the visualization objects and select Show Plot or Show Results. Both displays will keep the results of past experiment runs and can be recalled to be viewed again.

Papered By Kareem, Asst. Prof.

JBREC

54
Filters: Add:weka.unsupervised.atrribute.add

Result:

NonSparseToSparse: weka.unsupervised.atrribute.NonSparseToSparse

Result:

Papered By Kareem, Asst. Prof.

JBREC

55

Classifiers: CrossValidation:

Result:

Papered By Kareem, Asst. Prof.

JBREC