Sei sulla pagina 1di 7

Weka Exercise 1

Getting Acquainted With Weka


Most CE802 students use the Weka package as the basis of their assignments. This is because
it not only provides implementations of a ide range of learning procedures but also includes
the machinery for running systematic e!periments and reporting relevant statistics for the
results. "n other ords# it ill do a lot of the ork for you.
These e!ercises serve to purposes$
They enable you to discover hat facilities Weka provides and ho to use them.
They allo you to see some of the learning procedures that e discuss in the
lectures in action.
Obtaining Weka
"mplementations of Weka for a ide variety of machines%operating systems can be
donloaded from the Weka ebsite & http$%%.cs.aikato.ac.n'%ml%eka%inde!.html (.
)arious versions of Weka are on offer* you almost certainly ant the stable version hich is
currently Weka +.,. -ince Weka is ritten in .ava it re/uires the .ava virtual machine.
Choose the appropriate donload option if you do not already have this on your computer.
The code comes as a self0e!tracting e!ecutable file &eka0+0,08.e!e( so installation is very
simple indeed.
Running Weka
1ssuming you do not override the defaults during installation# Weka ill be located in a
folder called Weka0+., in the 2rogram 3iles folder. The main program can be launched via a
short cut or by clicking on a file called either eka.e!e or eka.4ar &there are minor
differences beteen different versions(. 5nce launched# a small indo ill appear# usually
in the top right of your screen# through hich you chose the interface you ant to use.
The E!plorer is the most useful for most CE802 assignments. Clicking on the button ill
launch the E!plorer interface.
6
The Explorer Interface
This is probably the most confusing part of becoming familiar ith Weka because you are
presented ith /uite a comple! screen.
"nitially 7preprocess8 ill have been selected. This is the tab you select hen you ant to tell
Weka here to find the data set that you ant to use.
Weka processes data sets that are in its on 1933 format. Conveniently# the donload ill
have set up a folder ithin the Weka0+., folder called 7data8. This contains a selection of data
files in 1933 format.
2
ARFF forat files
:ou do not need to kno about 1933 format unless you ish to convert data from other
formats. ;oever# it is useful to see the information that such files provide to Weka.
The folloing is an e!ample of an 1933 file for a dataset similar to the one used in the
decision tree lecture$
@relation weather.symbolic
@attribute outlook {sunny, overcast, rainy}
@attribute temperature {hot, mild, cool}
@attribute humidity {high, normal}
@attribute windy {TRUE, !"#E}
@attribute play {yes, no}
@data
sunny,hot,high,!"#E,no
sunny,hot,high,TRUE,no
overcast,hot,high,!"#E,yes
rainy,mild,high,!"#E,yes
rainy,cool,normal,!"#E,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,!"#E,no
sunny,cool,normal,!"#E,yes
rainy,mild,normal,!"#E,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,!"#E,yes
rainy,mild,high,TRUE,no
"t consists of three parts. The <relation line gives the dataset a name for use ithin Weka.
The <attribute lines declare the attributes of the e!amples in the data set &=ote that this ill
include the classification attribute(. Each line specifies an attribute>s name and the values it
may take. "n this e!ample the attributes have nominal values so these are listed e!plicitly. "n
other cases attributes might take numbers as values and in such cases this ould be indicated
as in the folloing e!ample$
@attribute temperature numeric
The remainder of the file lists the actual e!amples# in comma separated format* the attribute
values appear in the order in hich they are declared above.
Opening a data set!
"n the E!plorer indo# click on 75pen file8 and then use the broser to navigate to the
?data> folder ithin the Weka0+., folder. -elect the file called eather.nominal.arff. &This is in
fact the file listed above(.
This is a ?toy> data set# like the ones used in class for demonstration purposes. "n this case# the
normal usage is to learn to predict the ?play> attribute from four others providing information
about the eather.
+
The E!plorer indo should no look like this$
Most of the information it displays is self0e!planatory$ it is a data set containing 6@ e!amples
&instances( each of hich has A attributes. The ?play> attribute has been suggested as the class
attribute &i.e. the one that ill be predicted from the others(.
Most of the right hand of the indo gives you information about the attributes. "nitially# it
ill give you information about the first attribute &?outlook>(. This shos that it has +
possible values tells you ho many there are of each value. The bar chart in the loer right
shos ho the values of the suggested class variable are distributed across the possible
values of the ?outlook>.
"f you click on ?temperature> in the panel on the left# the information about the ?outlook>
attribute ill be replaced by the corresponding information about the temperature attribute.
"hoosing a classifier
=e!t e must select a machine learning procedure to apply to this data. The task is
classification so click on the ?classify> tab near the top of the E!plorer indo.
@
The indo should no look like this$
By default# a classifier called Cero9 has been selected. We ant a different classifier so click
on the Choose button. 1 hierarchical pop up menu appears. Click to e!pand ?Trees># hich
appears at the end of this menu# then select .@8 hich is the decision tree program e ant.
A
The E!plorer indo no looks like this indicating that .@8 has been chosen.
The other information alongside .@8 indicates the parameters that have been chosen for the
program. 3or this e!ercise e ill ignore these.
"hoosing the experiental procedures
The panel headed ?Test options> allos the user to choose the e!perimental procedure. We
shall have more to say about this later in the course. 3or the present e!ercise click on ?Dse
training set>. &This ill simply build a tree using all the e!amples in the data set(.
The small panel half ay don the left hand side indicates hich attribute ill be used as the
classification attribute. "t ill currently be set to ?play>. &=ote that this is hat actually
determines the classification attribute E the ?class> attribute on the pre0process screen is
simply to allo you to see ho a variable appears to depend on the values of other attributes(.
,
Running the decision tree progra
=o# simply click the start button and the program ill run. The results ill appear in the
scrollable panel on the right of the E!plorer indo. =ormally these ill be of great interest
but for present purposes all e need to notice is that the resulting tree classified all 6@
training e!amples correctly. The tree constructed is presented in indented format# a common
method for large trees$
$%& pruned tree
''''''''''''''''''
outlook ( sunny
) humidity ( high* no +,.-.
) humidity ( normal* yes +/.-.
outlook ( overcast* yes +%.-.
outlook ( rainy
) windy ( TRUE* no +/.-.
) windy ( !"#E* yes +,.-.
0umber o1 "eaves * 2
#i3e o1 the tree * &
The panel on the loer left headed ?9esult list &right0click for options(> provides access to
more information about the results. 9ight clicking ill produce a menu from hich
?)isuali'e Tree> can be selected. This ill display the decision tree in a more attractive
format$
=ote that this form of display is really only suitable for small trees. Comparing the to forms
should make it clear ho the indented format orks.
F