Sei sulla pagina 1di 81


• Attributes = columns
• Instance = measurements

Classifiers have default parameters, to change them and override them you just click on the name of
the Classifier:

 As you can see there are several options, which vary from classifier to classifier, to see what
these classifiers can do and what they can’t you can click “More” and “Capabilities”.
o “More” = gives synopsis and also other info such as what options you can use and
what those options mean.
o “Capabilities” = tells you what classes or what attributes you can use with these
particular classifying algorithms.
 So, for this one you can use nominal attributes, missing values, empty
nominal attributes etc.
 This is particularly helpful as some of the algorithm they can’t deal with
missing value, so you can use this info whether or not your classifying
algorithm can handle missing values etc.
Can change the parameters and save these setting rather than having to change them manually each

Generally, when you are building a machine learning problem, you have a set of examples that are
given to you that are labelled, let's call this A. Sometimes you have a separate set of examples not
intended to be used for training, let's call this B. This gives us the four options in Weka:

1. Use training set:

a. Build a model on file A, apply it to file A: This is testing on the training set. This is
generally a bad idea from an evaluation point of view. This is like seeing the same
questions exactly in the exam as you would in real life. Some classifiers (e.g. nearest
neighbour for example) always get 100% on the training set.
2. Supplied test set:
a. Build a model on file A, apply it to file B: If you have a file B, this is the one you want
to do. But you don't always have a file B.
3. Cross-validation:
a. But data is expensive! And we could be using B for training instead of testing. So,
what if I take file A, divide it into 5 equal chunks, T, U, V, W, X. Then I train on T, U, V,
W and test on X. Then I train on T, U, V, X and test on W, and similarly test on V, U
and T, and then average the accuracy results. Because I'm repeating the process 5
times, I get lower variance, and so my estimate of the accuracy of my classifier is
likely to be closer to the truth. This would be 5-fold cross validation. It also takes 5
times as long!
4. Percentage split:
a. Ugh. I hate that it takes 5 times as long, so why don't I do just one-fold? Split the
data into 80% training and 20% test. This is percentage split.
In classify tab there is a pull-down menu:

This is where you can choose what attribute you are trying to predict (using linear regression as

I click start for attribute “healthGrade” and I got message roughly “can’t do numeric classes”, so in
WEKA when you get your feet wet this is one of the 1st things you will notice that any algorithm can
not predict any data point.

So, this logistic regression algorithm won’t predict numeric class data but will predict nominal class

For test options choose cross validation with 10-fold.

So, relating all these 6 data attributes, let see what the 1st logistic regression formula starts out as.

So, click start

Has taken the 6 attributes and assigned a weight to each of them, so with all six, logistic regression
initially predicted with 92.5% accuracy.

As I interpreted that data I looked at bottom at the confusion matrix see that it gets tripped up more
on the successfuls where it is 32 out of 36 correct and the unsucessfuls where it is 42 out of 44

Now let’s go back and see if we only use those 2 data points if it helps or hurts this logistic regression
algorithm in its prediction accuracy.

We click the we don’t need and click remove

So, we have two independent variables and based off how they relate to each other we are going to
see how close can we predict the dependant outcome variable.

Click start:

This time we get 98.75% prediction accuracy and at bottom we see that all 44 unsuccessful
measurements were predicted correctly, and 35 of 36 successfuls were predicted correctly.

For this logistic regression algorithm removing data points during pre-processing improves the
algorithm which is not always the case.

So, this is another one of those 1st time seeing it in WEKA lessons, that some algorithms like this it is
inefficient to throw all the data you can at it. Whereas as other algorithms like J48 decision tree it’s
the more efficient thing to do.

Less go to classify and choose J48 which is a decision tree, and again I ask myself hypothetically what
do I want this decision tree to predict or find out? So since logistic regression did not work with
numeric classes I figured I try predict a numeric class this time.

So, went to pull down menu and said lets use decision tree to predict “hrv”, that is if I give this
decision tree algorithm all 6 data points, how accurately can it predict the 7th (hrv) one? Or can it at
As the logistic regression algorithm this J48 decision tree algorithm does not work with numeric
classes, J48 does work with nominal classes so since I had 3 I ran each decision three just to get a
little more familiar with WEKA. 1st I did “healthGrade”:

As you can see it predicted 100% accurately, to see how, I clicked visualise tree:

It looked at all 6 data points and it picked hrv as these most relative for a prediction, based on one
question in fact was the hrv <= or > 69.5? It predicted 100% if healthGrade was successful or
unsuccessful. A small but neat thing as well is I graded success if it was above or below 70 and the
decision tree here says, “you actually didn’t have any 69.6 to 70 so 69.5 is your actual line in the
sand to classify success”. So that can be a clear way of thinking about a metric.

Next, I put “patternSleep” in the pulldown menu and again the decision tree got 100% and I clicked
visualise tree:
It shows it related dayID most to the outcome variable. Was the dayID more than 10? It could
predict based of this one question? But the 3rd nominal class variable was not as simple as to predict.

sequenceID was predicted only 86.25% of the time.

Visualizing the decision tree shows what the model settled on. The formula started with the
hoursAwake variable, further down it added the dayID variable, but at certain points I saw
uncertainty. The tree is reflexing the confusion matrix. Lets look at the sequenceID variable closer to
see why?

It looks nice and segmented but it has overlap, I took 4 heart measurements each day. 1st as I woke
up, 2nd two to 5 hours after that, 3rd about 2 to 5 hours after that and lastly 2 or more hours after
that, before I went to sleep. So the 2nd sequence can have a 5 hour and the 3rd can have a 5 hour,
the 3rd can have a 7 hour and the fourth can have a 7 hour. So now you see why it so easy to predict
the 1st one and it’s so difficult to predict the middle ones.

And along those lines the decision tree showed me another introductory WEKA lesson, after 18 days
it was only 83% accurate. I saw this decision tree algorithm improve to the 86% with 8 more
instances. So, we can assume with more instances or more attributes this tree will have more
potential branches and get more and more accurate.

Next was neural network, choose MultilayerPerceptron which is a neural network. See
MultilayerPerceptions think neural networks. Again, to get my feet wet with neural network I
stepped back and considered a hypothetical. What could I want neural network to predict or find out
for me? I choose 2 things. Predict the complex sequenceID variable which gave the decision tree
trouble, 2nd I wanted to predict a numeric class, hrv again as the 1st two algorithms only work with
nominal class. Then again, I asked if I should remove some data points like in logistic regression or
leave them all like in decision tree? It seemed that the more data points MultilayerPerceptron had
the better so I left all 7 in. I looked at test options and choose percentage splits this time. Testing at
66% split means that it builds it initial model or formula using 53 of the 80 instances which are heart
measurements I took. And after the initial model is built it runs it on the last 3rd or 27 instances to
test its accuracy.

So, let’s see how accurate the algorithm was in predicting sequenceID on the 1st set of 27 instances
that it held out for testing?
It was 88% and the 1st sequence it got them all right but again the middle ones there is uncertainty.
And as you can see the last one it got them all right.

I looked up at the model it built and didn’t quite understand it but it was helpful non the less to see a
now a 3rd unique algorithm that can be used in data analysis.

I was just trying to get an initial experience with WEKA and thus was happy to see
MultilayerPerceptron a neural network algorithm that can handle a numeric class such as hrv, I
clicked start and looked at the result and noticed a numeric class did not have an identical summary
as nominal class. Remember how I got 24 out of 27 correctly class instances or 89% well here with
numeric it seems less correct vs incorrect. Instead of saying it got 0 out of 27 correct, the emphasis
seems more on how close it got.

Error statistics were in the summary but here I went to more option and so I could output prediction
and see accuracy in another way.
So, then I re ran it and the output prediction showed what this neural network model predicted for
those 27, side by side with the actual value. My 1st thought was this is the type of thing I can paste
into a spreadsheet for anyone to understand. Say needing to ask “are these prediction close
enough” because he is where we are at with the model. I thought of them without knowing what
Root mean square error is, saying I see a close enough accuracy rate here (2nd above) between these
two (RMSE AND RAE) its useful now, or maybe them saying no, its not useful now.

Finally with neural network to visualise (right click name) the formula I turn the GUI on and then
clicking re run it via Start.

I thought this visual was pretty self-explanatory and you can see why the call it neural network and
at the bottom (Epoch etc) those are events with more events (click Start) this algorithms error rate
decreases over time (Error per Epoch got smaller). So, I can plug in 50,000 events for example and it
will get smarter.
Last I looked at one more algorithm in WEKA called Support Vector Machine, go to
classifiers>functions see there are SMO and SMOreg both are based on support vector machine if
you look at the properties of them. You will see that even though both are based on support vector
machine they are different algorithms, can see this by clicking the name from top after choosing it.

Let’s choose SMOreg to predict a nominal variable. Here reverse of before this algorithm won’t
accept nominal class it will accept numeric class. Another introductory lesson this doesn’t mean
what’s nominal class can’t ever work with SMOreg, I may be able to edit something nominal class
just by opening the data file and representing the data differently and saving the variable this time
as a numeric class for example. Or I may want to run one of the many filters in WEKA. As a data
miner you spend a bunch of time changing data in ways that can make an algorithm more efficient.

Those are some examples again of what’s called data pre-processing and data pre-processing is a big
part of using WEKA. Now let’s use the training set as is without any cross validation, and to see what
it looks like let’s check output as we did before. Take note of the original order being intact and left
as is. Now let’s run 10-fold cross validation.
Here are 10 constitutive 1-8, 1-8, 1-8 for my 80 instances. Let’s compare their prediction accuracy.

The 10 fold cross validation the mean absolute error is 2.75% and the root mean squared is 3.5%.
How does that compare to the test without cross-validation?

Click it in the result list. Less error but now I saw why its less valid as a predictive model.
02 Exploring the Explorer

The graph shown on bottom right are like histograms of the attribute value in terms of the attribute
we are trying to predict. Blue corresponds to positive and red to negative.

03 Exploring datasets

By default, the last attribute in WEKA is always the class value, can change it if you like, so you can
decide to predict different attribute.

These attributes or features can be discrete (“nominal”) or continuous (“numeric”), what

we looked at in weather data were discrete or what we call nominal attribute values where they
belonged to a certain fixed set. Or they can be numeric.

Also, the class can be discrete (“classification”) or continuous (“regression”), we

are looking at discrete class, as it can be “yes” or “no”. Another type of machine learning
problem can involve continuous classes where you are trying to predict a number, which is called
a “regression problem” in the trade.

Opened weather.numeric.arff same as before but “temperature” and “humidity” are

now numeric attributes before were nominal attributes.

Opened glass.arff

04 Building a classifier

Hi! you probably learned a bit about flowers if you did the activity associated with the last lesson.
Now, we're going to actually build a classifier: Lesson 1.4 Building a classifier. We're going to use a
system called J48—I'll tell you why it's called J48 in a minute— to analyse the glass dataset. That we
looked at in the last lesson.

I've got the glass dataset open here. I going to go to the Classify panel, and I choose a classifier here.
There are different kinds of classifiers, Weka has bayes classifiers, functions classifiers, lazy
classifiers, meta classifiers, and so on.
We're going to use a tree classifier, J48 is a tree classifier, so I'm going to open trees and click J48.
Here is the J48 classifier. Let's run it. If we just press start, we've got the dataset, we've got the
classifier, and lo and behold, it's done it. It's a bit of an anti-climax, really. Weka makes things very
easy for you to do. The problem is understanding what it is that you have done, let's take a look.
Here is some information about the datasets, glass dataset, the number of instances and attributes.

Then it's printed out a representation of a tree here, we'll look at these trees later on, but just note
that this tree has got 30 leaves and 59 nodes altogether, and the overall accuracy is 66.8%. So, it's
done pretty well.

Down at the bottom, we've got a confusion matrix, remember there were about seven different
kinds of glass. This is building windows made of float glass, you can see that 50 of these have been
classified as 'a', which is correctly classified, 15 of them have been classified as ‘b’ which is building
windows non-float glass, so those are errors, and 3 have been classified as 'c', and so on. This is a
confusion matrix. Most of the weight is down the main diagonal, which we like to see because that
indicates correct classifications. Everything off the main diagonal indicates a misclassification. That's
the confusion matrix.
Let's investigate this a bit further. We're going to open a configuration panel for J48. Remember I
chose it by clicking the Choose button. Now, if I click it here (the name), I get a configuration panel. I
clicked J48 in this menu, and I get a configuration panel, which gives a bunch of parameters. I'm not
going to really talk about these parameters. Let's just look at one of them, the unpruned parameter,
which by default is false. What we've just done is to build a pruned tree, because unpruned is False.
We can change this to make it True and build an unpruned tree. We've changed the configuration.

We can run it again. It just ran again, and now we have a potentially different result. Let's just have a
look. We have got 67% correct classification and what did we have before? These are the previous
runs, this is the previous run, and there we had 66% and now, in this run that we've just done with
the unpruned tree, we've got 67% accuracy, and the tree is the same size.

So that's one option, I'm just going to look at another option, and then we'll look at some trees. I'm
going to click the configuration panel again, and I'm going to change the minNumObj parameter.
What is that? That is the minimum number of instances per leaf, I'm going to change that from 2 up
to 15 to have larger leaves.

These are the leaves of the tree here, and these numbers in brackets are the number of instances
that get to the leaf. When there are 2 numbers, this means that 1 incorrectly classified instance got
to this leaf and 5 correctly classified instances got there.

You can see that all of these leaves are pretty small, with sometimes just 2 or 3 or here is one with
31 instances. We've constrained now this number, the tree is going to be generated, and this
number is always going to be 15 or more.

Let’s run it again, now we've got a worse result, 61% correct classification.
But a much smaller tree, with only 8 leaves. Now, we can visualize this tree. If I right click on the
line—these are the lines that describe each of the runs that we've done, and this is the third run—if I
right click on that, I get a little menu, and I can visualize the tree.

There it is. If I right click on empty space, I can fit this to the screen. This is the decision tree. This
says first look at the Barium (Ba) content, if it's large, then it must be headlamps. If it's small, then
Magnesium (Mg), If that's small, then let's look at potassium (K), and if that's small, then we've got

That sounds like a pretty good thing to me; I don't want too much potassium in my tableware. This is
a visualization of the tree and it's the same tree that you can see by looking above in log. This is a
different representation of the same tree.

I'll just show you one more thing about this configuration panel, which is the More button.

This gives you more information about the classifier, about J48. It's always useful to look at that to
see where these classifiers have come from. In this case, let me explain why it's called J48, it’s based
on a famous system that's called C4.5, which was described in a book. The book is referenced here.
In fact, I think I've got on my shelf here. This book here, "C4.5: Programs for Machine Learning" by
an Australian computer scientist called Ross Quinlan. He started out with a system called ID3— I
think that might have been in his PhD thesis— and then C4.5 became quite famous. This kind of
morphed through various versions into C4.5. It became famous; the book came out, and so on, he
continued to work on this system, then It went up to C4.8, and then he went commercial. Up until
then, these were all open source systems.
When we built Weka, we took the latest version of C4.5, which was C4.8, and we rewrote it. Weka's
written in Java, so we called it J48, maybe it's not a very good name, but that's the name that stuck.
There's a little bit of history for you.

We've talked about classifiers in Weka, I've shown you where you find the classifiers, we classified
the glass dataset. We looked at how to interpret the output from J48, in particular the confusion
matrix. We looked at the configuration panel for J48, we looked at a couple of options: pruned
versus unpruned trees and the option to avoid small leaves. I told you how J48 really corresponds to
the machine learning system that most people know as C4.5. C4.5 and C4.8 were really pretty
similar, so we just talk about J48 as if it's synonymous with C4.5. You can read about this in the
book— Section 11.1 about Building a decision tree and Examining the output. Now, off you go, and
do the activity associated with this lesson.

05 Using a filter


Weka include many filters that can be used before invoking a classifier to clean up the
dataset or alter it in some way. Filters help with data preparation, for example, you can easily
remove an attribute. Or you can remove all instances that have a certain value for an attribute (e.g.
instances for which humidity has the value high). Surprisingly, removing attributes sometimes
leads to better classification! – and also simpler decision trees.

Hello! In the last lesson, we looked at using a classifier in Weka, J48.

In this lesson, we’re going to look at another of Weka’s principal features: filters. One of the
main messages of this course is that it’s really important when you’re data mining to get close to
your data, and to think about pre-processing it, or filtering it in some way, before applying a
classifier. I’m going to start by using a filter to remove an attribute from the weather
data. Let me start up the Weka Explorer and open the weather data.

I’m going to remove the “humidity” attribute: that’s attribute number 3. I can look at filters; just
like we chose classifiers using this Choose button on the Classify panel, we choose filters by
using the Choose button here. There are a lot of different filters, Allfilter and MultiFilter
are ways of combining filters. We have supervised and unsupervised filters. Supervised
filters are ones that use the class value for their operation. They aren’t so common as
unsupervised filters, which don’t use the class value. There are attribute filters and
instance filters. We want to remove an attribute. So, we’re looking for an attribute
filter, there are so many filters in Weka that you just have to learn to look around and find what
you want.

I’m going to look for removing an attribute. Here we go, “Remove”. Now, before, when we
configured the J48 classifier, we clicked here. I’m going to click here, and we can configure
the filter. This is “A filter that removes a range of attributes from the
dataset”. I can specify a range of attributes here. I just want to remove one. I think it was
attribute number 3 we were going to remove. I can “invert the selection” and remove
all the other attributes and leave 3, but I’m just going to leave it like that. Click OK, and watch
“humidity” go when we apply the filter. Nothing happens until you click Apply to apply

I’ve just applied it, and here we are, the “humidity” attribute has been removed. Luckily I can
undo the effect of that and put it back by pressing the “Undo” button. That’s how to remove an

Actually, the bad news is there is a much easier way to remove an attribute: you don’t need to
use a filter at all. If you just want to remove an attribute, you can select it here and click the
“Remove” button at the bottom. It does the same job. Sorry about that. But filters are really useful
and can do much more complex things than that. Let’s, for example, imagine removing, not an
attribute, but let’s remove all instances where humidity has the value “high”. That is,
attribute number 3 has this first value. That’s going to remove 7 instances from the dataset. There
are 14 instances altogether, so we’re going to get left with a reduced dataset of 7 instances.

Let’s look for a filter to do that. We want to remove instances, so it’s going to be an instance
filter. I just have to look down here and see if there is anything suitable, how about
RemoveWithValues? I can click that to configure it, and I can click “More” to see what it does.
Here it says it “Filters instances according to the value of an attribute”,
which is exactly what we want. We’re going to set the “attributeIndex”; we want the third
attribute (humidity) so type 3, and the “nominalIndices” to 1 the first value. We can remove
a number of different values; we’ll just remove the first value. Now we’ve configured that. Nothing
happens until we apply the filter.

Watch what happens when we apply it. We still have the “humidity” attribute there, but we have
zero elements with high humidity. In fact, the dataset has been reduced to only 7 instances. Recall
that when you do anything here, you can save the results. So, we could save that reduced dataset if
we wanted, but I don’t want to do that now. I’m going to undo this.

We removed the instances where humidity is high. We have to think about, when we’re looking
for filters, whether we want a supervised or an unsupervised filter, whether we want an
attribute filter or an instance filter, and then just use your common sense to look
down the list of filters to see which one you want.

Sometimes when you filter data you get much better classification. Here’s a really simple
example. I’m going to open the “glass.arff” dataset that we saw before. Here’s the glass
dataset, I’m going to use J48, which we did before, It’s a tree classifier.

I’m going to use Cross-validation with 10 Folds, start that, and I get an accuracy of
66.8%. Let’s remove Fe when we do we get a smaller dataset. Go and run J48 again.
Now we get an accuracy of 67.3%. So, we’ve improved the accuracy a little bit by removing that
attribute. Sometimes the effect is pretty dramatic, actually, in this dataset, I’m going to remove
everything except the refractive index and Magnesium (Mg). I’m going to remove all of these
attributes and am left with a much smaller dataset with two attributes. Apply J48 again.

Now I’ve got an even better result, 68.7% accuracy. I can visualize that tree, of course – remember?
– by right-clicking and selecting “visualizing the tree” and have a look and see what it
means. It’s much easier to visualize trees when they are smaller. This is a good one to look at and
consider what the structure of this decision is.

That’s it for now. We’ve looked at filters in Weka; supervised versus unsupervised,
attribute versus instance filters. To find the right filter you need to look. They can be
very powerful, and judiciously removing attributes can both improve performance and increase
comprehensibility. For further stuff look at section 11.2 on loading and filtering files.
06 Visualizing your data

Anyway, one of the constantly recurring themes in this course is the necessity to get close to your
data, look at it in every possible way. In this last lesson of the first class, we're going to look at
visualizing your data. This is what we're going to do, we're going to use the Visualize panel, I'm
going to open the iris dataset. You came across the iris dataset in one of the activities, I think.
I'm using it because it has numeric attributes, 4 numeric attributes: sepallength,
sepalwidth, petallength, petalwidth. The class are the three kinds of iris flower: Iris-
setosa, Iris-versicolor, and Iris-virginica.

Let's go to the Visualize panel and visualize this data. There is a matrix of two-dimensional plots,
a five-by-five matrix of plots. If I can select one of these plots, I'm going to be looking at a plot of
sepalwidth on the x-axis and petalwidth on the y-axis.

That's a plot of the data. The colors correspond to the three classes.
I can actually change the colors. By clicking where it says at bottom Iris-s… If I don't like those, I
could select another color, but I'm going to leave them the way they are.

I can look at individual data points by clicking on them. This is talking about instance number 86
with a sepallength of 6, sepalwidth of 3.4, and so on. That's a versicolor, which is why
this spot is coloured red. We can look individual instances.

We can change the x- and y-axis by changing on the menus here.

Better still, if we click on this little set of bars here, these represent the attributes, so of I click it:
The x-axis will change to sepallength. Here the x-axis is sepalwidth (clicking on 2nd one).
Clicking on 3rd the x-axis is petallength, and so on.

If I right click, then it will change the y-axis to sepallength. So, I can quickly browse around
these different plots.

There is a Jitter slider. Sometimes, points sit right on top of each other, and jitter just adds a little
bit of randomness to the x- and the y-axis. With a little bit of jitter on here, the darker spots
represent multiple instances.

The jitter function in the Visualize panel just adds artificial random noise to the coordinates of the
plotted points in order to spread the data out a bit (so that you can see points that might have been
obscured by others). From online
If I click on one of those, I can see that that point represents three separate instances, all of class
iris-setosa, and they all have the same value of petallength and sepalwidth. Both of
which are being plotted on this graph. The sepalwidth and petallength are 3.0 and 1.4 for
each of the three instances.

If I click another one here. This one here are two with very similar [sepalwidths] and
petallengths, both of the class versicolor. The jitter slider helps you distinguish
between points that are in fact very close together.

Another thing we can do is select bits of this dataset. I'm going to choose select Rectangle here.

If I draw a rectangle now, I can select these points. If I were to click the Submit button (top left
button) this rectangle, then all other points would be excluded and just these points would appear
on the graph, with the access re-scaled appropriately.

Here we go. I've submitted that rectangle, and you can see that there's just the red points and green
points there. I could save that if I wanted as a different dataset, or I could reset it and maybe try
another kind of selection like this, where I'm going to have some blue points, some red and some
green points and see what that looks like. This might be a way of cleaning up outliers in your data, by
selecting rectangles and saving the new dataset. That's visualizing the dataset itself. What about
visualizing the result of a classifier?
Let's get rid of this visualize panel and back to the Preprocess panel. I'm going to use a
classifier. I'm going to use, guess what, J48. Let's find it under trees. I'm going to run it.

Then if I right click on this entry here in the log area, I can view classifier errors.

Here we've got the class plotted against the predicted class. The square boxes represent errors. If I
click on one of these, I can, of course, change the different axes if I want (box on right similar to
above). I can change the x-axis and the y-axis, but I'm going to go back to class and predicted class as
the axis.

If I click on one of these boxes, I can see where the errors are. There are two instances where the
predicted class is versicolor, and the actual class is virginica.
We can see these in the confusion matrix. The actual class is virginica (last row 2nd value 2) ,
and the predicted class is versicolor, that's 'b'. These two entries in the confusion matrix is
represented by these instances here (above highlighted in red).

If I look at another point, say this one. Here I've got one instance which is in fact a setosa
predicted to be a versicolor. That is this setosa (2nd row 2nd value in confusion matrix)
predicted to be a versicolor. I can look at this plot and find out where the misclassifications are
actually occurring, the errors in the confusion matrix.

So, get down and dirty with your data and visualize it. You can do all sorts of things. You can clean it
up, detect outliers. You can look at the classification errors. For example, there’s a filter that
allows you to add the classifications as a new attribute.

Let's just go and have a look at that. I'm going to go and find a filter. We're going to add an
attribute. It's supervised because it uses a class. Add an attribute, and
AddClassfication. Here I get to choose in the configuration panel, the machine learning
scheme. I'm going to choose J48, of course, and I'm going to outputClassification—put
that True. That's configured it, and I'm going to apply it. It will add a new attribute.

It's done it, and this attribute is the classification according to J48. Weka is very powerful. You can
do all sorts of things with classifiers and filters. That's the end of the first class. There's a
section of the book on Visualization, section 11.2.
07 Class 1 Questions

Some people were going straight to the YouTube videos, and if you just go straight to the YouTube videos you don't see the
activities. You should be seeing this picture when you go to the course. This is the website, and you need to go and look at
the course from here.

2nd edition book is good same as 3rd.

The next thing is about the algorithms. People want to learn about the details of the algorithms and how they work. Are
you going to learn about those? Is there a MOOC class that goes into the algorithms provided by Weka, rather than the
mechanics of running it?

The answer is "yes": you will be learning something about these algorithms. I've put the syllabus here; it's on the course
webpage. You can see from this syllabus what we're going to be doing. We're going to be looking at, for example, the J48
algorithm for decision trees and pruning decision trees; the nearest neighbour algorithm for instance-based learning; and
linear regression; classification by regression. We'll look at quite a few algorithms in Classes 3 and 4. I'm not going to tell
you about the algorithms in gory detail, however: they can get quite tricky inside.

What I want to do is to communicate the overall way that they work -- the idea behind the algorithms -- rather than the
details. The book does give you full details of exactly how these algorithms work inside. We're not going to be able to cover
them in that much detail in the course, but we will be talking about how the algorithms work and what they do.

The next thing I want to talk about: someone asked about using Naive Bayes. How can we use the NaiveBayes classifier
algorithm on a dataset, and how can we test for particular data, whether it fits into particular classes? Let me go to Weka
here. We're going to be covering this in future lessons -- Lesson 3.3 on Naive Bayes and so on -- but I'll just show you. All of
this is very easy. If I go to Classify, and I want to run Naive Bayes, I just need to find NaiveBayes. I happen to know it's in the
bayes section, and I can run it here. Just like that. We've just run NaiveBayes. I'll be doing this more slowly and looking
more at the output in Lesson 3.3. A natural thing to ask is if you had a particular test instance, which way would Naive
Bayes classify it, or any other kind of classifier?

This is the weather data we're using here, and I've created a file and called it It's a standard ARFF file,
and I got it by editing the weather.nominal.arff file. You can see that I've just got one day here. I've got the same header as
for the regular weather file and just one day -- but I could have several days if I wanted. I've put a question mark for the
class, because I want to know what class is predicted for that. We'll be talking about this in Lesson 2.1 -- you're probably
doing it right now -- but we can use a "supplied test set". I'm going to set that one that I created, which I called, as my test set. I can run this, and it will evaluate it on the test set. On the "More options..." menu --
you'll be learning about this in Lesson 4.3 -- there's an "Output predictions" option, here.

If I now run it and look up here, I will find instance number 1, the actual class was "?" -- I showed you that, that was what
was in the ARFF file -- and the predicted class is "no". There's some other information. This is how I can find out what
predictions would be on new test data. Actually, there's nothing stopping me from setting as my test file the same as the
training file. I can use weather.nominal.arff as my test file, and run it again. Now, I can see these are the 14 instances in the
standard weather data. This is their actual class, this is the predicted class, predicted by, in this case, Naive Bayes. There's a
mark in this column whenever there's an error, whenever the actual class differs from the predicted class. Again, we get
that by, in the "More options..." menu, checking "Output predictions". We're going to talk about that in other lessons. I just
wanted to show you that it's very easy to do these things in Weka.

The final thing I just wanted to mention is, if you're configuring a classifier -- any classifier, or indeed any filter -- there are
these buttons at the bottom. There's an "Open" and "Save" button, as well as the OK button that we normally use. These
buttons are not about opening files in the Explorer, they're about saving configured classifiers. So, you could set
parameters here and save that configuration with a name and a file and then open it later on. We don't do that in this
course, so we never use these Open and Save buttons here in the GenericObjectEditor. This is the GenericObjectEditor that
I get by clicking a classifier or filter. Just ignore the Open and Save buttons here. They do not open ARFF files for you. That's
all I wanted to say. Carry on with Class 2.

08 Be a classifier

Hi! Welcome back to Data Mining with Weka. This is Class 2. In the first class, we downloaded Weka
and we looked around the Explorer and a few datasets; we used a classifier, the J48
classifier; we used a filter to remove attributes and to remove some instances; we
visualized some data—we visualized classification errors on a dataset; and along the way we looked
at a few datasets, the weather data, both the nominal and numeric version, the glass data,
and the iris dataset.

This class is all about evaluation. In Lesson 1.4, we built a classifier using J48. In this first
lesson of the second class, we're going to see what it's like to actually be a classifier ourselves.
Then, later on in subsequent lessons in this class, we're going to look at more about evaluation
training and testing, baseline accuracy and cross-validation.

First of all, we're going to see what it's like to be a classifier. We're going to construct a
decision tree ourselves, interactively. I'm going to just open up Weka here. The Weka Explorer. I'm
going to load the segment-challenge.arff dataset, segment-challenger.arff.

Let's first of all look at the class. The class values are brickface, sky, foliage, cement,
window, path, and grass. It looks like this is kind of an image analysis dataset. When we look at
the attributes, we see things like the centroid of columns and rows, pixel counts, line
densities, means of intensities, and various other things. Saturation, hue, and the
class, as I said before, is different kinds of texture: bricks, sky, foliage, and so on.
That's the segment challenge dataset.

Now, I'm going to select the user classifier. The user classifier is a tree
classifier. We'll see what it does in just a minute. Before I start, this is really quite important.
I'm going to use a supplied test set. I'm going to set the test set, which is used to evaluate the
classifier to be segment-test.arff. The training set is segment-challenge, the test
set is segment-test. Now we're all set. I'm going to start the classifier.
What we see is a window with two panels: the Tree Visualizer and the Data Visualizer.

Let's start with the Data Visualizer.

We looked at visualization in the last class, how you can select different attributes for the x and y.
I'm going to plot the region-centroid-row against the intensity-mean.

That's the plot I get.

Now, we're going to select a class so I am going to select Rectangle. If I draw out with my mouse
a rectangle here, I'm going to have a rectangle that's pretty well pure reds, as far as I can see. I'm
going to submit this rectangle. You can see that that area has gone, and the picture has been

Now I'm building up a tree here. If I look at the Tree Visualizer, I've got a tree. We've split
on these two attributes, region-centroid-row and intensity-mean. Here we've got sky,
these are all sky classes. Here we've got a mixture of brickface, foliage, cement, window,
path, and grass. We're kind of going to build up this tree. What I want to do is to take this node
and refine it a bit more.

Here is the Data Visualizer again. I'm going to select a rectangle containing these items here
and submit that. They've gone from this picture.

You can see that here, I've created this split, another split on region-centroid-row and
intensity-mean, and here, this is almost all path 233 path instances, and then a mixture here
(bottom right). This is a pure node we've got over there, (top left small), this is almost a pure node
(middle square), this is the one I want to work on (bottom right).

I'm going to cover some of those instances now. Let's take this lot here and submit that. Then I'm
going to take this lot here and submit that. Maybe I'll take those ones there and submit that. This
little cluster here seems pretty uniform. Submit that. I haven't actually changed the axes, but, of
course, at any time, I could change these axes to better separate the remaining classes. I could kind
of mess around with these.

Actually, a quick way to do it is to click here on these bars (as did previously). Left click for x and
right click for y. I can quickly explore different pairs of axes to see if I can get a better split. Here's
the tree I've created.
It looks like this. You can see that we have successively elaborated down this branch here. When I
finish with this, I can “Accept The Tree” by right clicking.

Actually, before I do that, let me just show you that we were selecting rectangles here but I've got
other things I can select: a polygon or a polyline. If I don't want to use rectangles, I can use
polygons or polylines. If you like, you can experiment with those to select different shaped areas.
There's an area I've got selected I just can't quite finish it off. Alright, I right clicked to finish it off. I
could submit that. I'm not confined to rectangles; I can use different shapes. I'm not going to do
that. I'm satisfied with this tree for the moment. I'm going to accept the tree. Once I do this, there
is no going back, so you want to be sure. If I accept the tree, "Are you sure?" “Yes”.

Here, I've got a confusion matrix, and I can look at the errors. My tree classifies 78% of the
instances correctly, nearly 79% correctly, and 21% incorrectly.

That's not too bad, especially considering how quickly I built that tree. It's over to you now. I'd like
you to play around and see if you can do better than this by spending a little bit longer on getting a
nice tree.

I'd like you to reflect on a couple of things. First of all, what strategy you're using to build this tree.
Basically, we're covering different regions of the instance space, trying to get pure regions to create
pure branches. This is kind of like a bottom-up covering strategy. We cover this area and this area
and this area.
Actually, that's not how J48 works. When it builds its trees, it tries to do a judicious split through
the whole dataset. At the very top level, it'll split the entire dataset into two in a way that doesn't
necessarily separate out particular classes but makes it easier when it starts working on each half of
the dataset further splitting in a top-down manner in order to try and produce an optimal tree. It will
produce trees much better than the one that I just produced with the user classifier. I'd also like you
to reflect on what it is we're trying to do here.

Given enough time, you could produce a 'perfect' tree for the dataset, but don't forget that the
dataset that we've loaded is the training dataset. We're going to evaluate this tree on a different
dataset, the test dataset, which hopefully comes from the same source, but is not identical to the
training dataset. We're not trying to precisely fit the training dataset; we’re trying to fit it in a way
that generalizes the kinds of patterns exhibited in the dataset. We're looking for something that will
perform well on the test data. That highlights the importance of evaluation in machine learning.
That's what this class is going to be about, different ways of evaluating your classifier. That's it.
There's some information in the course text about the user classifier, section 11.2 “Do it
yourself: the User Classifier” which you can read if you like. Please go on and do the activity
associated with this lesson and produce your own classifier.

09 Training and testing

Hi! This is Lesson in Data Mining with Weka, and here we're going to look at training hand testing in
a little bit more detail.

Here's a situation. We've got a machine learning algorithm, and we feed into it training data, and it
produces a classifier -- a basic machine learning situation. For that classifier, we can test it with
some independent test data. We can put that into the classifier and get some evaluation results,
and, separately, we can deploy the classifier in some real situation to make predictions on fresh data
coming from the environment.
It's really important in classification, when you're looking at your evaluation results, you only get
reliable evaluation results if the test data is different from the training data. That's what we're going
to look at in this lesson.

What if you only have one dataset? If you just have one dataset, you should divide it into two parts.
Maybe use some of it for training and some of it for testing. Perhaps, 2/3 of it for training and 1/3 of
it for testing. It's really important that the training data is different from the test data. Both training
and test sets are produced by independent sampling from an infinite population. That's the basic
scenario here, but they're different independent samples. It's not the same data. If it is the same
data, then your evaluation results are misleading. They don't reflect what you should actually expect
on new data when you deploy your classifier.

Here we're going to look at the segment dataset, which we used in the last lesson. I'm going to open
the segment-challenge.arff. I'm going to use a supplied test set. First of all, I'm going to use the J48
tree learner. I'm going to use a “Supplied test set”, and I will set it to the appropriate segment-
test.arff file, I'm going to open that. Now we've got a test set, and let's see how it does.

In the last lesson, on the same data with the user classifier, I think I got 79% accuracy. J48 does much
better; it gets 96% accuracy on the same test set. Suppose I was to evaluate it on the training set? I
can do that by just specifying under “Test options” “Use training set”. Now it will train it again and
evaluate it on the training set, which is not what you're supposed to do, because you get misleading

Here, it's saying the accuracy is 99% on the training set. That is not representative of what we would
get using this on independent data. If we had just one dataset, if we didn't have a test dataset, we
could do a “Percentage split”. Here's a percentage split, this is going to be 66% training data and 34%
test data. That's going to make a random split of the dataset. If I run that, I get 95%, that's just about
the same as what we got when we had an independent test set, just slightly worse.
If I were to run it again, if we had a different split, we'd expect a slightly different result, but actually,
I get exactly the same results, 95%. That's because Weka, before it does a run, it reinitializes the
random number generator. The reason is to make sure that you can get repeatable results, if it
didn't do that, then the results that you got would not be repeatable. However, if you wanted to
have a look at the differences that you might get on different runs, then there is a way of resetting
the random number between each run. We're going to look at that in the next lesson. That's this

The basic assumption of machine learning is that the training and test sets are independently
sampled from an infinite population, the same population. If you have just one dataset, you should
hold part of it out for testing, maybe 33% as we just did or perhaps 10%. We would expect a slight
variation in results each time if we hold out a different set, but Weka produces the same results
each time by design by making sure it reinitializes the random number generator each time. We ran
J48 on the segment-challenge dataset. If you'd like, you can go and look at the course text on
Training and testing, Section 5.1 (Training and testing) and please go and do the activity associated
with this lesson.

10 Repeated training and testing

Hello again! In the last lesson, we looked at training and testing. We saw that we can evaluate a
classifier on an independent test set or using a Percentage split with a certain percentage of
the dataset used to train and the rest used for testing or and this is generally a very bad idea -- we
can evaluate it on the training set itself which gives misleadingly optimistic performance figures. In
this lesson, we're going to look a little bit more at training and testing. In fact, what we're going to
do is repeatedly train and test using percentage split.

Now, in the last lesson, we saw that if you simply repeat the training and testing, you get the same
result each time because Weka initializes the random number generator before it does each run to
make sure that you know what's going on when you do the same experiment again tomorrow. But,
there is a way of overriding that.

So, we will be using independent random numbers on different occasions to produce a percentage
split of the dataset into a training and test set. I'm going to open the segment-
challenge.arff. That's what we used before. Notice there are 1500 instances here; that’s
quite a lot. I'm going to go to Classify. I'm going to choose J48, our standard method, I guess.
I’m going to use a percentage split, and because we've got 1500 instances, I'm going to
choose 90% for training and just 10% for testing. I reckon that 10% -- that's 150 instances -- for
testing is going to give us a reasonable estimate, and we might as well train on as many as we can to
get the most accurate classifier.
I'm going to run this, and the accuracy figure I get -- this is what I got in the last lesson is 96.6667%.
Now, this is misleadingly high accuracy here, i'm going to call that 96.7%, or 0.967, and then, I'm
going to do it again and just see how much variation we get of that figure initializing the random
number generator to different amounts each time.

If I go to the More options menu, I get a number of options here which are quite useful:
outputting the model, we're doing that; outputting statistics; we can output
different evaluation measures; we're doing the confusion matrix; we're storing the
prediction for visualization; we can output the predictions if we want; we
can do a cost-sensitive evaluation; and we can set the random seed for cross-
validation or percentage split. That's set by default to 1. I'm going to change that to 2, a
different random seed.

We could also output the source code for the classifier if we wanted, but I just want to change
the random seed. Then I want to run it again.

Before we got 0.967, and this time we get 0.94, 94%, quite different, you see, If I were then to
change this again to, say 3 and run it again, again, I get 94%. If I change it again to 4 and run it again,
I get 96.7%. Let's do one more, change it to 5, run it again, and now I get 95.3%.
Here's a table with these figures in, if we run it 10 times, we get this set of results. Given this set of
experimental results, we can calculate the mean and standard deviation. The sample
mean is the sum of all of these error figures or these success rates, I should say divided by the
number 10 of them. That's 0.949, about 95%, that's really what we would expect to get. That's a
better estimate than the 96.7% that we started out with, a more reliable estimate. We can calculate
the sample variance we take the deviation from the mean, we subtract the mean from each
of these numbers, we square that, add them up, and we divide, not by n, but by n-1. That might
surprise you, perhaps the reason for it being n-1 is because we've actually calculated the mean
from this sample. When the mean is calculated from the sample, you need to divide by n-1, leading
to a slightly larger variance estimate than if you were to divide by n.

We take the square root of that, and in this case, we get a standard deviation of 1.8%. Now you can
see that the real performance of J48 on the segment-challenge dataset is approximately 95%
accuracy, plus or minus approximately 2%. Anywhere, let's say, between 93-97% accuracy. These
figures that you get, that Weka puts out for you, are misleading. You need to be careful how you
interpret them, because the result is certainly not 95.333%. There's a lot of variation on all of these

Remember, the basic assumption is the training and test sets are sampled independently from an
infinite population, and you should expect a slight variation in results perhaps more than just a slight
variation in results. You can estimate the variation in results by setting the random-number
seed and repeating the experiment. You can calculate the mean and the standard deviation
experimentally, which is what we just did. Off you go now, and do the activity associated with this
lesson. I'll see you in the next lesson.

11 Baseline accuracy

Hello again! In this lesson we're going to look at an important new concept called baseline
accuracy. We're going to actually use a new dataset, the diabetes dataset so open
diabetes.arff. Have a quick look at this dataset.
The class is tested_negative or tested_positive for diabetes, we've got attributes like
preg, which I think has to do with the number of times they've been pregnant; age, which is the
age. Of course, we can learn more about this dataset by looking at the ARFF file itself.

Here is the diabetes dataset, you can see its diabetes in Pima Indians, there's a lot of information
here. The attributes: number of times pregnant, plasma, glucose concentration, and so
on. Diabetes pedigree function.

I'm going to use percentage split, I'm going to try a few different classifiers. Let's look at J48
first, our old friend J48.

We get 76% with J48, I’m going to look at some other classifiers. You learn about these classifiers
later on in this course, but right now we're just going to look at a few. Look at NaiveBayes
classifier in the bayes category and run that.
here we get 77%, a little bit better, but probably not significant.

Let's choose in the lazy category IBk, again, we'll learn about this later on.

Here we get 73%, quite a bit worse.

We'll use one final one, the PART, partial rules in the rule’s category.

Here we get 74%. We'll learn about these classifiers later, but they are just different classifiers,
alternative to J48.

You can see that J48 and NaiveBayes are pretty good, probably about the same, the 1%
difference between them probably isn't significant. IBk and PART are probably about the same
performance, again, 1% between them. There is a fair gap, I guess, between those bottom two and
the top two, which probably is significant.

I'd like to think about these figures, 76%, is that good to get 76% accuracy? If we go back and look at
this dataset:
The class we see that there are 500 negative instances and 268 positive
instances. If you had to guess, you'd guess it would be negative, and you'd be right 500/768
(the sum of these two things, the total number of instances). You'd be right that fraction of the time,
500/768 if you always guess [negative], and that works out to 65%.

Actually, there's a rules classifier called ZeroR, which does exactly that. The ZeroR classifier just
looks for the most popular class and guesses that all the time. If I run this on the training set:

That will give us the exact same number, 500/768 which is 65%. It's a very, very simple, kind of
trivial classifier, that always just guesses the most popular class. It's ok to evaluate that on the
training set, because it's hardly using the training set at all to form the classifier.

That's what we would call the baseline. The baseline gives 65% accuracy, and J48 gives 76%
accuracy, significantly above the baseline, but not all that much above the baseline.

It's always good when you're looking at these figures to consider what the very simplest kind of
classifier the baseline classifier, would get you. Sometimes, baseline might give you the best
I'm going to open a dataset here, we're not going to discuss this dataset. It's a bit of a strange
dataset, not really designed for this kind of classification. It's called supermarket. I'm going to
open supermarket, and without even looking at it, I'm just going to apply a few schemes here.
I'm going to apply ZeroR, and I get 64%. I'm going to apply J48 and I think I'll use a percentage
split for evaluation because it's not fair to use the training set here. Now I get 63%, that's worse than
the baseline. If I try NaiveBayes, these are the ones I tried before, I get again 63%, worse than
the baseline, if I choose IBk, this is going to take a little while here, it's a rather slow scheme. Here
we are; it's finished now, only 38%, that is way, way worse than the baseline. We'll just try PART,
partial decision rules, here we get 63%.

The upshot is that the baseline actually gave a better performance than any of these classifiers and
one of them was really atrocious compared with the baseline. This is because, for this dataset, the
attributes are not really informative.

The rule here is, don't just apply Weka to a dataset blindly. You need to understand what's going on.
When you do apply Weka to a dataset, always make sure that you try the baseline classifier ZeroR,
before doing anything else. In general, simplicity is best. Always try simple classifiers before you try
more complicated ones.

Also, you should consider, when you get these small differences whether the differences are likely to
be significant. We saw these 1% differences in the last lesson that were probably not at all
significant. You should always try a simple baseline, you should look at the dataset. We shouldn't
blindly apply Weka to a dataset; we should try to understand what's going on. That's this lesson off
you go and do the activity associated with this lesson, and I'll see you soon!

12 Cross validation

I want to introduce you to the standard way of evaluating the performance of a machine learning
algorithm, which is called cross-validation. A couple of lessons back, we looked at evaluating on an
independent test set, and we also talked about evaluating on the training set (don't do that). We
also talked about evaluating using the holdout method by taking the one dataset and holding out a
little bit for testing and using the rest for training.

There is a fourth option on Weka's Classify panel, which is called cross-validation, and that's what
we're going to talk about here. Cross-validation is a way of improving upon repeated holdout. We
tried using the holdout method with different random-number seeds each time. That's called
repeated holdout. Cross-validation is a systematic way of doing repeated holdout that actually
improves upon it by reducing the variance of the estimate.

We take a training set and we create a classifier. Then we're looking to evaluate the performance of
that classifier, and there is a certain amount of variance in that evaluation, because it's all statistical
underneath. We want to keep the variance in the estimation as low as possible.

Cross-validation is a way of reducing the variance, and a variant on cross-validation called stratified
cross-validation reduces it even further. I'm going to explain that in this class.

In a previous lesson, we held out 10% for the testing and we repeated that 10 times. That's the
repeated holdout method. We've got one dataset, and we divided it independently 10 separate
times into a training set and a test set. With cross-validation, we divide it just once, but we divide
into, say, 10 pieces. Then, we take 9 of the pieces and use them for training and the last piece we
use for testing. Then, with the same division, we take another 9 pieces and use them for training and
the held-out piece for testing.

We do the whole thing 10 times, using a different segment for testing each time. In other words, we
divide the dataset into 10 pieces, and then we hold out each of these pieces in turn for testing, train
on the rest, do the testing and average the 10 results. That would be 10-fold cross-validation.
Divide the dataset into 10 parts (these are called folds), hold out each part in turn and average the
results. So, each data point in the dataset is used once for testing and 9 times for training. That's 10-
fold cross-validation.

Stratified cross-validation is a simple variant where, when we do the initial division into 10 parts, we
ensure that each fold has got approximately the correct proportion of each of the class values. Of
course, there are many different ways of dividing a dataset into 10 equal parts we just make sure we
choose a division that has approximately the right representation of class values in each of the folds.
That's stratified cross-validation. It helps reduce the variance in the estimate a little bit more.

Then, once we've done the cross-validation, what Weka does is run the algorithm an eleventh time
on the whole dataset. That will then produce a classifier that we might deploy in practice. We use
10-fold cross-validation in order to get an evaluation result and estimate of the error and then
finally, we do classification one more time to get an actual classifier to use in practice.

That's what I wanted to tell you. Cross-validation is better than repeated holdout, and we'll look at
that in the next lesson. Stratified cross-validation is even better. Weka does stratify cross-validation
by default.

With 10-fold cross-validation, Weka invokes the learning algorithm 11 times, one for each fold of the
cross-validation and then a final time on the entire dataset. The practical rule of thumb is that if
you've got lots of data, you can use a percentage split and evaluate it just once. Otherwise, if you
don't have too much data, you should use stratified 10-fold cross-validation.

How big is lots? Well, this is what everyone asks. How long is a piece of string, you know? It's hard to
say, but it depends on a few things. It depends on the number of classes in your dataset. If you've
got a two-class dataset, then if you had, say 100-1000 datapoints, that would probably be good
enough for a pretty reliable evaluation. If you did 90% and 10% split in the training and test set, if
you had, say 10,000 data points in a two-class problem, then I think you'd have lots and lots of data,
you wouldn't need to go to cross-validation. If, on the other hand, you had 100 different classes,
then that's different, right? You would need a larger dataset, because you want a fair representation
of each class when you do the evaluation.

It's really hard to say exactly; it depends on the circumstances. If you've got thousands and
thousands of data points, you might just do things once with a holdout. If you've got less than a
thousand data points, even with a two-class problem, then you might as well do 10-fold cross-
validation. It really doesn't take much longer. Well, it takes 10-times as long, but the times are
generally pretty short. You can read more about this in Section 5.3 (cross-validation).Now it's time
for you to go and do the activity associated with this [lesson].

13 Cross validation results

We're here to talk about Lesson 2.6, which is about cross-validation results. We learned
about cross-validation in the last lesson.

I said that cross-validation was a better way of evaluating your machine learning algorithm
evaluating your classifier, than repeated holdout, repeating the holdout method. Cross-
validation does things 10 times. You can use holdout to do things 10 times, but cross-
validation is a better way of doing things.

Let's just do a little experiment here. I'm going to start up Weka and open the diabetes dataset. The
baseline accuracy, which ZeroR gives me that's the default classifier, by the way,
rules>ZeroR if I just run that, well, it will evaluate it using cross-validation. Actually, for a
true baseline, I should just use the training set.
That'll just look at the chances of getting a correct result if we simply guess the most likely class, in
this case 65.1%. That's the baseline accuracy. That's the first thing you should do with any dataset.

Then we're going to look at J48, which is down here under trees. There it is I'm going to evaluate
it with 10-fold cross-validation.

It takes just a second to do that. I get a result of 73.8%, and we can change the random-number
seed like we did before. The default is 1; let's put a random-number seed of 2.

Run it again. I get 75%.

Do it again. Change it to, say, 3; I can choose anything I want, of course. Run it again, and I get

These are the numbers I get on this slide with 10 different random-number seeds. Those are
the same numbers on this slide in the right-hand column, the 10 values I got 73.8% 75.0%,
75.5%, and so on.
I can calculate the mean, which for that right-hand column is 74.5%, and the sample standard
deviation which is 0.9%, using just the same formulas that we used before. Before we use these
formulas for the holdout method we repeated the holdout 10 times. These are the results you get
on this dataset, if you repeat holdout, that is using 90% for training and 10% for testing, which is,
of course, what we're doing with 10-fold cross-validation.

I would get those results there, and if I average those, I get a mean of 74.8%, which is satisfactorily
close to 74.5%, but I get a larger standard deviation, quite a lot larger standard
deviation of 4.6%, as opposed to 0.9% with cross-validation.

Now, you might be asking yourself why use 10-fold cross-validation. With Weka we can
use 20-fold cross-validation or anything, we just set the number folds here beside the cross-
validation box to whatever we want. So, we can use 20-fold cross-validation,
what that would do would be to divide the dataset into 20 equal parts and repeat 20 times. Take
one part out, train on the other 95% of the dataset, and then do it a 21st time on the whole dataset.
So, why 10, why not 20? Well, that's a good question really, and there's not a very good answer. We
want to use quite a lot of data for training, because, in the final analysis, we're going to use the
entire dataset for training.

If we're using 10-fold cross-validation, then we're using 90% of the dataset. Maybe it
would be a little better to use 95% of the dataset for training with 20-fold cross-
validation. On the other hand, we want to make sure that what we evaluate on is a valid
statistical sample. So, in general, it's not necessarily a good idea to use a large number of folds with
cross-validation. Also, of course, 20-fold cross-validation will take twice as long
as 10-fold cross-validation. The upshot is that there isn't a really good answer to this
question, but the standard thing to do is to use 10-fold cross-validation, and that's why
it's Weka's default.
We've shown in this lesson that cross-validation really is better than repeated holdout.
Remember, on the last slide, we found that we got about the same mean for repeated holdout as for
cross-validation, but we got a much smaller variance for cross-validation.

We know that the evaluation in this machine learning method, J48, on this dataset, diabetes we get
74.5% accuracy, probably somewhere between 73.5% and 75.5%. That is actually substantially
larger than the baseline. So, J48 is doing something for us better than the baseline. Cross-
validation reduces the variance of the estimate. That's the end of this class. Off you go and do
the activity. I'll see you at the next class.

14 Class 2 Questions

Hi! Well, Class 2 has gone flying by, and here are some things I'd like to discuss. First of all, we made
some mistakes in the answers to the activities. Sorry about that. We've corrected them. Secondly, a
general point, some people have been asking questions, for example, about huge datasets. How big
a dataset can Weka deal with? The answer is pretty big, actually. But it depends on what you do, and
it's a fairly complicated question to discuss. If it's not big enough, there are ways of improving things.

Anyway, issues like that should be discussed on the Weka mailing list, or you should look in the
Weka FAQ, where there's quite a lot of discussion on this particular issue. The Weka API: the
programming interface to Weka. You can incorporate the Weka routines in your program. It's
wonderful stuff, but it's not covered in this MOOC. So, the right place to discuss those issues is the
Weka mailing list. Finally, personal emails to me. You know, there are 5,000 people on this MOOC,
and I can't cope with personal emails, so please send them to the mailing list and not to me

I'd like to discuss the issues of numeric precision in Weka. Weka prints percentages to 4 decimal
places; it prints most numbers to 4 decimal places. That’s misleadingly high accuracy don't take
these at face value. For example, here we've done an experiment using a 40% percentage split, and
we get 92.3333% accuracy printed out. Well, that's the exact right answer to the wrong question,
we're not interested in the performance on this particular test set. What we're interested in is how
Weka will do in general on data from this source. We certainly can't infer that that's this percentage
to 4 decimal place accuracy.

In Class 2, we're trying to sensitize you to the fact that these figures aren't to be taken at face value.
For example, there we are with a 40% split. If we do a 30% split, we get not 92.3333% and get
92.381%. The difference between these two numbers is completely insignificant. You shouldn't be
saying this is better than the other number. They are both the same, really, within the amount of
statistical fuzz that's involved in the experiment.

We're trying to train you to write your answers to the nearest percentage point, or perhaps 1
decimal place. Those are the answers that are being accepted as correct. The reason we're doing
that is to try to train you to think about these numbers and what they really represent, rather than
just copy/pasting whatever Weka prints out. These numbers need to be interpreted, for example, in
Activity 2.6 in question 2, the 4-digit answer would be 0.7354%, and 0.7 and 0.74 are the only
accepted answers.

In question 5, the 4-decimal place accuracy is 1.7256%, and we would accept 1.73%, 1.7% and 2%.
We're a bit selective in what we'll accept here.
I want to move on to the user classifier now, some people got some confusing results, because they
created splits that involved the class attribute. When you're dealing with the test set, you don't
know the class attribute -- that's what you're trying to find out. So, it doesn't make sense to create
splits in the decision tree that involve testing the class attribute. If you do that, you're going to get 0
accuracy on test data, because the class value cannot be evaluated on the test data. What was the
cause of that confusion.

Here's the league table for the user classifier. J48 gets 96.2%, just as a reference point Magda did
really well and got very close to that, with 93.9%. It took her 6.5-7 minutes, according to the script
that she mailed in. Myles did pretty well -- 93.5%. In the class, I got 78% in just a few seconds. I
think if you get over 90% you're doing pretty well on this dataset for the user classifier. The point is
not to get a good result, it's to think about the process of classification.

Let's move to Activity 2.2, partitioning the datasets for training and testing. Question 1 asked you to
evaluate J48 with percentage split, using 10% for the training set, 20%, 40%, 60%, and 80%. What
you observed is that the accuracy increases as we go through that set of numbers. "Performance
always increases" for those numbers. It doesn't always increase in general, in general, you would
expect an increasing trend, the more training data the better the performance, asymptoting off at
some point. You would expect some fluctuation, though, so sometimes you would expect it to go
down and up again. In this particular case, performance always increases.

You were asked to estimate J48's true accuracy on the segment-challenge dataset in Question 4.
Well, "true accuracy", what do we mean by "true accuracy"? I guess maybe it's not very well defined,
but what one thinks of is if you have a large enough training set, the performance of J48 is going to
increase up to some kind of point, and what would that point be?

Actually, if you do this -- in fact, you've done it! -- you found that between 60% training sets and 97-
98% training sets using the percentage split option consistently yield correctly classified instances in
the range 94-97%. So, 95% is probably the best fit from this selection of possible numbers. It's true,
by the way, that greater weight is normally given to the training portion of this split. Usually when
we use percentage split, we would use 2/3, or maybe 3/4, or maybe 90% of the training data, and
the smaller amount for the test data.

Questions 6 and 7 were confusing, and we've changed those. The issue there was how a classifier's
performance, and secondly the reliability of the estimate of the classifier's performance, is expected
to increase as the volume of the training data increases. Or, how they change with the size of the
dataset. The performance is expected to increase as the volume of training data increases, and the
reliability of the estimate is also expected to increase as the volume of test data increases. With the
percentage split option, there's a trade-off between the amount of test data and the amount of
training data. That's what that question is trying to get at.
Activity 2.3 Question 5: "How do the mean and standard deviation estimates depend on the number
of samples?" Well, the answer is that roughly speaking both stay the same. Let me find Activity 2.3,
Question 5, as you increase the number of samples, you expect the estimated mean to converge to
the true value of the mean, and the estimated standard deviation to converge to the true standard
deviation. So, they would both stay about the same, this is, in fact, now marked as correct. Actually,
because of the "n-1" in the denominator of the formula for variance, it's true that the standard
deviation decreases a tiny bit, but it's a very small effect. So, we've also accepted that answer as
correct. That's how the mean and standard deviation estimates depend on the number of samples.
Perhaps a more important question is how the reliability of the mean would change.

What decreases is the standard error of the estimate of the mean, which is the standard deviation
of the theoretical distribution of the large population of such estimates. The estimate of the mean is
a better, more reliable estimate with a larger training set size.

"The supermarket dataset is weird." Yes, it is weird: it's intended to be weird. Actually, in the
supermarket dataset, each instance represents a supermarket trolley, and, instead of putting a 0 for
every item you don't buy -- of course, when we go to the supermarket, we don't buy most of the
items in the supermarket -- the ARFF file codes that as a question mark, which stands for "missing

We're going to discuss missing values in Class 5. This dataset is suitable for association rule learning,
which we're not doing in this course. The message I'm trying to emphasize here is that you need to
understand what you're doing, not just process datasets blindly. Yes, it is weird.

There's been some discussion on the mailing list about cross-validation and the extra model. When
you do cross-validation, you're trying to do two things. You're trying to get an estimate of the
expected accuracy of a classifier, and you're trying to actually produce a really good classifier. To
produce a really good classifier to use in the future, you want to use the entire training set to train
up the classifier. To get an estimate of its accuracy, however, you can't do that unless you have an
independent test set. So cross-validation takes 90% for training and 10% for testing, repeats that 10
times, and averages the results to get an estimate. Once you've got the estimate, if you want an
actual classifier to use, the best classifier is one built on the full training set.

The same is true with a percentage split option. Weka will evaluate the percentage split, but then it
will print the classifier that it produces from the entire training set to give you a classifier to use on
your problem in the future. There's been a little bit of discussion on advanced stuff. I think maybe a
follow-up course might be a good idea here.

Someone noticed that if you apply a filter to the training set, you need to apply exactly the same
filter to the test set, which is sometimes a bit difficult to do, particularly if the training and test sets
are produced by cross-validation. There's an advanced classifier called the "FilteredClassifier" which
addresses that problem.

In his response to a question on the supermarket dataset, Peter mentioned "unbalanced" datasets,
and the cost of different kinds of error. This is something that Weka can take into account with a
cost sensitive evaluation, and there is a classifier called the CostSensitiveClassifier that allows you to
do that.

Finally, someone just asked a question on attribute selection: how do you select a good subset of
attributes? Excellent question! There's a whole attribute Selection panel, which we're not able to
talk about in this MOOC. This is just an introductory MOOC on Weka. Maybe we'll come up with an
advanced, follow-up MOOC where we're able to discuss some of these more advanced issues. That's

15 Simplicity first

Hi! This is the third class of Data Mining with Weka, and in this class, we’re going to look at some
simple machine learning methods and how they work. We're going to start out emphasizing the
message that simple algorithms often work very well. In data mining, maybe in life in general, you
should always try simple things before you try more complicated things. There are many different
kinds of simple structure. For example, it might that one attribute in the dataset does all the work,
everything depends on the value of one of the attributes. Or, it might be that all of the attributes
contribute equally and independently. Or a simple structure might be a decision tree that tests just a
few of the attributes. We might calculate the distance from an unknown sample to the nearest
training sample or a result my depend on a linear combination of attributes.

We're going to look at all of these simple structures in the next few lessons. There's no universally
best learning algorithm. The success of a machine learning method depends on the domain. Data
mining really is an experimental science. We're going to look at OneR rule learner, where one
attribute does all the work. It's extremely simple, very trivial, actually, but we're going to start with
simple things and build up to more complex things. OneR learns what you might call a one-level
decision tree, or a set of rules that all test one particular attribute. A tree that branches only at the
root node depending on the value of a particular attribute, or, equivalently, a set of rules that test
the value of that particular attribute.

The basic version of OneR, there’s one branch for each value of the attribute. We choose which
attribute first, and we make one branch for each possible value of the attribute. Each branch assigns
the most frequent class that comes down that branch. The error rate is the proportion of instances
that don't belong to the majority class of their corresponding branch. We choose the attribute with
the smallest error rate. Let's look at what this actually means. Here's the algorithm. For each
attribute, we're going to make some rules. For each value of the attribute, we're going to make a
rule that counts how often each class appears, finds the most frequent class, makes the rule assign
that most frequent class to this attribute value combination, and then we're going to calculate the
error rate of this attribute's rules.

We're going to repeat that for each of the attributes in the dataset and choose the attribute with the
smallest error rate. Here's the weather data again. What OneR does, is it looks at each attribute in
turn, outlook, temperature, humidity, and wind, and forms rules based on that. For outlook, there
are three possible values: sunny, overcast, and rainy. We just count out of the 5 sunny instances, 2
of them are yeses and 3 of them are nos. We're going to choose a rule, if it's sunny choose no. We're
going to get 2 errors out of 5. For overcast, all of the 4 overcast values of outlook lead to yes values
for the class play.

So, we're going to choose the rule if outlook is overcast, then yes, giving us 0 errors. Finally, for
outlook is rainy we're going to choose yes, as well, and that would also give us 2 errors out of the 5
instances. We've got a total number of errors if we branch on outlook of 4. We can branch on
temperature and do the same thing. When temperature is hot, there are 2 nos and 2 yeses. We just
choose arbitrarily in the case of a tie so we'll choose if it's hot, let's predict no, getting 2 errors. If
temperature is mild, we'll predict yes, getting 2/6 errors, and if the temperature is cool, we'll predict
yes, getting 1 out of the 4 instances as an error.

And the same for humidity and wind. We look at the total error values; we choose the rule with the
lowest total error value -- either outlook or humidity. That's a tie, so we'll just choose arbitrarily, and
choose outlook. That's how OneR works, it’s as simple as that. Let's just try it. Here's Weka. I'm going
to open the nominal weather data. I'm going to go to Classify. This is such a trivial dataset that the
results aren't very meaningful but if I just run ZeroR to start off with, I get an error rate of 64%.

If I now choose OneR and run that. I get a rule, and the rule I get is branched on outlook, if it's sunny
then choose no, overcast choose yes, and rainy choose yes. We get 10 out of 14 instances correct on
the training set. We're evaluating this using cross-validation. Doesn't really make much sense on
such a small dataset. Interesting, though, that the [success] rate we get, 42% is pretty bad, worse
than ZeroR. Actually, with any 2-class problem, you would expect to get a success rate of at least
50%. Tossing a coin would give you 50%. This OneR scheme is not performing very well on this trivial
dataset. Notice that the rule it finally prints out since we're using 10-fold cross-validation, it does the
whole thing 10 times and then on the 11th time calculates a rule from the entire dataset and that's
what it prints out.

That's where this rule comes from. OneR, one attribute does all the work. This is a very simple
method of machine learning described in 1993, 20 years ago in a paper called "Very Simple
Classification Rules Perform Well on Most Commonly Used Datasets" by a guy called Rob Holte, who
lives in Canada. He did an experimental evaluation of the OneR method on 16 commonly used
datasets. He used cross-validation just like we've told you to evaluate these things, 0 and he found
that the simple rules from OneR often outperformed far more complex methods 109 that had been
proposed for these datasets. How can such a simple method work so well? Some datasets really are
simple, and others are so small, noisy, or complex 113 00:07:39,950 --> 00:07:42,010 that you can't
learn anything from them. So, it's always worth trying the simplest things first. Section 4.1 of the
course text talks about OneR. Now it's time for you to go and do the activity associated with this
lesson. Bye for now!

16 Model Overfitting

Hi! Before we go on to talk about some more simple classifier methods, we need to talk about
overfitting. Any machine learning method may 'overfit' the training data, that's when it produces a
classifier that fits the training data too tightly and doesn't generalize well to independent 4 test data.
Remember the user classifier that you built at the beginning of Class 2, when you built a classifier
yourself? Imagine tediously putting a tiny circle around every single training 7 data point. You could
build a classifier very laboriously that would be 100% correct on the training data, but probably
wouldn't generalize very well to independent test data. That's overfitting. It's a general problem.
We're going to illustrate it with OneR.

We're going to look at the numeric version of the weather problem, where temperature and
humidity are numbers and not nominal values. If you think about how OneR works, when it comes to
make a rule on the attribute temperature, it's going to make complex rule that branches 14 different
ways perhaps for the 14 different instances of the dataset. Each rule is going to have zero errors; it's
going to get it exactly right. If we branch on temperature, we're going to get a perfect rule, with a
total error count of zero. In fact, OneR has a parameter that limits the complexity of rules. I'm not
going to talk about how it works. It's pretty simple, but it's just a bit distracting and not very
important. The point is that the parameter allows you to limit the complexity of the rules that are
produced by OneR. Let's open the numeric weather data. We can go to OneR and choose it. There's
OneR, and let's just create a rule.

Here the rule is based on the outlook attribute. This is exactly what happened in the last lesson with
the nominal version of the weather data. Let's just remove the outlook attribute and try it again.
Now let's see what happens when we classify with OneR. Now it branches on humidity. If humidity is
less than 82.5%, it's a yes day; if it's greater than 82.5%, it's a no day and that gets 10 out of 14
instances correct. So far so good, that's using the default setting of OneR's parameter that controls
the complexity of the rules it generates.

We can go and look at OneR and remember you can configure a classifier by clicking on it. We see
that there's a parameter called minBucketSize, and it's set to 6 by default, which is a good
compromise value. I'm going to change that value to 1, and then see what happens. Run OneR again,
and now I get a different kind of rule. It's branching many different ways on the temperature
attribute. This rule is overfitted to the dataset. It's a very accurate rule on the training data, but it
won't generalize well to independent test data.

Now let's see what happens with a more realistic dataset. I'll open diabetes, which is a numeric
dataset. 51 00:04:26,849 --> 00:04:32,9 All the attributes are numeric, and the class is either
tested_negative or tested_positive. Let's run ZeroR to get a baseline figure for this dataset. Here I
get 65% for the baseline. We really ought to be able to do better than that. Let's run OneR. The
default parameter settings that is a value of 6 for OneR's parameter that controls rule complexity.
We get 71.5%. That's pretty good.

We're evaluating using cross-validation. OneR outperforms the baseline accuracy by quite a bit --
71% versus 65%. If we look at the rule, it branches on "plas". This is the plasma-glucose
concentration. So, depending on which of these regions the plasma-glucose concentration falls into,
then we're going to predict a negative or a positive outcome. That seems like quite a sensible
ruleNow, let's change OneR's parameter to make it overfit. We'll configure OneR, find the
minBucketSize parameter, and change it to 1. When we run OneR again, we get 57% accuracy, quite
a bit lower than the ZeroR baseline 70 of 65%.

If you look at the rule. Here it is. It's testing a different attribute, pedi, which -- if you look at the
comments of the ARFF file -- happens to be the diabetes pedigree function, whatever that is. You
can see that this attribute has a lot of different values, and it looks like we're branching on pretty
well every single one. That gives us lousy performance when evaluated by cross-validation, which is
what we're doing now. If you were to evaluate it on the training set, you would expect to see very
good performance. Yes, here we get 87.5% accuracy on the training set, which is very good for this
dataset. Of course, that figure is completely misleading; the rule is strongly overfitted to the training
dataset and doesn't generalize well to independent test sets.
That's a good example of overfitting. Overfitting is a general phenomenon that plagues all machine
learning methods. We've illustrated it by playing around with the parameter of the OneR method,
but it happens with all machine learning methods. It's one reason why you should never evaluate on
the training set. Overfitting can occur in more general contexts. Let's suppose you've got a dataset
and you choose a very large number of machine learning methods say a million different machine
learning methods and choose the best for your dataset using cross-validation. Well, because you've
used so many machine learning methods, you can't expect to get the same performance on new test

You've chosen so many, that the one that you've ended up with is going to be overfitted to the
dataset you're using. It's not sufficient just to use cross-validation and believe the results. In this
case, you might divide the data three ways, into a training set, a test set, and a validation set.
Choose the method using the training and test set. By all means, use your million machine learning
methods and choose the best on the training 100 and test set or the best using cross-validation on
the training set. 101 But then, leave aside this separate validation set for use at the end, once you've
chosen your machine learning method, and evaluate it on that to get a much more realistic
assessment of how it would perform on independent test data. Overfitting is a really big problem in
machine learning. You can read a bit more about OneR and what this parameter actually does in the
course text in Section 4.1. Off you go now and do the activity associated with this class. Bye for now.

17 Using probabilities

Hi! This is Lesson 3.3 on using probabilities. It's the one bit of Data Mining with Weka that we're
going to see a little bit of mathematics, but don't worry, I'll take you through it gently. 4 The OneR
strategy that we've just been studying assumes that there is one of the attributes that does all the
work, that takes the responsibility of the decision. That's a simple strategy. Another simple strategy
is the opposite, to assume all of the attributes contribute equally and independently to the decision.
This is called the "Naive Bayes" method -- I'll explain the name later on. There are two assumptions
that underline Naive Bayes: that the attributes are equally important and that they are statistically
independent, that is, knowing the value of one of the attributes doesn't tell you anything about the
value 14 of any of the other attributes.

This independence assumption is never actually correct, but the method based on it often works
well in practice. There's a theorem in probability called "Bayes Theorem" after this guy Thomas
Bayes from the 18th century. It's about the probability of a hypothesis H given evidence E. In our
case, the hypothesis is the class of an instance and the evidence is the attribute values of the
instance. The theorem is that Pr[H|E] -- the probability of the class given the instance, the
hypothesis given the evidence -- is equal to Pr[E|H] times Pr[H] divided by Pr[E]. Pr[H] by itself is
called the [prior] probability of the hypothesis H.

That's the probability of the event before any evidence is seen. That's really the baseline probability
of the event. For example, in the weather data, I think there are 9 yeses and 5 nos, so the baseline
probability of the hypothesis "play equals yes" is 9/14 and "play equals no" is 5/14. What this
equation says is how to update that probability Pr[H] when you see some evidence, to get what's call
the "a posteriori" probability of H, that means after the evidence. The evidence in our case is the
attribute values of an unknown instance. That's E. That's Bayes Theorem. Now, what makes this
method "naive"? The naive assumption is -- I've said it before -- that the evidence splits into parts
that are statistically independent.
The parts of the evidence in our case are the four different attribute values in the weather data.
When you have independent events, the probabilities multiply, so Pr[H|E], 39 according to the top
equation, is the product of Pr[E|H] times the prior probability Pr[H] divided by Pr[E]. Pr[E|H] splits
up into these parts: Pr[E1|H], the first attribute value; Pr[E2|H], the second attribute value; and so
on for all of the attributes. That's maybe a bit abstract, let's look at the actual weather data. On the
right-hand side is the weather data. In the large table at the top, we've taken each of the attributes.

Let's start with "outlook". Under the "yes" hypothesis and the "no" hypothesis, we've looked at how
many times the outlook is "sunny". It's sunny twice under yes and 3 times under no. That comes
straight from the data in the table. Overcast. 52 When the outlook is overcast, it's always a "yes"
instance, so there were 4 of those, and zero "no" instances. Then, rainy is 3 "yes" instances and 2
"no" instances. Those numbers just come straight from the data table given the instance values.
Then, we take those numbers and underneath we make them into probabilities Let's say we know
the hypothesis. Let's say we know it's a "yes".

Then the probability of it being "sunny" is 2/9ths, "overcast" is 4/9ths, and "rainy" 3/9ths, 60
00:04:52,960 --> 00:04:56,460 simply because when you add up 2 plus 4 plus 3 you get 9. 61 Those
are the probabilities. If we know that the outcome is "no", the probabilities are "sunny" 3/5ths,
"overcast" 0/5ths, and "rainy" 2/5ths. That's for the "outlook" attribute. That's what we're looking
for, you see, the probability of each of these attribute values given the hypothesis H. 67 The next
attribute is temperature, and we just do the same thing with that to get the probabilities of the 3
values -- hot, mild, and cool -- under the "yes" hypothesis or the "no" hypothesis.

The same with humidity and windy. Play, that's the prior probability -- Pr[H]. It's "yes" 9/14ths of the
time, "no" 5/14ths of the time, even if you don't know anything about the attribute values. The
equation we're looking at is this one below, and we just need to work it out. Here's an example.
Here's an unknown day, a new day. We don't know what the value of "play" is, but we know it's
sunny, cool, high, and windy. We can just multiply up these probabilities. If we multiply for the yes
hypothesis, we get 2/9th times 3/9ths times 3/9ths times 3/9ths -- those are just the numbers on
the previous slide Pr[E1|H], Pr[E2|H], Pr[E3|H] Pr[E4|H] -- finally Pr[H], that is 9/14ths.

That gives us a likelihood of 0.0053 when you multiply them. Then, for the "no" class, we do the
same to get a likelihood of 0.0206. These numbers are not probabilities. 84 00:06:46,720 -->
Probabilities have to add up to 1. They are likelihoods. But we can get the probabilities from them by
using a straightforward technique of normalization. Take those likelihoods for "yes" and "no" and we
normalize them as shown below to make them add up to 1. That's how we get the probability of
"play" on a new day with different attribute values. Just to go through that again. The evidence is
"outlook" is "sunny", "temperature" is "cool", "humidity" is "high", "windy" is "true" -- and we don't
know what play is. The [likelihood] of a "yes", given the evidence is the product of those 4
probabilities -- one for outlook, temperature, humidity and windy -- times the prior probability,
which is just the baseline probability of a "yes". That product of fractions is divided by Pr[E].

We don't know what Pr[E] is, but it doesn't matter, because we can do the same calculation 98 for
Pr[E] of "no", which gives us another equation just like this, and then we can calculate the actual
probabilities by normalizing them so that the two probabilities add up to 1. Pr[E] for "yes" plus Pr[E]
for "no" equals 1. It's actually quite simple when you look at it in numbers, and it's simple when you
look at it in Weka, as well. I'm going to go to Weka here, and I'm going to open the nominal weather
data, which is here. We've seen that before, of course, many times. I'm going to go to Classify. I'm
going to use the NaiveBayes method. It's under this bayes category here. There are a lot of
implementations of different variants of Bayes. I'm just going to use the straightforward NaiveBayes
method here. I'll just run it.

This is what we get. The success probability calculated according to cross-validation. More
interestingly, we get the model. 115 The model is just like the table I showed you before divided
under the "yes" class and the "no" class. We've got the four attributes -- outlook, temperature,
humidity, and windy -- and then, 118 for each of the attribute values, we've got the number of times
that attribute value appears. Now, there's one little and important difference between this table and
the one I showed you before. Let me go back to my slide and look at these numbers. before.

Let me go back to my slide and look at these numbers. You can see that for outlook under "yes" on
my slide, I've got 2, 4, and 3, and Weka has got 3, 5, and 4. That's 1 more each time for a total of 12,
instead of a total of 9. Weka adds 1 to all of the counts. The reason it does this is to get rid of the
zeros. In the original table under outlook, under "no", the probability of overcast given "no" is zero,
and we're going to be multiplying that into things. What that would mean in effect, if we took that
zero at face value, is that the probability of the class being "no" given any day for which the outlook
was overcast would be zero. Anything multiplied by zero is zero.

These zeros in probability terms have sort of a veto over all of the other numbers, and we don't
want that. We don't want to categorically conclude that it must be a "no" day on a basis that it's
overcast, and we've never seen an overcast outlook on a "no" day before. That's called a "zero-
frequency problem", and Weka's solution -- the most common solution -- is very simple, we just add
1 to all the counts. That's why all those numbers in the Weka table are 1 bigger than the numbers in
the table on the slide. Aside from that, it's all exactly the same. We're avoiding zero frequencies by
effectively starting all counts at 1 instead of starting them at 0, so they can't end up at 0. That's the
Naive Bayes method. The assumption is that all attributes contribute equally and independently to
the outcome. That works surprisingly well, even in situations where the independence assumption is
clearly violated. Why does it work so well when the assumption is wrong? That's a good question.

Basically, classification doesn't need accurate probability estimates. We're just going to choose as
the class the outcome with the largest probability. 00 As long as the greatest probability is assigned
to the correct class, it doesn't matter if the probability estimates are all that accurate. This actually
means that if you add redundant attributes you get problems with Naive Bayes. The extreme case of
dependence is where two attributes have the same values, identical 154 attributes. That will cause
havoc with the Naive Bayes method. However, Weka contains methods for attribute selection to
allow you to select a subset of fairly independent attributes after which you can safely use Naive
Bayes. There's quite a bit of stuff on statistical modeling in Section 4.2 of the course text. Now you
need to go and do that activity. See you soon!
18 Decision trees

Hi! Here in Lesson 3.4, we're continuing our exploration of simple classifiers by looking at classifiers
that produce decision trees. We're going to look at J48. We've used this classifier quite a bit so far.
Let's have a look at how it works inside. J48 is based on a top-down strategy, a recursive divide and
conquer strategy. You select which attribute to split on at the root node, and then you create a
branch for each possible attribute value, and that splits the instances into subsets, one for each
branch that extends from the root node. Then you repeat the the procedure recursively for each
branch, selecting an attribute at each node, and you use only instances that reach that branch to
make the selection.

At the end you stop, perhaps you might continue until all instances have the same class. The trick is,
the question is, how do you select a good attribute for the root node. This is the weather data, and
as you can see, outlook has been selected for the root node. Here are the four possibilities: outlook,
windy, humidity, and temperature. These are the consequences of splitting on each of these
attributes. What we're really looking for is a pure split, a split into pure nodes. We would be
delighted if we found an attribute that split exactly into one node where they are all yeses, another
node where they are all nos, and perhaps a third node where they are all yeses again.

That would be the best thing. What we don't want is mixtures, because when we get mixtures of
yeses and nos at a node, then we've got to split again. You can see that splitting on outlook looks
pretty good. We get one branch with two yeses and three nos, then we get a pure yes branch for
overcast, and, when outlook is rainy, we get three yeses and two nos. 0 How are we going to
quantify this to decide which one of these attributes produces the purest nodes? We're on a quest
here for purity. The aim is to get the smallest tree, and top-down tree induction methods use some
kind of heuristic.

The most popular heuristic to produce pure nodes is an information theory-based heuristic. 1 I'm not
going to explain information theory to you, that would be another MOOC of its own -- quite an
interesting one, actually. Information theory was founded by Claude Shannon, an American
mathematician and scientist who died about 12 years ago. He was an amazing guy. He did some
amazing things. One of the most amazing things, I think, is that he could ride a unicycle and juggle
clubs at the same time when he was in his 80's. That's pretty impressive. He came up the whole idea
of information theory and quantifying entropy, which measures information in bits.

This is the formula for entropy: the sum of p log p's for each of the possible outcomes. I'm not really
going to explain it to you. All of those minus signs are there because logarithms are negative if
numbers are less than 1 and probabilities always are less than 1. So, the entropy comes out to be a
positive number. What we do is we look at the information gain. How much information in bits do
you gain by knowing the value of an attribute? That is, the entropy of the distribution before the
split minus the entropy of the distribution after the split. Here's how it works out for the weather

These are the number of bits. If you split on outlook, you gain 0.247 bits. I know you might be
surprise to see fractional numbers of bits, normally we think of 1 bit or 8 bits or 32 bits, but
information theory shows how you can regard bits as fractions. These produce fractional numbers of
bits. I don't want to go into the details. You can see, knowing the value for windy gives you only
0.048 bits of information. Humidity is quite a bit better; temperature is way down there at 0.029
bits. We're going to choose the attribute that gains the most bits of information, and that, in this
case, is outlook.

At the top level of this tree, the root node, we're going to split on outlook. Having decided to split on
outlook, we need to look at each of 3 branches that emanate from outlook corresponding to the 3
possible values of outlook, and consider what to do at each of those branches. 66 At the first branch,
we might split on temperature, windy or humidity. We're not going to split on outlook again because
we know that outlook is sunny. For all instances that reach this place, the outlook is sunny. For the
other 3 things, we do exactly the same thing. We evaluate the information gain for temperature at
that point, for windy and humidity, and we choose the best. In this case, it's humidity with a gain of
0.971 bits. You can see that, if we branch on humidity, then we get pure nodes: 3 nos in one and
yeses in the other. When we get that, we don't need to split anymore. We're on a quest for purity.
That's how it works. It just carries on until it reaches the end, until it has pure nodes. Let's open up
Weka, and just do this with the nominal weather data.

Of course, we've done this before, but I'll just do it again. It won't take long. J48 is the workhorse
data mining algorithm. There's the data. We're going to choose J48. It's a tree classifier. We're going
to run this, and we get a tree -- the very tree I showed you before -- split first on outlook: sunny,
overcast and rainy. Then, if it's sunny, split on humidity, 3 instances reach that node. Then split on
normal, 3 yes instances reach that node, and so on. We can look at the tree using Visualize the tree
in the right-click menu. Here it is. These are the number of yes instances that reach this node and
the number of no instances. In the case of this particular tree, of course we're using cross validation

It's done an 11th run on the whole dataset. It's given us these numbers by looking at the training set.
In fact, this becomes a pure node here. 97 00:07:24,520 --> 00:07:29,950 Sometimes you get 2
numbers here -- 3/2 or 3/1. The first number indicates the number of correct things that reach that
node, so in this case the number of nos. If there was another number following the 3, that would
indicate the number of yeses, that is, incorrect things that reach that node. But that doesn't occur in
this very simple situation. There you have it, J48: top-down induction of decision trees. It's soundly
based in information theory. It's a pretty good data mining algorithm. 10 years ago I might have said
it's the best data mining algorithm, but some even better ones, I think, have been produced since

However, the real advantage of J48 is that it's reliable and robust, and, most importantly, it produces
a tree that people can understand. It's very easy to understand the output of J48. That's really
important when you're applying data mining. 112 There are a lot of different criteria you could use
for attribute selection. Here we're using information gain. Actually, in practice, these don't normally
make a huge difference. There are some important modifications that need to be done to this
algorithm to be useful in practice. I've only really explained the basic principles. The actual J48
incorporates some more complex stuff to make it work under different circumstances in practice.
We'll talk about those in the next lesson. Section 4.3 of the text Divide-and-conquer: Constructing
decision trees explains the simple version of J48 that I've explained here. Now you should go and do
the activity associated with this lesson. Good luck! See you next time!

19 Pruning decision trees
Hi! In the last class, we looked at a bare-bones algorithm for constructing decision trees. To get an
industrial strength decision tree induction algorithm, we need to add some more complicated stuff,
notably pruning. We're going to talk in this [lesson] about pruning decision trees. Here's a guy
pruning a tree, and that's a good image to have in your mind when we're talking about decision
trees. We're looking at those little twigs and little branches around the edge of the tree, seeing if
their worthwhile, and snipping them off if they're not contributing. That way, we'll get a decision
tree that might perform worse on the training data, but perhaps generalizes better to independent
test data. That's what we want. Here's the weather data again. I'm sorry to keep harking back to the
weather data, but it's just a nice simple example that we all know now. I've added here a new
attribute. I call it an ID code attribute, which is different for each instance.

I've just given them an identification code: a, b, c, and so on. Let's just think from the last lesson,
what's going to happen when we consider which is the best attribute to split on at the root, the first
decision. We're going to be looking for the information gain from each of our attributes separately.
We're going to gain a lot of information by choosing the ID code. Actually, if you split on the ID code,
that tells you everything about the instance we're looking at. That's going to be a maximal amount of
information gain, and clearly we're going to split on that attribute at the root node of the decision
tree. But that's not going to generalize at all to new weather instances. To get around this problem,
having constructed a decision tree, decision tree algorithms then automatically prune it back. You
don't see any of this, it just happens when you start the algorithm in Weka. How do we prune? There
are some simple techniques for pruning, and some more complicated techniques for pruning. A very
simple technique is to not continue splitting if the nodes get very small.

I said in the last lesson that we're going to keep splitting until each node has just one class
associated with it. Perhaps that's not such a good idea. If we have a very small node with a couple
instances, it's probably not worth splitting that node. That's actually a parameter in J48. I've got
Weka going here. I'm going to choose J48 and look at the parameters. There's a parameter called
minNumObj. If I mouse over that parameter, it says "The minimum number of instances per leaf".
The default value for that is 2. The second thing we do is to build a full tree and then work back from
the leaves. It turns out to be better to build a full tree and prune back rather than trying to do
forward pruning as you're building the tree. We apply a statistical test at each stage.

That's the confidenceFactor parameter. It's here. The default value is 0.25. "The confidence factor
used for pruning [smaller values incur more pruning]." Then, sometimes it's good to prune an
interior node, and to raise the subtree beneath that interior node up one level. That's called
subtreeRaising. That's this parameter here. We can switch it on or switch it off. "Whether to
consider the subtree raising operation during pruning." Subtree raising actually increases the
complexity of the algorithm, so it would work faster if you turned off subtree raising on a large
problem. I'm not going to talk about the details of these methods. Pruning is a messy and
complicated subject, and it's not particularly illuminating.

Actually, I don't really recommend playing around with these parameters here. The default values on
J48 tend to do a pretty good job. Of course, it's become apparent to you now that the need to prune
is really a result of the original unpruned tree overfitting the training dataset. This is another
instance of overfitting. Sometimes simplifying a decision tree gives better results, not just a smaller,
more manageable tree, but actually better results. I'm going to open the diabetes data. I'm going to
choose J48, and I'm just going to run it with the default parameters. I get an accuracy of 73.8%,
evaluated using cross-validation. The size of the tree is 20 leaves, and a total of 39 nodes. That's 19
interior nodes and 20 leaf nodes. Let's switch off pruning. J48 prunes by default. We're going to
switch off pruning. We've got an unpruned option here, which is false, which means it's pruning.
I'm going to change that to true -- which means it's not pruning any more -- and run it again. Now we
get a slightly worse result, 72.7%, probably not significantly worse. We get a slightly larger tree -- 22
leaves and That's a double whammy, really. We've got a bigger tree, which is harder to understand,
and we've got a slightly worse prediction result. We would prefer the pruned [tree] in this example
on this dataset. I'm going to show you a more extreme example with the breast cancer data. I don't
think we've looked at the breast cancer data before. The class is no-recurrence-events versus
recurrence-events, and there are attributes like age, menopause, tumor size, and so on. I'm going to
go classify this with J48 in the default configuration. I need to switch on pruning -- that is, make
unpruned false -- and then run it. I get an accuracy of 75.5%, and I get a fairly small tree with 4
leaves and 2 internal nodes. I can look at that tree here, or I can visualize the tree.

We get this nice, simple little decision structure here, which is quite comprehensible and performs
pretty well, 75% accuracy. I'm going to switch off pruning. Make unpruned true, and run it again.
First of all, I get a much worse result, 69.6% -- probably signficantly worse than the 75.5% I had
before. More importantly, I get a huge tree, with It's massive. If I try to visualize that, I probably
won't be able to see very much. I can try to fit that to my screen, and it's still impossible to see
what's going on here. In fact, if I look at the textual description of the tree, it's just extremely
complicated. That's a bad thing. Here, an unpruned tree is a very bad idea. We get a huge tree which
does quite a bit worse than a much simpler decision structure. J48 does pruning by default and, in
general, you should let it do pruning according to the default parameters. That would be my
recommendation. We've talked about J48, or, in other words, C4.5. Remember, in Lesson 1.4, we
talked about the progression from C4.5 by Ross Quinlan.

Here is a picture of Ross Quinlan, an Australian computer scientist, at the bottom of the screen. The
progression from C4.5 from Ross to J48, which is the Java implementation essentially equivalent to
C4.5. It's a very popular method. It's a simple method and easy to use. Decision trees are very
attractive because you can look at them and see what the structure of the decision is, see what's
important about your data. There are many different pruning methods, and their main effect is to
change the size of the tree. They have a small effect on the accuracy, and it often makes the
accuracy worse. They often have a huge effect on the size of the tree, as we just saw with the breast
cancer data. Pruning is actually a general technique to guard against overfitting, and it can be
applied to structures other than trees, like decision rules. There's a lot more we could say about
decision trees. For example, we've been talking about univariate decision trees -- that is, ones that
have a single test at each node. You can imagine a multivariate tree, where there is a compound
test. The test of the node might be 'if this attribute is that AND that attribute is something else'. You
can imagine more complex decision trees produced by more complex decision tree algorithms. In
general, C4.5/J48 is a popular and useful workhorse algorithm for data mining. You can read a lot
more about decision trees if you go to the course text. Section 6.1 tells you about pruning and gives
you the mathematical details of the pruning methods that I've just sketched here. It's time for you to
do the activity, and I'll see you in the next lesson. Bye for now!

20 Nearest neighbor

Hi! I'm sitting here in New Zealand. It's on the globe behind me. That's New Zealand, at the top of
the world, surrounded by water. But that's not where I'm from originally. I moved here about 20
years ago. Here on this map, of course, this is New Zealand -- Google puts things with the north at
the top, which is probably what you're used to. I came here from the University of Calgary in Canada,
where I was for many years. I used to be head of computer science for the University of Calgary. But,
originally, I'm from Belfast, Northern Ireland, which is here in the United Kingdom. So, my accent
actually is Northern Irish, not New Zealand. This is not a New Zealand accent. We're going to talk
here in the last lesson of Class 3 about another machine learning method called the nearest
neighbor, or instance-based, machine learning method.

When people talk about rote learning, they just talk about remember stuff without really thinking
about it. It's the simplest kind of learning. Nearest neighbor implements rote learning. It just
remembers the training instances, and then, to classify a new instance, it searches the training set
for one that is most like the new instance. The representation of the knowledge here is just the set
of instances. It's a kind of lazy learning. The learner does nothing until it has to do some predictions.
Confusingly, it's also called instance-based learning. Nearest neighbor learning and instance-based
learning are the same thing. Here is just a little picture of 2-dimensional instance space. The blue
points and the white points are two different classes -- yes and no, for example.

Then we've got an unknown instance, the red one. We want to know which class it's in. So, we
simply find the closest instance in each of the classes and see which is closest. In this case, it's the
blue class. So, we would classify that red point as though it belonged to the blue class. If you think
about this, that's implicitly drawing a line between the two clouds of points. It's a straight line here,
the perpendicular bisector of the line that joins the two closest points. The nearest neighbor method
produces a linear decision boundary. Actually, it's a little bit more complicated than that. It produces
a piece-wise linear decision boundary with sometimes a bunch of little linear pieces of the decision
boundary. Of course, the trick is what do we mean by "most like".

We need a similarity function, and conventionally, people use the regular distance function, the
Euclidean distance, which is the sum of the squares of the differences between the attributes.
Actually, it's the square root of the sum of the squares, but since we're just comparing two
instances, we don't need to take the square root. Or, you might use the Manhattan or city block
distance, which is the sum of the absolute differences between the attribute values. Of course, I've
been talking about numeric attributes here. If attributes are nominal, we need the difference
between different attribute values. Conventionally, people just say the distance is 1 if the attribute
values are different and 0 if they are the same. It might be a good idea with nearest neighbor
learning to normalize the attributes so that they all lie between 0 and 1, so the distance isn't skewed
by some attribute that happens to be on some gigantic scale.

What about noisy instances. If we have a noisy dataset, then by accident we might find an
incorrectly classified training instance as the nearest one to our test instance. You can guard against
that by using the k-nearest-neighbors. k might be 3 or 5, and you look for the 3 or the 5 nearest
neighbors and choose the majority class amongst those when classifying an unknown point. That's
the k-nearest-neighbor method. In Weka, it's called IBk (instance-based learning with parameter k),
and it's in the lazy class. Let's open the glass dataset. Go to Classify and choose the lazy classifier IBk.
Let's just run it. We get an accuracy of 70.6%. The model is not really printed here, because there is
no model. It's just the set of training instances. We're using 10-fold cross-validation, of course. Let's
change the value of k, this kNN is the k value. It's set by default to 1. (The number of neighbors to
use.) We'll change that to, say, 5 and run that. In this case, we get a slightly worse result, This is not
such a noisy dataset, I guess. If we change it to 20 and run it again.

We get 65% accuracy, slightly worse again. If we had a noisy dataset, we might find that the accuracy
figures improved as k got little bit larger. Then, it would always start to decrease again. If we set k to
be an extreme value, close to the size of the whole dataset, then we're taking the distance of the
test instance to all of the points in the dataset and averaging those, which will probably give us
something close to the baseline accuracy. Here, if I set k to be a ridiculous value like 100. I'm going to
take the 100 nearest instances and average their classes. We get an accuracy of 35%, which, I think is
pretty close to the baseline accuracy for this dataset. Let me just find that out with ZeroR, the
baseline accuracy is indeed 35%. Nearest neighbor is a really good method. It's often very accurate.
It can be slow. A simple implementation would involve scanning the entire training dataset to make
each prediction, because we've got to calculate the distance of the unknown test instance from all of
the training instances to see which is closest.

There are more sophisticated data structures that can make this faster, so you don't need to scan
the whole dataset every time. It assumes all attributes are equally important. If that wasn't the case,
you might want to look at schemes for selecting or weighting attributes depending on their
importance. If we've got noisy instances, than we can use a majority vote over the k nearest
neighbors, or we might weight instances according to their prediction accuracy. Or, we might try to
identify reliable prototypes, one for each of the classes. This is a very old method. Statisticians have
used k-nearest-neighbor since the 1950's. There's an interesting theoretical result. If the number (n)
of training instances approaches infinity, and k also gets larger in such a way that k/n approaches 0,
but k also approaches infinity, the error of the k-nearest-neighbor method approaches the
theoretical minimum error for that dataset. There is a theoretical guarantee that with a huge dataset
and large values of k, you're going to get good results from nearest neighbor learning. There's a
section in the text, Section 4.7 on Instance-based learning. This is the last lesson of Class 3. Off you
go and do the activity, and I'll see you in Class 4. Bye for now!

21 Questions

Hi! We've just finished Class 3, and here are some of the issues that arose. I have a list of them here,
so let's start at the top. Numeric precision in activities has caused a little bit of unnecessary angst. So
we've simplified our policy. In general, we're asking you to round your percentages to the nearest
integer. We certainly don't want you typing in those Some people are getting the wrong results in
Weka. One reason you might get the wrong results is that the random seed is not set to the default
value. Whenever you change the random seed, it stays there until you change it back or until you
restart Weka. Just restart Weka or reset the random seed to 1. Another thing you should do is check
your version of Weka. We asked you to download 3.6.10. There have been some bug fixes since the
previous version, so you really do need to use this new version.

One of the activities asked you to copy an attribute, and some people found some surprising things
with Weka claiming 100% accuracy. If you accidentally ask Weka to predict something that's already
there as an attribute, it will do very well, with very high accuracy! It's very easy to mislead yourself
when you're doing data mining. You just need to make sure you know what you're trying to predict,
you know what the attributes are, and you haven't accidentally included a copy of the class attribute
as one of the attributes that's being used for prediction. There's been some discussion on the
mailing list about whether OneR is really always better than ZeroR on the training set. In fact, it is.
Someone proved it. (Thank you Jurek for sharing that proof with us.) Someone else found a
counterexample! "If we had a dataset with 10 instances, 6 belonging to Class A and 4 belonging to
Class B, with attribute values selected randomly, wouldn't ZeroR outperform OneR? -- OneR would
be fooled by the randomness of attribute values."

It's kind of anthropomorphic to talk about OneR being "fooled by" things. It's not fooled by anything.
It's not a person; it's not a being: it's just an algorithm. It just gets an input and does its thing with
the data. If you think that OneR might be fooled, then why don't you try it? Set up this dataset with
10 instances, 6 in A and 4 in B, select the attributes randomly, and see what happens. I think you'll
be able to convince yourself quite easily that this counterexample isn't a counterexample at all. It is
definitely true that OneR is always better than ZeroR on the training set. That doesn't necessarily
mean it's going to be better on an independent test set, of course. The next thing is Activity 3.3,
which asks you to repeat attributes with NaiveBayes.

Some people asked "why are we doing this?" It's just an exercise! We're just trying to understand
NaiveBayes a bit better, and what happens when you get highly correlated attributes, like repeated
attributes. With NaiveBayes, enough repetitions mean that the other attributes won't matter at all.
This is because all attributes contribute equally to the decision, so multiple copies of an attribute
skew it in that direction. This is not true with other learning algorithms. It's true for NaiveBayes, but
it's not true for OneR or J48, for example. Copied attributes doesn't effect OneR at all. The copying
exercise is just to illustrate what happens with NaiveBayes when you have non-independent
attributes. It's not something you do in real life.

Although you might copy an attribute in order to transform it in some way, for example. Someone
asked about the mathematics. In Bayes formula you get Pr[E|H]^k, if the attribute was repeated k
times, in the top line. How does this work mathematically? First of all, I'd just like to say that the
Bayes formulation assumes independent attributes. Bayes expansion is not true if the attributes are
dependent. But the algorithm works off that, so let's see what would happen. If you can stomach a
bit of mathematics, here's the equation for the probability of the hypothesis given the evidence
(Pr[H|E]). H might be Play is "yes" or Play is "no", for example, in the weather data. It's equal to this
fairly complicated formula at the top, which, let me just simplify it by writing "..." for all the bits after
here. So Pr[E1|H]^k, where E1 is repeated k times, times all the other stuff, divided by Pr[E]. What
the algorithm does: because we don't know Pr[E], we normalize the 2 probabilities by calculating
Pr[yes|E] using this formula and Pr[no|E], and normalizing them so that they add up to 1.

That then computes Pr[yes|E] as this thing here -- which is at the top, up here -- Pr[E1|yes]^k,
divided by that same thing, plus the corresponding thing for "no". If you look at this formula and just
forget about the "...", what's going to happen is that these probabilities are less than 1. If we take
them to the k'th power, they are going to get very small as k gets bigger. In fact, they're going to
approach 0. But one of them is going to approach 0 faster than the other one. Whichever one is
bigger -- for example, if the "yes" one is bigger than the "no" one -- then it's going to dominate. The
normalized probability then is going to be 1 if the "yes" probability is bigger than the "no"
probability, otherwise 0. That's what's actually going to happen in this formula as k approaches
infinity. The result is as though there is only one attribute: E1.

That's a mathematical explanation of what happens when you copy attributes in NaiveBayes. Don't
worry if you didn't follow that; that was just for someone who asked. Decision trees and bits.
Someone said on the mailing list that in the lecture there was a condition that resulted in branches
with all "yes" or all "no" results completely determining things. Why was the information gain only
[0.971] and not the full 1 bit? This is the picture they were talking about. Here, "humidity"
determines these are all "no" and these are all "yes" for high and normal humidity, respectively.
When you calculate the information gain -- and this is the formula for information gain -- you get
0.971 bits. You might expect 1 (and I would agree), and you would get 1 if you had 3 no's and 3 yes's
here, or if you had 2 no's and 2 yes's. But because there is a slight imbalance between the number of
no's and the number of yes's, you don't actually get 1 bit under these circumstances. There were
some questions on Class 2 about stratified cross-validation, which tries to get the same proportion of
class values in each fold. Some suggested maybe you should choose the number of folds so that it
can do this exactly, instead of approximately.
If you chose as the number of folds an exact divisor of the number of elements in each class, we'd be
able to do this exactly. "Would that be a good thing to do?" was the question. The answer is no, not
really. These things are all estimates, and you're treating them as though they were exact answers.
They are all just estimates. There are more important considerations to take into account when
determining the number of folds to do in your cross-validation. Like: you want a large enough test
set to get an accurate estimate of the classification performance, and you want a large enough
training set to train the classifier adequately. Don't worry about stratification being approximate.
The whole thing is pretty approximate actually. Someone else asked "why is there a 'Use training
set'" option on the Classify tab. It's very misleading to take the evaluation you get on the training
data seriously, as we know. So why is it there in Weka? Well, we might want it for some purposes.
For example, it does give you a quick upper bound on an algorithm's performance: it couldn't
possibly do better than it would do on the training set. That might be useful, allowing you to quickly
reject a learning algorithm. The important thing here is to understand what is wrong with using the
training set for a performance estimate, and what overfitting is. Rather than changing the interface
so you can't do bad things, I would rather protect you by educating you about what the issues are
here. There have been quite a few suggested topics for a follow-up course: attribute selection,
clustering, the Experimenter, parameter optimization, the KnowledgeFlow interface, and simple
command line interface.

We're considering a followup course, and we'll be asking you for feedback on that at the end of this
course. Finally, someone said "Please let me know if there is a way to make a small donation" -- he's
enjoying the course so much! Well, thank you very much. We'll make sure there is a way to make a
small donation at the end of the course. That's it for now. On with Class 4. I hope you enjoy Class 4,
and we'll talk again later. Bye for now!

22 Classification boundaries

Hello, again, and welcome to Data Mining with Weka, back here in New Zealand. In this class, Class
4, we're going to look at some pretty cool machine learning methods. We're going to look at linear
regression, classification by regression, logistic regression, support vector machines, and ensemble
learning. The last few of these are contemporary methods, which haven't been around very long.
They are kind of state-of-the-art machine learning methods. Remember, there are 5 classes in this
course, so next week is Class 5, the last class. We'll be tidying things up and summarizing things then.
You're well over halfway through; you're doing well. Just hang on in there. In this lesson, we're going
to start by looking at classification boundaries for different machine learning methods. We're going
to use Weka's Boundary Visualizer, which is another Weka tool that we haven't encountered yet. I'm
going to use a 2-dimensional dataset. I've prepared iris.2d.arff.

It's a I took the regular iris dataset and deleted a couple of attributes -- sepallength and sepalwidth --
leaving me with this 2D dataset, and the class. We're going to look at that using the Boundary
Visualizer. You get that from this Visualization menu on the Weka Chooser. There are a lot of tools in
Weka, and we're just going to look at this one here, the Boundary Visualizer. I'm going to open the
same file in the Boundary Visualizer, the 2-dimensional iris dataset. Here we've got a plot of the
data. You can see that we're plotting petalwidth on the y-axis against petallength on the x-axis. This
is a picture of the dataset with the and virginica in blue. I'm going to choose a classifier. Let's begin
with the OneR classifier, which is in rules. I'm going to "plot training data" and just going to let it rip.
The color diagram shows the decision boundaries, with the training data superimposed on it.
Let's look at what OneR does to this dataset in the Explorer. OneR has chosen to split on petalwidth.
If it's less than a certain amount, we get a setosa; if it's intermediate, we get a versicolor; and if it's
greater than the upper boundary, we get a viriginica. It's the same as what's being shown here.
We're splitting on petalwidth. If it's less than a certain amount, we get a setosa; in the middle, a
versicolor; and at the top, a virginica. This is a spatial representation of the decision boundary that
OneR creates on this dataset. That's what the Boundary Visualizer does; it draws decision
boundaries. It shows here that OneR chooses an attribute -- in this case petalwidth -- to split on. It
might have chosen petallength, in which case we'd have vertical decision boundaries.

Either way, we're going to get stripes from OneR. I'm going to go ahead and look at some boundaries
for other schemes. Let's look at IBk, which is a "lazy" classifier. That's the instance-based learner we
looked at in the last class. I'm going to run that. Here we get a different kind of pattern. I'll just stop
it there. We've got diagonal lines. Down here are the setosas underneath this diagonal line; the
versicolors in the intermediate region; and the virginicas, by and large, in the top right-hand corner.
Remember what [IBk] does. It takes a test instance. Let's say we had an instance here, just on this
side of the boundary, in the red. Then it chooses the nearest instance to that.

That would be this one, I guess. That's kind of the nearer than this one here. This is a red point. If I
were to cross over the boundary here, it would choose a green class, because this would be the
nearest instance then. If you think about it, this boundary goes halfway between this nearest red
point and this nearest green point. Similarly, if I take a point up here, I guess the two nearest
instances are this blue one and this green one. This blue one is closer. In this case, the boundary
goes along this straight line here. You can see that it's not just a single line: this is a piecewise linear
line, so this part of the boundary goes exactly halfway between these two points quite close to it.

Down here, the boundary goes exactly halfway between these two points. It's the perpendicular
bisector of the line joining these points. So, we get a piecewise linear boundary made up of little
pieces. It's kind of interesting to see what happens if we change the parameter: if we look at, say, 5
nearest neighbors instead of just 1. Now we get a slightly blurry picture, because whereas down
here in the pure red region the points, if we look in the intermediate region here, then the nearest
neighbors to a point here -- this is going to be in the 5, and this might be another one in the 5, and
there might be a couple more down here in the 5. So we get an intermediate color here, and IBk
takes a vote. If we had 3 reds and 2 greens, then we'd be in the red region and that would be
depicted as this darker red here.

If it had been the other way round with more greens than reds, we'd be in the green region. So
we've got a blurring of these boundaries. These are probabilistic descriptions of the boundary. Let
me just change k to 20 and see what happens. Now we get the same shape, but even more blurry
boundaries. The Boundary Visualizer reveals the way that machine learning schemes are thinking, if
you like. The internal representation of the dataset. They help you think about the sorts of things
that machine learning methods do. Let's choose another scheme. I'm going to choose NaiveBayes.
When we talked about NaiveBayes, we only talked about discrete attributes. With continuous
attributes, I'm going to choose a supervised discretization method. Don't worry about this detail, it's
the most common way of using NaiveBayes with numeric attributes. Let's look at that picture.

This is interesting. When you think about NaiveBayes, it treats each of the two attributes as
contributing equally and independently to the decision. It sort of decides what it should be along this
dimension and decides what it should be along this dimension and multiples the two together.
Remember the multiplication that went on in NaiveBayes. When you multiple these things together,
you get a checkerboard pattern of probabilities, multiplying up the probabilities. That's because the
attributes are being treated independently. That's a very different kind of decision boundary from
what we saw with instance-based learning. That's what's so good about the Boundary Visualizer: it
helps you think about how things are working inside. I'm going to do one more example.

I'm going to do J48, which is in trees. Here we get this kind of structure. Let's take a look at what
happens in the Explorer if we choose J48. We get this little decision tree: split first on petalwidth; if
it's less than 0.6 it's a setosa for sure. Then split again on petalwidth; if it's greater than 1.7, it's a
virginica for sure. Then, in between, split on petallength and then again on petalwidth, getting a
mixture of versicolors and viriginicas. We split first on petalwidth; that's this split here. Remember
the vertical axis is the petalwidth axis. If it's less than a certain amount, it's a setosa for sure. Then
we split again on the same axis. If it's greater than a certain amount, it's a virginica for sure. If it's in
the intermediate region, we split on the other axis, which is petallength. Down here, it's a versicolor
for sure, and here we're going to split again on the petalwidth attribute. Let's change the
minNumObj parameter, which controls the minimum size of the leaves.

If we increase that, we're going to get a simpler tree. We discussed this parameter in one of the
lessons of Class 3. If we run now, then we get a simpler version, corresponding to the simpler rules
we get with this parameter set. Or we can set the parameter to a higher value, say 10, and run it
again. We get even simpler rules, very similar to the rules produced by OneR. We've looked at
classification boundaries. Classifiers create boundaries in instance space and different classifiers
have different capabilities for carving up instance space. That's called the "bias" of the classifier --
the way in which it's capable of carving up the instance space. We looked at OneR, IBk, NaiveBayes,
and J48, and found completely different biases, completely different ways they carve up the instance
space. Of course, this kind of visualization is restricted to numeric attributes and 2-dimensional
plots, so it's not a very general tool, but it certainly helps you think about these different classifiers.
You can read about classification boundaries in Section 17.3 of the course text. Now off you go and
do the activity associated with this lesson. Good luck! We'll see you later.

23 Linear regression

Hi! This is Lesson 4.2 on Linear Regression. Back in Lesson 1.3, we actually mentioned the difference
between a classification problem and a regression problem. A classification problem is when what
you're trying to predict is a nominal value, whereas in a regression problem what you're trying to
predict is a numeric value. We've seen examples of datasets with nominal and numeric attributes
before, but we've never looked at the problem of regression, of trying to predict a numeric value as
the output of a machine learning scheme. That's what we're doing in this [lesson], linear regression.
We've only had nominal classes so far, so now we're going to look at numeric classes. This is a
classical statistical method, dating back more than 2 centuries. This is the kind of picture you see.
You have a cloud of data points in 2 dimensions, and we're trying to fit a straight line to this cloud of
data points and looking for the best straight-line fit. Only in our case we might have more than 2
dimensions, there might be multiple dimensions.

It's still a standard problem. Let's just look at the 2-dimensional case here. You can write a straight
line equation in this form, with weights w0 plus w1a1 plus w2a2, and so on. Just think about this in
one dimension where there's only one "a". Forget about all the things at the end here, just consider
w0 plus w1a1. That's the equation of this line -- it's the equation of a straight line -- where w0 and
w1 are two constants to be determined from the data. This, of course, is going to work most
naturally with numeric attributes, because we're multiplying these attribute values by weights. We'll
worry about nominal attributes in just a minute. We're going to calculate these weights from the
training data -- w0, w1, and w2. Those are what we're going to calculate from the training data.
Then, once we've calculated the weights, we're going to predict the value for the first training
instance, a1. The notation gets really horrendous here. I know it looks pretty scary, but it's pretty
simple. We're using this linear sum with these weights that we've calculated, using the attribute
values of the first [training] instance in order to get the predicted value for that instance.

We're going to get predicted values for the training instances using this rather horrendous formula
here. I know it looks pretty scary, but it's actually not so scary. These w's are just numbers that we've
calculated from the training data, and then these things here are the attribute values of the first
training instance a1 -- that 1 at the top here means it's the first training instance. This 1, 2, 3 means
it's the first, second, and third attribute. We can write this in this neat little sum form here, which
looks a little bit better. Notice, by the way, that we're defining a0 -- the zeroth attribute value -- to
be 1. That just makes this formula work. For the first training instance, that gives us this number x,
the predicted value for the first training instance and this particular value of a1. Then we're choosing
the weights to minimize the squared error on the training data. This is the actual x value for this i'th
training instance. This is the predicted value for the i'th training instance. We're going to take the
difference between the actual and the predicted value, square them up, and add them all together.
And that's what we're trying to minimize.

We get the weights by minimizing this sum of squared errors. That's a mathematical job; we don't
need to worry about the mechanics of doing that. It's a standard matrix problem. It works fine if
there are more instances than attributes. You couldn't expect this to work if you had a huge number
of attributes and not very many instances. But providing there are more instances than attributes --
and usually there are, of course -- that's going to work ok. If we did have nominal values, if we just
have a 2-valued/binary-valued, we could just convert it to 0 and 1 and use those numbers. If we
have multi-valued nominal attributes, you'll have a look at that in the activity at the end of this
lesson. We're going to open a regression dataset and see what it does: cpu.arff.

This is a regular kind of dataset. It's got numeric attributes, and the most important thing here is that
it's got a numeric class -- we're trying to predict a numeric value. We can run LinearRegression; it's in
the functions category. We just run it, and this is the output. We've got the model here. The class
has been predicted as a linear sum. These are the weights I was talking about. It's this weight times
this attribute value plus this weight times this attribute value, and so on. Minus -- and this is w0, the
constant weight, not modified by an attribute. This is a formula for computing the class. When you
use that formula, you can look at the success of it in terms of the training data. The correlation
coefficient, which is a standard statistical measure, is 0.9.

That's pretty good. Then there are various other error figures here that are printed. On the slide, you
can see the interpretation of these error figures. It's really hard to know which one to use. They all
tend to produce the same sort of picture, but I guess the exact one you should use depends on the
application. There's the mean absolute error and the root mean squared error, which is the standard
metric to use. That's linear regression. I'm actually going to look at nonlinear regression here. A
"model tree" is a tree where each leaf has one of these linear regression models. We create a tree
like this, and then at each leaf we have a linear model, which has got those coefficients. It's like a
patchwork of linear models, and this set of 6 linear patches approximates a continuous function.
There's a method under "trees" with the rather mysterious name of M5P. If we just run that, that
produces a model tree. Maybe I should just visualize the tree.

Now I can see the model tree, which is similar to the one on the slide. You can see that each of these
-- in this case 5 -- leaves has a linear model -- LM1, LM2, LM3, ... And if we look back here, the linear
models are defined like this: LM1 has this linear formula; this linear formula for LM2; and so on. We
chose trees > M5P, we ran it, and we looked at the output. We could compare these performance
figures -- 92-93% correlation, mean absolute error of 30, and so on -- with the ones for regular linear
regression, which got a slightly lower correlation, and a slightly higher absolute error -- in fact, I think
all these error figures are slightly higher. That's something we'll be asking you to do in the activity
associated with this lesson. Linear regression is a well-founded, venerable mathematical technique.
Practical problems often require non-linear solutions. The M5P method builds trees of regression
models, with linear models at each leaf of the tree. You can read about this in the course text in
Section 4.6. Off you go now and do the activity associated with this lesson.

24 Classification by regression

Hi! Welcome back! In the last lesson, we looked at linear regression -- the problem of predicting, not
a nominal class value, but a numeric class value. The regression problem. In this lesson, we're going
to look at how to use regression techniques for classification. It sounds a bit weird, but regression
techniques can be really good under certain circumstances, and we're going to see if we can apply
them to ordinary classification problems. In a 2-class problem, it's quite easy really. We're going to
call the 2 classes 0 and 1 and just use those as numbers, and then come up with a regression line
that, presumably for most 0 instances has a pretty low value, and for most 1 instances has a larger
value, and then come up with a threshold for determining whether, if it's less than that threshold,
we're going to predict class 0; if it's greater, we're going to predict class 1. If we want to generalize
that to more than for each class. We set the output to 1 for instances that belong to the class, and 0
for instances that don't. Then come up with a separate regression line for each class, and given an
unknown test example, we're going to choose a class with the largest output. That would give us n
regressions for a problem where there are n different classes.

We could alternatively use pairwise regression: take every pair of classes -- that's n squared over 2 --
and have a linear regression line for each pair of classes, discriminating an instance in one class of
that pair from the other class of that pair. We're going to work with a 2-class problem, and we're
going to investigate 2-class classification by regression. I'm going to open diabetes.arff. Then I'm
going to convert the class. Actually, let's just try to apply regression to this. I'm going to try
LinearRegression. You see it's grayed out here. That means it's not applicable. I can select it, but I
can't start it. It's not applicable because linear regression applies to a dataset where the class is
numeric, and we've got a dataset where the class is nominal.

We need to fix that. We're going to change this from these 2 labels to 0 and 1, respectively. We'll do
that with a filter. We want to change an attribute. It's unsupervised. We want to change a nominal
to a binary attribute, so that's the NominalToBinary filter. We want to apply that to the 9th
attribute. The default will apply it to all the attributes, but we just want to apply it to the 9th
attribute. I'm hoping it will change this attribute from nominal to binary. Unfortunately, it doesn't. It
doesn't have any effect, and the reason it doesn't have any effect is because these attribute filters
don't work on the class value. I can change the class value; we're going to give this "No class", so
now this is not the class value for the dataset. Run the filter again. Now I've got what I want: this
attribute "class" is either 0 or 1. In fact, this is the histogram -- there are this number of 0's and this
number of 1's, which correspond to the two different values in the original dataset. Now, we've got
our LinearRegression, and we can just run it.

This is the regression line. It's a line, 0.02 times the "pregnancy" attribute, plus this times the "plas"
attribute, and so on, plus this times the "age" attribute, plus this number. That will give us a number
for any given instance. We can see that number if we select "Output predictions" and run it again.
Here is a table of predictions for each instance in the dataset. This is the instance number; this is the
actual class of the instance, which is 0 or 1; this is the predicted class, which is a number --
sometimes it's less than 0. We would hope that these numbers are generally fairly small for 0's and
generally larger for 1's. They sort of are, although it's not really easy to tell. This is the error value
here in the fourth column. I'm going to do more extensive investigation, and you might ask why are
we bothering to do this? First of all, it's an interesting idea that I want to explore. It will lead to quite
good performance for classification by regression, and it will lead into the next lesson on logistic
regression, which is an excellent classification technique. Perhaps most importantly, we'll learn how
to do some cool things with the Weka interface. My strategy is to add a new attribute called
"classification" that gives this predicted number, and then we're going to use OneR to optimize a
split point for the two classes.

We'll have to restore the class back to its original nominal value, because, remember, I just
converted it to numeric. Here it is in detail. We're going to use a supervised attribute filter
[AddClassification]. This is actually pretty cool, I think. We're going to add a new attribute called
"classification". We're going to choose a classifier for that -- LinearRegression. We need to set
"outputClassification" to "True". If we just run this, it will add a new attribute to the dataset. It's
called "classification", and it's got these numeric values, which correspond exactly to the numeric
values that were predicted here by the linear regression scheme. Now, we've got this "classification"
attribute, and what I'd like to do now is to convert the class attribute back to nominal from numeric.
I want to use ZeroR now, and ZeroR will only work with a nominal class. Let me convert that. I want
NumericToNominal. I want to run that on attribute number 9.

Let me apply that, and now, sure enough, I've got the two labels 0 and 1. This is a nominal attribute
with these two labels. I'll be sure to make that one the class attribute. Then I get the colors back -- 2
colors for the 2 classes. Really, I want to predict this "class" based on the value of "classification",
that numeric value. I'm going to delete all the other attributes. I'm going to go to my Classify panel
here. I'm going to predict "class" -- this nominal value "class" -- and I'm going to use OneR. I think I'll
stop outputting the predictions because they just get in the way; and run that. It's 72-73%, and that's
a bit disappointing. But actually, when you look at this, OneR has produced this really overfitted rule.
We want a single split point. If it's less than this than predict 0, otherwise predict 1.

We can get around that by changing this "b" parameter, the minBucketSize parameter, to be
something much larger. I'm going to change it to 100 and run it again. Now I've got much better
performance, 77% accuracy, and this is the kind of split I've got: if the classification -- that is the
regression value -- is less than 0.47 I'm going to call it a 0; otherwise I'm going to call it a 1. So I've
got what I wanted, classification by regression. We've extended linear regression to classification.
This performance of 76.8% is actually quite good for this problem. It was easy to do with 2 classes, 0
and 1; otherwise you need to have a regression for each class -- multi-response linear regression --
or else for each pair of classes -- pairwise linear regression. We learnt quite a few things about
Weka. We learned about unsupervised attribute filters to convert nominal attributes to binary, and
numeric attributes back to nominal. We learned about this cool filter AddClassification, which adds
the classification according to a machine learning scheme as an attribute in the dataset. We learned
about setting and unsetting the class of the dataset, and we learned about the minimum bucket size
parameter to prevent OneR from overfitting. That's classification by regression. In the next lesson,
we're going to do better. We're going to look at logistic regression, an advanced technique which
effectively does classification by regression in an even more effective way. We'll see you soon.
25 Logistic regression

Hi! Welcome back to Data Mining with Weka. In the last lesson, we looked at classification by
regression, how to use linear regression to perform classification tasks. In this lesson we're going to
look at a more powerful way of doing the same kind of thing. It's called "logistic regression". It's
fairly mathematical, and we're not going to go into the dirty details of how it works, but I'd like to
give you a flavor of the kinds of things it does and the basic principles that underline logistic
regression. Then, of course, you can use it yourself in Weka without any problem. One of the things
about data mining is that you can sometimes do better by using prediction probabilities rather than
actual classes. Instead of predicting whether it's going to be a "yes" or a "no", you might do better to
predict the probability with which you think it's going to be a "yes" or a "no". For example, the
weather is 95% likely to be rainy tomorrow, or 72% likely to be sunny, instead of saying it's definitely
going to be rainy or it's definitely going to be sunny. Probabilities are really useful things in data
mining. NaiveBayes produces probabilities; it works in terms of probabilities. We've sen that in an
earlier lesson. I'm going to open diabetes and run NaiveBayes. I'm going to use a percentage split
with 90%, so that leaves 10% as a test set.

Then I'm going to make sure I output the predictions on those 10%, and run it. I want to look at the
predictions that have been output. This is a 2-class dataset, the classes are tested_negative and
tested_positive, and these are the instances -- number 1, number 2, number 3, etc. This is the actual
class -- tested_negative, tested_positive, tested_negative, etc. This is the predicted class --
tested_negative, tested_negative, tested_negative, tested_negative, etc. This is a plus under the
error column to say where there's an error, so there's an error with instance number 2. These are
the actual probabilities that come out of NaiveBayes. So for instance 1 we've got a 99% probability
that it's negative, and a 1% probability that it's positive.

So we predict it's going to be negative; that's why that's tested_negative. And in fact we're correct; it
is tested_negative. This instance, which is actually incorrect, we're predicting 67% percent for
negative and 33% for positive, so we decide it's a negative, and we're wrong. We might have been
better saying that here we're really sure it's going to be a negative, and we're right; here we think it's
going to be a negative, but we're not sure, and it turns out that we're wrong. Sometimes it's a lot
better to think in terms of the output as probabilities, rather than being forced to make a binary,
black-or-white classification. Other data mining methods produce probabilities, as well. If I look at
ZeroR, and run that, these are the probabilities -- 65% versus Of course, it's ZeroR! -- it always
produces the same thing.

In this case, it always says tested_negative and always has the same probabilities. The reason why
the numbers are like that, if you look at the slide here, is that we've chosen a 90% training set and a
10% test set, and the training set contains 448 negative instances and 243 positive instances.
Remember the "Laplace Correction" in Lesson 3.2? -- we add 1 to each of those counts to get 449
and 244. That gives us a 65% probability for being a negative instance. That's where these numbers
come from. If we look at J48 and run that, then we get more interesting probabilities here -- the
negative and positive probabilities, respectively. You can see where the errors are.

These probabilities are all different. Internally, J48 uses probabilities in order to do its pruning
operations. We talked about that when we discussed J48's pruning, although I didn't explain
explicitly how the probabilities are derived. The idea of logistic regression is to make linear
regression produce probabilities, too. This gets a little bit hairy. Remember, when we use linear
regression for classification, we calculate a linear function using regression and then apply a
threshold to decide whether it's a 0 or a 1. It's tempting to imagine that you can interpret these
numbers as probabilities, instead of thresholding like that, but that's a mistake. They're not
probabilities. These numbers that come out on the regression line are sometimes negative, and
sometimes greater than 1.

They can't be probabilities, because probabilities don't work like that. In order to get better
probability estimates, a slightly more sophisticated technique is used. In linear regression, we have a
linear sum. In logistic regression, we have the same linear sum down here -- the same kind of linear
sum that we saw before -- but we embed it in this kind of formula. This is called a "logit transform".
A logit transform -- this is multi-dimensional with a lot of different a's here. If we've got just one
dimension, one variable, a1, then if this is the input to the logit transform, the output looks like this:
it's between 0 and 1. It's sort of an S-shaped curve that applies a softer function. Rather than just 0
and then a step function, it's soft version of a step function that never gets below 0, never gets
above 1, and has a smooth transition in between. When you're working with a logit transform,
instead of minimizing the squared error (remember, when we do linear regression we minimize the
squared error), it's better to choose weights to maximize a probabilistic function called the "log-
likelihood function", which is this pretty scary looking formula down at the bottom.

That's the basis of logistic regression. We won't talk about the details any more: let me just do it.
We're going to use the diabetes dataset. In the last lesson we got 76.8% with classification by
regression. Let me tell you if you do ZeroR, NaiveBayes, and J48, you get these numbers here. I'm
going to find the logistic regression scheme. It's in "functions", and called "Logistic". I'm going to use
10-fold cross-validation. I'm not going to output the predictions. I'll just run it -- and I get 77.2%
accuracy. That's the best figure in this column, though it's not much better than NaiveBayes, so you
might be a bit skeptical about whether it really is better. I did this 10 times and calculated the means
myself, and we get these figures for the mean of 10 runs. ZeroR stays the same, of course, at 65.1%;
it produces the same accuracy on each run. NaiveBayes and J48 are different, and here logistic
regression gets an average of 77.5%, which is appreciably better than the other figures in this
column. You can extend the idea to multiple classes. When we did this in the previous lesson, we
performed a regression for each class, a multi-response regression.

That actually doesn't work well with logistic regression, because you need the probabilities to sum to
1 over the various different classes. That introduces more computational complexity and needs to be
tackled as a joint optimization problem. The result is logistic regression, a popular and powerful
machine learning method that uses the logit transform to predict probabilities directly. It works
internally with probabilities, like NaiveBayes does. We also learned in this lesson about prediction
probabilities that can be obtained from other methods, and how to calculate probabilities from
ZeroR. You can read in the course text about logistic regression in Section 4.6. Now you should go
and do the activity associated with this lesson. See you soon.

26 Support vector machines

Hello again. In most courses, there comes a point where things start to get a little tough. In the last
couple of lessons, you've seen some mathematics that you probably didn't want to see, and you
might have realized that you'll never completely understand how all these machine learning
methods work in detail. I want you to know that what I'm trying to convey is the gist of modern
machine learning methods, not the details. What's important is that you can use them and that you
understand a little bit of the principles behind how they work. And the math is almost finished. So
hang in there; things will start to get easier -- and anyway, there's not far to go: just a few more
lessons. I told you before that I play music. Someone came round to my house last night with a
contrabassoon. It's the deepest, lowest instrument in the orchestra. You don't often see or hear one.
So, here I am, trying to play a contrabassoon for the first time. I think this has got to be the lowest
point of our course, Data Mining with Weka! Today I want to talk about support vector machines,
another advanced machine learning technique. We looked at logistic regression in the last lesson,
and we found that these produce linear boundaries in the space.

In fact, here I've used Weka's Boundary Visualizer to show the boundary produced by a logistic
regression machine -- this is on the 2D Iris data, plotting petalwidth against petallength. This black
line is the boundary between these classes, the red class and the green class. It might be more
sensible, if we were going to put a boundary between these two classes, to try and drive it through
the widest channel between the two classes, the maximum separation from each class. Here's a
picture where the black line now is right down the middle of the channel between the two classes.
Actually, mathematically, we can find that line by taking the two critical members, one from each
class -- they're called support vectors; these are the critical points that define the channel -- and take
the perpendicular bisector of the line joining those two support vectors.

That's the idea of support vector machines. We're going to put a line between the two classes, but
not just any old line that separates them. We're trying to drive the widest channel between the two
classes. Here's another picture. We've got two clouds of points, and I've drawn a line around the
outside of each cloud -- the green cloud and the brown cloud. It's clear that any interior points aren't
going to affect this hyperplane, this plane, this separating line. I call it a line, but in multi dimensions
it would be a plane, or a hyperplane in four or more dimensions. There's just a few of the points in
each cloud that define the position of the line: the support vectors. In this case, there are three
points. Support vectors define the boundary. The thing is that all the other instances in the training
data could be deleted without changing the position of the dividing hyperplane. There's a simple
equation and this is the last equation in this course. A simple equation that gives the formula for the
maximum margin hyperplane as a sum over the support vectors.

These are kind of a vector product with each of the support vectors, and the sum there. It's pretty
simple to calculate this maximum margin hyperplane once you've got the support vectors. It's a very
easy sum, and, like I say, it only depends on the support vectors. None of the other points play any
part in this calculation. Now in real life, you might not be able to drive a straight line between the
classes. Classes are called "linearly separable" if there exists a straight line that separates the two
classes. In this picture, the two classes are not linearly separable. It might be a little hard to see, but
there are some blue points on the green side of the line, and a couple of green points on the blue
side of the line. It's not possible to get a single straight line that divide these points. That makes
support vector machines -- the mathematics -- a little more complicated. But it's still possible to
define the maximum margin hyperplane under these conditions.

That's it: support vector machines. It's a linear decision boundary. Actually, there's a really clever
technique which allows you to get more complex boundaries. It's called the "Kernel trick". By using
different formulas for the "kernel" -- and in Weka you just select from some possible different
kernels -- you can get different shapes of boundaries, not just straight lines. Support vector
machines are fantastic because they're very resilient to overfitting. The boundary just depends on a
very small number of points in the dataset. So, it's not going to overfit the dataset, because it
doesn't depend on almost all of the points in the dataset, just a few of these critical points -- the
support vectors.
So, it's very resilient to overfitting, even with large numbers of attributes. In Weka, there are a
couple of implementations of support vector machines. We could look in the "functions" category
for "SMO". Let me have a look at that over here. If I look in "functions" for "SMO", that implements
an algorithm called "Sequential Minimal Optimization" for training a support vector classifier. There
are a few parameters here, including, for example, the different choices of kernel. You can choose
different kernels: you can play around and try out different things. There are a few other
parameters. Actually, the SMO algorithm is restricted to two classes, so this will only work with a 2-
class dataset. There are other, more comprehensive, implementations of support vector machines in
Weka. There's a library called "LibSVM", an external library, and Weka has an interface to this
library. This is a wrapper class for the LibSVM tools. You need to download these separately from
Weka and put them in the right Java classpath. You can see that there are a lot of different
parameters here, and, in fact, a lot of information on this support vector machine package. That's
support vector machines. You can read about them in Section 6.4 of the textbook if you like, and
please go and do the associated activity. See you soon for the last lesson in this class.

27 Ensemble learning

Hello again! We're up to the last lesson in the fourth class, Lesson 4.6 on Ensemble Learning. In real
life, when we have important decisions to make, we often choose to make them using a committee.
Having different experts sitting down together, with different perspectives on the problem, and
letting them vote, is often a very effective and robust way of making good decisions. The same is
true in machine learning. We can often improve predictive performance by having a bunch of
different machine learning methods, all producing classifiers for the same problem, and then letting
them vote when it comes to classifying an unknown test instance. One of the disadvantages is that
this produces output that is hard to analyze. There are actually approaches that try and produce a
single comprehensible structure, but we're not going to be looking at any of those. So the output will
be hard to analyze, but you often get very good performance. It's a fairly recent technique in
machine learning. We're going to look at four methods, called "bagging", "randomization",
"boosting", and "stacking". They're all implemented in Weka, of course. With bagging, we want to
produce several different decision structures.

Let's say we use J48 to produce decision trees, then we want to produce slightly different decision
trees. We can do that by having several different training sets of the same size. We can get those by
sampling the original training set. In fact, in bagging, you sample the set "with replacement", which
means that sometimes you might get two of the same [instances] chosen in your sample. We
produce several different training sets, and then we build a model for each one -- let's say a decision
tree -- using the same machine learning scheme, or using some other machine learning scheme.
Then we combine the predictions of the different models by voting, or if it's a regression situation
you would average the numeric result rather than voting on it.

This is very suitable for learning schemes that are called "unstable". Unstable learning schemes are
ones where a small change in the training data can make a big change in the model. Decision trees
are a really good example of this. You can get a decision tree and just make a tiny little change in the
training data and get a completely different kind of decision tree. Whereas with NaiveBayes, if you
think about how NaiveBayes works, little changes in the training set aren't going to make much
difference to the result of NaiveBayes, so that's a "stable" machine learning method. In Weka we
have a "Bagging" classifier in the meta set. I'm going to choose meta > Bagging: here it is. We can
choose here the bag size -- this is saying a bag size of 100%, which is going to sample the training set
to get another set the same size, but it's going to sample "with replacement".

That means we're going to get different sets of the same size every time we sample, but each set
might contain repeats of the original training [instances]. Here we choose which classifier we want
to bag, and we can choose the number of bagging iterations here, and a random-number seed.
That's the bagging method. The next one I want to talk about is "random forests". Here, instead of
randomizing the training data, we randomize the algorithm. How you randomize the algorithm
depends on what the algorithm is. Random forests are when you're using decision tree algorithms.
Remember when we talked about how J48 works? -- it selects the best attribute for splitting on each
time. You can randomize this procedure by not necessarily selecting the very best, but choosing a
few of the best options, and randomly picking amongst them.

That gives you different trees every time. Generally, if you bag decision trees, if you randomize them
and bag the result, you get better performance. In Weka, we can look under "tree" classifiers for
RandomForest. Again, that's got a bunch of parameters. The maximum depth of the trees produced -
- I think 0 would be unlimited depth. The number of features we're going to use. We might select,
say 4 features; we would select from the top 4 features -- every time we decide on the decision to
put in the tree, we select that from among the top 4 candidates. The number of trees we're going to
produce, and so on. That's random forests. Here's another kind of algorithm: it's called "boosting".
It's iterative: new models are influenced by the performance of previously built models.

Basically, the idea is that you create a model, and then you look at the instances that are
misclassified by that model. These are the hard instances to classify, the ones it gets wrong. You put
extra weight on those instances to make a training set for producing the next model in the iteration.
This encourages the new model to become an "expert" for instances that were misclassified by all
the earlier models. The intuitive justification for this is that in a real life committee, committee
members should complement each other's expertise by focusing on different aspects of the
problem. In the end, to combine them we use voting, but we actually weight models according to
their performance. There's a very good scheme called AdaBoostM1, which is in Weka and is a
standard and very good boosting implementation -- it often produces excellent results.

There are few parameters to this as well; particularly the number of iterations. The final ensemble
learning method is called "stacking". Here we're going to have base learners, just like the learners
we talked about previously. We're going to combine them not with voting, but by using a meta-
learner, another learner scheme that combines the output of the base learners. We're going to call
the base learners level-0 models, and the meta-learner is a level-1 model. The predictions of the
base learners are input to the meta-learner. Typically you use different machine learning schemes as
the base learners to get different experts that are good at different things. You need to be a little bit
careful in the way you generate data to train the level-1 model: this involves quite a lot of cross-
validation, I won't go into that. In Weka, there's a meta classifier called "Stacking", as well as
"StackingC" -- which is a more efficient version of Stacking.

Here is Stacking; you can choose different meta-classifiers here, and the number of stacking folds.
We can choose different classifiers; different level-0 classifiers, and a different meta-classifier. In
order to create multiple level-0 models, you need to specify a meta-classifier as the level-0 model. It
gets a little bit complicated; you need to fiddle around with Weka to get that working. That's it then.
We've been talking about combining multiple models into ensembles to produce an ensemble for
learning, and the analogy is with committees of humans. Diversity helps, especially when learners
are unstable. And we can create diversity in different ways. In bagging, we create diversity by
resampling the training set. In random forests, we create diversity by choosing alternative branches
to put in our decision trees. In boosting, we create diversity by focusing on where the existing model
makes errors; and in stacking, we combine results from a bunch of different kinds of learner using
another learner, instead of just voting. There's a chapter in the course text on Ensemble learning --
it's quite a large topic, really. There's an activity that you should go and do before we proceed to the
next class, the last class in this course. We'll learn about putting it all together, taking a more global
view of the machine learning process. We'll see you then.

28 - Class 4 Questions

Hi! Well, it's summertime here in New Zealand. Summer's just arrived, and, as you can see, I'm
sitting outside for a change of venue. This is Class 5 of the MOOC -- the last class! Here are a few
comments on Class 4, some issues that came up. We had a couple of errors in the activities; we
corrected those pretty quickly. Some of the activities are getting harder -- you will have noticed that!
But I think if you're doing the activities you'll be learning a lot. You learn a lot through doing the
activities, so keep it up! And the Class 5 activities are much easier. There was a question about
converting nominal variables to numeric in Activity 4.2. Someone said the result of the supervised
nominal binary filter was weird. Yes, well, it is a little bit weird. If you click the "More" button for
that filter, it says that k-1 new binary attributes are generated in the manner described in this book
(if you can get hold of it). Let me just tell you a little bit more about this. I've come up with an
example of a nominal attribute called "fruit", and it has 3 values: orange, apple, and banana. In this
dataset, the class is "juicy"; it's a numeric measure of juiciness.

I don't know about where you live, but in New Zealand oranges are juicier than apples, and apples
are juicier than bananas. I'm assuming that in this dataset, if you average the juiciness of all the
instances where the fruit attribute equals orange you get a larger value than if you do this with all
the instances where the fruit attribute equals apple, and that's larger than for banana. That sort of
orders these values. Let's consider ways of making "fruit" into a set of binary attributes. The simplest
method, and the one that's used by the unsupervised conversion filter, is Method 1 here. We create
3 new binary attributes; I've just called them "fruit=orange", "fruit=apple", and "fruit=banana". The
first attribute value is 1 if it's an orange and 0 otherwise.

The second attribute, "fruit=apple", is 1 if it's an apple and 0 otherwise, and the same for banana. Of
course, of these three binary attributes, exactly one of them has to be "1" for any instance. Here's
another way of doing it, Method 2. We take each possible subset: as well as "orange", "apple" and
"banana", we have another binary variable for "orange_or_apple", another for "orange_or_banana",
and another for "apple_or_banana". For example, if the value of fruit was "orange", then the first
attribute ("fruit=orange") would be 1, the fourth attribute ("orange_or_apple") would be 1, and the
fifth attribute ("orange_or_banana") would be 1. All of the others would be 0. This effectively
creates a binary attribute for each subset of possible values of the "fruit" attribute.

Actually, we don't create one for the empty subset or the full subset (with all 3 of the values in). We
get 2^k-2 values for a k-valued attribute. That's impractical in general, because 2^k grows very fast
as k grows. The third method is the one that is actually used, and this is the one that's described in
that book. We create 2 new attributes (k-1, in general, for a k-valued attribute):
"fruit=orange_or_apple" and "fruit=apple". For oranges, the first attribute is 1 and the second is 0;
for apples, they're both 1; and for bananas, they're both 0. That's assuming this ordering of class
values: orange is largest in juiciness, and banana is smallest in juiciness. There's a theorem that, if
you're making a decision tree, the best way of splitting a node for a nominal variable with k values is
one of the k-1 positions -- well, you can read this. In fact, this theorem is reflected in Method 3.

That is the best way of splitting these attribute values. Whether it's a good thing in practice or not,
well, I don't know. You should try it and see. Perhaps you can try Method 3 for the supervised
conversion filter and Method 1 for the unsupervised conversion filter and see which produces the
best results on your dataset. Weka doesn't implement Method 2, because the number of attributes
explodes with the number of possible values, and you could end up with some very large datasets.
The next question is about simulating multiresponse linear regression: "Please explain!" Well, we're
looking at a Weka screen like this. We're running linear regression on the iris dataset where we've
mapped the values so that the class for any Virginica instance is 1 and 0 for the others. We've done it
with this kind of configuration. This is the default configuration of the makeIndicator filter. It's
working on the last attribute -- that's the class.

In this case, the value index is last, which means we're looking at the last value, which, in fact, is
Virginica. We could put a number here to get the first, second, or third values. That's how we get the
dataset, and then we run linear regression on this to get a linear model. Now, I want to look at the
output for the first 4 instances. We've got an actual class of 1, 1, 0, 0 and the predicted value of
these numbers. I've written those down in this little table over here: 1, 1, 0, 0 and these numbers.
That for the dataset where all of the Virginicas are mapped to 1 and the other irises are mapped to
0. When we do the corresponding mapping with Versicolors, we get this as the actual class -- we just
run Weka and look at what appeared on the screen -- and this is the predicted value. We get these
for Setosa. So, you can see that the first instance is actually a Virginica - 1, 0, 0. I've put in bold the
largest of these 3 numbers. This is the largest, 0.966, which is bigger than 0.117 and -0.065, so
multiresponse linear regression is going to predict Virginica for instance 1. It's got the largest value.
And that's correct. For the second instance, it's also a Virginica, and it's also the largest of the 3
values in its row. For the third instance, it's actually a Versicolor.

The actual output is 1 for the Versicolor model, but the largest prediction is still for the Virginica
model. It's going to predict Virginica for an iris that's actually Versicolor. That's going to be a mistake.
In the [fourth] case, it's actually a Setosa -- the actual column is 1 for Setosa -- and this is the largest
value in the row, so it's going to correctly predict Setosa. That's how multiresponse linear regression
works. "How does OneR use the rules it generates? Please explain!" Well, here's the rule generated
by OneR. It hinges on attribute 6. Of course, if you click the "Edit" button in the Preprocess panel,
you can see the value of this attribute for each instance. This is what we see in the Explorer when we
run OneR. You can see the predicted instances here. These are the predicted instances -- g, b, g, b, g,
g, etc. These are the predictions. The question is, how does it get these predictions. This is the value
of attribute 6 for instance 1. What the OneR code does is go through each of these conditions and
looks to see if it's satisfied. Is 0.02 less than -0.2? -- no, it's not. Is it less than -0.01? -- no, it's not.

Is it less than 0.001? -- no, it's not. (It's surprisingly hard to get these right, especially when you've
got all of the other decimal places in the list here.) Is it less than 0.1? -- yes, it is. So rule 4 fires -- this
is rule 4 -- and predicts "g". I've written down here the number of the rule clause that fires. In this
case, for instance 2, the value of the attribute is -0.4, and that satisfies the first rule. So this satisfies
number 1, and we predict "b". And so on down the list. That's what OneR does. It goes through the
rule evaluating each of these clauses until it finds one that is true, and then it uses the corresponding
prediction as its output. Moving on to ensemble learning questions. There were some questions on
ensemble learning, about these ten OneR models. "Are these ten alternative ways of classifying the
data?" Well, in a sense, but they are used together: AdaBoost.M1 combines them. In practice you
don't just pick one of them and use that: AdaBoost combines these models inside itself -- the
predictions it prints are produced by its combined model.

The weights are used in the combination to decide how much weight to give each of these models.
And when Weka reports a certain accuracy, that's for the combined model. It's not the average; it's
not the best; it's combined in the way that AdaBoost combines them. That's all done internally in the
algorithm. I didn't really explain the details of how the algorithm works; you'll have to look that up, I
guess. The point is AdaBoostM1 combines these models for you. You don't have to think of them as
separate models. They're all combined by AdaBoostM1. Someone complained that we're supposed
to be looking for simplicity, and this seems pretty complicated. That's true. The real disadvantage of
these kinds of models, ensemble models, is that it's hard to look at the rules. It's hard to see inside
to see what they're doing. Perhaps you should be a bit wary of that. But they can produce very good
results. You know how to test machine learning methods reliably using cross-validation or whatever.
So, sometimes they're good to use.

"How does Weka make predictions? How can you use Weka to make predictions?" You can use the
"Supplied test set" option on the Classify panel to put in a test set and see the predictions on that.
Or, alternatively, there is a program -- if you can run Java programs -- there's a program here. This is
how you run it: "java weka.classifiers.trees.J48" with your ARFF data file, and you put question
marks there to indicate the class. Then you give it the model, which you've output from the Explorer.
You can look at how to do this on the Weka Wiki on the FAQ list: "using Weka to make predictions".
Can you bootstrap learning? Someone talked about some friends of his who were using training data
to train a classifier and using the results of the classification to create further training data, and
continuing the cycle -- kind of bootstrapping. That sounds very attractive, but it can also be unstable.
It might work, but I think you'd be pretty lucky for it to work well. It's a potentially rather unreliable
way of doing things -- believing the classifications on new data and using that to further train the
classifier. He also said these friends of his don't really look into the classification algorithm. I guess
I'm trying to tell you a little bit about how each classification algorithm works, because I think it
really does help to know that.

You should be looking inside and thinking about what's going on inside your data mining method. A
couple of suggestions of things not covered in this MOOC: FilteredClassifier and association rules,
the Apriori association rule learner. As I said before, maybe we'll produce a follow-up MOOC and
include topics like this in it. That's it for now. Class 5 is the last class. It's a short class. Go ahead and
do it. Please complete the assessments and finish off the course. It'll be open this week, and it'll
remain open for one further week if you're getting behind. But after that, it'll be closed. So, you
need to get on with it.

29 The data mining process

Hello again! This is the last class of Data Mining with Weka, and we're going to step back a little bit
and take a look at some more global issues with regard to the data mining process. It's a short class
with just four lessons: the data mining process, pitfalls and pratfalls, data mining and ethics, and
finally, a quick summary. Let's get on with Lesson 5.1. This might be your vision of the data mining
process. You've got some data or someone gives you some data. You've got Weka. You apply Weka
to the data, you get some kind of cool result from that, and everyone's happy. If so, I've got bad
news for you. It's not going to be like that at all. Really, this would be a better way to think about it.
You're going to have a circle; you're going to go round and round the circle. It's true that Weka is
important -- it's in the very middle of the circle here. It's going to be crucial, but it's only a small part
of what you have to do. Perhaps the biggest problem is going to be to ask the right kind of question.
You need to be answering a question, not just vaguely exploring a collection of data. Then, you need
to get together the data that you can get hold of that gives you a chance of answering this question
using data mining techniques. It's hard to collect the data.

You're probably going to have an initial dataset, but you might need to add some demographic data,
or some weather data, or some data about other stuff. You're going to have to go to the web and
find more information to augment your dataset. Then you'll merge all that together: do some
database hacking to get a dataset that contains all the attributes that you think you might need -- or
that you think Weka might need. Then you're going to have to clean the data. The bad news is that
real world data is always very messy. That's a long and painstaking process of looking around,
looking at the data, trying to understand it, trying to figure out what the anomalies are and whether
it's good to delete them or not. That's going to take a while. Then you're going to need to define
some new features, probably. This is the feature engineering process, and it's the key to successful
data mining. Then, finally, you're going to use Weka, of course. You might go around this circle a few
times to get a nice algorithm for classification, and then you're going to need to deploy the algorithm
in the real world. Each of these processes is difficult.

You need to think about the question that you want to answer. "Tell me something cool about this
data" is not a good enough question. You need to know what you want to know from the data. Then
you need to gather it. There's a lot of data around, like I said at the very beginning, but the trouble is
that we need classified data to use classification techniques in data mining. We need expert
judgements on the data, expert classifications, and there's not so much data around that includes
expert classifications, or correct results. They say that more data beats a clever algorithm. So rather
than spending time trying to optimize the exact algorithm you're going to use in Weka, you might be
better off employed in getting more and more data. Then you've got to clean it, and like I said
before, real data is very mucky. That's going to be a painstaking matter of looking through it and
looking for anomalies. Feature engineering, the next step, is the key to data mining. We'll talk about
how Weka can help you a little bit in a minute. Then you've got to deploy the result. Implementing it
-- well, that's the easy part. The difficult part is to convince your boss to use this result from this data
mining process that he probably finds very mysterious and perhaps doesn't trust very much. Getting
anything actually deployed in the real world is a pretty tough call.

The key technical part of all this is feature engineering, and Weka has a lot of [filters] that will help
with this. Here are just a few of them. It might be worth while defining a new feature, a new
attribute that's a mathematical expression involving existing attributes. Or you might want to modify
an existing attribute. With AddExpression, you can use any kind of mathematical formula to create a
new attribute from existing ones. You might want to normalize or center your data, or standardize it
statistically. Transform a numeric attribute to have a zero mean -- that's "center". Or transform it to
a given numeric range -- that's "normalize". Or give it a zero mean and unit variance, that's a
statistical operation called "standardization". You might want to take those numeric attributes and
discretize them into nominal values.

Weka has both supervised and unsupervised attribute discretization filters. There are a lot of other
transformations. For example, the PrincipalComponents transformation involves a matrix analysis of
the data to select the principal components in a linear space. That's mathematical, and Weka
contains a good implementation. RemoveUseless will remove attributes that don't vary at all, or vary
too much. Actually, I think we encountered that in one of our activities. Then, there are a couple of
filters that help you deal with time series, when your instances represent a series over time. You
probably want to take the difference between one instance and the next, or a difference with some
kind of lag -- one instance and the one 5 before it, or 10 before it. These are just a few of the filters
that Weka contains to help you with your feature engineering. The message of this lesson is that
Weka is only a small part of the entire data mining process, and it's the easiest part. In this course,
we've chosen to tell you about the easiest part of the process! I'm sorry about that. The other bits
are, in practice, much more difficult. There's an old programmer's blessing: "May all your problems
be technical ones". It's the other problems -- the political problems in getting hold of the data, and
deploying the result -- those are the ones that tend to be much more onerous in the overall data
mining process. So good luck! There's some stuff about this in the course text. Section 1.3 contains
information on Fielded Applications, all of which have gone through this kind of process in order to
get them out there and used in the field. There's an activity associated with this lesson. Off you go
and do it, and we'll see you in the next lesson.

30 Pitfalls and pratfalls

Hi! Welcome back for another few minutes in New Zealand. In the last lesson, Lesson 5.1, we
learned that Weka only helps you with a small part of the overall data mining process, the technical
part, which is perhaps the easy part. In this lesson, we're going to learn that there are many pitfalls
and pratfalls even in that part. Let me just define these for you. A "pitfall" is a hidden or unsuspected
danger or difficulty, and there are plenty of those in the field of machine learning. A "pratfall" is a
stupid and humiliating action, which is very easy to do when you're working with data. The first
lesson is that you should be skeptical. In data mining it's very easy to cheat. Whether you're cheating
consciously or unconsciously, it's easy to mislead yourself or mislead others about the significance of
your results. For a reliable test, you should use a completely fresh sample of data that has never
been seen before. You should save something for the very end, that you don't use until you've
selected your algorithm, decided how you're going to apply it, and the filters, and so on.

At the very, very end, having done all that, run it on some fresh data to get an estimate of how it will
perform. Don't be tempted to then change it to improve it so that you get better results on that
data. Always do your final run on fresh data. We've talked a lot about overfitting, and this is basically
the same kind of problem. Of course, you know not to test on the training set. We've talked about
that endlessly throughout this course. Data that's been used for development in any way is tainted.
Any time you use some data to help you make a choice of the filter, or the classifier, or how you're
going to treat your problem, then that data is tainted. You should be using completely fresh data to
get evaluation results. Leave some evaluation data aside for the very end of the process. That's the
first piece of advice. Another thing I haven't told you about in this course so far is missing values. In
real datasets, it's very common that some of the data values are missing.

They haven't been recorded. They might be unknown; we might have forgotten to record them; they
might be irrelevant. There are two basic strategies for dealing with missing values in a dataset. You
can omit instances where the attribute value is missing, or somehow find a way of omitting that
particular attribute in that instance. Or you can treat missing as a separate possible value. You need
to ask yourself, is there significance in the fact that a value is missing? They say that if you've got
something wrong with you and go to the doctor, and he does some tests on you: if you just record
the tests that he does -- not the results of the test, but just the ones he chooses to do -- there's a
very good chance that you can work out what's wrong with you just from the existence of the tests,
not from their results. That's because the doctor chooses tests intelligently. The fact that he doesn't
choose a test doesn't mean that that value is missing, or accidentally not there. There's huge
significance in the fact that he's chosen not to do certain tests. This is a situation where "missing"
should be treated as a separate possible value.

There's significance in the fact that a value is missing. But in other situations, a value might be
missing simply because a piece of equipment malfunctioned, or for some other reason -- maybe
someone forgot something. Then there's no significance in the fact that it's missing. Pretty well all
machine learning algorithms deal with missing values. In an ARFF file, if you put a question mark as a
data value, that's treated as a missing value. All methods in Weka can deal with missing values. But
they make different assumptions about them. If you don't appreciate this, it's easy to get misled. Let
me just take two simple and well known (to us) examples -- OneR and J48. They deal with missing
values in different ways. I'm going to load the nominal weather data and run OneR on it: I get 43%.
Let me run J48 on it, to get 50%. I'm going to edit this dataset by changing the value of "outlook" for
the first four "no" instances to "missing". That's how we do it here in this editor. If we were to write
this file out in ARFF format, we'd find that these values are written into the file as question marks.
Now, if we look at "outlook", you can see that it says here there are 4 missing values.

If you count up these labels -- 2, 4, and Plus another 4 that are missing, to make the Let's go back to
J48 and run it again. We still get 50%, the same result. Of course, this is a tiny dataset, but the fact is
that the results here are not affected by the fact that a few of the values are missing. However, if we
run OneR, I get a much higher accuracy, a 93% accuracy. The rule that I've got is "branch on
outlook", which is what we had before I think. Here it says there are 4 possibilities: if it's sunny, it's a
yes; if it's overcast it's a yes; if it's rainy, it's a yes; and if it's missing, it's a no. Here, OneR is using the
fact that a value is missing as significant, as something you can branch on. Whereas if you were to
look at a J48 tree, it would never have a branch that corresponded to a missing value. It treats them
differently. It is very important to know and remember. The final thing I want to tell you about in this
lesson is the "no free lunch" theorem. There's no free lunch in data mining. Here's a way to illustrate
it. Suppose you've got a 2-class problem with Let's say you've got a huge training set with a million
instances and their classifications in the training set.

The number of possible instances is 2 to the attributes. And you know 10^6 of them. So you don't
know the classes of 2^100 - 10^6 examples. Let me tell you that 2^100 - 10^6 is 99.999...% of 2^100.
There's this huge number of examples that you just don't know the classes of. How could you
possibly figure them out? If you apply a data mining scheme to this, it will figure them out, but how
could you possibly figure out all of those things just from the tiny amount of data that you've been
given. In order to generalize, every learner must embody some knowledge or assumptions beyond
the data it's given. Each learning algorithm implicitly provides a set of assumptions. The best way to
think about those assumptions is to think back to the Boundary Visualizer we looked at in Lesson 4.1.
You saw that different machine learning schemes are capable of drawing different kinds of
boundaries in instance space. These boundaries correspond to a set of assumptions about the sort of
decisions we can make. There's no universal best algorithm; there's no free lunch. There's no single
best algorithm. Data mining is an experimental science, and that's why we've been teaching you how
to experiment with data mining yourself.

This is just a summary. Be skeptical: when people tell you about data mining results and they say
that it gets this kind of accuracy, then to be sure about that you want to have them test their
classifier on your new, fresh data that they've never seen before. Overfitting has many faces.
Different learning schemes make different assumptions about missing values, which can really
change the results. There is no universal best learning algorithm. Data mining is an experimental
science, and it's very easy to be misled by people quoting the results of data mining experiments.
That's it for now. Off you go and do the activity. We'll see you in the next lesson.

31 Data mining and ethics

Hi! Welcome to Lesson 5.3 of Data Mining with Weka. Before we start, I thought I'd show you where
I live. I told you before that I moved to New Zealand many years ago. I live in a place called Hamilton.
Let me just zoom in and see if we can find Hamilton in the North Island of New Zealand, around the
center of the North Island. This is where the University of Waikato is. Here is the university; this is
where I live. This is my journey to work: I cycle every morning through the countryside. As you can
see, it's really nice. I live out here in the country. I'm a sheep farmer! I've got four sheep, three in the
paddock and one in the freezer. I cycle in -- it takes about half an hour -- and I get to the university. I
have the distinction of being able to go from one week to the next without ever seeing a traffic light,
because I live out on the same edge of town as the university. When I get to the campus of the
University of Waikato, it's a very beautiful campus. We've got three lakes. There are two of the lakes,
and another lake down here. It's a really nice place to work!

So I'm very happy here. Let's move on to talk about data mining and ethics. In Europe, they have a
lot of pretty stringent laws about information privacy. For example, if you're going to collect any
personal information about anyone, a purpose must be stated. The information should not be
disclosed to others without consent. Records kept on individuals must be accurate and up to date.
People should be able to review data about themselves. Data should be deleted when it's no longer
needed. Personal information must not be transmitted to other locations. Some data is too sensitive
to be collected, except in extreme circumstances. This is true in some countries in Europe,
particularly Scandinavia. It's not true, of course, in the United States. Data mining is about collecting
and utilizing recorded information, and it's good to be aware of some of these ethical issues. People
often try to anonymize data so that it's safe to distribute for other people to work on, but
anonymization is much harder than you think. Here's a little story for you.

When Massachusetts released medical records summarizing every state employee's hospital record
in the mid-1990's, the Governor gave a public assurance that it had been anonymized by removing
all identifying information -- name, address, and social security number. He was surprised to receive
is own health records (which included a lot of private information) in the mail shortly afterwards!
People could be re-identified from the information that was left there. There's been quite a bit of
research done on re-identification techniques. For example, using publicly available records on the
internet, 50% of Americans can be identified from their city, birth date, and sex. zip code as well.
There was some interesting work done on a movie database. Netflix released a database of 100
million records of movie ratings. They got individuals to rate movies [on the scale] 1-5, and they had
a whole bunch of people doing this -- a total of 100 million records.

It turned out that you could identify 99% of people in the database if you knew their ratings for 6
movies and approximately when they saw them. Even if you only know their ratings for 2 movies,
you can identify 70% of people. This means you can use the database to find out the other movies
that these people watched. They might not want you to know that. Re-identification is remarkably
powerful, and it is incredibly hard to anonymize data effectively in a way that doesn't destroy the
value of the entire dataset for data mining purposes. Of course, the purpose of data mining is to
discriminate: that's what we're trying to do! We're trying to learn rules that discriminate one class
from another in the data -- who gets the loan? -- who gets a special offer? But, of course, certain
kinds of discrimination are unethical, not to mention illegal. For example, racial, sexual, and religious
discrimination is certainly unethical, and in most places illegal.

But it depends on the context. Sexual discrimination is usually illegal ... except for doctors. Doctors
are expected to take gender into account when they make their make their diagnoses. They don't
want to tell a man that he is pregnant, for example. Also, information that appears innocuous may
not be. For example, area codes -- zip codes in the US -- correlate strongly with race; membership of
certain organizations correlates with gender. So although you might have removed the explicit racial
and gender information from you database, it still might be able to be inferred from other
information that's there. It's very hard to deal with data: it has a way of revealing secrets about itself
in unintended ways. Another ethical issue concerning data mining is that correlation does not imply
causation. Here's a classic example: as ice cream sales increase, so does the rate of drownings.
Therefore, ice cream consumption causes drowning?

Probably not. They're probably both caused by warmer temperatures -- people going to beaches.
What data mining reveals is simply correlations, not causation. Really, we want causation. We want
to be able to predict the effects of our actions, but all we can look at using data mining techniques is
correlation. To understand about causation, you need a deeper model of what's going on. I just
wanted to alert you to some of the issues, some of the ethical issues, in data mining, before you go
away and use what you've learned in this course on your own datasets: issues about the privacy of
personal information; the fact that anonymization is harder than you think; re-identification of
individuals from supposedly anonymized data is easier than you think; data mining and
discrimination -- it is, after all, about discrimination; and the fact that correlation does not imply
causation. There's a section in the textbook, Data mining and ethics, which you can read for more
background information, and there's a little activity associated with this lesson, which you should go
and do now. I'll see you in the next lesson, which is the last lesson of the course.

32 Summary

Hi! This is the last lesson in the course Data mining with Weka, Lesson 5.4 - Summary. We'll just have
a quick summary of what we've learned here. One of the main points I've been trying to convey is
that there's no magic in data mining. There's a huge array of alternative techniques, and they're all
fairly straightforward algorithms. We've seen the principles of many of them. Perhaps we don't
understand the details, but we've got the basic idea of the main methods of machine learning used
in data mining. And there is no single, universal best method. Data mining is an experimental
science. You need to find out what works best on your problem. Weka makes it easy for you. Using
Weka you can try out different methods, you can try out different filters, different learning methods.
You can play around with different datasets.

It's very easy to do experiments in Weka. Perhaps you might say it's too easy, because it's important
to understand what you're doing, not just blindly click around and look at the results. That's what
I've tried to emphasize in this course -- understanding and evaluating what you're doing. There are
many pitfalls you can fall into if you don't really understand what's going on behind the scenes. It's
not a matter of just blindly applying the tools in the workbench. We've stressed in the course the
focus on evaluation, evaluating what you're doing, and the significance of the results of the
evaluation. Different algorithms differ in performance, as we've seen. In many problems, it's not a
big deal.
The differences between the algorithms are really not very important in many situations, and you
should perhaps be spending more time on looking at the features and how the problem is described
and the operational context that you're working in, rather than stressing about getting the absolute
best algorithm. It might not make all that much difference in practice. Use your time wisely. There's
a lot of stuff that we've missed out. I'm really sorry I haven't been able to cover more of this stuff.
There's a whole technology of filtered classifiers, where you want to filter the training data, but not
the test data. That's especially true when you've got a supervised filter, where the results of the
filter depend on the class values of the training instances. You want to filter the training data, but
not the test data, or maybe take a filter designed for the training data and apply the same filter to
the test data without re-optimizing it for the test data, which would be cheating. You often want to
do this during cross-validation.

The trouble in Weka is that you can't get hold of those cross-validation folds; it's all done internally.
Filtered classifiers are a simple way of dealing with this problem. We haven't talked about costs of
different decisions and different kinds of errors, but in real life different errors have different costs.
We've talked about optimizing the error rate, or the classification accuracy, but really, in most
situations, we should be talking about costs, not raw accuracy figures, and these are different things.
There's a whole panel in the Weka Explorer for attribute selection, which helps you select a subset of
attributes to use when learning, and in many situations it's really valuable, before you do any
learning, to select an appropriate small subset of attributes to use. There are a lot of clustering
techniques in Weka. Clustering is where you want to learn something even when there is no class
value: you want to cluster the instances according to their attribute values.

Association rules are another kind of learning technique where we're looking for associations
between attributes. There's no particular class, but we're looking for any strong associations
between any of the attributes. Again, that's another panel in the Explorer. Text classification. There
are some fantastic text filters in Weka which allow you to handle textual data as words, or as
characters, or n-grams (sequences of three, four, or five consecutive characters). You can do text
mining using Weka. Finally, we've focused exclusively on the Weka Explorer, but the Weka

Experimenter is also worth getting to know. We've done a fair amount of rather boring, tedious,
calculations of means and standard deviations manually by changing the random-number seed and
running things again. That's very tedious to do by hand. The Experimenter makes it very easy to do
this automatically. So, there's a lot more to learn, and I'm wondering if you'd be interested in an
Advanced Data Mining with Weka course. I'm toying with the idea of putting one on, and I'd like you
to let us know what you think about the idea, and what you'd like to see included. Let me just finish
off here with a final thought. We've been talking about data, data mining. Data is recorded facts, a
change of state in the world, perhaps.

That's the input to our data mining process, and the output is information, the patterns -- the
expectations -- that underlie that data: patterns that can be used for prediction in useful applications
in the real world. We've going from data to information. Moving up in the world of people, not
computers, "knowledge" is the accumulation of your entire set of expectations, all the information
that you have and how it works together -- a large store of expectations and the different situations
where they apply. Finally, I like to define "wisdom" as the value attached to knowledge. I'd like to
encourage you to be wise when using data mining technology. You've learned a lot in this course.
You've got a lot of power now that you can use to analyze your own datasets. Use this technology
wisely for the good of the world. That's my final thought for you. There is an activity associated with
this lesson, a little revision activity. Go and do that, and then do the final assessment, and we will
send you your certificate if you do well enough. Good luck! It's been good talking to you, and maybe
we'll see you in an advanced version of this course. Bye for now!