0 Voti positivi0 Voti negativi

4 visualizzazioni81 pagineJun 27, 2019

© © All Rights Reserved

DOCX, PDF, TXT o leggi online da Scribd

© All Rights Reserved

4 visualizzazioni

© All Rights Reserved

Sei sulla pagina 1di 81

• Attributes = columns

• Instance = measurements

Classifiers have default parameters, to change them and override them you just click on the name of

the Classifier:

As you can see there are several options, which vary from classifier to classifier, to see what

these classifiers can do and what they can’t you can click “More” and “Capabilities”.

o “More” = gives synopsis and also other info such as what options you can use and

what those options mean.

o “Capabilities” = tells you what classes or what attributes you can use with these

particular classifying algorithms.

So, for this one you can use nominal attributes, missing values, empty

nominal attributes etc.

This is particularly helpful as some of the algorithm they can’t deal with

missing value, so you can use this info whether or not your classifying

algorithm can handle missing values etc.

Can change the parameters and save these setting rather than having to change them manually each

time.

Generally, when you are building a machine learning problem, you have a set of examples that are

given to you that are labelled, let's call this A. Sometimes you have a separate set of examples not

intended to be used for training, let's call this B. This gives us the four options in Weka:

a. Build a model on file A, apply it to file A: This is testing on the training set. This is

generally a bad idea from an evaluation point of view. This is like seeing the same

questions exactly in the exam as you would in real life. Some classifiers (e.g. nearest

neighbour for example) always get 100% on the training set.

2. Supplied test set:

a. Build a model on file A, apply it to file B: If you have a file B, this is the one you want

to do. But you don't always have a file B.

3. Cross-validation:

a. But data is expensive! And we could be using B for training instead of testing. So,

what if I take file A, divide it into 5 equal chunks, T, U, V, W, X. Then I train on T, U, V,

W and test on X. Then I train on T, U, V, X and test on W, and similarly test on V, U

and T, and then average the accuracy results. Because I'm repeating the process 5

times, I get lower variance, and so my estimate of the accuracy of my classifier is

likely to be closer to the truth. This would be 5-fold cross validation. It also takes 5

times as long!

4. Percentage split:

a. Ugh. I hate that it takes 5 times as long, so why don't I do just one-fold? Split the

data into 80% training and 20% test. This is percentage split.

In classify tab there is a pull-down menu:

This is where you can choose what attribute you are trying to predict (using linear regression as

classifier).

I click start for attribute “healthGrade” and I got message roughly “can’t do numeric classes”, so in

WEKA when you get your feet wet this is one of the 1st things you will notice that any algorithm can

not predict any data point.

So, this logistic regression algorithm won’t predict numeric class data but will predict nominal class

data.

So, relating all these 6 data attributes, let see what the 1st logistic regression formula starts out as.

Has taken the 6 attributes and assigned a weight to each of them, so with all six, logistic regression

initially predicted with 92.5% accuracy.

As I interpreted that data I looked at bottom at the confusion matrix see that it gets tripped up more

on the successfuls where it is 32 out of 36 correct and the unsucessfuls where it is 42 out of 44

correct.

Now let’s go back and see if we only use those 2 data points if it helps or hurts this logistic regression

algorithm in its prediction accuracy.

So, we have two independent variables and based off how they relate to each other we are going to

see how close can we predict the dependant outcome variable.

Click start:

This time we get 98.75% prediction accuracy and at bottom we see that all 44 unsuccessful

measurements were predicted correctly, and 35 of 36 successfuls were predicted correctly.

For this logistic regression algorithm removing data points during pre-processing improves the

algorithm which is not always the case.

So, this is another one of those 1st time seeing it in WEKA lessons, that some algorithms like this it is

inefficient to throw all the data you can at it. Whereas as other algorithms like J48 decision tree it’s

the more efficient thing to do.

Less go to classify and choose J48 which is a decision tree, and again I ask myself hypothetically what

do I want this decision tree to predict or find out? So since logistic regression did not work with

numeric classes I figured I try predict a numeric class this time.

So, went to pull down menu and said lets use decision tree to predict “hrv”, that is if I give this

decision tree algorithm all 6 data points, how accurately can it predict the 7th (hrv) one? Or can it at

all?

As the logistic regression algorithm this J48 decision tree algorithm does not work with numeric

classes, J48 does work with nominal classes so since I had 3 I ran each decision three just to get a

little more familiar with WEKA. 1st I did “healthGrade”:

As you can see it predicted 100% accurately, to see how, I clicked visualise tree:

It looked at all 6 data points and it picked hrv as these most relative for a prediction, based on one

question in fact was the hrv <= or > 69.5? It predicted 100% if healthGrade was successful or

unsuccessful. A small but neat thing as well is I graded success if it was above or below 70 and the

decision tree here says, “you actually didn’t have any 69.6 to 70 so 69.5 is your actual line in the

sand to classify success”. So that can be a clear way of thinking about a metric.

Next, I put “patternSleep” in the pulldown menu and again the decision tree got 100% and I clicked

visualise tree:

It shows it related dayID most to the outcome variable. Was the dayID more than 10? It could

predict based of this one question? But the 3rd nominal class variable was not as simple as to predict.

Visualizing the decision tree shows what the model settled on. The formula started with the

hoursAwake variable, further down it added the dayID variable, but at certain points I saw

uncertainty. The tree is reflexing the confusion matrix. Lets look at the sequenceID variable closer to

see why?

It looks nice and segmented but it has overlap, I took 4 heart measurements each day. 1st as I woke

up, 2nd two to 5 hours after that, 3rd about 2 to 5 hours after that and lastly 2 or more hours after

that, before I went to sleep. So the 2nd sequence can have a 5 hour and the 3rd can have a 5 hour,

the 3rd can have a 7 hour and the fourth can have a 7 hour. So now you see why it so easy to predict

the 1st one and it’s so difficult to predict the middle ones.

And along those lines the decision tree showed me another introductory WEKA lesson, after 18 days

it was only 83% accurate. I saw this decision tree algorithm improve to the 86% with 8 more

instances. So, we can assume with more instances or more attributes this tree will have more

potential branches and get more and more accurate.

Next was neural network, choose MultilayerPerceptron which is a neural network. See

MultilayerPerceptions think neural networks. Again, to get my feet wet with neural network I

stepped back and considered a hypothetical. What could I want neural network to predict or find out

for me? I choose 2 things. Predict the complex sequenceID variable which gave the decision tree

trouble, 2nd I wanted to predict a numeric class, hrv again as the 1st two algorithms only work with

nominal class. Then again, I asked if I should remove some data points like in logistic regression or

leave them all like in decision tree? It seemed that the more data points MultilayerPerceptron had

the better so I left all 7 in. I looked at test options and choose percentage splits this time. Testing at

66% split means that it builds it initial model or formula using 53 of the 80 instances which are heart

measurements I took. And after the initial model is built it runs it on the last 3rd or 27 instances to

test its accuracy.

So, let’s see how accurate the algorithm was in predicting sequenceID on the 1st set of 27 instances

that it held out for testing?

It was 88% and the 1st sequence it got them all right but again the middle ones there is uncertainty.

And as you can see the last one it got them all right.

I looked up at the model it built and didn’t quite understand it but it was helpful non the less to see a

now a 3rd unique algorithm that can be used in data analysis.

I was just trying to get an initial experience with WEKA and thus was happy to see

MultilayerPerceptron a neural network algorithm that can handle a numeric class such as hrv, I

clicked start and looked at the result and noticed a numeric class did not have an identical summary

as nominal class. Remember how I got 24 out of 27 correctly class instances or 89% well here with

numeric it seems less correct vs incorrect. Instead of saying it got 0 out of 27 correct, the emphasis

seems more on how close it got.

Error statistics were in the summary but here I went to more option and so I could output prediction

and see accuracy in another way.

So, then I re ran it and the output prediction showed what this neural network model predicted for

those 27, side by side with the actual value. My 1st thought was this is the type of thing I can paste

into a spreadsheet for anyone to understand. Say needing to ask “are these prediction close

enough” because he is where we are at with the model. I thought of them without knowing what

Root mean square error is, saying I see a close enough accuracy rate here (2nd above) between these

two (RMSE AND RAE) its useful now, or maybe them saying no, its not useful now.

Finally with neural network to visualise (right click name) the formula I turn the GUI on and then

clicking re run it via Start.

I thought this visual was pretty self-explanatory and you can see why the call it neural network and

at the bottom (Epoch etc) those are events with more events (click Start) this algorithms error rate

decreases over time (Error per Epoch got smaller). So, I can plug in 50,000 events for example and it

will get smarter.

Last I looked at one more algorithm in WEKA called Support Vector Machine, go to

classifiers>functions see there are SMO and SMOreg both are based on support vector machine if

you look at the properties of them. You will see that even though both are based on support vector

machine they are different algorithms, can see this by clicking the name from top after choosing it.

Let’s choose SMOreg to predict a nominal variable. Here reverse of before this algorithm won’t

accept nominal class it will accept numeric class. Another introductory lesson this doesn’t mean

what’s nominal class can’t ever work with SMOreg, I may be able to edit something nominal class

just by opening the data file and representing the data differently and saving the variable this time

as a numeric class for example. Or I may want to run one of the many filters in WEKA. As a data

miner you spend a bunch of time changing data in ways that can make an algorithm more efficient.

Those are some examples again of what’s called data pre-processing and data pre-processing is a big

part of using WEKA. Now let’s use the training set as is without any cross validation, and to see what

it looks like let’s check output as we did before. Take note of the original order being intact and left

as is. Now let’s run 10-fold cross validation.

Here are 10 constitutive 1-8, 1-8, 1-8 for my 80 instances. Let’s compare their prediction accuracy.

The 10 fold cross validation the mean absolute error is 2.75% and the root mean squared is 3.5%.

How does that compare to the test without cross-validation?

Click it in the result list. Less error but now I saw why its less valid as a predictive model.

CLASS 1

02 Exploring the Explorer

https://www.youtube.com/watch?v=CV6dohykPhY&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=2

The graph shown on bottom right are like histograms of the attribute value in terms of the attribute

we are trying to predict. Blue corresponds to positive and red to negative.

03 Exploring datasets

https://www.youtube.com/watch?v=qhumDUIh0Mo&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=3

By default, the last attribute in WEKA is always the class value, can change it if you like, so you can

decide to predict different attribute.

we looked at in weather data were discrete or what we call nominal attribute values where they

belonged to a certain fixed set. Or they can be numeric.

are looking at discrete class, as it can be “yes” or “no”. Another type of machine learning

problem can involve continuous classes where you are trying to predict a number, which is called

a “regression problem” in the trade.

now numeric attributes before were nominal attributes.

Opened glass.arff

04 Building a classifier

https://www.youtube.com/watch?v=_0OF2c8m56k&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=4

Hi! you probably learned a bit about flowers if you did the activity associated with the last lesson.

Now, we're going to actually build a classifier: Lesson 1.4 Building a classifier. We're going to use a

system called J48—I'll tell you why it's called J48 in a minute— to analyse the glass dataset. That we

looked at in the last lesson.

I've got the glass dataset open here. I going to go to the Classify panel, and I choose a classifier here.

There are different kinds of classifiers, Weka has bayes classifiers, functions classifiers, lazy

classifiers, meta classifiers, and so on.

We're going to use a tree classifier, J48 is a tree classifier, so I'm going to open trees and click J48.

Here is the J48 classifier. Let's run it. If we just press start, we've got the dataset, we've got the

classifier, and lo and behold, it's done it. It's a bit of an anti-climax, really. Weka makes things very

easy for you to do. The problem is understanding what it is that you have done, let's take a look.

Here is some information about the datasets, glass dataset, the number of instances and attributes.

Then it's printed out a representation of a tree here, we'll look at these trees later on, but just note

that this tree has got 30 leaves and 59 nodes altogether, and the overall accuracy is 66.8%. So, it's

done pretty well.

Down at the bottom, we've got a confusion matrix, remember there were about seven different

kinds of glass. This is building windows made of float glass, you can see that 50 of these have been

classified as 'a', which is correctly classified, 15 of them have been classified as ‘b’ which is building

windows non-float glass, so those are errors, and 3 have been classified as 'c', and so on. This is a

confusion matrix. Most of the weight is down the main diagonal, which we like to see because that

indicates correct classifications. Everything off the main diagonal indicates a misclassification. That's

the confusion matrix.

Let's investigate this a bit further. We're going to open a configuration panel for J48. Remember I

chose it by clicking the Choose button. Now, if I click it here (the name), I get a configuration panel. I

clicked J48 in this menu, and I get a configuration panel, which gives a bunch of parameters. I'm not

going to really talk about these parameters. Let's just look at one of them, the unpruned parameter,

which by default is false. What we've just done is to build a pruned tree, because unpruned is False.

We can change this to make it True and build an unpruned tree. We've changed the configuration.

We can run it again. It just ran again, and now we have a potentially different result. Let's just have a

look. We have got 67% correct classification and what did we have before? These are the previous

runs, this is the previous run, and there we had 66% and now, in this run that we've just done with

the unpruned tree, we've got 67% accuracy, and the tree is the same size.

So that's one option, I'm just going to look at another option, and then we'll look at some trees. I'm

going to click the configuration panel again, and I'm going to change the minNumObj parameter.

What is that? That is the minimum number of instances per leaf, I'm going to change that from 2 up

to 15 to have larger leaves.

These are the leaves of the tree here, and these numbers in brackets are the number of instances

that get to the leaf. When there are 2 numbers, this means that 1 incorrectly classified instance got

to this leaf and 5 correctly classified instances got there.

You can see that all of these leaves are pretty small, with sometimes just 2 or 3 or here is one with

31 instances. We've constrained now this number, the tree is going to be generated, and this

number is always going to be 15 or more.

Let’s run it again, now we've got a worse result, 61% correct classification.

But a much smaller tree, with only 8 leaves. Now, we can visualize this tree. If I right click on the

line—these are the lines that describe each of the runs that we've done, and this is the third run—if I

right click on that, I get a little menu, and I can visualize the tree.

There it is. If I right click on empty space, I can fit this to the screen. This is the decision tree. This

says first look at the Barium (Ba) content, if it's large, then it must be headlamps. If it's small, then

Magnesium (Mg), If that's small, then let's look at potassium (K), and if that's small, then we've got

tableware.

That sounds like a pretty good thing to me; I don't want too much potassium in my tableware. This is

a visualization of the tree and it's the same tree that you can see by looking above in log. This is a

different representation of the same tree.

I'll just show you one more thing about this configuration panel, which is the More button.

This gives you more information about the classifier, about J48. It's always useful to look at that to

see where these classifiers have come from. In this case, let me explain why it's called J48, it’s based

on a famous system that's called C4.5, which was described in a book. The book is referenced here.

In fact, I think I've got on my shelf here. This book here, "C4.5: Programs for Machine Learning" by

an Australian computer scientist called Ross Quinlan. He started out with a system called ID3— I

think that might have been in his PhD thesis— and then C4.5 became quite famous. This kind of

morphed through various versions into C4.5. It became famous; the book came out, and so on, he

continued to work on this system, then It went up to C4.8, and then he went commercial. Up until

then, these were all open source systems.

When we built Weka, we took the latest version of C4.5, which was C4.8, and we rewrote it. Weka's

written in Java, so we called it J48, maybe it's not a very good name, but that's the name that stuck.

There's a little bit of history for you.

We've talked about classifiers in Weka, I've shown you where you find the classifiers, we classified

the glass dataset. We looked at how to interpret the output from J48, in particular the confusion

matrix. We looked at the configuration panel for J48, we looked at a couple of options: pruned

versus unpruned trees and the option to avoid small leaves. I told you how J48 really corresponds to

the machine learning system that most people know as C4.5. C4.5 and C4.8 were really pretty

similar, so we just talk about J48 as if it's synonymous with C4.5. You can read about this in the

book— Section 11.1 about Building a decision tree and Examining the output. Now, off you go, and

do the activity associated with this lesson.

05 Using a filter

https://www.youtube.com/watch?v=Ui9dM8-RMv0&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=5

Brief:

Weka include many filters that can be used before invoking a classifier to clean up the

dataset or alter it in some way. Filters help with data preparation, for example, you can easily

remove an attribute. Or you can remove all instances that have a certain value for an attribute (e.g.

instances for which humidity has the value high). Surprisingly, removing attributes sometimes

leads to better classification! – and also simpler decision trees.

In this lesson, we’re going to look at another of Weka’s principal features: filters. One of the

main messages of this course is that it’s really important when you’re data mining to get close to

your data, and to think about pre-processing it, or filtering it in some way, before applying a

classifier. I’m going to start by using a filter to remove an attribute from the weather

data. Let me start up the Weka Explorer and open the weather data.

I’m going to remove the “humidity” attribute: that’s attribute number 3. I can look at filters; just

like we chose classifiers using this Choose button on the Classify panel, we choose filters by

using the Choose button here. There are a lot of different filters, Allfilter and MultiFilter

are ways of combining filters. We have supervised and unsupervised filters. Supervised

filters are ones that use the class value for their operation. They aren’t so common as

unsupervised filters, which don’t use the class value. There are attribute filters and

instance filters. We want to remove an attribute. So, we’re looking for an attribute

filter, there are so many filters in Weka that you just have to learn to look around and find what

you want.

I’m going to look for removing an attribute. Here we go, “Remove”. Now, before, when we

configured the J48 classifier, we clicked here. I’m going to click here, and we can configure

the filter. This is “A filter that removes a range of attributes from the

dataset”. I can specify a range of attributes here. I just want to remove one. I think it was

attribute number 3 we were going to remove. I can “invert the selection” and remove

all the other attributes and leave 3, but I’m just going to leave it like that. Click OK, and watch

“humidity” go when we apply the filter. Nothing happens until you click Apply to apply

filter.

I’ve just applied it, and here we are, the “humidity” attribute has been removed. Luckily I can

undo the effect of that and put it back by pressing the “Undo” button. That’s how to remove an

attribute.

Actually, the bad news is there is a much easier way to remove an attribute: you don’t need to

use a filter at all. If you just want to remove an attribute, you can select it here and click the

“Remove” button at the bottom. It does the same job. Sorry about that. But filters are really useful

and can do much more complex things than that. Let’s, for example, imagine removing, not an

attribute, but let’s remove all instances where humidity has the value “high”. That is,

attribute number 3 has this first value. That’s going to remove 7 instances from the dataset. There

are 14 instances altogether, so we’re going to get left with a reduced dataset of 7 instances.

Let’s look for a filter to do that. We want to remove instances, so it’s going to be an instance

filter. I just have to look down here and see if there is anything suitable, how about

RemoveWithValues? I can click that to configure it, and I can click “More” to see what it does.

Here it says it “Filters instances according to the value of an attribute”,

which is exactly what we want. We’re going to set the “attributeIndex”; we want the third

attribute (humidity) so type 3, and the “nominalIndices” to 1 the first value. We can remove

a number of different values; we’ll just remove the first value. Now we’ve configured that. Nothing

happens until we apply the filter.

Watch what happens when we apply it. We still have the “humidity” attribute there, but we have

zero elements with high humidity. In fact, the dataset has been reduced to only 7 instances. Recall

that when you do anything here, you can save the results. So, we could save that reduced dataset if

we wanted, but I don’t want to do that now. I’m going to undo this.

We removed the instances where humidity is high. We have to think about, when we’re looking

for filters, whether we want a supervised or an unsupervised filter, whether we want an

attribute filter or an instance filter, and then just use your common sense to look

down the list of filters to see which one you want.

Sometimes when you filter data you get much better classification. Here’s a really simple

example. I’m going to open the “glass.arff” dataset that we saw before. Here’s the glass

dataset, I’m going to use J48, which we did before, It’s a tree classifier.

I’m going to use Cross-validation with 10 Folds, start that, and I get an accuracy of

66.8%. Let’s remove Fe when we do we get a smaller dataset. Go and run J48 again.

Now we get an accuracy of 67.3%. So, we’ve improved the accuracy a little bit by removing that

attribute. Sometimes the effect is pretty dramatic, actually, in this dataset, I’m going to remove

everything except the refractive index and Magnesium (Mg). I’m going to remove all of these

attributes and am left with a much smaller dataset with two attributes. Apply J48 again.

Now I’ve got an even better result, 68.7% accuracy. I can visualize that tree, of course – remember?

– by right-clicking and selecting “visualizing the tree” and have a look and see what it

means. It’s much easier to visualize trees when they are smaller. This is a good one to look at and

consider what the structure of this decision is.

That’s it for now. We’ve looked at filters in Weka; supervised versus unsupervised,

attribute versus instance filters. To find the right filter you need to look. They can be

very powerful, and judiciously removing attributes can both improve performance and increase

comprehensibility. For further stuff look at section 11.2 on loading and filtering files.

06 Visualizing your data

https://www.youtube.com/watch?v=dGNEiWTkx-M&index=6&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7

Anyway, one of the constantly recurring themes in this course is the necessity to get close to your

data, look at it in every possible way. In this last lesson of the first class, we're going to look at

visualizing your data. This is what we're going to do, we're going to use the Visualize panel, I'm

going to open the iris dataset. You came across the iris dataset in one of the activities, I think.

I'm using it because it has numeric attributes, 4 numeric attributes: sepallength,

sepalwidth, petallength, petalwidth. The class are the three kinds of iris flower: Iris-

setosa, Iris-versicolor, and Iris-virginica.

Let's go to the Visualize panel and visualize this data. There is a matrix of two-dimensional plots,

a five-by-five matrix of plots. If I can select one of these plots, I'm going to be looking at a plot of

sepalwidth on the x-axis and petalwidth on the y-axis.

That's a plot of the data. The colors correspond to the three classes.

I can actually change the colors. By clicking where it says at bottom Iris-s… If I don't like those, I

could select another color, but I'm going to leave them the way they are.

I can look at individual data points by clicking on them. This is talking about instance number 86

with a sepallength of 6, sepalwidth of 3.4, and so on. That's a versicolor, which is why

this spot is coloured red. We can look individual instances.

Better still, if we click on this little set of bars here, these represent the attributes, so of I click it:

The x-axis will change to sepallength. Here the x-axis is sepalwidth (clicking on 2nd one).

Clicking on 3rd the x-axis is petallength, and so on.

If I right click, then it will change the y-axis to sepallength. So, I can quickly browse around

these different plots.

There is a Jitter slider. Sometimes, points sit right on top of each other, and jitter just adds a little

bit of randomness to the x- and the y-axis. With a little bit of jitter on here, the darker spots

represent multiple instances.

The jitter function in the Visualize panel just adds artificial random noise to the coordinates of the

plotted points in order to spread the data out a bit (so that you can see points that might have been

obscured by others). From online

If I click on one of those, I can see that that point represents three separate instances, all of class

iris-setosa, and they all have the same value of petallength and sepalwidth. Both of

which are being plotted on this graph. The sepalwidth and petallength are 3.0 and 1.4 for

each of the three instances.

If I click another one here. This one here are two with very similar [sepalwidths] and

petallengths, both of the class versicolor. The jitter slider helps you distinguish

between points that are in fact very close together.

Another thing we can do is select bits of this dataset. I'm going to choose select Rectangle here.

If I draw a rectangle now, I can select these points. If I were to click the Submit button (top left

button) this rectangle, then all other points would be excluded and just these points would appear

on the graph, with the access re-scaled appropriately.

Here we go. I've submitted that rectangle, and you can see that there's just the red points and green

points there. I could save that if I wanted as a different dataset, or I could reset it and maybe try

another kind of selection like this, where I'm going to have some blue points, some red and some

green points and see what that looks like. This might be a way of cleaning up outliers in your data, by

selecting rectangles and saving the new dataset. That's visualizing the dataset itself. What about

visualizing the result of a classifier?

Let's get rid of this visualize panel and back to the Preprocess panel. I'm going to use a

classifier. I'm going to use, guess what, J48. Let's find it under trees. I'm going to run it.

Then if I right click on this entry here in the log area, I can view classifier errors.

Here we've got the class plotted against the predicted class. The square boxes represent errors. If I

click on one of these, I can, of course, change the different axes if I want (box on right similar to

above). I can change the x-axis and the y-axis, but I'm going to go back to class and predicted class as

the axis.

If I click on one of these boxes, I can see where the errors are. There are two instances where the

predicted class is versicolor, and the actual class is virginica.

We can see these in the confusion matrix. The actual class is virginica (last row 2nd value 2) ,

and the predicted class is versicolor, that's 'b'. These two entries in the confusion matrix is

represented by these instances here (above highlighted in red).

If I look at another point, say this one. Here I've got one instance which is in fact a setosa

predicted to be a versicolor. That is this setosa (2nd row 2nd value in confusion matrix)

predicted to be a versicolor. I can look at this plot and find out where the misclassifications are

actually occurring, the errors in the confusion matrix.

So, get down and dirty with your data and visualize it. You can do all sorts of things. You can clean it

up, detect outliers. You can look at the classification errors. For example, there’s a filter that

allows you to add the classifications as a new attribute.

Let's just go and have a look at that. I'm going to go and find a filter. We're going to add an

attribute. It's supervised because it uses a class. Add an attribute, and

AddClassfication. Here I get to choose in the configuration panel, the machine learning

scheme. I'm going to choose J48, of course, and I'm going to outputClassification—put

that True. That's configured it, and I'm going to apply it. It will add a new attribute.

It's done it, and this attribute is the classification according to J48. Weka is very powerful. You can

do all sorts of things with classifiers and filters. That's the end of the first class. There's a

section of the book on Visualization, section 11.2.

07 Class 1 Questions

https://www.youtube.com/watch?v=PRatbc8lOU8&index=7&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7

Some people were going straight to the YouTube videos, and if you just go straight to the YouTube videos you don't see the

activities. You should be seeing this picture when you go to the course. This is the website, and you need to go and look at

the course from here.

The next thing is about the algorithms. People want to learn about the details of the algorithms and how they work. Are

you going to learn about those? Is there a MOOC class that goes into the algorithms provided by Weka, rather than the

mechanics of running it?

The answer is "yes": you will be learning something about these algorithms. I've put the syllabus here; it's on the course

webpage. You can see from this syllabus what we're going to be doing. We're going to be looking at, for example, the J48

algorithm for decision trees and pruning decision trees; the nearest neighbour algorithm for instance-based learning; and

linear regression; classification by regression. We'll look at quite a few algorithms in Classes 3 and 4. I'm not going to tell

you about the algorithms in gory detail, however: they can get quite tricky inside.

What I want to do is to communicate the overall way that they work -- the idea behind the algorithms -- rather than the

details. The book does give you full details of exactly how these algorithms work inside. We're not going to be able to cover

them in that much detail in the course, but we will be talking about how the algorithms work and what they do.

The next thing I want to talk about: someone asked about using Naive Bayes. How can we use the NaiveBayes classifier

algorithm on a dataset, and how can we test for particular data, whether it fits into particular classes? Let me go to Weka

here. We're going to be covering this in future lessons -- Lesson 3.3 on Naive Bayes and so on -- but I'll just show you. All of

this is very easy. If I go to Classify, and I want to run Naive Bayes, I just need to find NaiveBayes. I happen to know it's in the

bayes section, and I can run it here. Just like that. We've just run NaiveBayes. I'll be doing this more slowly and looking

more at the output in Lesson 3.3. A natural thing to ask is if you had a particular test instance, which way would Naive

Bayes classify it, or any other kind of classifier?

This is the weather data we're using here, and I've created a file and called it weather.one.day.arff. It's a standard ARFF file,

and I got it by editing the weather.nominal.arff file. You can see that I've just got one day here. I've got the same header as

for the regular weather file and just one day -- but I could have several days if I wanted. I've put a question mark for the

class, because I want to know what class is predicted for that. We'll be talking about this in Lesson 2.1 -- you're probably

doing it right now -- but we can use a "supplied test set". I'm going to set that one that I created, which I called

weather.one.day.arff, as my test set. I can run this, and it will evaluate it on the test set. On the "More options..." menu --

you'll be learning about this in Lesson 4.3 -- there's an "Output predictions" option, here.

If I now run it and look up here, I will find instance number 1, the actual class was "?" -- I showed you that, that was what

was in the ARFF file -- and the predicted class is "no". There's some other information. This is how I can find out what

predictions would be on new test data. Actually, there's nothing stopping me from setting as my test file the same as the

training file. I can use weather.nominal.arff as my test file, and run it again. Now, I can see these are the 14 instances in the

standard weather data. This is their actual class, this is the predicted class, predicted by, in this case, Naive Bayes. There's a

mark in this column whenever there's an error, whenever the actual class differs from the predicted class. Again, we get

that by, in the "More options..." menu, checking "Output predictions". We're going to talk about that in other lessons. I just

wanted to show you that it's very easy to do these things in Weka.

The final thing I just wanted to mention is, if you're configuring a classifier -- any classifier, or indeed any filter -- there are

these buttons at the bottom. There's an "Open" and "Save" button, as well as the OK button that we normally use. These

buttons are not about opening files in the Explorer, they're about saving configured classifiers. So, you could set

parameters here and save that configuration with a name and a file and then open it later on. We don't do that in this

course, so we never use these Open and Save buttons here in the GenericObjectEditor. This is the GenericObjectEditor that

I get by clicking a classifier or filter. Just ignore the Open and Save buttons here. They do not open ARFF files for you. That's

all I wanted to say. Carry on with Class 2.

CLASS 2

08 Be a classifier

https://www.youtube.com/watch?v=blLXSe3_q6A&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=8

Hi! Welcome back to Data Mining with Weka. This is Class 2. In the first class, we downloaded Weka

and we looked around the Explorer and a few datasets; we used a classifier, the J48

classifier; we used a filter to remove attributes and to remove some instances; we

visualized some data—we visualized classification errors on a dataset; and along the way we looked

at a few datasets, the weather data, both the nominal and numeric version, the glass data,

and the iris dataset.

This class is all about evaluation. In Lesson 1.4, we built a classifier using J48. In this first

lesson of the second class, we're going to see what it's like to actually be a classifier ourselves.

Then, later on in subsequent lessons in this class, we're going to look at more about evaluation

training and testing, baseline accuracy and cross-validation.

First of all, we're going to see what it's like to be a classifier. We're going to construct a

decision tree ourselves, interactively. I'm going to just open up Weka here. The Weka Explorer. I'm

going to load the segment-challenge.arff dataset, segment-challenger.arff.

Let's first of all look at the class. The class values are brickface, sky, foliage, cement,

window, path, and grass. It looks like this is kind of an image analysis dataset. When we look at

the attributes, we see things like the centroid of columns and rows, pixel counts, line

densities, means of intensities, and various other things. Saturation, hue, and the

class, as I said before, is different kinds of texture: bricks, sky, foliage, and so on.

That's the segment challenge dataset.

Now, I'm going to select the user classifier. The user classifier is a tree

classifier. We'll see what it does in just a minute. Before I start, this is really quite important.

I'm going to use a supplied test set. I'm going to set the test set, which is used to evaluate the

classifier to be segment-test.arff. The training set is segment-challenge, the test

set is segment-test. Now we're all set. I'm going to start the classifier.

What we see is a window with two panels: the Tree Visualizer and the Data Visualizer.

We looked at visualization in the last class, how you can select different attributes for the x and y.

I'm going to plot the region-centroid-row against the intensity-mean.

Now, we're going to select a class so I am going to select Rectangle. If I draw out with my mouse

a rectangle here, I'm going to have a rectangle that's pretty well pure reds, as far as I can see. I'm

going to submit this rectangle. You can see that that area has gone, and the picture has been

rescaled.

Now I'm building up a tree here. If I look at the Tree Visualizer, I've got a tree. We've split

on these two attributes, region-centroid-row and intensity-mean. Here we've got sky,

these are all sky classes. Here we've got a mixture of brickface, foliage, cement, window,

path, and grass. We're kind of going to build up this tree. What I want to do is to take this node

and refine it a bit more.

Here is the Data Visualizer again. I'm going to select a rectangle containing these items here

and submit that. They've gone from this picture.

You can see that here, I've created this split, another split on region-centroid-row and

intensity-mean, and here, this is almost all path 233 path instances, and then a mixture here

(bottom right). This is a pure node we've got over there, (top left small), this is almost a pure node

(middle square), this is the one I want to work on (bottom right).

I'm going to cover some of those instances now. Let's take this lot here and submit that. Then I'm

going to take this lot here and submit that. Maybe I'll take those ones there and submit that. This

little cluster here seems pretty uniform. Submit that. I haven't actually changed the axes, but, of

course, at any time, I could change these axes to better separate the remaining classes. I could kind

of mess around with these.

Actually, a quick way to do it is to click here on these bars (as did previously). Left click for x and

right click for y. I can quickly explore different pairs of axes to see if I can get a better split. Here's

the tree I've created.

It looks like this. You can see that we have successively elaborated down this branch here. When I

finish with this, I can “Accept The Tree” by right clicking.

Actually, before I do that, let me just show you that we were selecting rectangles here but I've got

other things I can select: a polygon or a polyline. If I don't want to use rectangles, I can use

polygons or polylines. If you like, you can experiment with those to select different shaped areas.

There's an area I've got selected I just can't quite finish it off. Alright, I right clicked to finish it off. I

could submit that. I'm not confined to rectangles; I can use different shapes. I'm not going to do

that. I'm satisfied with this tree for the moment. I'm going to accept the tree. Once I do this, there

is no going back, so you want to be sure. If I accept the tree, "Are you sure?" “Yes”.

Here, I've got a confusion matrix, and I can look at the errors. My tree classifies 78% of the

instances correctly, nearly 79% correctly, and 21% incorrectly.

That's not too bad, especially considering how quickly I built that tree. It's over to you now. I'd like

you to play around and see if you can do better than this by spending a little bit longer on getting a

nice tree.

I'd like you to reflect on a couple of things. First of all, what strategy you're using to build this tree.

Basically, we're covering different regions of the instance space, trying to get pure regions to create

pure branches. This is kind of like a bottom-up covering strategy. We cover this area and this area

and this area.

Actually, that's not how J48 works. When it builds its trees, it tries to do a judicious split through

the whole dataset. At the very top level, it'll split the entire dataset into two in a way that doesn't

necessarily separate out particular classes but makes it easier when it starts working on each half of

the dataset further splitting in a top-down manner in order to try and produce an optimal tree. It will

produce trees much better than the one that I just produced with the user classifier. I'd also like you

to reflect on what it is we're trying to do here.

Given enough time, you could produce a 'perfect' tree for the dataset, but don't forget that the

dataset that we've loaded is the training dataset. We're going to evaluate this tree on a different

dataset, the test dataset, which hopefully comes from the same source, but is not identical to the

training dataset. We're not trying to precisely fit the training dataset; we’re trying to fit it in a way

that generalizes the kinds of patterns exhibited in the dataset. We're looking for something that will

perform well on the test data. That highlights the importance of evaluation in machine learning.

That's what this class is going to be about, different ways of evaluating your classifier. That's it.

There's some information in the course text about the user classifier, section 11.2 “Do it

yourself: the User Classifier” which you can read if you like. Please go on and do the activity

associated with this lesson and produce your own classifier.

https://www.youtube.com/watch?v=GWcmc6ES_jY&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=9

Hi! This is Lesson in Data Mining with Weka, and here we're going to look at training hand testing in

a little bit more detail.

Here's a situation. We've got a machine learning algorithm, and we feed into it training data, and it

produces a classifier -- a basic machine learning situation. For that classifier, we can test it with

some independent test data. We can put that into the classifier and get some evaluation results,

and, separately, we can deploy the classifier in some real situation to make predictions on fresh data

coming from the environment.

It's really important in classification, when you're looking at your evaluation results, you only get

reliable evaluation results if the test data is different from the training data. That's what we're going

to look at in this lesson.

What if you only have one dataset? If you just have one dataset, you should divide it into two parts.

Maybe use some of it for training and some of it for testing. Perhaps, 2/3 of it for training and 1/3 of

it for testing. It's really important that the training data is different from the test data. Both training

and test sets are produced by independent sampling from an infinite population. That's the basic

scenario here, but they're different independent samples. It's not the same data. If it is the same

data, then your evaluation results are misleading. They don't reflect what you should actually expect

on new data when you deploy your classifier.

Here we're going to look at the segment dataset, which we used in the last lesson. I'm going to open

the segment-challenge.arff. I'm going to use a supplied test set. First of all, I'm going to use the J48

tree learner. I'm going to use a “Supplied test set”, and I will set it to the appropriate segment-

test.arff file, I'm going to open that. Now we've got a test set, and let's see how it does.

In the last lesson, on the same data with the user classifier, I think I got 79% accuracy. J48 does much

better; it gets 96% accuracy on the same test set. Suppose I was to evaluate it on the training set? I

can do that by just specifying under “Test options” “Use training set”. Now it will train it again and

evaluate it on the training set, which is not what you're supposed to do, because you get misleading

results.

Here, it's saying the accuracy is 99% on the training set. That is not representative of what we would

get using this on independent data. If we had just one dataset, if we didn't have a test dataset, we

could do a “Percentage split”. Here's a percentage split, this is going to be 66% training data and 34%

test data. That's going to make a random split of the dataset. If I run that, I get 95%, that's just about

the same as what we got when we had an independent test set, just slightly worse.

If I were to run it again, if we had a different split, we'd expect a slightly different result, but actually,

I get exactly the same results, 95%. That's because Weka, before it does a run, it reinitializes the

random number generator. The reason is to make sure that you can get repeatable results, if it

didn't do that, then the results that you got would not be repeatable. However, if you wanted to

have a look at the differences that you might get on different runs, then there is a way of resetting

the random number between each run. We're going to look at that in the next lesson. That's this

lesson.

The basic assumption of machine learning is that the training and test sets are independently

sampled from an infinite population, the same population. If you have just one dataset, you should

hold part of it out for testing, maybe 33% as we just did or perhaps 10%. We would expect a slight

variation in results each time if we hold out a different set, but Weka produces the same results

each time by design by making sure it reinitializes the random number generator each time. We ran

J48 on the segment-challenge dataset. If you'd like, you can go and look at the course text on

Training and testing, Section 5.1 (Training and testing) and please go and do the activity associated

with this lesson.

https://www.youtube.com/watch?v=GWcmc6ES_jY&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=9

Hello again! In the last lesson, we looked at training and testing. We saw that we can evaluate a

classifier on an independent test set or using a Percentage split with a certain percentage of

the dataset used to train and the rest used for testing or and this is generally a very bad idea -- we

can evaluate it on the training set itself which gives misleadingly optimistic performance figures. In

this lesson, we're going to look a little bit more at training and testing. In fact, what we're going to

do is repeatedly train and test using percentage split.

Now, in the last lesson, we saw that if you simply repeat the training and testing, you get the same

result each time because Weka initializes the random number generator before it does each run to

make sure that you know what's going on when you do the same experiment again tomorrow. But,

there is a way of overriding that.

So, we will be using independent random numbers on different occasions to produce a percentage

split of the dataset into a training and test set. I'm going to open the segment-

challenge.arff. That's what we used before. Notice there are 1500 instances here; that’s

quite a lot. I'm going to go to Classify. I'm going to choose J48, our standard method, I guess.

I’m going to use a percentage split, and because we've got 1500 instances, I'm going to

choose 90% for training and just 10% for testing. I reckon that 10% -- that's 150 instances -- for

testing is going to give us a reasonable estimate, and we might as well train on as many as we can to

get the most accurate classifier.

I'm going to run this, and the accuracy figure I get -- this is what I got in the last lesson is 96.6667%.

Now, this is misleadingly high accuracy here, i'm going to call that 96.7%, or 0.967, and then, I'm

going to do it again and just see how much variation we get of that figure initializing the random

number generator to different amounts each time.

If I go to the More options menu, I get a number of options here which are quite useful:

outputting the model, we're doing that; outputting statistics; we can output

different evaluation measures; we're doing the confusion matrix; we're storing the

prediction for visualization; we can output the predictions if we want; we

can do a cost-sensitive evaluation; and we can set the random seed for cross-

validation or percentage split. That's set by default to 1. I'm going to change that to 2, a

different random seed.

We could also output the source code for the classifier if we wanted, but I just want to change

the random seed. Then I want to run it again.

Before we got 0.967, and this time we get 0.94, 94%, quite different, you see, If I were then to

change this again to, say 3 and run it again, again, I get 94%. If I change it again to 4 and run it again,

I get 96.7%. Let's do one more, change it to 5, run it again, and now I get 95.3%.

Here's a table with these figures in, if we run it 10 times, we get this set of results. Given this set of

experimental results, we can calculate the mean and standard deviation. The sample

mean is the sum of all of these error figures or these success rates, I should say divided by the

number 10 of them. That's 0.949, about 95%, that's really what we would expect to get. That's a

better estimate than the 96.7% that we started out with, a more reliable estimate. We can calculate

the sample variance we take the deviation from the mean, we subtract the mean from each

of these numbers, we square that, add them up, and we divide, not by n, but by n-1. That might

surprise you, perhaps the reason for it being n-1 is because we've actually calculated the mean

from this sample. When the mean is calculated from the sample, you need to divide by n-1, leading

to a slightly larger variance estimate than if you were to divide by n.

We take the square root of that, and in this case, we get a standard deviation of 1.8%. Now you can

see that the real performance of J48 on the segment-challenge dataset is approximately 95%

accuracy, plus or minus approximately 2%. Anywhere, let's say, between 93-97% accuracy. These

figures that you get, that Weka puts out for you, are misleading. You need to be careful how you

interpret them, because the result is certainly not 95.333%. There's a lot of variation on all of these

figures.

Remember, the basic assumption is the training and test sets are sampled independently from an

infinite population, and you should expect a slight variation in results perhaps more than just a slight

variation in results. You can estimate the variation in results by setting the random-number

seed and repeating the experiment. You can calculate the mean and the standard deviation

experimentally, which is what we just did. Off you go now, and do the activity associated with this

lesson. I'll see you in the next lesson.

11 Baseline accuracy

https://www.youtube.com/watch?v=GWcmc6ES_jY&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=9

Hello again! In this lesson we're going to look at an important new concept called baseline

accuracy. We're going to actually use a new dataset, the diabetes dataset so open

diabetes.arff. Have a quick look at this dataset.

The class is tested_negative or tested_positive for diabetes, we've got attributes like

preg, which I think has to do with the number of times they've been pregnant; age, which is the

age. Of course, we can learn more about this dataset by looking at the ARFF file itself.

Here is the diabetes dataset, you can see its diabetes in Pima Indians, there's a lot of information

here. The attributes: number of times pregnant, plasma, glucose concentration, and so

on. Diabetes pedigree function.

I'm going to use percentage split, I'm going to try a few different classifiers. Let's look at J48

first, our old friend J48.

We get 76% with J48, I’m going to look at some other classifiers. You learn about these classifiers

later on in this course, but right now we're just going to look at a few. Look at NaiveBayes

classifier in the bayes category and run that.

here we get 77%, a little bit better, but probably not significant.

Let's choose in the lazy category IBk, again, we'll learn about this later on.

We'll use one final one, the PART, partial rules in the rule’s category.

Here we get 74%. We'll learn about these classifiers later, but they are just different classifiers,

alternative to J48.

You can see that J48 and NaiveBayes are pretty good, probably about the same, the 1%

difference between them probably isn't significant. IBk and PART are probably about the same

performance, again, 1% between them. There is a fair gap, I guess, between those bottom two and

the top two, which probably is significant.

I'd like to think about these figures, 76%, is that good to get 76% accuracy? If we go back and look at

this dataset:

The class we see that there are 500 negative instances and 268 positive

instances. If you had to guess, you'd guess it would be negative, and you'd be right 500/768

(the sum of these two things, the total number of instances). You'd be right that fraction of the time,

500/768 if you always guess [negative], and that works out to 65%.

Actually, there's a rules classifier called ZeroR, which does exactly that. The ZeroR classifier just

looks for the most popular class and guesses that all the time. If I run this on the training set:

That will give us the exact same number, 500/768 which is 65%. It's a very, very simple, kind of

trivial classifier, that always just guesses the most popular class. It's ok to evaluate that on the

training set, because it's hardly using the training set at all to form the classifier.

That's what we would call the baseline. The baseline gives 65% accuracy, and J48 gives 76%

accuracy, significantly above the baseline, but not all that much above the baseline.

It's always good when you're looking at these figures to consider what the very simplest kind of

classifier the baseline classifier, would get you. Sometimes, baseline might give you the best

results.

I'm going to open a dataset here, we're not going to discuss this dataset. It's a bit of a strange

dataset, not really designed for this kind of classification. It's called supermarket. I'm going to

open supermarket, and without even looking at it, I'm just going to apply a few schemes here.

I'm going to apply ZeroR, and I get 64%. I'm going to apply J48 and I think I'll use a percentage

split for evaluation because it's not fair to use the training set here. Now I get 63%, that's worse than

the baseline. If I try NaiveBayes, these are the ones I tried before, I get again 63%, worse than

the baseline, if I choose IBk, this is going to take a little while here, it's a rather slow scheme. Here

we are; it's finished now, only 38%, that is way, way worse than the baseline. We'll just try PART,

partial decision rules, here we get 63%.

The upshot is that the baseline actually gave a better performance than any of these classifiers and

one of them was really atrocious compared with the baseline. This is because, for this dataset, the

attributes are not really informative.

The rule here is, don't just apply Weka to a dataset blindly. You need to understand what's going on.

When you do apply Weka to a dataset, always make sure that you try the baseline classifier ZeroR,

before doing anything else. In general, simplicity is best. Always try simple classifiers before you try

more complicated ones.

Also, you should consider, when you get these small differences whether the differences are likely to

be significant. We saw these 1% differences in the last lesson that were probably not at all

significant. You should always try a simple baseline, you should look at the dataset. We shouldn't

blindly apply Weka to a dataset; we should try to understand what's going on. That's this lesson off

you go and do the activity associated with this lesson, and I'll see you soon!

12 Cross validation

I want to introduce you to the standard way of evaluating the performance of a machine learning

algorithm, which is called cross-validation. A couple of lessons back, we looked at evaluating on an

independent test set, and we also talked about evaluating on the training set (don't do that). We

also talked about evaluating using the holdout method by taking the one dataset and holding out a

little bit for testing and using the rest for training.

There is a fourth option on Weka's Classify panel, which is called cross-validation, and that's what

we're going to talk about here. Cross-validation is a way of improving upon repeated holdout. We

tried using the holdout method with different random-number seeds each time. That's called

repeated holdout. Cross-validation is a systematic way of doing repeated holdout that actually

improves upon it by reducing the variance of the estimate.

We take a training set and we create a classifier. Then we're looking to evaluate the performance of

that classifier, and there is a certain amount of variance in that evaluation, because it's all statistical

underneath. We want to keep the variance in the estimation as low as possible.

Cross-validation is a way of reducing the variance, and a variant on cross-validation called stratified

cross-validation reduces it even further. I'm going to explain that in this class.

In a previous lesson, we held out 10% for the testing and we repeated that 10 times. That's the

repeated holdout method. We've got one dataset, and we divided it independently 10 separate

times into a training set and a test set. With cross-validation, we divide it just once, but we divide

into, say, 10 pieces. Then, we take 9 of the pieces and use them for training and the last piece we

use for testing. Then, with the same division, we take another 9 pieces and use them for training and

the held-out piece for testing.

We do the whole thing 10 times, using a different segment for testing each time. In other words, we

divide the dataset into 10 pieces, and then we hold out each of these pieces in turn for testing, train

on the rest, do the testing and average the 10 results. That would be 10-fold cross-validation.

Divide the dataset into 10 parts (these are called folds), hold out each part in turn and average the

results. So, each data point in the dataset is used once for testing and 9 times for training. That's 10-

fold cross-validation.

Stratified cross-validation is a simple variant where, when we do the initial division into 10 parts, we

ensure that each fold has got approximately the correct proportion of each of the class values. Of

course, there are many different ways of dividing a dataset into 10 equal parts we just make sure we

choose a division that has approximately the right representation of class values in each of the folds.

That's stratified cross-validation. It helps reduce the variance in the estimate a little bit more.

Then, once we've done the cross-validation, what Weka does is run the algorithm an eleventh time

on the whole dataset. That will then produce a classifier that we might deploy in practice. We use

10-fold cross-validation in order to get an evaluation result and estimate of the error and then

finally, we do classification one more time to get an actual classifier to use in practice.

That's what I wanted to tell you. Cross-validation is better than repeated holdout, and we'll look at

that in the next lesson. Stratified cross-validation is even better. Weka does stratify cross-validation

by default.

With 10-fold cross-validation, Weka invokes the learning algorithm 11 times, one for each fold of the

cross-validation and then a final time on the entire dataset. The practical rule of thumb is that if

you've got lots of data, you can use a percentage split and evaluate it just once. Otherwise, if you

don't have too much data, you should use stratified 10-fold cross-validation.

How big is lots? Well, this is what everyone asks. How long is a piece of string, you know? It's hard to

say, but it depends on a few things. It depends on the number of classes in your dataset. If you've

got a two-class dataset, then if you had, say 100-1000 datapoints, that would probably be good

enough for a pretty reliable evaluation. If you did 90% and 10% split in the training and test set, if

you had, say 10,000 data points in a two-class problem, then I think you'd have lots and lots of data,

you wouldn't need to go to cross-validation. If, on the other hand, you had 100 different classes,

then that's different, right? You would need a larger dataset, because you want a fair representation

of each class when you do the evaluation.

It's really hard to say exactly; it depends on the circumstances. If you've got thousands and

thousands of data points, you might just do things once with a holdout. If you've got less than a

thousand data points, even with a two-class problem, then you might as well do 10-fold cross-

validation. It really doesn't take much longer. Well, it takes 10-times as long, but the times are

generally pretty short. You can read more about this in Section 5.3 (cross-validation).Now it's time

for you to go and do the activity associated with this [lesson].

https://www.youtube.com/watch?v=GWcmc6ES_jY&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=9

We're here to talk about Lesson 2.6, which is about cross-validation results. We learned

about cross-validation in the last lesson.

I said that cross-validation was a better way of evaluating your machine learning algorithm

evaluating your classifier, than repeated holdout, repeating the holdout method. Cross-

validation does things 10 times. You can use holdout to do things 10 times, but cross-

validation is a better way of doing things.

Let's just do a little experiment here. I'm going to start up Weka and open the diabetes dataset. The

baseline accuracy, which ZeroR gives me that's the default classifier, by the way,

rules>ZeroR if I just run that, well, it will evaluate it using cross-validation. Actually, for a

true baseline, I should just use the training set.

That'll just look at the chances of getting a correct result if we simply guess the most likely class, in

this case 65.1%. That's the baseline accuracy. That's the first thing you should do with any dataset.

Then we're going to look at J48, which is down here under trees. There it is I'm going to evaluate

it with 10-fold cross-validation.

It takes just a second to do that. I get a result of 73.8%, and we can change the random-number

seed like we did before. The default is 1; let's put a random-number seed of 2.

Do it again. Change it to, say, 3; I can choose anything I want, of course. Run it again, and I get

75.5%.

These are the numbers I get on this slide with 10 different random-number seeds. Those are

the same numbers on this slide in the right-hand column, the 10 values I got 73.8% 75.0%,

75.5%, and so on.

I can calculate the mean, which for that right-hand column is 74.5%, and the sample standard

deviation which is 0.9%, using just the same formulas that we used before. Before we use these

formulas for the holdout method we repeated the holdout 10 times. These are the results you get

on this dataset, if you repeat holdout, that is using 90% for training and 10% for testing, which is,

of course, what we're doing with 10-fold cross-validation.

I would get those results there, and if I average those, I get a mean of 74.8%, which is satisfactorily

close to 74.5%, but I get a larger standard deviation, quite a lot larger standard

deviation of 4.6%, as opposed to 0.9% with cross-validation.

Now, you might be asking yourself why use 10-fold cross-validation. With Weka we can

use 20-fold cross-validation or anything, we just set the number folds here beside the cross-

validation box to whatever we want. So, we can use 20-fold cross-validation,

what that would do would be to divide the dataset into 20 equal parts and repeat 20 times. Take

one part out, train on the other 95% of the dataset, and then do it a 21st time on the whole dataset.

So, why 10, why not 20? Well, that's a good question really, and there's not a very good answer. We

want to use quite a lot of data for training, because, in the final analysis, we're going to use the

entire dataset for training.

If we're using 10-fold cross-validation, then we're using 90% of the dataset. Maybe it

would be a little better to use 95% of the dataset for training with 20-fold cross-

validation. On the other hand, we want to make sure that what we evaluate on is a valid

statistical sample. So, in general, it's not necessarily a good idea to use a large number of folds with

cross-validation. Also, of course, 20-fold cross-validation will take twice as long

as 10-fold cross-validation. The upshot is that there isn't a really good answer to this

question, but the standard thing to do is to use 10-fold cross-validation, and that's why

it's Weka's default.

We've shown in this lesson that cross-validation really is better than repeated holdout.

Remember, on the last slide, we found that we got about the same mean for repeated holdout as for

cross-validation, but we got a much smaller variance for cross-validation.

We know that the evaluation in this machine learning method, J48, on this dataset, diabetes we get

74.5% accuracy, probably somewhere between 73.5% and 75.5%. That is actually substantially

larger than the baseline. So, J48 is doing something for us better than the baseline. Cross-

validation reduces the variance of the estimate. That's the end of this class. Off you go and do

the activity. I'll see you at the next class.

14 Class 2 Questions

Hi! Well, Class 2 has gone flying by, and here are some things I'd like to discuss. First of all, we made

some mistakes in the answers to the activities. Sorry about that. We've corrected them. Secondly, a

general point, some people have been asking questions, for example, about huge datasets. How big

a dataset can Weka deal with? The answer is pretty big, actually. But it depends on what you do, and

it's a fairly complicated question to discuss. If it's not big enough, there are ways of improving things.

Anyway, issues like that should be discussed on the Weka mailing list, or you should look in the

Weka FAQ, where there's quite a lot of discussion on this particular issue. The Weka API: the

programming interface to Weka. You can incorporate the Weka routines in your program. It's

wonderful stuff, but it's not covered in this MOOC. So, the right place to discuss those issues is the

Weka mailing list. Finally, personal emails to me. You know, there are 5,000 people on this MOOC,

and I can't cope with personal emails, so please send them to the mailing list and not to me

personally.

I'd like to discuss the issues of numeric precision in Weka. Weka prints percentages to 4 decimal

places; it prints most numbers to 4 decimal places. That’s misleadingly high accuracy don't take

these at face value. For example, here we've done an experiment using a 40% percentage split, and

we get 92.3333% accuracy printed out. Well, that's the exact right answer to the wrong question,

we're not interested in the performance on this particular test set. What we're interested in is how

Weka will do in general on data from this source. We certainly can't infer that that's this percentage

to 4 decimal place accuracy.

In Class 2, we're trying to sensitize you to the fact that these figures aren't to be taken at face value.

For example, there we are with a 40% split. If we do a 30% split, we get not 92.3333% and get

92.381%. The difference between these two numbers is completely insignificant. You shouldn't be

saying this is better than the other number. They are both the same, really, within the amount of

statistical fuzz that's involved in the experiment.

We're trying to train you to write your answers to the nearest percentage point, or perhaps 1

decimal place. Those are the answers that are being accepted as correct. The reason we're doing

that is to try to train you to think about these numbers and what they really represent, rather than

just copy/pasting whatever Weka prints out. These numbers need to be interpreted, for example, in

Activity 2.6 in question 2, the 4-digit answer would be 0.7354%, and 0.7 and 0.74 are the only

accepted answers.

In question 5, the 4-decimal place accuracy is 1.7256%, and we would accept 1.73%, 1.7% and 2%.

We're a bit selective in what we'll accept here.

I want to move on to the user classifier now, some people got some confusing results, because they

created splits that involved the class attribute. When you're dealing with the test set, you don't

know the class attribute -- that's what you're trying to find out. So, it doesn't make sense to create

splits in the decision tree that involve testing the class attribute. If you do that, you're going to get 0

accuracy on test data, because the class value cannot be evaluated on the test data. What was the

cause of that confusion.

Here's the league table for the user classifier. J48 gets 96.2%, just as a reference point Magda did

really well and got very close to that, with 93.9%. It took her 6.5-7 minutes, according to the script

that she mailed in. Myles did pretty well -- 93.5%. In the class, I got 78% in just a few seconds. I

think if you get over 90% you're doing pretty well on this dataset for the user classifier. The point is

not to get a good result, it's to think about the process of classification.

Let's move to Activity 2.2, partitioning the datasets for training and testing. Question 1 asked you to

evaluate J48 with percentage split, using 10% for the training set, 20%, 40%, 60%, and 80%. What

you observed is that the accuracy increases as we go through that set of numbers. "Performance

always increases" for those numbers. It doesn't always increase in general, in general, you would

expect an increasing trend, the more training data the better the performance, asymptoting off at

some point. You would expect some fluctuation, though, so sometimes you would expect it to go

down and up again. In this particular case, performance always increases.

You were asked to estimate J48's true accuracy on the segment-challenge dataset in Question 4.

Well, "true accuracy", what do we mean by "true accuracy"? I guess maybe it's not very well defined,

but what one thinks of is if you have a large enough training set, the performance of J48 is going to

increase up to some kind of point, and what would that point be?

Actually, if you do this -- in fact, you've done it! -- you found that between 60% training sets and 97-

98% training sets using the percentage split option consistently yield correctly classified instances in

the range 94-97%. So, 95% is probably the best fit from this selection of possible numbers. It's true,

by the way, that greater weight is normally given to the training portion of this split. Usually when

we use percentage split, we would use 2/3, or maybe 3/4, or maybe 90% of the training data, and

the smaller amount for the test data.

Questions 6 and 7 were confusing, and we've changed those. The issue there was how a classifier's

performance, and secondly the reliability of the estimate of the classifier's performance, is expected

to increase as the volume of the training data increases. Or, how they change with the size of the

dataset. The performance is expected to increase as the volume of training data increases, and the

reliability of the estimate is also expected to increase as the volume of test data increases. With the

percentage split option, there's a trade-off between the amount of test data and the amount of

training data. That's what that question is trying to get at.

Activity 2.3 Question 5: "How do the mean and standard deviation estimates depend on the number

of samples?" Well, the answer is that roughly speaking both stay the same. Let me find Activity 2.3,

Question 5, as you increase the number of samples, you expect the estimated mean to converge to

the true value of the mean, and the estimated standard deviation to converge to the true standard

deviation. So, they would both stay about the same, this is, in fact, now marked as correct. Actually,

because of the "n-1" in the denominator of the formula for variance, it's true that the standard

deviation decreases a tiny bit, but it's a very small effect. So, we've also accepted that answer as

correct. That's how the mean and standard deviation estimates depend on the number of samples.

Perhaps a more important question is how the reliability of the mean would change.

What decreases is the standard error of the estimate of the mean, which is the standard deviation

of the theoretical distribution of the large population of such estimates. The estimate of the mean is

a better, more reliable estimate with a larger training set size.

"The supermarket dataset is weird." Yes, it is weird: it's intended to be weird. Actually, in the

supermarket dataset, each instance represents a supermarket trolley, and, instead of putting a 0 for

every item you don't buy -- of course, when we go to the supermarket, we don't buy most of the

items in the supermarket -- the ARFF file codes that as a question mark, which stands for "missing

value".

We're going to discuss missing values in Class 5. This dataset is suitable for association rule learning,

which we're not doing in this course. The message I'm trying to emphasize here is that you need to

understand what you're doing, not just process datasets blindly. Yes, it is weird.

There's been some discussion on the mailing list about cross-validation and the extra model. When

you do cross-validation, you're trying to do two things. You're trying to get an estimate of the

expected accuracy of a classifier, and you're trying to actually produce a really good classifier. To

produce a really good classifier to use in the future, you want to use the entire training set to train

up the classifier. To get an estimate of its accuracy, however, you can't do that unless you have an

independent test set. So cross-validation takes 90% for training and 10% for testing, repeats that 10

times, and averages the results to get an estimate. Once you've got the estimate, if you want an

actual classifier to use, the best classifier is one built on the full training set.

The same is true with a percentage split option. Weka will evaluate the percentage split, but then it

will print the classifier that it produces from the entire training set to give you a classifier to use on

your problem in the future. There's been a little bit of discussion on advanced stuff. I think maybe a

follow-up course might be a good idea here.

Someone noticed that if you apply a filter to the training set, you need to apply exactly the same

filter to the test set, which is sometimes a bit difficult to do, particularly if the training and test sets

are produced by cross-validation. There's an advanced classifier called the "FilteredClassifier" which

addresses that problem.

In his response to a question on the supermarket dataset, Peter mentioned "unbalanced" datasets,

and the cost of different kinds of error. This is something that Weka can take into account with a

cost sensitive evaluation, and there is a classifier called the CostSensitiveClassifier that allows you to

do that.

Finally, someone just asked a question on attribute selection: how do you select a good subset of

attributes? Excellent question! There's a whole attribute Selection panel, which we're not able to

talk about in this MOOC. This is just an introductory MOOC on Weka. Maybe we'll come up with an

advanced, follow-up MOOC where we're able to discuss some of these more advanced issues. That's

it.

CLASS 3

15 Simplicity first

https://www.youtube.com/watch?v=wDuNdvgXU_4&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=15

Hi! This is the third class of Data Mining with Weka, and in this class, we’re going to look at some

simple machine learning methods and how they work. We're going to start out emphasizing the

message that simple algorithms often work very well. In data mining, maybe in life in general, you

should always try simple things before you try more complicated things. There are many different

kinds of simple structure. For example, it might that one attribute in the dataset does all the work,

everything depends on the value of one of the attributes. Or, it might be that all of the attributes

contribute equally and independently. Or a simple structure might be a decision tree that tests just a

few of the attributes. We might calculate the distance from an unknown sample to the nearest

training sample or a result my depend on a linear combination of attributes.

We're going to look at all of these simple structures in the next few lessons. There's no universally

best learning algorithm. The success of a machine learning method depends on the domain. Data

mining really is an experimental science. We're going to look at OneR rule learner, where one

attribute does all the work. It's extremely simple, very trivial, actually, but we're going to start with

simple things and build up to more complex things. OneR learns what you might call a one-level

decision tree, or a set of rules that all test one particular attribute. A tree that branches only at the

root node depending on the value of a particular attribute, or, equivalently, a set of rules that test

the value of that particular attribute.

The basic version of OneR, there’s one branch for each value of the attribute. We choose which

attribute first, and we make one branch for each possible value of the attribute. Each branch assigns

the most frequent class that comes down that branch. The error rate is the proportion of instances

that don't belong to the majority class of their corresponding branch. We choose the attribute with

the smallest error rate. Let's look at what this actually means. Here's the algorithm. For each

attribute, we're going to make some rules. For each value of the attribute, we're going to make a

rule that counts how often each class appears, finds the most frequent class, makes the rule assign

that most frequent class to this attribute value combination, and then we're going to calculate the

error rate of this attribute's rules.

We're going to repeat that for each of the attributes in the dataset and choose the attribute with the

smallest error rate. Here's the weather data again. What OneR does, is it looks at each attribute in

turn, outlook, temperature, humidity, and wind, and forms rules based on that. For outlook, there

are three possible values: sunny, overcast, and rainy. We just count out of the 5 sunny instances, 2

of them are yeses and 3 of them are nos. We're going to choose a rule, if it's sunny choose no. We're

going to get 2 errors out of 5. For overcast, all of the 4 overcast values of outlook lead to yes values

for the class play.

So, we're going to choose the rule if outlook is overcast, then yes, giving us 0 errors. Finally, for

outlook is rainy we're going to choose yes, as well, and that would also give us 2 errors out of the 5

instances. We've got a total number of errors if we branch on outlook of 4. We can branch on

temperature and do the same thing. When temperature is hot, there are 2 nos and 2 yeses. We just

choose arbitrarily in the case of a tie so we'll choose if it's hot, let's predict no, getting 2 errors. If

temperature is mild, we'll predict yes, getting 2/6 errors, and if the temperature is cool, we'll predict

yes, getting 1 out of the 4 instances as an error.

And the same for humidity and wind. We look at the total error values; we choose the rule with the

lowest total error value -- either outlook or humidity. That's a tie, so we'll just choose arbitrarily, and

choose outlook. That's how OneR works, it’s as simple as that. Let's just try it. Here's Weka. I'm going

to open the nominal weather data. I'm going to go to Classify. This is such a trivial dataset that the

results aren't very meaningful but if I just run ZeroR to start off with, I get an error rate of 64%.

If I now choose OneR and run that. I get a rule, and the rule I get is branched on outlook, if it's sunny

then choose no, overcast choose yes, and rainy choose yes. We get 10 out of 14 instances correct on

the training set. We're evaluating this using cross-validation. Doesn't really make much sense on

such a small dataset. Interesting, though, that the [success] rate we get, 42% is pretty bad, worse

than ZeroR. Actually, with any 2-class problem, you would expect to get a success rate of at least

50%. Tossing a coin would give you 50%. This OneR scheme is not performing very well on this trivial

dataset. Notice that the rule it finally prints out since we're using 10-fold cross-validation, it does the

whole thing 10 times and then on the 11th time calculates a rule from the entire dataset and that's

what it prints out.

That's where this rule comes from. OneR, one attribute does all the work. This is a very simple

method of machine learning described in 1993, 20 years ago in a paper called "Very Simple

Classification Rules Perform Well on Most Commonly Used Datasets" by a guy called Rob Holte, who

lives in Canada. He did an experimental evaluation of the OneR method on 16 commonly used

datasets. He used cross-validation just like we've told you to evaluate these things, 0 and he found

that the simple rules from OneR often outperformed far more complex methods 109 that had been

proposed for these datasets. How can such a simple method work so well? Some datasets really are

simple, and others are so small, noisy, or complex 113 00:07:39,950 --> 00:07:42,010 that you can't

learn anything from them. So, it's always worth trying the simplest things first. Section 4.1 of the

course text talks about OneR. Now it's time for you to go and do the activity associated with this

lesson. Bye for now!

16 Model Overfitting

https://www.youtube.com/watch?v=rBANBJ9LtVY&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=16

Hi! Before we go on to talk about some more simple classifier methods, we need to talk about

overfitting. Any machine learning method may 'overfit' the training data, that's when it produces a

classifier that fits the training data too tightly and doesn't generalize well to independent 4 test data.

Remember the user classifier that you built at the beginning of Class 2, when you built a classifier

yourself? Imagine tediously putting a tiny circle around every single training 7 data point. You could

build a classifier very laboriously that would be 100% correct on the training data, but probably

wouldn't generalize very well to independent test data. That's overfitting. It's a general problem.

We're going to illustrate it with OneR.

We're going to look at the numeric version of the weather problem, where temperature and

humidity are numbers and not nominal values. If you think about how OneR works, when it comes to

make a rule on the attribute temperature, it's going to make complex rule that branches 14 different

ways perhaps for the 14 different instances of the dataset. Each rule is going to have zero errors; it's

going to get it exactly right. If we branch on temperature, we're going to get a perfect rule, with a

total error count of zero. In fact, OneR has a parameter that limits the complexity of rules. I'm not

going to talk about how it works. It's pretty simple, but it's just a bit distracting and not very

important. The point is that the parameter allows you to limit the complexity of the rules that are

produced by OneR. Let's open the numeric weather data. We can go to OneR and choose it. There's

OneR, and let's just create a rule.

Here the rule is based on the outlook attribute. This is exactly what happened in the last lesson with

the nominal version of the weather data. Let's just remove the outlook attribute and try it again.

Now let's see what happens when we classify with OneR. Now it branches on humidity. If humidity is

less than 82.5%, it's a yes day; if it's greater than 82.5%, it's a no day and that gets 10 out of 14

instances correct. So far so good, that's using the default setting of OneR's parameter that controls

the complexity of the rules it generates.

We can go and look at OneR and remember you can configure a classifier by clicking on it. We see

that there's a parameter called minBucketSize, and it's set to 6 by default, which is a good

compromise value. I'm going to change that value to 1, and then see what happens. Run OneR again,

and now I get a different kind of rule. It's branching many different ways on the temperature

attribute. This rule is overfitted to the dataset. It's a very accurate rule on the training data, but it

won't generalize well to independent test data.

Now let's see what happens with a more realistic dataset. I'll open diabetes, which is a numeric

dataset. 51 00:04:26,849 --> 00:04:32,9 All the attributes are numeric, and the class is either

tested_negative or tested_positive. Let's run ZeroR to get a baseline figure for this dataset. Here I

get 65% for the baseline. We really ought to be able to do better than that. Let's run OneR. The

default parameter settings that is a value of 6 for OneR's parameter that controls rule complexity.

We get 71.5%. That's pretty good.

We're evaluating using cross-validation. OneR outperforms the baseline accuracy by quite a bit --

71% versus 65%. If we look at the rule, it branches on "plas". This is the plasma-glucose

concentration. So, depending on which of these regions the plasma-glucose concentration falls into,

then we're going to predict a negative or a positive outcome. That seems like quite a sensible

ruleNow, let's change OneR's parameter to make it overfit. We'll configure OneR, find the

minBucketSize parameter, and change it to 1. When we run OneR again, we get 57% accuracy, quite

a bit lower than the ZeroR baseline 70 of 65%.

If you look at the rule. Here it is. It's testing a different attribute, pedi, which -- if you look at the

comments of the ARFF file -- happens to be the diabetes pedigree function, whatever that is. You

can see that this attribute has a lot of different values, and it looks like we're branching on pretty

well every single one. That gives us lousy performance when evaluated by cross-validation, which is

what we're doing now. If you were to evaluate it on the training set, you would expect to see very

good performance. Yes, here we get 87.5% accuracy on the training set, which is very good for this

dataset. Of course, that figure is completely misleading; the rule is strongly overfitted to the training

dataset and doesn't generalize well to independent test sets.

That's a good example of overfitting. Overfitting is a general phenomenon that plagues all machine

learning methods. We've illustrated it by playing around with the parameter of the OneR method,

but it happens with all machine learning methods. It's one reason why you should never evaluate on

the training set. Overfitting can occur in more general contexts. Let's suppose you've got a dataset

and you choose a very large number of machine learning methods say a million different machine

learning methods and choose the best for your dataset using cross-validation. Well, because you've

used so many machine learning methods, you can't expect to get the same performance on new test

data.

You've chosen so many, that the one that you've ended up with is going to be overfitted to the

dataset you're using. It's not sufficient just to use cross-validation and believe the results. In this

case, you might divide the data three ways, into a training set, a test set, and a validation set.

Choose the method using the training and test set. By all means, use your million machine learning

methods and choose the best on the training 100 and test set or the best using cross-validation on

the training set. 101 But then, leave aside this separate validation set for use at the end, once you've

chosen your machine learning method, and evaluate it on that to get a much more realistic

assessment of how it would perform on independent test data. Overfitting is a really big problem in

machine learning. You can read a bit more about OneR and what this parameter actually does in the

course text in Section 4.1. Off you go now and do the activity associated with this class. Bye for now.

17 Using probabilities

https://www.youtube.com/watch?v=ecOdJ0ON8YQ&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=17

Hi! This is Lesson 3.3 on using probabilities. It's the one bit of Data Mining with Weka that we're

going to see a little bit of mathematics, but don't worry, I'll take you through it gently. 4 The OneR

strategy that we've just been studying assumes that there is one of the attributes that does all the

work, that takes the responsibility of the decision. That's a simple strategy. Another simple strategy

is the opposite, to assume all of the attributes contribute equally and independently to the decision.

This is called the "Naive Bayes" method -- I'll explain the name later on. There are two assumptions

that underline Naive Bayes: that the attributes are equally important and that they are statistically

independent, that is, knowing the value of one of the attributes doesn't tell you anything about the

value 14 of any of the other attributes.

This independence assumption is never actually correct, but the method based on it often works

well in practice. There's a theorem in probability called "Bayes Theorem" after this guy Thomas

Bayes from the 18th century. It's about the probability of a hypothesis H given evidence E. In our

case, the hypothesis is the class of an instance and the evidence is the attribute values of the

instance. The theorem is that Pr[H|E] -- the probability of the class given the instance, the

hypothesis given the evidence -- is equal to Pr[E|H] times Pr[H] divided by Pr[E]. Pr[H] by itself is

called the [prior] probability of the hypothesis H.

That's the probability of the event before any evidence is seen. That's really the baseline probability

of the event. For example, in the weather data, I think there are 9 yeses and 5 nos, so the baseline

probability of the hypothesis "play equals yes" is 9/14 and "play equals no" is 5/14. What this

equation says is how to update that probability Pr[H] when you see some evidence, to get what's call

the "a posteriori" probability of H, that means after the evidence. The evidence in our case is the

attribute values of an unknown instance. That's E. That's Bayes Theorem. Now, what makes this

method "naive"? The naive assumption is -- I've said it before -- that the evidence splits into parts

that are statistically independent.

The parts of the evidence in our case are the four different attribute values in the weather data.

When you have independent events, the probabilities multiply, so Pr[H|E], 39 according to the top

equation, is the product of Pr[E|H] times the prior probability Pr[H] divided by Pr[E]. Pr[E|H] splits

up into these parts: Pr[E1|H], the first attribute value; Pr[E2|H], the second attribute value; and so

on for all of the attributes. That's maybe a bit abstract, let's look at the actual weather data. On the

right-hand side is the weather data. In the large table at the top, we've taken each of the attributes.

Let's start with "outlook". Under the "yes" hypothesis and the "no" hypothesis, we've looked at how

many times the outlook is "sunny". It's sunny twice under yes and 3 times under no. That comes

straight from the data in the table. Overcast. 52 When the outlook is overcast, it's always a "yes"

instance, so there were 4 of those, and zero "no" instances. Then, rainy is 3 "yes" instances and 2

"no" instances. Those numbers just come straight from the data table given the instance values.

Then, we take those numbers and underneath we make them into probabilities Let's say we know

the hypothesis. Let's say we know it's a "yes".

Then the probability of it being "sunny" is 2/9ths, "overcast" is 4/9ths, and "rainy" 3/9ths, 60

00:04:52,960 --> 00:04:56,460 simply because when you add up 2 plus 4 plus 3 you get 9. 61 Those

are the probabilities. If we know that the outcome is "no", the probabilities are "sunny" 3/5ths,

"overcast" 0/5ths, and "rainy" 2/5ths. That's for the "outlook" attribute. That's what we're looking

for, you see, the probability of each of these attribute values given the hypothesis H. 67 The next

attribute is temperature, and we just do the same thing with that to get the probabilities of the 3

values -- hot, mild, and cool -- under the "yes" hypothesis or the "no" hypothesis.

The same with humidity and windy. Play, that's the prior probability -- Pr[H]. It's "yes" 9/14ths of the

time, "no" 5/14ths of the time, even if you don't know anything about the attribute values. The

equation we're looking at is this one below, and we just need to work it out. Here's an example.

Here's an unknown day, a new day. We don't know what the value of "play" is, but we know it's

sunny, cool, high, and windy. We can just multiply up these probabilities. If we multiply for the yes

hypothesis, we get 2/9th times 3/9ths times 3/9ths times 3/9ths -- those are just the numbers on

the previous slide Pr[E1|H], Pr[E2|H], Pr[E3|H] Pr[E4|H] -- finally Pr[H], that is 9/14ths.

That gives us a likelihood of 0.0053 when you multiply them. Then, for the "no" class, we do the

same to get a likelihood of 0.0206. These numbers are not probabilities. 84 00:06:46,720 -->

Probabilities have to add up to 1. They are likelihoods. But we can get the probabilities from them by

using a straightforward technique of normalization. Take those likelihoods for "yes" and "no" and we

normalize them as shown below to make them add up to 1. That's how we get the probability of

"play" on a new day with different attribute values. Just to go through that again. The evidence is

"outlook" is "sunny", "temperature" is "cool", "humidity" is "high", "windy" is "true" -- and we don't

know what play is. The [likelihood] of a "yes", given the evidence is the product of those 4

probabilities -- one for outlook, temperature, humidity and windy -- times the prior probability,

which is just the baseline probability of a "yes". That product of fractions is divided by Pr[E].

We don't know what Pr[E] is, but it doesn't matter, because we can do the same calculation 98 for

Pr[E] of "no", which gives us another equation just like this, and then we can calculate the actual

probabilities by normalizing them so that the two probabilities add up to 1. Pr[E] for "yes" plus Pr[E]

for "no" equals 1. It's actually quite simple when you look at it in numbers, and it's simple when you

look at it in Weka, as well. I'm going to go to Weka here, and I'm going to open the nominal weather

data, which is here. We've seen that before, of course, many times. I'm going to go to Classify. I'm

going to use the NaiveBayes method. It's under this bayes category here. There are a lot of

implementations of different variants of Bayes. I'm just going to use the straightforward NaiveBayes

method here. I'll just run it.

This is what we get. The success probability calculated according to cross-validation. More

interestingly, we get the model. 115 The model is just like the table I showed you before divided

under the "yes" class and the "no" class. We've got the four attributes -- outlook, temperature,

humidity, and windy -- and then, 118 for each of the attribute values, we've got the number of times

that attribute value appears. Now, there's one little and important difference between this table and

the one I showed you before. Let me go back to my slide and look at these numbers. before.

Let me go back to my slide and look at these numbers. You can see that for outlook under "yes" on

my slide, I've got 2, 4, and 3, and Weka has got 3, 5, and 4. That's 1 more each time for a total of 12,

instead of a total of 9. Weka adds 1 to all of the counts. The reason it does this is to get rid of the

zeros. In the original table under outlook, under "no", the probability of overcast given "no" is zero,

and we're going to be multiplying that into things. What that would mean in effect, if we took that

zero at face value, is that the probability of the class being "no" given any day for which the outlook

was overcast would be zero. Anything multiplied by zero is zero.

These zeros in probability terms have sort of a veto over all of the other numbers, and we don't

want that. We don't want to categorically conclude that it must be a "no" day on a basis that it's

overcast, and we've never seen an overcast outlook on a "no" day before. That's called a "zero-

frequency problem", and Weka's solution -- the most common solution -- is very simple, we just add

1 to all the counts. That's why all those numbers in the Weka table are 1 bigger than the numbers in

the table on the slide. Aside from that, it's all exactly the same. We're avoiding zero frequencies by

effectively starting all counts at 1 instead of starting them at 0, so they can't end up at 0. That's the

Naive Bayes method. The assumption is that all attributes contribute equally and independently to

the outcome. That works surprisingly well, even in situations where the independence assumption is

clearly violated. Why does it work so well when the assumption is wrong? That's a good question.

Basically, classification doesn't need accurate probability estimates. We're just going to choose as

the class the outcome with the largest probability. 00 As long as the greatest probability is assigned

to the correct class, it doesn't matter if the probability estimates are all that accurate. This actually

means that if you add redundant attributes you get problems with Naive Bayes. The extreme case of

dependence is where two attributes have the same values, identical 154 attributes. That will cause

havoc with the Naive Bayes method. However, Weka contains methods for attribute selection to

allow you to select a subset of fairly independent attributes after which you can safely use Naive

Bayes. There's quite a bit of stuff on statistical modeling in Section 4.2 of the course text. Now you

need to go and do that activity. See you soon!

18 Decision trees

https://www.youtube.com/watch?v=IPh8PxDtgdQ&index=18&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7

Hi! Here in Lesson 3.4, we're continuing our exploration of simple classifiers by looking at classifiers

that produce decision trees. We're going to look at J48. We've used this classifier quite a bit so far.

Let's have a look at how it works inside. J48 is based on a top-down strategy, a recursive divide and

conquer strategy. You select which attribute to split on at the root node, and then you create a

branch for each possible attribute value, and that splits the instances into subsets, one for each

branch that extends from the root node. Then you repeat the the procedure recursively for each

branch, selecting an attribute at each node, and you use only instances that reach that branch to

make the selection.

At the end you stop, perhaps you might continue until all instances have the same class. The trick is,

the question is, how do you select a good attribute for the root node. This is the weather data, and

as you can see, outlook has been selected for the root node. Here are the four possibilities: outlook,

windy, humidity, and temperature. These are the consequences of splitting on each of these

attributes. What we're really looking for is a pure split, a split into pure nodes. We would be

delighted if we found an attribute that split exactly into one node where they are all yeses, another

node where they are all nos, and perhaps a third node where they are all yeses again.

That would be the best thing. What we don't want is mixtures, because when we get mixtures of

yeses and nos at a node, then we've got to split again. You can see that splitting on outlook looks

pretty good. We get one branch with two yeses and three nos, then we get a pure yes branch for

overcast, and, when outlook is rainy, we get three yeses and two nos. 0 How are we going to

quantify this to decide which one of these attributes produces the purest nodes? We're on a quest

here for purity. The aim is to get the smallest tree, and top-down tree induction methods use some

kind of heuristic.

The most popular heuristic to produce pure nodes is an information theory-based heuristic. 1 I'm not

going to explain information theory to you, that would be another MOOC of its own -- quite an

interesting one, actually. Information theory was founded by Claude Shannon, an American

mathematician and scientist who died about 12 years ago. He was an amazing guy. He did some

amazing things. One of the most amazing things, I think, is that he could ride a unicycle and juggle

clubs at the same time when he was in his 80's. That's pretty impressive. He came up the whole idea

of information theory and quantifying entropy, which measures information in bits.

This is the formula for entropy: the sum of p log p's for each of the possible outcomes. I'm not really

going to explain it to you. All of those minus signs are there because logarithms are negative if

numbers are less than 1 and probabilities always are less than 1. So, the entropy comes out to be a

positive number. What we do is we look at the information gain. How much information in bits do

you gain by knowing the value of an attribute? That is, the entropy of the distribution before the

split minus the entropy of the distribution after the split. Here's how it works out for the weather

data.

These are the number of bits. If you split on outlook, you gain 0.247 bits. I know you might be

surprise to see fractional numbers of bits, normally we think of 1 bit or 8 bits or 32 bits, but

information theory shows how you can regard bits as fractions. These produce fractional numbers of

bits. I don't want to go into the details. You can see, knowing the value for windy gives you only

0.048 bits of information. Humidity is quite a bit better; temperature is way down there at 0.029

bits. We're going to choose the attribute that gains the most bits of information, and that, in this

case, is outlook.

At the top level of this tree, the root node, we're going to split on outlook. Having decided to split on

outlook, we need to look at each of 3 branches that emanate from outlook corresponding to the 3

possible values of outlook, and consider what to do at each of those branches. 66 At the first branch,

we might split on temperature, windy or humidity. We're not going to split on outlook again because

we know that outlook is sunny. For all instances that reach this place, the outlook is sunny. For the

other 3 things, we do exactly the same thing. We evaluate the information gain for temperature at

that point, for windy and humidity, and we choose the best. In this case, it's humidity with a gain of

0.971 bits. You can see that, if we branch on humidity, then we get pure nodes: 3 nos in one and

yeses in the other. When we get that, we don't need to split anymore. We're on a quest for purity.

That's how it works. It just carries on until it reaches the end, until it has pure nodes. Let's open up

Weka, and just do this with the nominal weather data.

Of course, we've done this before, but I'll just do it again. It won't take long. J48 is the workhorse

data mining algorithm. There's the data. We're going to choose J48. It's a tree classifier. We're going

to run this, and we get a tree -- the very tree I showed you before -- split first on outlook: sunny,

overcast and rainy. Then, if it's sunny, split on humidity, 3 instances reach that node. Then split on

normal, 3 yes instances reach that node, and so on. We can look at the tree using Visualize the tree

in the right-click menu. Here it is. These are the number of yes instances that reach this node and

the number of no instances. In the case of this particular tree, of course we're using cross validation

here.

It's done an 11th run on the whole dataset. It's given us these numbers by looking at the training set.

In fact, this becomes a pure node here. 97 00:07:24,520 --> 00:07:29,950 Sometimes you get 2

numbers here -- 3/2 or 3/1. The first number indicates the number of correct things that reach that

node, so in this case the number of nos. If there was another number following the 3, that would

indicate the number of yeses, that is, incorrect things that reach that node. But that doesn't occur in

this very simple situation. There you have it, J48: top-down induction of decision trees. It's soundly

based in information theory. It's a pretty good data mining algorithm. 10 years ago I might have said

it's the best data mining algorithm, but some even better ones, I think, have been produced since

then.

However, the real advantage of J48 is that it's reliable and robust, and, most importantly, it produces

a tree that people can understand. It's very easy to understand the output of J48. That's really

important when you're applying data mining. 112 There are a lot of different criteria you could use

for attribute selection. Here we're using information gain. Actually, in practice, these don't normally

make a huge difference. There are some important modifications that need to be done to this

algorithm to be useful in practice. I've only really explained the basic principles. The actual J48

incorporates some more complex stuff to make it work under different circumstances in practice.

We'll talk about those in the next lesson. Section 4.3 of the text Divide-and-conquer: Constructing

decision trees explains the simple version of J48 that I've explained here. Now you should go and do

the activity associated with this lesson. Good luck! See you next time!

https://www.youtube.com/watch?v=Vo675o8ai2w&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=19

Hi! In the last class, we looked at a bare-bones algorithm for constructing decision trees. To get an

industrial strength decision tree induction algorithm, we need to add some more complicated stuff,

notably pruning. We're going to talk in this [lesson] about pruning decision trees. Here's a guy

pruning a tree, and that's a good image to have in your mind when we're talking about decision

trees. We're looking at those little twigs and little branches around the edge of the tree, seeing if

their worthwhile, and snipping them off if they're not contributing. That way, we'll get a decision

tree that might perform worse on the training data, but perhaps generalizes better to independent

test data. That's what we want. Here's the weather data again. I'm sorry to keep harking back to the

weather data, but it's just a nice simple example that we all know now. I've added here a new

attribute. I call it an ID code attribute, which is different for each instance.

I've just given them an identification code: a, b, c, and so on. Let's just think from the last lesson,

what's going to happen when we consider which is the best attribute to split on at the root, the first

decision. We're going to be looking for the information gain from each of our attributes separately.

We're going to gain a lot of information by choosing the ID code. Actually, if you split on the ID code,

that tells you everything about the instance we're looking at. That's going to be a maximal amount of

information gain, and clearly we're going to split on that attribute at the root node of the decision

tree. But that's not going to generalize at all to new weather instances. To get around this problem,

having constructed a decision tree, decision tree algorithms then automatically prune it back. You

don't see any of this, it just happens when you start the algorithm in Weka. How do we prune? There

are some simple techniques for pruning, and some more complicated techniques for pruning. A very

simple technique is to not continue splitting if the nodes get very small.

I said in the last lesson that we're going to keep splitting until each node has just one class

associated with it. Perhaps that's not such a good idea. If we have a very small node with a couple

instances, it's probably not worth splitting that node. That's actually a parameter in J48. I've got

Weka going here. I'm going to choose J48 and look at the parameters. There's a parameter called

minNumObj. If I mouse over that parameter, it says "The minimum number of instances per leaf".

The default value for that is 2. The second thing we do is to build a full tree and then work back from

the leaves. It turns out to be better to build a full tree and prune back rather than trying to do

forward pruning as you're building the tree. We apply a statistical test at each stage.

That's the confidenceFactor parameter. It's here. The default value is 0.25. "The confidence factor

used for pruning [smaller values incur more pruning]." Then, sometimes it's good to prune an

interior node, and to raise the subtree beneath that interior node up one level. That's called

subtreeRaising. That's this parameter here. We can switch it on or switch it off. "Whether to

consider the subtree raising operation during pruning." Subtree raising actually increases the

complexity of the algorithm, so it would work faster if you turned off subtree raising on a large

problem. I'm not going to talk about the details of these methods. Pruning is a messy and

complicated subject, and it's not particularly illuminating.

Actually, I don't really recommend playing around with these parameters here. The default values on

J48 tend to do a pretty good job. Of course, it's become apparent to you now that the need to prune

is really a result of the original unpruned tree overfitting the training dataset. This is another

instance of overfitting. Sometimes simplifying a decision tree gives better results, not just a smaller,

more manageable tree, but actually better results. I'm going to open the diabetes data. I'm going to

choose J48, and I'm just going to run it with the default parameters. I get an accuracy of 73.8%,

evaluated using cross-validation. The size of the tree is 20 leaves, and a total of 39 nodes. That's 19

interior nodes and 20 leaf nodes. Let's switch off pruning. J48 prunes by default. We're going to

switch off pruning. We've got an unpruned option here, which is false, which means it's pruning.

I'm going to change that to true -- which means it's not pruning any more -- and run it again. Now we

get a slightly worse result, 72.7%, probably not significantly worse. We get a slightly larger tree -- 22

leaves and That's a double whammy, really. We've got a bigger tree, which is harder to understand,

and we've got a slightly worse prediction result. We would prefer the pruned [tree] in this example

on this dataset. I'm going to show you a more extreme example with the breast cancer data. I don't

think we've looked at the breast cancer data before. The class is no-recurrence-events versus

recurrence-events, and there are attributes like age, menopause, tumor size, and so on. I'm going to

go classify this with J48 in the default configuration. I need to switch on pruning -- that is, make

unpruned false -- and then run it. I get an accuracy of 75.5%, and I get a fairly small tree with 4

leaves and 2 internal nodes. I can look at that tree here, or I can visualize the tree.

We get this nice, simple little decision structure here, which is quite comprehensible and performs

pretty well, 75% accuracy. I'm going to switch off pruning. Make unpruned true, and run it again.

First of all, I get a much worse result, 69.6% -- probably signficantly worse than the 75.5% I had

before. More importantly, I get a huge tree, with It's massive. If I try to visualize that, I probably

won't be able to see very much. I can try to fit that to my screen, and it's still impossible to see

what's going on here. In fact, if I look at the textual description of the tree, it's just extremely

complicated. That's a bad thing. Here, an unpruned tree is a very bad idea. We get a huge tree which

does quite a bit worse than a much simpler decision structure. J48 does pruning by default and, in

general, you should let it do pruning according to the default parameters. That would be my

recommendation. We've talked about J48, or, in other words, C4.5. Remember, in Lesson 1.4, we

talked about the progression from C4.5 by Ross Quinlan.

Here is a picture of Ross Quinlan, an Australian computer scientist, at the bottom of the screen. The

progression from C4.5 from Ross to J48, which is the Java implementation essentially equivalent to

C4.5. It's a very popular method. It's a simple method and easy to use. Decision trees are very

attractive because you can look at them and see what the structure of the decision is, see what's

important about your data. There are many different pruning methods, and their main effect is to

change the size of the tree. They have a small effect on the accuracy, and it often makes the

accuracy worse. They often have a huge effect on the size of the tree, as we just saw with the breast

cancer data. Pruning is actually a general technique to guard against overfitting, and it can be

applied to structures other than trees, like decision rules. There's a lot more we could say about

decision trees. For example, we've been talking about univariate decision trees -- that is, ones that

have a single test at each node. You can imagine a multivariate tree, where there is a compound

test. The test of the node might be 'if this attribute is that AND that attribute is something else'. You

can imagine more complex decision trees produced by more complex decision tree algorithms. In

general, C4.5/J48 is a popular and useful workhorse algorithm for data mining. You can read a lot

more about decision trees if you go to the course text. Section 6.1 tells you about pruning and gives

you the mathematical details of the pruning methods that I've just sketched here. It's time for you to

do the activity, and I'll see you in the next lesson. Bye for now!

20 Nearest neighbor

https://www.youtube.com/watch?v=2F4KwLwUsP8&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=20

Hi! I'm sitting here in New Zealand. It's on the globe behind me. That's New Zealand, at the top of

the world, surrounded by water. But that's not where I'm from originally. I moved here about 20

years ago. Here on this map, of course, this is New Zealand -- Google puts things with the north at

the top, which is probably what you're used to. I came here from the University of Calgary in Canada,

where I was for many years. I used to be head of computer science for the University of Calgary. But,

originally, I'm from Belfast, Northern Ireland, which is here in the United Kingdom. So, my accent

actually is Northern Irish, not New Zealand. This is not a New Zealand accent. We're going to talk

here in the last lesson of Class 3 about another machine learning method called the nearest

neighbor, or instance-based, machine learning method.

When people talk about rote learning, they just talk about remember stuff without really thinking

about it. It's the simplest kind of learning. Nearest neighbor implements rote learning. It just

remembers the training instances, and then, to classify a new instance, it searches the training set

for one that is most like the new instance. The representation of the knowledge here is just the set

of instances. It's a kind of lazy learning. The learner does nothing until it has to do some predictions.

Confusingly, it's also called instance-based learning. Nearest neighbor learning and instance-based

learning are the same thing. Here is just a little picture of 2-dimensional instance space. The blue

points and the white points are two different classes -- yes and no, for example.

Then we've got an unknown instance, the red one. We want to know which class it's in. So, we

simply find the closest instance in each of the classes and see which is closest. In this case, it's the

blue class. So, we would classify that red point as though it belonged to the blue class. If you think

about this, that's implicitly drawing a line between the two clouds of points. It's a straight line here,

the perpendicular bisector of the line that joins the two closest points. The nearest neighbor method

produces a linear decision boundary. Actually, it's a little bit more complicated than that. It produces

a piece-wise linear decision boundary with sometimes a bunch of little linear pieces of the decision

boundary. Of course, the trick is what do we mean by "most like".

We need a similarity function, and conventionally, people use the regular distance function, the

Euclidean distance, which is the sum of the squares of the differences between the attributes.

Actually, it's the square root of the sum of the squares, but since we're just comparing two

instances, we don't need to take the square root. Or, you might use the Manhattan or city block

distance, which is the sum of the absolute differences between the attribute values. Of course, I've

been talking about numeric attributes here. If attributes are nominal, we need the difference

between different attribute values. Conventionally, people just say the distance is 1 if the attribute

values are different and 0 if they are the same. It might be a good idea with nearest neighbor

learning to normalize the attributes so that they all lie between 0 and 1, so the distance isn't skewed

by some attribute that happens to be on some gigantic scale.

What about noisy instances. If we have a noisy dataset, then by accident we might find an

incorrectly classified training instance as the nearest one to our test instance. You can guard against

that by using the k-nearest-neighbors. k might be 3 or 5, and you look for the 3 or the 5 nearest

neighbors and choose the majority class amongst those when classifying an unknown point. That's

the k-nearest-neighbor method. In Weka, it's called IBk (instance-based learning with parameter k),

and it's in the lazy class. Let's open the glass dataset. Go to Classify and choose the lazy classifier IBk.

Let's just run it. We get an accuracy of 70.6%. The model is not really printed here, because there is

no model. It's just the set of training instances. We're using 10-fold cross-validation, of course. Let's

change the value of k, this kNN is the k value. It's set by default to 1. (The number of neighbors to

use.) We'll change that to, say, 5 and run that. In this case, we get a slightly worse result, This is not

such a noisy dataset, I guess. If we change it to 20 and run it again.

We get 65% accuracy, slightly worse again. If we had a noisy dataset, we might find that the accuracy

figures improved as k got little bit larger. Then, it would always start to decrease again. If we set k to

be an extreme value, close to the size of the whole dataset, then we're taking the distance of the

test instance to all of the points in the dataset and averaging those, which will probably give us

something close to the baseline accuracy. Here, if I set k to be a ridiculous value like 100. I'm going to

take the 100 nearest instances and average their classes. We get an accuracy of 35%, which, I think is

pretty close to the baseline accuracy for this dataset. Let me just find that out with ZeroR, the

baseline accuracy is indeed 35%. Nearest neighbor is a really good method. It's often very accurate.

It can be slow. A simple implementation would involve scanning the entire training dataset to make

each prediction, because we've got to calculate the distance of the unknown test instance from all of

the training instances to see which is closest.

There are more sophisticated data structures that can make this faster, so you don't need to scan

the whole dataset every time. It assumes all attributes are equally important. If that wasn't the case,

you might want to look at schemes for selecting or weighting attributes depending on their

importance. If we've got noisy instances, than we can use a majority vote over the k nearest

neighbors, or we might weight instances according to their prediction accuracy. Or, we might try to

identify reliable prototypes, one for each of the classes. This is a very old method. Statisticians have

used k-nearest-neighbor since the 1950's. There's an interesting theoretical result. If the number (n)

of training instances approaches infinity, and k also gets larger in such a way that k/n approaches 0,

but k also approaches infinity, the error of the k-nearest-neighbor method approaches the

theoretical minimum error for that dataset. There is a theoretical guarantee that with a huge dataset

and large values of k, you're going to get good results from nearest neighbor learning. There's a

section in the text, Section 4.7 on Instance-based learning. This is the last lesson of Class 3. Off you

go and do the activity, and I'll see you in Class 4. Bye for now!

21 Questions

https://www.youtube.com/watch?v=aIw4w_o3cg4&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=21

Hi! We've just finished Class 3, and here are some of the issues that arose. I have a list of them here,

so let's start at the top. Numeric precision in activities has caused a little bit of unnecessary angst. So

we've simplified our policy. In general, we're asking you to round your percentages to the nearest

integer. We certainly don't want you typing in those Some people are getting the wrong results in

Weka. One reason you might get the wrong results is that the random seed is not set to the default

value. Whenever you change the random seed, it stays there until you change it back or until you

restart Weka. Just restart Weka or reset the random seed to 1. Another thing you should do is check

your version of Weka. We asked you to download 3.6.10. There have been some bug fixes since the

previous version, so you really do need to use this new version.

One of the activities asked you to copy an attribute, and some people found some surprising things

with Weka claiming 100% accuracy. If you accidentally ask Weka to predict something that's already

there as an attribute, it will do very well, with very high accuracy! It's very easy to mislead yourself

when you're doing data mining. You just need to make sure you know what you're trying to predict,

you know what the attributes are, and you haven't accidentally included a copy of the class attribute

as one of the attributes that's being used for prediction. There's been some discussion on the

mailing list about whether OneR is really always better than ZeroR on the training set. In fact, it is.

Someone proved it. (Thank you Jurek for sharing that proof with us.) Someone else found a

counterexample! "If we had a dataset with 10 instances, 6 belonging to Class A and 4 belonging to

Class B, with attribute values selected randomly, wouldn't ZeroR outperform OneR? -- OneR would

be fooled by the randomness of attribute values."

It's kind of anthropomorphic to talk about OneR being "fooled by" things. It's not fooled by anything.

It's not a person; it's not a being: it's just an algorithm. It just gets an input and does its thing with

the data. If you think that OneR might be fooled, then why don't you try it? Set up this dataset with

10 instances, 6 in A and 4 in B, select the attributes randomly, and see what happens. I think you'll

be able to convince yourself quite easily that this counterexample isn't a counterexample at all. It is

definitely true that OneR is always better than ZeroR on the training set. That doesn't necessarily

mean it's going to be better on an independent test set, of course. The next thing is Activity 3.3,

which asks you to repeat attributes with NaiveBayes.

Some people asked "why are we doing this?" It's just an exercise! We're just trying to understand

NaiveBayes a bit better, and what happens when you get highly correlated attributes, like repeated

attributes. With NaiveBayes, enough repetitions mean that the other attributes won't matter at all.

This is because all attributes contribute equally to the decision, so multiple copies of an attribute

skew it in that direction. This is not true with other learning algorithms. It's true for NaiveBayes, but

it's not true for OneR or J48, for example. Copied attributes doesn't effect OneR at all. The copying

exercise is just to illustrate what happens with NaiveBayes when you have non-independent

attributes. It's not something you do in real life.

Although you might copy an attribute in order to transform it in some way, for example. Someone

asked about the mathematics. In Bayes formula you get Pr[E|H]^k, if the attribute was repeated k

times, in the top line. How does this work mathematically? First of all, I'd just like to say that the

Bayes formulation assumes independent attributes. Bayes expansion is not true if the attributes are

dependent. But the algorithm works off that, so let's see what would happen. If you can stomach a

bit of mathematics, here's the equation for the probability of the hypothesis given the evidence

(Pr[H|E]). H might be Play is "yes" or Play is "no", for example, in the weather data. It's equal to this

fairly complicated formula at the top, which, let me just simplify it by writing "..." for all the bits after

here. So Pr[E1|H]^k, where E1 is repeated k times, times all the other stuff, divided by Pr[E]. What

the algorithm does: because we don't know Pr[E], we normalize the 2 probabilities by calculating

Pr[yes|E] using this formula and Pr[no|E], and normalizing them so that they add up to 1.

That then computes Pr[yes|E] as this thing here -- which is at the top, up here -- Pr[E1|yes]^k,

divided by that same thing, plus the corresponding thing for "no". If you look at this formula and just

forget about the "...", what's going to happen is that these probabilities are less than 1. If we take

them to the k'th power, they are going to get very small as k gets bigger. In fact, they're going to

approach 0. But one of them is going to approach 0 faster than the other one. Whichever one is

bigger -- for example, if the "yes" one is bigger than the "no" one -- then it's going to dominate. The

normalized probability then is going to be 1 if the "yes" probability is bigger than the "no"

probability, otherwise 0. That's what's actually going to happen in this formula as k approaches

infinity. The result is as though there is only one attribute: E1.

That's a mathematical explanation of what happens when you copy attributes in NaiveBayes. Don't

worry if you didn't follow that; that was just for someone who asked. Decision trees and bits.

Someone said on the mailing list that in the lecture there was a condition that resulted in branches

with all "yes" or all "no" results completely determining things. Why was the information gain only

[0.971] and not the full 1 bit? This is the picture they were talking about. Here, "humidity"

determines these are all "no" and these are all "yes" for high and normal humidity, respectively.

When you calculate the information gain -- and this is the formula for information gain -- you get

0.971 bits. You might expect 1 (and I would agree), and you would get 1 if you had 3 no's and 3 yes's

here, or if you had 2 no's and 2 yes's. But because there is a slight imbalance between the number of

no's and the number of yes's, you don't actually get 1 bit under these circumstances. There were

some questions on Class 2 about stratified cross-validation, which tries to get the same proportion of

class values in each fold. Some suggested maybe you should choose the number of folds so that it

can do this exactly, instead of approximately.

If you chose as the number of folds an exact divisor of the number of elements in each class, we'd be

able to do this exactly. "Would that be a good thing to do?" was the question. The answer is no, not

really. These things are all estimates, and you're treating them as though they were exact answers.

They are all just estimates. There are more important considerations to take into account when

determining the number of folds to do in your cross-validation. Like: you want a large enough test

set to get an accurate estimate of the classification performance, and you want a large enough

training set to train the classifier adequately. Don't worry about stratification being approximate.

The whole thing is pretty approximate actually. Someone else asked "why is there a 'Use training

set'" option on the Classify tab. It's very misleading to take the evaluation you get on the training

data seriously, as we know. So why is it there in Weka? Well, we might want it for some purposes.

For example, it does give you a quick upper bound on an algorithm's performance: it couldn't

possibly do better than it would do on the training set. That might be useful, allowing you to quickly

reject a learning algorithm. The important thing here is to understand what is wrong with using the

training set for a performance estimate, and what overfitting is. Rather than changing the interface

so you can't do bad things, I would rather protect you by educating you about what the issues are

here. There have been quite a few suggested topics for a follow-up course: attribute selection,

clustering, the Experimenter, parameter optimization, the KnowledgeFlow interface, and simple

command line interface.

We're considering a followup course, and we'll be asking you for feedback on that at the end of this

course. Finally, someone said "Please let me know if there is a way to make a small donation" -- he's

enjoying the course so much! Well, thank you very much. We'll make sure there is a way to make a

small donation at the end of the course. That's it for now. On with Class 4. I hope you enjoy Class 4,

and we'll talk again later. Bye for now!

CLASS 4

22 Classification boundaries

https://www.youtube.com/watch?v=zoKKTi1qKdg&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=22

Hello, again, and welcome to Data Mining with Weka, back here in New Zealand. In this class, Class

4, we're going to look at some pretty cool machine learning methods. We're going to look at linear

regression, classification by regression, logistic regression, support vector machines, and ensemble

learning. The last few of these are contemporary methods, which haven't been around very long.

They are kind of state-of-the-art machine learning methods. Remember, there are 5 classes in this

course, so next week is Class 5, the last class. We'll be tidying things up and summarizing things then.

You're well over halfway through; you're doing well. Just hang on in there. In this lesson, we're going

to start by looking at classification boundaries for different machine learning methods. We're going

to use Weka's Boundary Visualizer, which is another Weka tool that we haven't encountered yet. I'm

going to use a 2-dimensional dataset. I've prepared iris.2d.arff.

It's a I took the regular iris dataset and deleted a couple of attributes -- sepallength and sepalwidth --

leaving me with this 2D dataset, and the class. We're going to look at that using the Boundary

Visualizer. You get that from this Visualization menu on the Weka Chooser. There are a lot of tools in

Weka, and we're just going to look at this one here, the Boundary Visualizer. I'm going to open the

same file in the Boundary Visualizer, the 2-dimensional iris dataset. Here we've got a plot of the

data. You can see that we're plotting petalwidth on the y-axis against petallength on the x-axis. This

is a picture of the dataset with the and virginica in blue. I'm going to choose a classifier. Let's begin

with the OneR classifier, which is in rules. I'm going to "plot training data" and just going to let it rip.

The color diagram shows the decision boundaries, with the training data superimposed on it.

Let's look at what OneR does to this dataset in the Explorer. OneR has chosen to split on petalwidth.

If it's less than a certain amount, we get a setosa; if it's intermediate, we get a versicolor; and if it's

greater than the upper boundary, we get a viriginica. It's the same as what's being shown here.

We're splitting on petalwidth. If it's less than a certain amount, we get a setosa; in the middle, a

versicolor; and at the top, a virginica. This is a spatial representation of the decision boundary that

OneR creates on this dataset. That's what the Boundary Visualizer does; it draws decision

boundaries. It shows here that OneR chooses an attribute -- in this case petalwidth -- to split on. It

might have chosen petallength, in which case we'd have vertical decision boundaries.

Either way, we're going to get stripes from OneR. I'm going to go ahead and look at some boundaries

for other schemes. Let's look at IBk, which is a "lazy" classifier. That's the instance-based learner we

looked at in the last class. I'm going to run that. Here we get a different kind of pattern. I'll just stop

it there. We've got diagonal lines. Down here are the setosas underneath this diagonal line; the

versicolors in the intermediate region; and the virginicas, by and large, in the top right-hand corner.

Remember what [IBk] does. It takes a test instance. Let's say we had an instance here, just on this

side of the boundary, in the red. Then it chooses the nearest instance to that.

That would be this one, I guess. That's kind of the nearer than this one here. This is a red point. If I

were to cross over the boundary here, it would choose a green class, because this would be the

nearest instance then. If you think about it, this boundary goes halfway between this nearest red

point and this nearest green point. Similarly, if I take a point up here, I guess the two nearest

instances are this blue one and this green one. This blue one is closer. In this case, the boundary

goes along this straight line here. You can see that it's not just a single line: this is a piecewise linear

line, so this part of the boundary goes exactly halfway between these two points quite close to it.

Down here, the boundary goes exactly halfway between these two points. It's the perpendicular

bisector of the line joining these points. So, we get a piecewise linear boundary made up of little

pieces. It's kind of interesting to see what happens if we change the parameter: if we look at, say, 5

nearest neighbors instead of just 1. Now we get a slightly blurry picture, because whereas down

here in the pure red region the points, if we look in the intermediate region here, then the nearest

neighbors to a point here -- this is going to be in the 5, and this might be another one in the 5, and

there might be a couple more down here in the 5. So we get an intermediate color here, and IBk

takes a vote. If we had 3 reds and 2 greens, then we'd be in the red region and that would be

depicted as this darker red here.

If it had been the other way round with more greens than reds, we'd be in the green region. So

we've got a blurring of these boundaries. These are probabilistic descriptions of the boundary. Let

me just change k to 20 and see what happens. Now we get the same shape, but even more blurry

boundaries. The Boundary Visualizer reveals the way that machine learning schemes are thinking, if

you like. The internal representation of the dataset. They help you think about the sorts of things

that machine learning methods do. Let's choose another scheme. I'm going to choose NaiveBayes.

When we talked about NaiveBayes, we only talked about discrete attributes. With continuous

attributes, I'm going to choose a supervised discretization method. Don't worry about this detail, it's

the most common way of using NaiveBayes with numeric attributes. Let's look at that picture.

This is interesting. When you think about NaiveBayes, it treats each of the two attributes as

contributing equally and independently to the decision. It sort of decides what it should be along this

dimension and decides what it should be along this dimension and multiples the two together.

Remember the multiplication that went on in NaiveBayes. When you multiple these things together,

you get a checkerboard pattern of probabilities, multiplying up the probabilities. That's because the

attributes are being treated independently. That's a very different kind of decision boundary from

what we saw with instance-based learning. That's what's so good about the Boundary Visualizer: it

helps you think about how things are working inside. I'm going to do one more example.

I'm going to do J48, which is in trees. Here we get this kind of structure. Let's take a look at what

happens in the Explorer if we choose J48. We get this little decision tree: split first on petalwidth; if

it's less than 0.6 it's a setosa for sure. Then split again on petalwidth; if it's greater than 1.7, it's a

virginica for sure. Then, in between, split on petallength and then again on petalwidth, getting a

mixture of versicolors and viriginicas. We split first on petalwidth; that's this split here. Remember

the vertical axis is the petalwidth axis. If it's less than a certain amount, it's a setosa for sure. Then

we split again on the same axis. If it's greater than a certain amount, it's a virginica for sure. If it's in

the intermediate region, we split on the other axis, which is petallength. Down here, it's a versicolor

for sure, and here we're going to split again on the petalwidth attribute. Let's change the

minNumObj parameter, which controls the minimum size of the leaves.

If we increase that, we're going to get a simpler tree. We discussed this parameter in one of the

lessons of Class 3. If we run now, then we get a simpler version, corresponding to the simpler rules

we get with this parameter set. Or we can set the parameter to a higher value, say 10, and run it

again. We get even simpler rules, very similar to the rules produced by OneR. We've looked at

classification boundaries. Classifiers create boundaries in instance space and different classifiers

have different capabilities for carving up instance space. That's called the "bias" of the classifier --

the way in which it's capable of carving up the instance space. We looked at OneR, IBk, NaiveBayes,

and J48, and found completely different biases, completely different ways they carve up the instance

space. Of course, this kind of visualization is restricted to numeric attributes and 2-dimensional

plots, so it's not a very general tool, but it certainly helps you think about these different classifiers.

You can read about classification boundaries in Section 17.3 of the course text. Now off you go and

do the activity associated with this lesson. Good luck! We'll see you later.

23 Linear regression

https://www.youtube.com/watch?v=UMa8knob32c&index=23&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7

Hi! This is Lesson 4.2 on Linear Regression. Back in Lesson 1.3, we actually mentioned the difference

between a classification problem and a regression problem. A classification problem is when what

you're trying to predict is a nominal value, whereas in a regression problem what you're trying to

predict is a numeric value. We've seen examples of datasets with nominal and numeric attributes

before, but we've never looked at the problem of regression, of trying to predict a numeric value as

the output of a machine learning scheme. That's what we're doing in this [lesson], linear regression.

We've only had nominal classes so far, so now we're going to look at numeric classes. This is a

classical statistical method, dating back more than 2 centuries. This is the kind of picture you see.

You have a cloud of data points in 2 dimensions, and we're trying to fit a straight line to this cloud of

data points and looking for the best straight-line fit. Only in our case we might have more than 2

dimensions, there might be multiple dimensions.

It's still a standard problem. Let's just look at the 2-dimensional case here. You can write a straight

line equation in this form, with weights w0 plus w1a1 plus w2a2, and so on. Just think about this in

one dimension where there's only one "a". Forget about all the things at the end here, just consider

w0 plus w1a1. That's the equation of this line -- it's the equation of a straight line -- where w0 and

w1 are two constants to be determined from the data. This, of course, is going to work most

naturally with numeric attributes, because we're multiplying these attribute values by weights. We'll

worry about nominal attributes in just a minute. We're going to calculate these weights from the

training data -- w0, w1, and w2. Those are what we're going to calculate from the training data.

Then, once we've calculated the weights, we're going to predict the value for the first training

instance, a1. The notation gets really horrendous here. I know it looks pretty scary, but it's pretty

simple. We're using this linear sum with these weights that we've calculated, using the attribute

values of the first [training] instance in order to get the predicted value for that instance.

We're going to get predicted values for the training instances using this rather horrendous formula

here. I know it looks pretty scary, but it's actually not so scary. These w's are just numbers that we've

calculated from the training data, and then these things here are the attribute values of the first

training instance a1 -- that 1 at the top here means it's the first training instance. This 1, 2, 3 means

it's the first, second, and third attribute. We can write this in this neat little sum form here, which

looks a little bit better. Notice, by the way, that we're defining a0 -- the zeroth attribute value -- to

be 1. That just makes this formula work. For the first training instance, that gives us this number x,

the predicted value for the first training instance and this particular value of a1. Then we're choosing

the weights to minimize the squared error on the training data. This is the actual x value for this i'th

training instance. This is the predicted value for the i'th training instance. We're going to take the

difference between the actual and the predicted value, square them up, and add them all together.

And that's what we're trying to minimize.

We get the weights by minimizing this sum of squared errors. That's a mathematical job; we don't

need to worry about the mechanics of doing that. It's a standard matrix problem. It works fine if

there are more instances than attributes. You couldn't expect this to work if you had a huge number

of attributes and not very many instances. But providing there are more instances than attributes --

and usually there are, of course -- that's going to work ok. If we did have nominal values, if we just

have a 2-valued/binary-valued, we could just convert it to 0 and 1 and use those numbers. If we

have multi-valued nominal attributes, you'll have a look at that in the activity at the end of this

lesson. We're going to open a regression dataset and see what it does: cpu.arff.

This is a regular kind of dataset. It's got numeric attributes, and the most important thing here is that

it's got a numeric class -- we're trying to predict a numeric value. We can run LinearRegression; it's in

the functions category. We just run it, and this is the output. We've got the model here. The class

has been predicted as a linear sum. These are the weights I was talking about. It's this weight times

this attribute value plus this weight times this attribute value, and so on. Minus -- and this is w0, the

constant weight, not modified by an attribute. This is a formula for computing the class. When you

use that formula, you can look at the success of it in terms of the training data. The correlation

coefficient, which is a standard statistical measure, is 0.9.

That's pretty good. Then there are various other error figures here that are printed. On the slide, you

can see the interpretation of these error figures. It's really hard to know which one to use. They all

tend to produce the same sort of picture, but I guess the exact one you should use depends on the

application. There's the mean absolute error and the root mean squared error, which is the standard

metric to use. That's linear regression. I'm actually going to look at nonlinear regression here. A

"model tree" is a tree where each leaf has one of these linear regression models. We create a tree

like this, and then at each leaf we have a linear model, which has got those coefficients. It's like a

patchwork of linear models, and this set of 6 linear patches approximates a continuous function.

There's a method under "trees" with the rather mysterious name of M5P. If we just run that, that

produces a model tree. Maybe I should just visualize the tree.

Now I can see the model tree, which is similar to the one on the slide. You can see that each of these

-- in this case 5 -- leaves has a linear model -- LM1, LM2, LM3, ... And if we look back here, the linear

models are defined like this: LM1 has this linear formula; this linear formula for LM2; and so on. We

chose trees > M5P, we ran it, and we looked at the output. We could compare these performance

figures -- 92-93% correlation, mean absolute error of 30, and so on -- with the ones for regular linear

regression, which got a slightly lower correlation, and a slightly higher absolute error -- in fact, I think

all these error figures are slightly higher. That's something we'll be asking you to do in the activity

associated with this lesson. Linear regression is a well-founded, venerable mathematical technique.

Practical problems often require non-linear solutions. The M5P method builds trees of regression

models, with linear models at each leaf of the tree. You can read about this in the course text in

Section 4.6. Off you go now and do the activity associated with this lesson.

24 Classification by regression

https://www.youtube.com/watch?v=NZi5-LmhXu8&index=24&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7

Hi! Welcome back! In the last lesson, we looked at linear regression -- the problem of predicting, not

a nominal class value, but a numeric class value. The regression problem. In this lesson, we're going

to look at how to use regression techniques for classification. It sounds a bit weird, but regression

techniques can be really good under certain circumstances, and we're going to see if we can apply

them to ordinary classification problems. In a 2-class problem, it's quite easy really. We're going to

call the 2 classes 0 and 1 and just use those as numbers, and then come up with a regression line

that, presumably for most 0 instances has a pretty low value, and for most 1 instances has a larger

value, and then come up with a threshold for determining whether, if it's less than that threshold,

we're going to predict class 0; if it's greater, we're going to predict class 1. If we want to generalize

that to more than for each class. We set the output to 1 for instances that belong to the class, and 0

for instances that don't. Then come up with a separate regression line for each class, and given an

unknown test example, we're going to choose a class with the largest output. That would give us n

regressions for a problem where there are n different classes.

We could alternatively use pairwise regression: take every pair of classes -- that's n squared over 2 --

and have a linear regression line for each pair of classes, discriminating an instance in one class of

that pair from the other class of that pair. We're going to work with a 2-class problem, and we're

going to investigate 2-class classification by regression. I'm going to open diabetes.arff. Then I'm

going to convert the class. Actually, let's just try to apply regression to this. I'm going to try

LinearRegression. You see it's grayed out here. That means it's not applicable. I can select it, but I

can't start it. It's not applicable because linear regression applies to a dataset where the class is

numeric, and we've got a dataset where the class is nominal.

We need to fix that. We're going to change this from these 2 labels to 0 and 1, respectively. We'll do

that with a filter. We want to change an attribute. It's unsupervised. We want to change a nominal

to a binary attribute, so that's the NominalToBinary filter. We want to apply that to the 9th

attribute. The default will apply it to all the attributes, but we just want to apply it to the 9th

attribute. I'm hoping it will change this attribute from nominal to binary. Unfortunately, it doesn't. It

doesn't have any effect, and the reason it doesn't have any effect is because these attribute filters

don't work on the class value. I can change the class value; we're going to give this "No class", so

now this is not the class value for the dataset. Run the filter again. Now I've got what I want: this

attribute "class" is either 0 or 1. In fact, this is the histogram -- there are this number of 0's and this

number of 1's, which correspond to the two different values in the original dataset. Now, we've got

our LinearRegression, and we can just run it.

This is the regression line. It's a line, 0.02 times the "pregnancy" attribute, plus this times the "plas"

attribute, and so on, plus this times the "age" attribute, plus this number. That will give us a number

for any given instance. We can see that number if we select "Output predictions" and run it again.

Here is a table of predictions for each instance in the dataset. This is the instance number; this is the

actual class of the instance, which is 0 or 1; this is the predicted class, which is a number --

sometimes it's less than 0. We would hope that these numbers are generally fairly small for 0's and

generally larger for 1's. They sort of are, although it's not really easy to tell. This is the error value

here in the fourth column. I'm going to do more extensive investigation, and you might ask why are

we bothering to do this? First of all, it's an interesting idea that I want to explore. It will lead to quite

good performance for classification by regression, and it will lead into the next lesson on logistic

regression, which is an excellent classification technique. Perhaps most importantly, we'll learn how

to do some cool things with the Weka interface. My strategy is to add a new attribute called

"classification" that gives this predicted number, and then we're going to use OneR to optimize a

split point for the two classes.

We'll have to restore the class back to its original nominal value, because, remember, I just

converted it to numeric. Here it is in detail. We're going to use a supervised attribute filter

[AddClassification]. This is actually pretty cool, I think. We're going to add a new attribute called

"classification". We're going to choose a classifier for that -- LinearRegression. We need to set

"outputClassification" to "True". If we just run this, it will add a new attribute to the dataset. It's

called "classification", and it's got these numeric values, which correspond exactly to the numeric

values that were predicted here by the linear regression scheme. Now, we've got this "classification"

attribute, and what I'd like to do now is to convert the class attribute back to nominal from numeric.

I want to use ZeroR now, and ZeroR will only work with a nominal class. Let me convert that. I want

NumericToNominal. I want to run that on attribute number 9.

Let me apply that, and now, sure enough, I've got the two labels 0 and 1. This is a nominal attribute

with these two labels. I'll be sure to make that one the class attribute. Then I get the colors back -- 2

colors for the 2 classes. Really, I want to predict this "class" based on the value of "classification",

that numeric value. I'm going to delete all the other attributes. I'm going to go to my Classify panel

here. I'm going to predict "class" -- this nominal value "class" -- and I'm going to use OneR. I think I'll

stop outputting the predictions because they just get in the way; and run that. It's 72-73%, and that's

a bit disappointing. But actually, when you look at this, OneR has produced this really overfitted rule.

We want a single split point. If it's less than this than predict 0, otherwise predict 1.

We can get around that by changing this "b" parameter, the minBucketSize parameter, to be

something much larger. I'm going to change it to 100 and run it again. Now I've got much better

performance, 77% accuracy, and this is the kind of split I've got: if the classification -- that is the

regression value -- is less than 0.47 I'm going to call it a 0; otherwise I'm going to call it a 1. So I've

got what I wanted, classification by regression. We've extended linear regression to classification.

This performance of 76.8% is actually quite good for this problem. It was easy to do with 2 classes, 0

and 1; otherwise you need to have a regression for each class -- multi-response linear regression --

or else for each pair of classes -- pairwise linear regression. We learnt quite a few things about

Weka. We learned about unsupervised attribute filters to convert nominal attributes to binary, and

numeric attributes back to nominal. We learned about this cool filter AddClassification, which adds

the classification according to a machine learning scheme as an attribute in the dataset. We learned

about setting and unsetting the class of the dataset, and we learned about the minimum bucket size

parameter to prevent OneR from overfitting. That's classification by regression. In the next lesson,

we're going to do better. We're going to look at logistic regression, an advanced technique which

effectively does classification by regression in an even more effective way. We'll see you soon.

25 Logistic regression

https://www.youtube.com/watch?v=mz4mPfi1j-g&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=25

Hi! Welcome back to Data Mining with Weka. In the last lesson, we looked at classification by

regression, how to use linear regression to perform classification tasks. In this lesson we're going to

look at a more powerful way of doing the same kind of thing. It's called "logistic regression". It's

fairly mathematical, and we're not going to go into the dirty details of how it works, but I'd like to

give you a flavor of the kinds of things it does and the basic principles that underline logistic

regression. Then, of course, you can use it yourself in Weka without any problem. One of the things

about data mining is that you can sometimes do better by using prediction probabilities rather than

actual classes. Instead of predicting whether it's going to be a "yes" or a "no", you might do better to

predict the probability with which you think it's going to be a "yes" or a "no". For example, the

weather is 95% likely to be rainy tomorrow, or 72% likely to be sunny, instead of saying it's definitely

going to be rainy or it's definitely going to be sunny. Probabilities are really useful things in data

mining. NaiveBayes produces probabilities; it works in terms of probabilities. We've sen that in an

earlier lesson. I'm going to open diabetes and run NaiveBayes. I'm going to use a percentage split

with 90%, so that leaves 10% as a test set.

Then I'm going to make sure I output the predictions on those 10%, and run it. I want to look at the

predictions that have been output. This is a 2-class dataset, the classes are tested_negative and

tested_positive, and these are the instances -- number 1, number 2, number 3, etc. This is the actual

class -- tested_negative, tested_positive, tested_negative, etc. This is the predicted class --

tested_negative, tested_negative, tested_negative, tested_negative, etc. This is a plus under the

error column to say where there's an error, so there's an error with instance number 2. These are

the actual probabilities that come out of NaiveBayes. So for instance 1 we've got a 99% probability

that it's negative, and a 1% probability that it's positive.

So we predict it's going to be negative; that's why that's tested_negative. And in fact we're correct; it

is tested_negative. This instance, which is actually incorrect, we're predicting 67% percent for

negative and 33% for positive, so we decide it's a negative, and we're wrong. We might have been

better saying that here we're really sure it's going to be a negative, and we're right; here we think it's

going to be a negative, but we're not sure, and it turns out that we're wrong. Sometimes it's a lot

better to think in terms of the output as probabilities, rather than being forced to make a binary,

black-or-white classification. Other data mining methods produce probabilities, as well. If I look at

ZeroR, and run that, these are the probabilities -- 65% versus Of course, it's ZeroR! -- it always

produces the same thing.

In this case, it always says tested_negative and always has the same probabilities. The reason why

the numbers are like that, if you look at the slide here, is that we've chosen a 90% training set and a

10% test set, and the training set contains 448 negative instances and 243 positive instances.

Remember the "Laplace Correction" in Lesson 3.2? -- we add 1 to each of those counts to get 449

and 244. That gives us a 65% probability for being a negative instance. That's where these numbers

come from. If we look at J48 and run that, then we get more interesting probabilities here -- the

negative and positive probabilities, respectively. You can see where the errors are.

These probabilities are all different. Internally, J48 uses probabilities in order to do its pruning

operations. We talked about that when we discussed J48's pruning, although I didn't explain

explicitly how the probabilities are derived. The idea of logistic regression is to make linear

regression produce probabilities, too. This gets a little bit hairy. Remember, when we use linear

regression for classification, we calculate a linear function using regression and then apply a

threshold to decide whether it's a 0 or a 1. It's tempting to imagine that you can interpret these

numbers as probabilities, instead of thresholding like that, but that's a mistake. They're not

probabilities. These numbers that come out on the regression line are sometimes negative, and

sometimes greater than 1.

They can't be probabilities, because probabilities don't work like that. In order to get better

probability estimates, a slightly more sophisticated technique is used. In linear regression, we have a

linear sum. In logistic regression, we have the same linear sum down here -- the same kind of linear

sum that we saw before -- but we embed it in this kind of formula. This is called a "logit transform".

A logit transform -- this is multi-dimensional with a lot of different a's here. If we've got just one

dimension, one variable, a1, then if this is the input to the logit transform, the output looks like this:

it's between 0 and 1. It's sort of an S-shaped curve that applies a softer function. Rather than just 0

and then a step function, it's soft version of a step function that never gets below 0, never gets

above 1, and has a smooth transition in between. When you're working with a logit transform,

instead of minimizing the squared error (remember, when we do linear regression we minimize the

squared error), it's better to choose weights to maximize a probabilistic function called the "log-

likelihood function", which is this pretty scary looking formula down at the bottom.

That's the basis of logistic regression. We won't talk about the details any more: let me just do it.

We're going to use the diabetes dataset. In the last lesson we got 76.8% with classification by

regression. Let me tell you if you do ZeroR, NaiveBayes, and J48, you get these numbers here. I'm

going to find the logistic regression scheme. It's in "functions", and called "Logistic". I'm going to use

10-fold cross-validation. I'm not going to output the predictions. I'll just run it -- and I get 77.2%

accuracy. That's the best figure in this column, though it's not much better than NaiveBayes, so you

might be a bit skeptical about whether it really is better. I did this 10 times and calculated the means

myself, and we get these figures for the mean of 10 runs. ZeroR stays the same, of course, at 65.1%;

it produces the same accuracy on each run. NaiveBayes and J48 are different, and here logistic

regression gets an average of 77.5%, which is appreciably better than the other figures in this

column. You can extend the idea to multiple classes. When we did this in the previous lesson, we

performed a regression for each class, a multi-response regression.

That actually doesn't work well with logistic regression, because you need the probabilities to sum to

1 over the various different classes. That introduces more computational complexity and needs to be

tackled as a joint optimization problem. The result is logistic regression, a popular and powerful

machine learning method that uses the logit transform to predict probabilities directly. It works

internally with probabilities, like NaiveBayes does. We also learned in this lesson about prediction

probabilities that can be obtained from other methods, and how to calculate probabilities from

ZeroR. You can read in the course text about logistic regression in Section 4.6. Now you should go

and do the activity associated with this lesson. See you soon.

https://www.youtube.com/watch?v=d8y3NpZKezI&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=26

Hello again. In most courses, there comes a point where things start to get a little tough. In the last

couple of lessons, you've seen some mathematics that you probably didn't want to see, and you

might have realized that you'll never completely understand how all these machine learning

methods work in detail. I want you to know that what I'm trying to convey is the gist of modern

machine learning methods, not the details. What's important is that you can use them and that you

understand a little bit of the principles behind how they work. And the math is almost finished. So

hang in there; things will start to get easier -- and anyway, there's not far to go: just a few more

lessons. I told you before that I play music. Someone came round to my house last night with a

contrabassoon. It's the deepest, lowest instrument in the orchestra. You don't often see or hear one.

So, here I am, trying to play a contrabassoon for the first time. I think this has got to be the lowest

point of our course, Data Mining with Weka! Today I want to talk about support vector machines,

another advanced machine learning technique. We looked at logistic regression in the last lesson,

and we found that these produce linear boundaries in the space.

In fact, here I've used Weka's Boundary Visualizer to show the boundary produced by a logistic

regression machine -- this is on the 2D Iris data, plotting petalwidth against petallength. This black

line is the boundary between these classes, the red class and the green class. It might be more

sensible, if we were going to put a boundary between these two classes, to try and drive it through

the widest channel between the two classes, the maximum separation from each class. Here's a

picture where the black line now is right down the middle of the channel between the two classes.

Actually, mathematically, we can find that line by taking the two critical members, one from each

class -- they're called support vectors; these are the critical points that define the channel -- and take

the perpendicular bisector of the line joining those two support vectors.

That's the idea of support vector machines. We're going to put a line between the two classes, but

not just any old line that separates them. We're trying to drive the widest channel between the two

classes. Here's another picture. We've got two clouds of points, and I've drawn a line around the

outside of each cloud -- the green cloud and the brown cloud. It's clear that any interior points aren't

going to affect this hyperplane, this plane, this separating line. I call it a line, but in multi dimensions

it would be a plane, or a hyperplane in four or more dimensions. There's just a few of the points in

each cloud that define the position of the line: the support vectors. In this case, there are three

points. Support vectors define the boundary. The thing is that all the other instances in the training

data could be deleted without changing the position of the dividing hyperplane. There's a simple

equation and this is the last equation in this course. A simple equation that gives the formula for the

maximum margin hyperplane as a sum over the support vectors.

These are kind of a vector product with each of the support vectors, and the sum there. It's pretty

simple to calculate this maximum margin hyperplane once you've got the support vectors. It's a very

easy sum, and, like I say, it only depends on the support vectors. None of the other points play any

part in this calculation. Now in real life, you might not be able to drive a straight line between the

classes. Classes are called "linearly separable" if there exists a straight line that separates the two

classes. In this picture, the two classes are not linearly separable. It might be a little hard to see, but

there are some blue points on the green side of the line, and a couple of green points on the blue

side of the line. It's not possible to get a single straight line that divide these points. That makes

support vector machines -- the mathematics -- a little more complicated. But it's still possible to

define the maximum margin hyperplane under these conditions.

That's it: support vector machines. It's a linear decision boundary. Actually, there's a really clever

technique which allows you to get more complex boundaries. It's called the "Kernel trick". By using

different formulas for the "kernel" -- and in Weka you just select from some possible different

kernels -- you can get different shapes of boundaries, not just straight lines. Support vector

machines are fantastic because they're very resilient to overfitting. The boundary just depends on a

very small number of points in the dataset. So, it's not going to overfit the dataset, because it

doesn't depend on almost all of the points in the dataset, just a few of these critical points -- the

support vectors.

So, it's very resilient to overfitting, even with large numbers of attributes. In Weka, there are a

couple of implementations of support vector machines. We could look in the "functions" category

for "SMO". Let me have a look at that over here. If I look in "functions" for "SMO", that implements

an algorithm called "Sequential Minimal Optimization" for training a support vector classifier. There

are a few parameters here, including, for example, the different choices of kernel. You can choose

different kernels: you can play around and try out different things. There are a few other

parameters. Actually, the SMO algorithm is restricted to two classes, so this will only work with a 2-

class dataset. There are other, more comprehensive, implementations of support vector machines in

Weka. There's a library called "LibSVM", an external library, and Weka has an interface to this

library. This is a wrapper class for the LibSVM tools. You need to download these separately from

Weka and put them in the right Java classpath. You can see that there are a lot of different

parameters here, and, in fact, a lot of information on this support vector machine package. That's

support vector machines. You can read about them in Section 6.4 of the textbook if you like, and

please go and do the associated activity. See you soon for the last lesson in this class.

27 Ensemble learning

https://www.youtube.com/watch?v=WJZN4eatdeM&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=27

Hello again! We're up to the last lesson in the fourth class, Lesson 4.6 on Ensemble Learning. In real

life, when we have important decisions to make, we often choose to make them using a committee.

Having different experts sitting down together, with different perspectives on the problem, and

letting them vote, is often a very effective and robust way of making good decisions. The same is

true in machine learning. We can often improve predictive performance by having a bunch of

different machine learning methods, all producing classifiers for the same problem, and then letting

them vote when it comes to classifying an unknown test instance. One of the disadvantages is that

this produces output that is hard to analyze. There are actually approaches that try and produce a

single comprehensible structure, but we're not going to be looking at any of those. So the output will

be hard to analyze, but you often get very good performance. It's a fairly recent technique in

machine learning. We're going to look at four methods, called "bagging", "randomization",

"boosting", and "stacking". They're all implemented in Weka, of course. With bagging, we want to

produce several different decision structures.

Let's say we use J48 to produce decision trees, then we want to produce slightly different decision

trees. We can do that by having several different training sets of the same size. We can get those by

sampling the original training set. In fact, in bagging, you sample the set "with replacement", which

means that sometimes you might get two of the same [instances] chosen in your sample. We

produce several different training sets, and then we build a model for each one -- let's say a decision

tree -- using the same machine learning scheme, or using some other machine learning scheme.

Then we combine the predictions of the different models by voting, or if it's a regression situation

you would average the numeric result rather than voting on it.

This is very suitable for learning schemes that are called "unstable". Unstable learning schemes are

ones where a small change in the training data can make a big change in the model. Decision trees

are a really good example of this. You can get a decision tree and just make a tiny little change in the

training data and get a completely different kind of decision tree. Whereas with NaiveBayes, if you

think about how NaiveBayes works, little changes in the training set aren't going to make much

difference to the result of NaiveBayes, so that's a "stable" machine learning method. In Weka we

have a "Bagging" classifier in the meta set. I'm going to choose meta > Bagging: here it is. We can

choose here the bag size -- this is saying a bag size of 100%, which is going to sample the training set

to get another set the same size, but it's going to sample "with replacement".

That means we're going to get different sets of the same size every time we sample, but each set

might contain repeats of the original training [instances]. Here we choose which classifier we want

to bag, and we can choose the number of bagging iterations here, and a random-number seed.

That's the bagging method. The next one I want to talk about is "random forests". Here, instead of

randomizing the training data, we randomize the algorithm. How you randomize the algorithm

depends on what the algorithm is. Random forests are when you're using decision tree algorithms.

Remember when we talked about how J48 works? -- it selects the best attribute for splitting on each

time. You can randomize this procedure by not necessarily selecting the very best, but choosing a

few of the best options, and randomly picking amongst them.

That gives you different trees every time. Generally, if you bag decision trees, if you randomize them

and bag the result, you get better performance. In Weka, we can look under "tree" classifiers for

RandomForest. Again, that's got a bunch of parameters. The maximum depth of the trees produced -

- I think 0 would be unlimited depth. The number of features we're going to use. We might select,

say 4 features; we would select from the top 4 features -- every time we decide on the decision to

put in the tree, we select that from among the top 4 candidates. The number of trees we're going to

produce, and so on. That's random forests. Here's another kind of algorithm: it's called "boosting".

It's iterative: new models are influenced by the performance of previously built models.

Basically, the idea is that you create a model, and then you look at the instances that are

misclassified by that model. These are the hard instances to classify, the ones it gets wrong. You put

extra weight on those instances to make a training set for producing the next model in the iteration.

This encourages the new model to become an "expert" for instances that were misclassified by all

the earlier models. The intuitive justification for this is that in a real life committee, committee

members should complement each other's expertise by focusing on different aspects of the

problem. In the end, to combine them we use voting, but we actually weight models according to

their performance. There's a very good scheme called AdaBoostM1, which is in Weka and is a

standard and very good boosting implementation -- it often produces excellent results.

There are few parameters to this as well; particularly the number of iterations. The final ensemble

learning method is called "stacking". Here we're going to have base learners, just like the learners

we talked about previously. We're going to combine them not with voting, but by using a meta-

learner, another learner scheme that combines the output of the base learners. We're going to call

the base learners level-0 models, and the meta-learner is a level-1 model. The predictions of the

base learners are input to the meta-learner. Typically you use different machine learning schemes as

the base learners to get different experts that are good at different things. You need to be a little bit

careful in the way you generate data to train the level-1 model: this involves quite a lot of cross-

validation, I won't go into that. In Weka, there's a meta classifier called "Stacking", as well as

"StackingC" -- which is a more efficient version of Stacking.

Here is Stacking; you can choose different meta-classifiers here, and the number of stacking folds.

We can choose different classifiers; different level-0 classifiers, and a different meta-classifier. In

order to create multiple level-0 models, you need to specify a meta-classifier as the level-0 model. It

gets a little bit complicated; you need to fiddle around with Weka to get that working. That's it then.

We've been talking about combining multiple models into ensembles to produce an ensemble for

learning, and the analogy is with committees of humans. Diversity helps, especially when learners

are unstable. And we can create diversity in different ways. In bagging, we create diversity by

resampling the training set. In random forests, we create diversity by choosing alternative branches

to put in our decision trees. In boosting, we create diversity by focusing on where the existing model

makes errors; and in stacking, we combine results from a bunch of different kinds of learner using

another learner, instead of just voting. There's a chapter in the course text on Ensemble learning --

it's quite a large topic, really. There's an activity that you should go and do before we proceed to the

next class, the last class in this course. We'll learn about putting it all together, taking a more global

view of the machine learning process. We'll see you then.

28 - Class 4 Questions

https://www.youtube.com/watch?v=50BPP-NMieM&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=28

Hi! Well, it's summertime here in New Zealand. Summer's just arrived, and, as you can see, I'm

sitting outside for a change of venue. This is Class 5 of the MOOC -- the last class! Here are a few

comments on Class 4, some issues that came up. We had a couple of errors in the activities; we

corrected those pretty quickly. Some of the activities are getting harder -- you will have noticed that!

But I think if you're doing the activities you'll be learning a lot. You learn a lot through doing the

activities, so keep it up! And the Class 5 activities are much easier. There was a question about

converting nominal variables to numeric in Activity 4.2. Someone said the result of the supervised

nominal binary filter was weird. Yes, well, it is a little bit weird. If you click the "More" button for

that filter, it says that k-1 new binary attributes are generated in the manner described in this book

(if you can get hold of it). Let me just tell you a little bit more about this. I've come up with an

example of a nominal attribute called "fruit", and it has 3 values: orange, apple, and banana. In this

dataset, the class is "juicy"; it's a numeric measure of juiciness.

I don't know about where you live, but in New Zealand oranges are juicier than apples, and apples

are juicier than bananas. I'm assuming that in this dataset, if you average the juiciness of all the

instances where the fruit attribute equals orange you get a larger value than if you do this with all

the instances where the fruit attribute equals apple, and that's larger than for banana. That sort of

orders these values. Let's consider ways of making "fruit" into a set of binary attributes. The simplest

method, and the one that's used by the unsupervised conversion filter, is Method 1 here. We create

3 new binary attributes; I've just called them "fruit=orange", "fruit=apple", and "fruit=banana". The

first attribute value is 1 if it's an orange and 0 otherwise.

The second attribute, "fruit=apple", is 1 if it's an apple and 0 otherwise, and the same for banana. Of

course, of these three binary attributes, exactly one of them has to be "1" for any instance. Here's

another way of doing it, Method 2. We take each possible subset: as well as "orange", "apple" and

"banana", we have another binary variable for "orange_or_apple", another for "orange_or_banana",

and another for "apple_or_banana". For example, if the value of fruit was "orange", then the first

attribute ("fruit=orange") would be 1, the fourth attribute ("orange_or_apple") would be 1, and the

fifth attribute ("orange_or_banana") would be 1. All of the others would be 0. This effectively

creates a binary attribute for each subset of possible values of the "fruit" attribute.

Actually, we don't create one for the empty subset or the full subset (with all 3 of the values in). We

get 2^k-2 values for a k-valued attribute. That's impractical in general, because 2^k grows very fast

as k grows. The third method is the one that is actually used, and this is the one that's described in

that book. We create 2 new attributes (k-1, in general, for a k-valued attribute):

"fruit=orange_or_apple" and "fruit=apple". For oranges, the first attribute is 1 and the second is 0;

for apples, they're both 1; and for bananas, they're both 0. That's assuming this ordering of class

values: orange is largest in juiciness, and banana is smallest in juiciness. There's a theorem that, if

you're making a decision tree, the best way of splitting a node for a nominal variable with k values is

one of the k-1 positions -- well, you can read this. In fact, this theorem is reflected in Method 3.

That is the best way of splitting these attribute values. Whether it's a good thing in practice or not,

well, I don't know. You should try it and see. Perhaps you can try Method 3 for the supervised

conversion filter and Method 1 for the unsupervised conversion filter and see which produces the

best results on your dataset. Weka doesn't implement Method 2, because the number of attributes

explodes with the number of possible values, and you could end up with some very large datasets.

The next question is about simulating multiresponse linear regression: "Please explain!" Well, we're

looking at a Weka screen like this. We're running linear regression on the iris dataset where we've

mapped the values so that the class for any Virginica instance is 1 and 0 for the others. We've done it

with this kind of configuration. This is the default configuration of the makeIndicator filter. It's

working on the last attribute -- that's the class.

In this case, the value index is last, which means we're looking at the last value, which, in fact, is

Virginica. We could put a number here to get the first, second, or third values. That's how we get the

dataset, and then we run linear regression on this to get a linear model. Now, I want to look at the

output for the first 4 instances. We've got an actual class of 1, 1, 0, 0 and the predicted value of

these numbers. I've written those down in this little table over here: 1, 1, 0, 0 and these numbers.

That for the dataset where all of the Virginicas are mapped to 1 and the other irises are mapped to

0. When we do the corresponding mapping with Versicolors, we get this as the actual class -- we just

run Weka and look at what appeared on the screen -- and this is the predicted value. We get these

for Setosa. So, you can see that the first instance is actually a Virginica - 1, 0, 0. I've put in bold the

largest of these 3 numbers. This is the largest, 0.966, which is bigger than 0.117 and -0.065, so

multiresponse linear regression is going to predict Virginica for instance 1. It's got the largest value.

And that's correct. For the second instance, it's also a Virginica, and it's also the largest of the 3

values in its row. For the third instance, it's actually a Versicolor.

The actual output is 1 for the Versicolor model, but the largest prediction is still for the Virginica

model. It's going to predict Virginica for an iris that's actually Versicolor. That's going to be a mistake.

In the [fourth] case, it's actually a Setosa -- the actual column is 1 for Setosa -- and this is the largest

value in the row, so it's going to correctly predict Setosa. That's how multiresponse linear regression

works. "How does OneR use the rules it generates? Please explain!" Well, here's the rule generated

by OneR. It hinges on attribute 6. Of course, if you click the "Edit" button in the Preprocess panel,

you can see the value of this attribute for each instance. This is what we see in the Explorer when we

run OneR. You can see the predicted instances here. These are the predicted instances -- g, b, g, b, g,

g, etc. These are the predictions. The question is, how does it get these predictions. This is the value

of attribute 6 for instance 1. What the OneR code does is go through each of these conditions and

looks to see if it's satisfied. Is 0.02 less than -0.2? -- no, it's not. Is it less than -0.01? -- no, it's not.

Is it less than 0.001? -- no, it's not. (It's surprisingly hard to get these right, especially when you've

got all of the other decimal places in the list here.) Is it less than 0.1? -- yes, it is. So rule 4 fires -- this

is rule 4 -- and predicts "g". I've written down here the number of the rule clause that fires. In this

case, for instance 2, the value of the attribute is -0.4, and that satisfies the first rule. So this satisfies

number 1, and we predict "b". And so on down the list. That's what OneR does. It goes through the

rule evaluating each of these clauses until it finds one that is true, and then it uses the corresponding

prediction as its output. Moving on to ensemble learning questions. There were some questions on

ensemble learning, about these ten OneR models. "Are these ten alternative ways of classifying the

data?" Well, in a sense, but they are used together: AdaBoost.M1 combines them. In practice you

don't just pick one of them and use that: AdaBoost combines these models inside itself -- the

predictions it prints are produced by its combined model.

The weights are used in the combination to decide how much weight to give each of these models.

And when Weka reports a certain accuracy, that's for the combined model. It's not the average; it's

not the best; it's combined in the way that AdaBoost combines them. That's all done internally in the

algorithm. I didn't really explain the details of how the algorithm works; you'll have to look that up, I

guess. The point is AdaBoostM1 combines these models for you. You don't have to think of them as

separate models. They're all combined by AdaBoostM1. Someone complained that we're supposed

to be looking for simplicity, and this seems pretty complicated. That's true. The real disadvantage of

these kinds of models, ensemble models, is that it's hard to look at the rules. It's hard to see inside

to see what they're doing. Perhaps you should be a bit wary of that. But they can produce very good

results. You know how to test machine learning methods reliably using cross-validation or whatever.

So, sometimes they're good to use.

"How does Weka make predictions? How can you use Weka to make predictions?" You can use the

"Supplied test set" option on the Classify panel to put in a test set and see the predictions on that.

Or, alternatively, there is a program -- if you can run Java programs -- there's a program here. This is

how you run it: "java weka.classifiers.trees.J48" with your ARFF data file, and you put question

marks there to indicate the class. Then you give it the model, which you've output from the Explorer.

You can look at how to do this on the Weka Wiki on the FAQ list: "using Weka to make predictions".

Can you bootstrap learning? Someone talked about some friends of his who were using training data

to train a classifier and using the results of the classification to create further training data, and

continuing the cycle -- kind of bootstrapping. That sounds very attractive, but it can also be unstable.

It might work, but I think you'd be pretty lucky for it to work well. It's a potentially rather unreliable

way of doing things -- believing the classifications on new data and using that to further train the

classifier. He also said these friends of his don't really look into the classification algorithm. I guess

I'm trying to tell you a little bit about how each classification algorithm works, because I think it

really does help to know that.

You should be looking inside and thinking about what's going on inside your data mining method. A

couple of suggestions of things not covered in this MOOC: FilteredClassifier and association rules,

the Apriori association rule learner. As I said before, maybe we'll produce a follow-up MOOC and

include topics like this in it. That's it for now. Class 5 is the last class. It's a short class. Go ahead and

do it. Please complete the assessments and finish off the course. It'll be open this week, and it'll

remain open for one further week if you're getting behind. But after that, it'll be closed. So, you

need to get on with it.

CLASS 5

29 The data mining process

https://www.youtube.com/watch?v=Jtu_4qN4bPk&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=29

Hello again! This is the last class of Data Mining with Weka, and we're going to step back a little bit

and take a look at some more global issues with regard to the data mining process. It's a short class

with just four lessons: the data mining process, pitfalls and pratfalls, data mining and ethics, and

finally, a quick summary. Let's get on with Lesson 5.1. This might be your vision of the data mining

process. You've got some data or someone gives you some data. You've got Weka. You apply Weka

to the data, you get some kind of cool result from that, and everyone's happy. If so, I've got bad

news for you. It's not going to be like that at all. Really, this would be a better way to think about it.

You're going to have a circle; you're going to go round and round the circle. It's true that Weka is

important -- it's in the very middle of the circle here. It's going to be crucial, but it's only a small part

of what you have to do. Perhaps the biggest problem is going to be to ask the right kind of question.

You need to be answering a question, not just vaguely exploring a collection of data. Then, you need

to get together the data that you can get hold of that gives you a chance of answering this question

using data mining techniques. It's hard to collect the data.

You're probably going to have an initial dataset, but you might need to add some demographic data,

or some weather data, or some data about other stuff. You're going to have to go to the web and

find more information to augment your dataset. Then you'll merge all that together: do some

database hacking to get a dataset that contains all the attributes that you think you might need -- or

that you think Weka might need. Then you're going to have to clean the data. The bad news is that

real world data is always very messy. That's a long and painstaking process of looking around,

looking at the data, trying to understand it, trying to figure out what the anomalies are and whether

it's good to delete them or not. That's going to take a while. Then you're going to need to define

some new features, probably. This is the feature engineering process, and it's the key to successful

data mining. Then, finally, you're going to use Weka, of course. You might go around this circle a few

times to get a nice algorithm for classification, and then you're going to need to deploy the algorithm

in the real world. Each of these processes is difficult.

You need to think about the question that you want to answer. "Tell me something cool about this

data" is not a good enough question. You need to know what you want to know from the data. Then

you need to gather it. There's a lot of data around, like I said at the very beginning, but the trouble is

that we need classified data to use classification techniques in data mining. We need expert

judgements on the data, expert classifications, and there's not so much data around that includes

expert classifications, or correct results. They say that more data beats a clever algorithm. So rather

than spending time trying to optimize the exact algorithm you're going to use in Weka, you might be

better off employed in getting more and more data. Then you've got to clean it, and like I said

before, real data is very mucky. That's going to be a painstaking matter of looking through it and

looking for anomalies. Feature engineering, the next step, is the key to data mining. We'll talk about

how Weka can help you a little bit in a minute. Then you've got to deploy the result. Implementing it

-- well, that's the easy part. The difficult part is to convince your boss to use this result from this data

mining process that he probably finds very mysterious and perhaps doesn't trust very much. Getting

anything actually deployed in the real world is a pretty tough call.

The key technical part of all this is feature engineering, and Weka has a lot of [filters] that will help

with this. Here are just a few of them. It might be worth while defining a new feature, a new

attribute that's a mathematical expression involving existing attributes. Or you might want to modify

an existing attribute. With AddExpression, you can use any kind of mathematical formula to create a

new attribute from existing ones. You might want to normalize or center your data, or standardize it

statistically. Transform a numeric attribute to have a zero mean -- that's "center". Or transform it to

a given numeric range -- that's "normalize". Or give it a zero mean and unit variance, that's a

statistical operation called "standardization". You might want to take those numeric attributes and

discretize them into nominal values.

Weka has both supervised and unsupervised attribute discretization filters. There are a lot of other

transformations. For example, the PrincipalComponents transformation involves a matrix analysis of

the data to select the principal components in a linear space. That's mathematical, and Weka

contains a good implementation. RemoveUseless will remove attributes that don't vary at all, or vary

too much. Actually, I think we encountered that in one of our activities. Then, there are a couple of

filters that help you deal with time series, when your instances represent a series over time. You

probably want to take the difference between one instance and the next, or a difference with some

kind of lag -- one instance and the one 5 before it, or 10 before it. These are just a few of the filters

that Weka contains to help you with your feature engineering. The message of this lesson is that

Weka is only a small part of the entire data mining process, and it's the easiest part. In this course,

we've chosen to tell you about the easiest part of the process! I'm sorry about that. The other bits

are, in practice, much more difficult. There's an old programmer's blessing: "May all your problems

be technical ones". It's the other problems -- the political problems in getting hold of the data, and

deploying the result -- those are the ones that tend to be much more onerous in the overall data

mining process. So good luck! There's some stuff about this in the course text. Section 1.3 contains

information on Fielded Applications, all of which have gone through this kind of process in order to

get them out there and used in the field. There's an activity associated with this lesson. Off you go

and do it, and we'll see you in the next lesson.

https://www.youtube.com/watch?v=36ci5kLsm44&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=30

Hi! Welcome back for another few minutes in New Zealand. In the last lesson, Lesson 5.1, we

learned that Weka only helps you with a small part of the overall data mining process, the technical

part, which is perhaps the easy part. In this lesson, we're going to learn that there are many pitfalls

and pratfalls even in that part. Let me just define these for you. A "pitfall" is a hidden or unsuspected

danger or difficulty, and there are plenty of those in the field of machine learning. A "pratfall" is a

stupid and humiliating action, which is very easy to do when you're working with data. The first

lesson is that you should be skeptical. In data mining it's very easy to cheat. Whether you're cheating

consciously or unconsciously, it's easy to mislead yourself or mislead others about the significance of

your results. For a reliable test, you should use a completely fresh sample of data that has never

been seen before. You should save something for the very end, that you don't use until you've

selected your algorithm, decided how you're going to apply it, and the filters, and so on.

At the very, very end, having done all that, run it on some fresh data to get an estimate of how it will

perform. Don't be tempted to then change it to improve it so that you get better results on that

data. Always do your final run on fresh data. We've talked a lot about overfitting, and this is basically

the same kind of problem. Of course, you know not to test on the training set. We've talked about

that endlessly throughout this course. Data that's been used for development in any way is tainted.

Any time you use some data to help you make a choice of the filter, or the classifier, or how you're

going to treat your problem, then that data is tainted. You should be using completely fresh data to

get evaluation results. Leave some evaluation data aside for the very end of the process. That's the

first piece of advice. Another thing I haven't told you about in this course so far is missing values. In

real datasets, it's very common that some of the data values are missing.

They haven't been recorded. They might be unknown; we might have forgotten to record them; they

might be irrelevant. There are two basic strategies for dealing with missing values in a dataset. You

can omit instances where the attribute value is missing, or somehow find a way of omitting that

particular attribute in that instance. Or you can treat missing as a separate possible value. You need

to ask yourself, is there significance in the fact that a value is missing? They say that if you've got

something wrong with you and go to the doctor, and he does some tests on you: if you just record

the tests that he does -- not the results of the test, but just the ones he chooses to do -- there's a

very good chance that you can work out what's wrong with you just from the existence of the tests,

not from their results. That's because the doctor chooses tests intelligently. The fact that he doesn't

choose a test doesn't mean that that value is missing, or accidentally not there. There's huge

significance in the fact that he's chosen not to do certain tests. This is a situation where "missing"

should be treated as a separate possible value.

There's significance in the fact that a value is missing. But in other situations, a value might be

missing simply because a piece of equipment malfunctioned, or for some other reason -- maybe

someone forgot something. Then there's no significance in the fact that it's missing. Pretty well all

machine learning algorithms deal with missing values. In an ARFF file, if you put a question mark as a

data value, that's treated as a missing value. All methods in Weka can deal with missing values. But

they make different assumptions about them. If you don't appreciate this, it's easy to get misled. Let

me just take two simple and well known (to us) examples -- OneR and J48. They deal with missing

values in different ways. I'm going to load the nominal weather data and run OneR on it: I get 43%.

Let me run J48 on it, to get 50%. I'm going to edit this dataset by changing the value of "outlook" for

the first four "no" instances to "missing". That's how we do it here in this editor. If we were to write

this file out in ARFF format, we'd find that these values are written into the file as question marks.

Now, if we look at "outlook", you can see that it says here there are 4 missing values.

If you count up these labels -- 2, 4, and Plus another 4 that are missing, to make the Let's go back to

J48 and run it again. We still get 50%, the same result. Of course, this is a tiny dataset, but the fact is

that the results here are not affected by the fact that a few of the values are missing. However, if we

run OneR, I get a much higher accuracy, a 93% accuracy. The rule that I've got is "branch on

outlook", which is what we had before I think. Here it says there are 4 possibilities: if it's sunny, it's a

yes; if it's overcast it's a yes; if it's rainy, it's a yes; and if it's missing, it's a no. Here, OneR is using the

fact that a value is missing as significant, as something you can branch on. Whereas if you were to

look at a J48 tree, it would never have a branch that corresponded to a missing value. It treats them

differently. It is very important to know and remember. The final thing I want to tell you about in this

lesson is the "no free lunch" theorem. There's no free lunch in data mining. Here's a way to illustrate

it. Suppose you've got a 2-class problem with Let's say you've got a huge training set with a million

instances and their classifications in the training set.

The number of possible instances is 2 to the attributes. And you know 10^6 of them. So you don't

know the classes of 2^100 - 10^6 examples. Let me tell you that 2^100 - 10^6 is 99.999...% of 2^100.

There's this huge number of examples that you just don't know the classes of. How could you

possibly figure them out? If you apply a data mining scheme to this, it will figure them out, but how

could you possibly figure out all of those things just from the tiny amount of data that you've been

given. In order to generalize, every learner must embody some knowledge or assumptions beyond

the data it's given. Each learning algorithm implicitly provides a set of assumptions. The best way to

think about those assumptions is to think back to the Boundary Visualizer we looked at in Lesson 4.1.

You saw that different machine learning schemes are capable of drawing different kinds of

boundaries in instance space. These boundaries correspond to a set of assumptions about the sort of

decisions we can make. There's no universal best algorithm; there's no free lunch. There's no single

best algorithm. Data mining is an experimental science, and that's why we've been teaching you how

to experiment with data mining yourself.

This is just a summary. Be skeptical: when people tell you about data mining results and they say

that it gets this kind of accuracy, then to be sure about that you want to have them test their

classifier on your new, fresh data that they've never seen before. Overfitting has many faces.

Different learning schemes make different assumptions about missing values, which can really

change the results. There is no universal best learning algorithm. Data mining is an experimental

science, and it's very easy to be misled by people quoting the results of data mining experiments.

That's it for now. Off you go and do the activity. We'll see you in the next lesson.

https://www.youtube.com/watch?v=Rv5mBqaXO30&index=31&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7

Hi! Welcome to Lesson 5.3 of Data Mining with Weka. Before we start, I thought I'd show you where

I live. I told you before that I moved to New Zealand many years ago. I live in a place called Hamilton.

Let me just zoom in and see if we can find Hamilton in the North Island of New Zealand, around the

center of the North Island. This is where the University of Waikato is. Here is the university; this is

where I live. This is my journey to work: I cycle every morning through the countryside. As you can

see, it's really nice. I live out here in the country. I'm a sheep farmer! I've got four sheep, three in the

paddock and one in the freezer. I cycle in -- it takes about half an hour -- and I get to the university. I

have the distinction of being able to go from one week to the next without ever seeing a traffic light,

because I live out on the same edge of town as the university. When I get to the campus of the

University of Waikato, it's a very beautiful campus. We've got three lakes. There are two of the lakes,

and another lake down here. It's a really nice place to work!

So I'm very happy here. Let's move on to talk about data mining and ethics. In Europe, they have a

lot of pretty stringent laws about information privacy. For example, if you're going to collect any

personal information about anyone, a purpose must be stated. The information should not be

disclosed to others without consent. Records kept on individuals must be accurate and up to date.

People should be able to review data about themselves. Data should be deleted when it's no longer

needed. Personal information must not be transmitted to other locations. Some data is too sensitive

to be collected, except in extreme circumstances. This is true in some countries in Europe,

particularly Scandinavia. It's not true, of course, in the United States. Data mining is about collecting

and utilizing recorded information, and it's good to be aware of some of these ethical issues. People

often try to anonymize data so that it's safe to distribute for other people to work on, but

anonymization is much harder than you think. Here's a little story for you.

When Massachusetts released medical records summarizing every state employee's hospital record

in the mid-1990's, the Governor gave a public assurance that it had been anonymized by removing

all identifying information -- name, address, and social security number. He was surprised to receive

is own health records (which included a lot of private information) in the mail shortly afterwards!

People could be re-identified from the information that was left there. There's been quite a bit of

research done on re-identification techniques. For example, using publicly available records on the

internet, 50% of Americans can be identified from their city, birth date, and sex. zip code as well.

There was some interesting work done on a movie database. Netflix released a database of 100

million records of movie ratings. They got individuals to rate movies [on the scale] 1-5, and they had

a whole bunch of people doing this -- a total of 100 million records.

It turned out that you could identify 99% of people in the database if you knew their ratings for 6

movies and approximately when they saw them. Even if you only know their ratings for 2 movies,

you can identify 70% of people. This means you can use the database to find out the other movies

that these people watched. They might not want you to know that. Re-identification is remarkably

powerful, and it is incredibly hard to anonymize data effectively in a way that doesn't destroy the

value of the entire dataset for data mining purposes. Of course, the purpose of data mining is to

discriminate: that's what we're trying to do! We're trying to learn rules that discriminate one class

from another in the data -- who gets the loan? -- who gets a special offer? But, of course, certain

kinds of discrimination are unethical, not to mention illegal. For example, racial, sexual, and religious

discrimination is certainly unethical, and in most places illegal.

But it depends on the context. Sexual discrimination is usually illegal ... except for doctors. Doctors

are expected to take gender into account when they make their make their diagnoses. They don't

want to tell a man that he is pregnant, for example. Also, information that appears innocuous may

not be. For example, area codes -- zip codes in the US -- correlate strongly with race; membership of

certain organizations correlates with gender. So although you might have removed the explicit racial

and gender information from you database, it still might be able to be inferred from other

information that's there. It's very hard to deal with data: it has a way of revealing secrets about itself

in unintended ways. Another ethical issue concerning data mining is that correlation does not imply

causation. Here's a classic example: as ice cream sales increase, so does the rate of drownings.

Therefore, ice cream consumption causes drowning?

Probably not. They're probably both caused by warmer temperatures -- people going to beaches.

What data mining reveals is simply correlations, not causation. Really, we want causation. We want

to be able to predict the effects of our actions, but all we can look at using data mining techniques is

correlation. To understand about causation, you need a deeper model of what's going on. I just

wanted to alert you to some of the issues, some of the ethical issues, in data mining, before you go

away and use what you've learned in this course on your own datasets: issues about the privacy of

personal information; the fact that anonymization is harder than you think; re-identification of

individuals from supposedly anonymized data is easier than you think; data mining and

discrimination -- it is, after all, about discrimination; and the fact that correlation does not imply

causation. There's a section in the textbook, Data mining and ethics, which you can read for more

background information, and there's a little activity associated with this lesson, which you should go

and do now. I'll see you in the next lesson, which is the last lesson of the course.

32 Summary

https://www.youtube.com/watch?v=W3hE5qrTsuM&list=PLzVF1nAqI9VmC96TbvOPMkXToSmBMHJn7&index=32

Hi! This is the last lesson in the course Data mining with Weka, Lesson 5.4 - Summary. We'll just have

a quick summary of what we've learned here. One of the main points I've been trying to convey is

that there's no magic in data mining. There's a huge array of alternative techniques, and they're all

fairly straightforward algorithms. We've seen the principles of many of them. Perhaps we don't

understand the details, but we've got the basic idea of the main methods of machine learning used

in data mining. And there is no single, universal best method. Data mining is an experimental

science. You need to find out what works best on your problem. Weka makes it easy for you. Using

Weka you can try out different methods, you can try out different filters, different learning methods.

You can play around with different datasets.

It's very easy to do experiments in Weka. Perhaps you might say it's too easy, because it's important

to understand what you're doing, not just blindly click around and look at the results. That's what

I've tried to emphasize in this course -- understanding and evaluating what you're doing. There are

many pitfalls you can fall into if you don't really understand what's going on behind the scenes. It's

not a matter of just blindly applying the tools in the workbench. We've stressed in the course the

focus on evaluation, evaluating what you're doing, and the significance of the results of the

evaluation. Different algorithms differ in performance, as we've seen. In many problems, it's not a

big deal.

The differences between the algorithms are really not very important in many situations, and you

should perhaps be spending more time on looking at the features and how the problem is described

and the operational context that you're working in, rather than stressing about getting the absolute

best algorithm. It might not make all that much difference in practice. Use your time wisely. There's

a lot of stuff that we've missed out. I'm really sorry I haven't been able to cover more of this stuff.

There's a whole technology of filtered classifiers, where you want to filter the training data, but not

the test data. That's especially true when you've got a supervised filter, where the results of the

filter depend on the class values of the training instances. You want to filter the training data, but

not the test data, or maybe take a filter designed for the training data and apply the same filter to

the test data without re-optimizing it for the test data, which would be cheating. You often want to

do this during cross-validation.

The trouble in Weka is that you can't get hold of those cross-validation folds; it's all done internally.

Filtered classifiers are a simple way of dealing with this problem. We haven't talked about costs of

different decisions and different kinds of errors, but in real life different errors have different costs.

We've talked about optimizing the error rate, or the classification accuracy, but really, in most

situations, we should be talking about costs, not raw accuracy figures, and these are different things.

There's a whole panel in the Weka Explorer for attribute selection, which helps you select a subset of

attributes to use when learning, and in many situations it's really valuable, before you do any

learning, to select an appropriate small subset of attributes to use. There are a lot of clustering

techniques in Weka. Clustering is where you want to learn something even when there is no class

value: you want to cluster the instances according to their attribute values.

Association rules are another kind of learning technique where we're looking for associations

between attributes. There's no particular class, but we're looking for any strong associations

between any of the attributes. Again, that's another panel in the Explorer. Text classification. There

are some fantastic text filters in Weka which allow you to handle textual data as words, or as

characters, or n-grams (sequences of three, four, or five consecutive characters). You can do text

mining using Weka. Finally, we've focused exclusively on the Weka Explorer, but the Weka

Experimenter is also worth getting to know. We've done a fair amount of rather boring, tedious,

calculations of means and standard deviations manually by changing the random-number seed and

running things again. That's very tedious to do by hand. The Experimenter makes it very easy to do

this automatically. So, there's a lot more to learn, and I'm wondering if you'd be interested in an

Advanced Data Mining with Weka course. I'm toying with the idea of putting one on, and I'd like you

to let us know what you think about the idea, and what you'd like to see included. Let me just finish

off here with a final thought. We've been talking about data, data mining. Data is recorded facts, a

change of state in the world, perhaps.

That's the input to our data mining process, and the output is information, the patterns -- the

expectations -- that underlie that data: patterns that can be used for prediction in useful applications

in the real world. We've going from data to information. Moving up in the world of people, not

computers, "knowledge" is the accumulation of your entire set of expectations, all the information

that you have and how it works together -- a large store of expectations and the different situations

where they apply. Finally, I like to define "wisdom" as the value attached to knowledge. I'd like to

encourage you to be wise when using data mining technology. You've learned a lot in this course.

You've got a lot of power now that you can use to analyze your own datasets. Use this technology

wisely for the good of the world. That's my final thought for you. There is an activity associated with

this lesson, a little revision activity. Go and do that, and then do the final assessment, and we will

send you your certificate if you do well enough. Good luck! It's been good talking to you, and maybe

we'll see you in an advanced version of this course. Bye for now!

## Molto più che documenti.

Scopri tutto ciò che Scribd ha da offrire, inclusi libri e audiolibri dei maggiori editori.

Annulla in qualsiasi momento.