06 ML (Clustering)

Machine Learning
Dr. Faraz Akram
Riphah International University

Unsupervised
learning
Clustering
Supervised learning
label
label1
model/
label3 predictor
label4
label5
Supervised learning: given labeled

examples
3
Unsupervised learning
Unupervised learning: given data, i.e. examples, but no labels
4
Unsupervised learning:Clustering
Raw data features

f1, f2, f3, , fn
f1, f2, f3, , fn
f1, f2, f3, , fn Clusters

f1, f2, f3, , fn group into
extract
classes/clusters
features
f1, f2, f3, , fn
No supervision, were only given data and want to

find natural groupings
5
What is Clustering
A grouping of data objects such that the objects
within a group are similar (or related) to one
another and different from (or unrelated to)
the objects in other groups
Examples in
Examples within different clusters
a cluster are very are very different
similar
Clustering example
Image segmentation:
Goal: Break up the image into meaningful
or perceptually similar regions
8
K-Means clustering
An iterative clustering algorithm
Initialize: Pick K random points as cluster
centers
Repeat:
1. Assign data points to closest cluster center
2. Change the cluster center to the average of its
assigned points
Stop when no points assignments change
9
10
11
K-means: an example
K-means: Initialize centers randomly
K-means: assign points to nearest center
K-means: readjust centers
No changes: Done
K-means
Iterate:
Assign/cluster each example to closest center
Recalculate centers as the mean of the points in a cluster
How do we do this?
K-means
Iterate:
iterate over each point:
- get distance to each cluster center
- assign to closest center (hard cluster)
K-means
Iterate:
iterate over each point:
- get distance to each cluster center
- assign to closest center
What distance measure should we use?

K-means
Iterate:
Recalculate centers as the mean of the points in
a cluster
Where are the cluster centers?

K-means
Iterate:
Recalculate centers as the mean of the points in
a cluster
How do we calculate these?

Example
we have 4 types of medicines

and each medicine have two
features. Our goal is to group 4.5
4
these objects into K=2 groups 3.5

3
2.5
2
Weight index pH
Medicine 1.5
(X) (Y) 1
0.5
A 1 1 0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
B 2 1
C 4 3
D 5 4
26
Iteration-1
Initial
value of centroids: Suppose we use medicine A
and medicine B as the first centroids
,
4.5
Medicine X Y Dist- Dist- Cluster 4
3.5
3
2.5
A 1 1 0 1 C-1 2
1.5
B 2 1 1 0 C-2 1
0.5
C 4 3 3.61 2.83 C-2 0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
D 5 4 5 4.24 C-2
Recompute centroids:
27
Iteration-2
Medicine X Y Dist- Dist- Cluster 4.5

4
3.5
3
A 1 1 0 3.14 C-1 2.5
2
B 2 1 1 2.36 C-1 1.5
1
C 4 3 3.61 0.47 C-2 0.5
0
D 5 4 5 1.89 C-2 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
Recompute centroids:
28
K-means variations/parameters
Initial (seed) cluster centers
Convergence
A fixed number of iterations
Partitions unchanged
Cluster centers dont change
K-means: Initialize centers randomly
What would happen here?
Seed selection ideas?

Seed choice
Results can vary drastically based on random seed
selection
Some seeds can result in poor convergence rate, or

convergence to sub-optimal clusterings
Common choices
Random point in feature space
Random point from dataset
Points least similar to any existing center (furthest centers
heuristic)
Try out multiple starting points
Furthest centers heuristic
1 = pick random point
for i = 2 to K:
i = point that is furthest from any previous centers
K-means: Initialize furthest from centers
Pick a random point for the first center

What point will be chosen next?

Furthest point from center


Any issues/concerns with this approach?

Furthest points concerns
If k = 4, which points will get chosen?

If we do a number of trials, will we get

different centers?
Doesnt deal well with outliers

K-means++
for k = 2 to K:
for i = 1 to N:
si = min d(xi, 1k-1) // smallest distance to any center
k = randomly pick point proportionate to s
How does this help?

K-means++
for k = 2 to K:
for i = 1 to N:
si = min d(xi, 1k-1) // smallest distance to any center
k = randomly pick point proportionate to s
- Makes it possible to select other points

- if #points >> #outliers, we will pick good points
- Makes it non-deterministic, which will help with random
runs
- Nice theoretical guarantees!
Pros
Easy to use
Good initial method
Cons
Need to know K
Problems when clusters are of different
size, densities
Cant handle Outliers well
43
44
Summary
Definition of clustering
Difference between supervised and unsupervised learning.
Finding labels for each datum.
Clustering algorithms
K-means
Always K clusters exist.
Find new mean value.
Find new clusters.
Stop when nothing changes in clusters (or changes are less than
very small value).
54

06 ML (Clustering)

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

06 ML (Clustering)

Caricato da

Copyright:

Formati disponibili

Machine Learning

Dr. Faraz Akram

Riphah International University

Supervised learning: given labeled

Unupervised learning: given data, i.e. examples, but no labels

Raw data features

f1, f2, f3, , fn

f1, f2, f3, , fn Clusters

No supervision, were only given data and want to

What distance measure should we use?

Where are the cluster centers?

How do we calculate these?

we have 4 types of medicines

these objects into K=2 groups 3.5

Medicine X Y Dist- Dist- Cluster 4.5

What would happen here?

Seed selection ideas?

Some seeds can result in poor convergence rate, or

1 = pick random point

Pick a random point for the first center

What point will be chosen next?

Furthest point from center

What point will be chosen next?

Furthest point from center

What point will be chosen next?

Furthest point from center

Any issues/concerns with this approach?

If k = 4, which points will get chosen?

If we do a number of trials, will we get

Doesnt deal well with outliers

1 = pick random point

k = randomly pick point proportionate to s

How does this help?

k = randomly pick point proportionate to s

- Makes it possible to select other points

Potrebbero piacerti anche