Sei sulla pagina 1di 73

Unit 4: Unsupervised

Learning
Adama Science and Technology University
School of Electrical Engineering and Computing
Department of Computer Science and Engineering

Dr. Mesfin Abebe Haile (2019)


Outline

 Grouping Unlabeled Items Using K-means Clustering.

 Association Analysis with the Apriori Algorithm.

 Efficiently Finding Frequent Items with FP-growth.

1/5/2020 2
Grouping Unlabeled Items
Using K-means Clustering

1/5/2020 3
Clustering in Machine
Learning
 Clustering: is the assignment of a set of observations into
subsets (clusters) so that observations in the same cluster are
similar in some sense.

 Clustering is a method of unsupervised learning.


 Unsupervised: data points have unknown outcome
 Supervised: data points have known outcome
 K-means clustering is an algorithm to classify or to group
objects based on attributes/features into K number of group.
 K is positive integer number.
1/5/2020 4
K-means Clustering

 We know before hand that these objects belong to two groups


(k=2) of medicine (Cluster 1 and cluster 2).
 The problem now is to determine which medicines belong to
cluster 1 and which medicines belong to the other cluster.
1/5/2020 5
K-means Clustering

 The basic step of K-means clustering is simple.


 In the beginning we determine number of cluster K and we
assume the centroid or center of these clusters.

 We can take any random objects as the initial centroid or the


first K objects in sequence can also serve as the initial
centroids.
 The K means algorithm will do the follow four steps until
convergence.

1/5/2020 6
K-means Clustering

 Step 1: Begin with a decision on the values of k = number of


cluster.

 Step 2: put any initial partition that classifies the data into k
clusters.
 You may assign the training sample randomly, or
systematically as the following:
 Take the first k training sample as single-element clusters.
 Assign each of the remaining (N-k) training sample to the cluster
with the nearest centroid. After each assignment, recomputed the
centroid of the gaining cluster.
1/5/2020 7
K-means Clustering

 Step 3: Take each sample in sequence and compute its distance


from the centroid of each of the cluster.
 If a sample is not currently in the cluster with the closest
centroid, switch this sample to that cluster and update the
centroid of the cluster gaining the new sample and the cluster
losing the sample.

 Step 4: Repeat step 3 until convergence is achieved, that is until


a pass through the training sample causes no new assignments.

1/5/2020 8
K-means Clustering

1/5/2020 9
K-means Algorithm

 K = 2 (find two clusters)

1/5/2020 10
K-means Algorithm

 K = 2 (randomly assign cluster centers)

1/5/2020 11
K-means Algorithm

 K = 2 (each point belongs to closest center)

1/5/2020 12
K-means Algorithm

 K = 2 (move each center to cluster’s mean)

1/5/2020 13
K-means Algorithm

 K = 2 (each point belongs to closest center)

1/5/2020 14
K-means Algorithm

 K = 2 (move each center to cluster’s mean)

1/5/2020 15
K-means Algorithm

 K = 2 (points don’t change - converged)

1/5/2020 16
K-means Clustering (Example)

 If the number of data is less than the number of cluster then we


assign each data as the centroid of the cluster. Each centroid
will have a cluster number.

 If the number of data is bigger than the number of cluster, for


each data, we calculate the distance to all centroid and get the
minimum distance.
 This data is said belong to the cluster that has minimum
distance from this data.

1/5/2020 17
K-means Clustering (Example)

 Suppose we have several objects (4 types of medicines) and


each object have two attributes or features as shown in the
table.
 Our goal is to group these objects into K=2 group of medicine
based on the two features (pH and weight index).

1/5/2020 18
K-means Clustering (Example)

 Each medicine represents one


point with two attributes (X,
Y) that we can represent it as
coordinate in an attribute
space as shown in the figure.

1/5/2020 19
K-means Clustering (Example)

 1. Initial value of centroids:


suppose we use medicine A
and medicine B as the first
centroids.

 Let and denote the coordinate


of the centroids, then C1 =
(1,1) and C2 = (2,1)

1/5/2020 20
K-means Clustering (Example)

 2. Objects-Centroids distance: Lets calculate the distance


between cluster centroid to each objects.
 Let us use Euclidean distance, then we have distance matrix at
iteration 0 is:

1/5/2020 21
K-means Clustering (Example)

 Each column in the distance matrix symbolizes the object.


 The first row of the distance matrix corresponds to the distance
of each object to the first centroid and the second row is the
distance of each object to the second centroid.

 For example, distance from medicine C = (4, 3) to the first


centroid C1 = (1, 1) is

 And its distance to the second centroid C2 = is

1/5/2020 22
K-means Clustering (Example)

 3. Objects clustering: Assign each object based on the minimum


distance.
 Thus, medicine A is assigned to group 1, medicine B to group 2,
medicine C to group 2 and medicine D to group 2.

 The element of Group matrix below is 1 if and only if the


objects is assigned to that group.

1/5/2020 23
K-means Clustering (Example)

 4. Iteration-1, determine centroids: Knowing the members of


each group, now compute the new centroid of each group based
on these new memberships.
 Group 1 only has one member.
 Thus the centroid remains in C1 = (1,1).
 Group 2 has three members, thus the centroid is the average
coordinate among the three members:

1/5/2020 24
K-means Clustering (Example)

 5. Iteration-1, Objects-Centroids
distances: The next step is to
compute the distance of all
objects to the new centroids.
 Similar to step 2, we have
distance matrix at iteration 1 is:

1/5/2020 25
K-means Clustering (Example)

 6. Iteration-1, Objects clustering: Similar to step 3, we assign


each object based on the minimum distance.
 Based on the new distance matrix, we move the medicine B to
Group 1 while all the other objects remain.
 The Group matrix is shown below:

1/5/2020 26
K-means Clustering (Example)

 7. Iteration-2, determine centroids: Now we repeat step 4 to


calculate the new centroids coordinator based on the clustering
of previous iteration.

 Group 1 and Group 2 both has two members, this the new
centroids are C1 =

 And C2 =

1/5/2020 27
K-means Clustering (Example)

 8. Iteration-2, Objects-
Centroids distances: Repeat
step 2 again, we have new
distance matrix at iteration 2 as:

1/5/2020 28
K-means Clustering (Example)

 9. Iteration-2, Objects clustering: Again, we assign each object


based on the minimum distance.

1/5/2020 29
K-means Clustering (Example)

 We obtain result that G2 = G1.


 Comparing the grouping of last iteration and this iteration
reveals that the objects does not move group anymore.

 Thus, the computation of the K-mean clustering has reached its


stability and no more iteration is needed.
 We get the final grouping as the results.

1/5/2020 30
K-means Clustering (Example)

1/5/2020 31
K-means Clustering

 Which model is the right one?


 Inertia: sum of squared distance from each point (xi) to its cluster
(ck) .

 Smaller value corresponds to tighter cluster.


 Other metrics can be used.

1/5/2020 32
K-Means : the Syntax

 Import the class containing the clustering method.


 From sklearn.cluster import KMeans
 Create an instance of the class.
 kmeans = kmeans( n_clusters = 3, init=‘k-means++’)

 Fit the instance on the data and then predict clusters for new
data.
 kmeans = kmeans.predict(x1)
 y_predict = kmrans.predict(x1)
 Can also be used in batch mode with MiniBatchkMeans.
1/5/2020 33
Distance Metrics

 Distance metric choice:


 Choice of distance metric is extremely import to clustering
success.
 Each metric has strengths and most appropriate use-cases.
 But sometimes choice of distance metric is also based on
empirical evaluation.

1/5/2020 34
Distance Metrics

 Euclidian distance:

1/5/2020 35
Distance Metrics

 Manhattan distance:

1/5/2020 36
Distance Metrics

 Cosine distance:

1/5/2020 37
Euclidean Vs Cosine Distance

 Euclidean is useful for coordinate based measurements.


 Cosine is better for data such as text where location of
occurrence is less important.

 Euclidean distance is more sensitive to curse of dimensionality.

1/5/2020 38
Distance Metrics

 Jaccard distance:
 Applies to sets (like word occurrence)
 Sentence A: “I like chocolate ice cream.”
 Set A = {I, like, chocolate, ice, cream}
 Sentence B: “Do I want chocolate cream or vanilla cream?”
 Set B = {Do, I, want, chocolate, cream, or, vanilla}

1/5/2020 39
Distance Metrics

 Jaccard distance:
 Applies to sets (like word occurrence)
 Sentence A: “I like chocolate ice cream.”
 Set A = {I, like, chocolate, ice, cream}
 Sentence B: “Do I want chocolate cream or vanilla cream?”
 Set B = {Do, I, want, chocolate, cream, or, vanilla}

1/5/2020 40
Distance Metrics : the Syntax

 Import the general pairwise distance function.


 From sklearn.metrics import pairwise_distances
 Calculate the distance.
 dist = pairwise_distances( X, Y, metric = ‘euclidean’)

 Other distance metric choices are: cosine, manhattan, jaccard,


etc.
 Distance metric methods can also be imported specifically, e.g.:
 From sklearn.metrics import euclidean_distances

1/5/2020 41
Other Types of Clustering

 Other types of clustering:


 Mini-Batch K-Means
 Affinity Propagation
 Mean Shift
 Spectral Clustering
 Ward
 DBSCAN etc…

1/5/2020 42
Association Analysis with the
Apriori Algorithm

1/5/2020 43
Mining Association Rules

 The goal of association rule finding is to extract correlation


relationships in the large datasets of items.
 An illustrative example of association rule mining is so-called
market basket analysis.

1/5/2020 44
Mining Association Rules

 What could be a rule and what kind of rules are we looking for?
 Example of an association rule could be:
 Computer => finacial _managment_software

 Which means that a purchase of a computer implies a purchase of


financial _management_software.
 Naturally, this rule may not hold for all customers and every single
purchase.
 Thus, we are going to associate two numbers with every such rule.
 These two numbers are called support and confidence.
1/5/2020 45
Notation and Basic Concepts

 They are used as a measure of the interestingness of the rules.


 Let Ω = {i1, i2, … im} be a universe of items.
 Also, let T = {t1, t2, …tn} be a set of all transactions collected over
a given period of time.
 Thus, t ⊆ Ω (“t is a subset of omega”). In reality, each transaction t
is assigned a number, for example a transaction id (TID).

 An association rule will be an implication of the form: A => B.


 Where both A and B are subsets of Ω and A ∩ B = ∅ (“the
intersection of sets A and B is an empty set”).
1/5/2020 46
Notation and Basic Concepts

 What is support?
 Support (frequency) is simply a probability that a randomly
chosen transaction t contains both items A and B.

1/5/2020 47
Notation and Basic Concepts

 What is confidence?
 Confidence (accuracy) is simply a probability that an itemset B
is purchased in a randomly chosen transaction t given that the
itemset A is purchased.

1/5/2020 48
Notation and Basic Concepts

 The goal is to find interesting rules.


 We would like to detect those with high support and high
confidence.
 Typically, we will select appropriate thresholds for both measures
and then look for all subsets that fulfill given support and
confidence criteria.
 A set of k items is called a k-itemset.
 For example, {bread, skim milk, pringles} is a 3- itemset.
 An itemset whose count (or probability) is greater than some pre-
specified threshold is called a frequent itemset.
1/5/2020 49
Notation and Basic Concepts

 How are we going to find interesting rules from the database T?

 It will be a two-step process:


 Find all frequent itemsets (each of these itemsets will occur at least
as frequently as pre-specified by the minimum support threshold)

 Generate strong association rules from the frequent itemsets (these


rules will satisfy both minimum support threshold and minimum
confidence threshold)

1/5/2020 50
Apriori Algorithm

 In the following transaction database D find all frequent itemsets.


 The minimum support count is 2.

1/5/2020 51
Apriori Algorithm

 The steps of the algorithm are shown below:

1/5/2020 52
Apriori Algorithm

 The steps of the algorithm are shown below:

1/5/2020 53
Apriori Algorithm

 The steps of the algorithm are shown below:

1/5/2020 54
Apriori Algorithm

 Apriori algorithm employs a level-wise search for frequent


itemsets.
 In particular, frequent k-itemsets are used to find frequent (k + 1)
itemsets.

 This is all based on the following property:


 All nonempty subsets of a frequent itemset must also be frequent.
 This means that, in order to find Lk + 1, we should only be looking at
L k.
 There are two steps in this process, the join step and the prune step.
1/5/2020 55
Apriori Algorithm

 How can we generate association rules from frequent itemsets?


 Once we find all frequent itemsets, we can easily calculate
confidence for any rule.

 For every frequent itemset, we generate all nonempty proper


subsets.
 Run each subset (and its complement) through the above formula.
 Those rules that create confidence above the pre-specified
threshold are outputted as association rules.
1/5/2020 56
Apriori Algorithm

 In the example below an itemset l = {I1, I2, I5} is frequent.


 All nonempty proper subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1},
{I2}, {I5}.
 The confidence of all the candidate association rules are now:

 If a minimum confidence rule was 75%, only the second, third and
sixth rule would be considered strong and thus outputted.
1/5/2020 57
Efficiently Finding Frequent
Items with FP-growth

1/5/2020 58
Disadvantage of Apriori

 The candidate generation could be extremely slow (pairs, triplets,


etc.).
 The candidate generation could generate duplicates depending on
the implementation.

 The counting method iterates through all of the transactions each


time.
 Constant items make the algorithm a lot heavier.
 Huge memory consumption.

1/5/2020 59
FP-growth

 FP-Growth is an improvement of apriori designed to eliminate


some of the heavy bottlenecks in apriori.
 The algorithm was planned with the benefits of mapReduce taken
into account, so it works well with any distributed system focused
on mapReduce.

 FP-Growth (Frequent Pattern) simplifies all the problems present


in apriori by using a structure called an FP-Tree.
 In an FP-Tree each node represents an item and it's current count,
and each branch represents a different association.

1/5/2020 60
FP-growth

 The whole algorithm is divided in 5 simple steps. Here we have a


simple example:
 Our client is named Mario and here we have his transactions:

 TMario= [ [beer, bread, butter, milk] ,


[beer, milk, butter],
[beer, milk, cheese] ,
[beer, cheese, bread],
[beer, butter, diapers, cheese] ]

1/5/2020 61
FP-growth

Step 1:
 The first step is we count all the items in all the transactions
 TMario= [ beer: 5, bread: 2, butter: 3, milk: 3, cheese: 3,
diapers: 1]

Step 2: (apply threshold)


 For this example let's say we have a threshold of 30% so each
item has to appear at least twice.
 TMario= [ beer: 5, bread: 2, butter: 3, milk: 3, cheese: 3,
diapers: 1]
1/5/2020 62
FP-growth

Step 3:
 Now we sort the list according to the count of each item.
 Tmariosorted = [ beer: 5, butter: 3, milk: 3, cheese: 3, bread: 2]

Step 4: (build the tree)


 Go through each of the transaction and add all the items in
the order they appear in our sorted list.
 Transaction to add = [ beer, bread, butter, milk]

1/5/2020 63
FP-growth

 Transaction to add = [ beer, bread, butter, milk]

1/5/2020 64
FP-growth

 Transaction 2= [ beer, milk, butter]

1/5/2020 65
FP-growth

 Transaction 3 = [ beer, milk, cheese]


 Only when they differ will the tree split.

1/5/2020 66
FP-growth

 Transaction 4 = [ beer, cheese, bread]

1/5/2020 67
FP-growth

 Transaction 5 = [ beer, butter, diapers, cheese]

1/5/2020 68
FP-growth

Step 5:
 we go through every branch of the tree and only include in the
association all the nodes whose count passed the threshold.

1/5/2020 69
FP-growth

1/5/2020 70
FP-growth

 FP-Growth beats Apriori by far.


 FP-Growth has less memory usage and less runtime.

 The differences are huge. FP-Growth is more scalable because


of its linear running time.
 Don't think twice if you have to make a decision between these
algorithms. Use FP-Growth.

1/5/2020 71
Apriori Vs FP-growth

1/5/2020 72
Question & Answer

1/5/2020 73

Potrebbero piacerti anche