Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
K MEANS CLUSTERING
Machine learning algorithms are divided into 2 classes: 1. supervised 2. unsupervised learning.
When we are given the following dataset, we do not know what is target variable. We have to group
these data points. By visual examination, we can divide these points into 2 groups (or clusters) as
shown at right side.
Here, we can take k = 2 (as there are 2 clusters). So, we have to take 2 random points. The next step
is to place these two points in the clusters. It depends on the nearness of each random point to the
clusters.
For example, the red point is in the center of the red cluster and green point is in the center of green
cluster. We have to put (adjust) these 2 data points in the centroid of the 2 clusters such that the
points are clearly separated into these 2 groups.
When seeing these data points, we thought there are 2 clusters. But other people may imagine 4 or
6 clusters as shown in the figure below:
Nageswarao Datatechs
But how to determine correct number of clusters (k) ? For this purpose we use ‘elbow’ method.
In elbow method, we start with finding distance between the centroid and the data points. Find their
sum of squared distances. We are squaring to handle the negative values.
Finally, find the total sum of squared errors SSE = SSE1 + SSE2+SSE3+…
The above calculations are done for k=2. When we repeat the same process for k=3, k=4, etc. we get
SSE3, SSE4, … etc.
Let us plot a graph for k values and SSE values. When k values are increasing, SSE values generally go
down and at some point, SSE may become 0. This is shown in the following figure. The important
point is to find out the elbow point in the graph. That is 4. Hence k=4 is the optimum value for
grouping the data points.
Nageswarao Datatechs
Example for clustering: dividing the items (like vegetables, oils, soaps etc) in clusters in super
markets.
Problem: Cluster the data related to Age and Income and find some important characteristics.
Dataset: income.csv
# the clusters are not grouped correctly. The reason is scaling is not good.
# use proper scaling. we use MinMaxScaler()
scaler = MinMaxScaler()
# fit the scale to income
scaler.fit(df[['Income($)']])
df['Income($)'] = scaler.transform(df[['Income($)']])
# fit the scale to Age
scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])
df.head()
plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
Nageswarao Datatechs
plt.scatter(df3.Age,df3['Income($)'],color='black')
# [:,0] --> all rows and column 0, [:,1] --> all rows and column 1
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroi
d')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()
# elbow plot
sse = []
k_rng = range(1,10)
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit(df[['Age','Income($)']])
sse.append(km.inertia_)
plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.plot(k_rng,sse)
# as seen in above plot, k=3 at albow. So, we have to take them as 3 groups
a) Use Iris flower dataset from sklearn library and try to form clusters of flowers using petal width
and length features. Drop the other two features for simplicity.
c) Draw elbow plot and from that figure out optimal value of k.