Sei sulla pagina 1di 5

Nageswarao Datatechs

K MEANS CLUSTERING

Machine learning algorithms are divided into 2 classes: 1. supervised 2. unsupervised learning.

K Means clustering comes under unsupervised learning.

When we are given the following dataset, we do not know what is target variable. We have to group
these data points. By visual examination, we can divide these points into 2 groups (or clusters) as
shown at right side.

Here, we can take k = 2 (as there are 2 clusters). So, we have to take 2 random points. The next step
is to place these two points in the clusters. It depends on the nearness of each random point to the
clusters.

For example, the red point is in the center of the red cluster and green point is in the center of green
cluster. We have to put (adjust) these 2 data points in the centroid of the 2 clusters such that the
points are clearly separated into these 2 groups.

When seeing these data points, we thought there are 2 clusters. But other people may imagine 4 or
6 clusters as shown in the figure below:
Nageswarao Datatechs

But how to determine correct number of clusters (k) ? For this purpose we use ‘elbow’ method.

In elbow method, we start with finding distance between the centroid and the data points. Find their
sum of squared distances. We are squaring to handle the negative values.

Finally, find the total sum of squared errors SSE = SSE1 + SSE2+SSE3+…

The above calculations are done for k=2. When we repeat the same process for k=3, k=4, etc. we get
SSE3, SSE4, … etc.

Let us plot a graph for k values and SSE values. When k values are increasing, SSE values generally go
down and at some point, SSE may become 0. This is shown in the following figure. The important
point is to find out the elbow point in the graph. That is 4. Hence k=4 is the optimum value for
grouping the data points.
Nageswarao Datatechs

Example for clustering: dividing the items (like vegetables, oils, soaps etc) in clusters in super
markets.

Problem: Cluster the data related to Age and Income and find some important characteristics.

Dataset: income.csv

# clustering with K - means


from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt

# view the dataset


df = pd.read_csv("F:/k-means/income.csv")
df

# create scatter plot to see the groups or clusters


plt.scatter(df['Age'],df['Income($)'])
plt.xlabel('Age')
plt.ylabel('Income($)')

# since 3 clusters are seen, let us use K means clustering


km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']]) # this gives cluster number
y_predicted # 1,0,2 --> there are 3 clusters

# add this cluster as another column


df['cluster']=y_predicted
df.head()
Nageswarao Datatechs

# find the center coordinates of clusters


km.cluster_centers_

# separate the 3 clusters into 3 dataframes


df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]

# scatter plot the clusters with cluster centers


plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
# [:,0] --> all rows and column 0, [:,1] --> all rows and column 1
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroi
d')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()

# the clusters are not grouped correctly. The reason is scaling is not good.
# use proper scaling. we use MinMaxScaler()
scaler = MinMaxScaler()
# fit the scale to income
scaler.fit(df[['Income($)']])
df['Income($)'] = scaler.transform(df[['Income($)']])
# fit the scale to Age
scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])

df.head()

# now once again fit the K-means clustering


km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']])
y_predicted

# store the y_predicted values into cluster column


df['cluster']=y_predicted
df.head()

# find cluster centers


km.cluster_centers_

# draw the plot once again


df1 = df[df.cluster==0]
df2 = df[df.cluster==1]
df3 = df[df.cluster==2]

plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
Nageswarao Datatechs

plt.scatter(df3.Age,df3['Income($)'],color='black')
# [:,0] --> all rows and column 0, [:,1] --> all rows and column 1
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroi
d')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()

# elbow plot
sse = []
k_rng = range(1,10)
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit(df[['Age','Income($)']])
sse.append(km.inertia_)

plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.plot(k_rng,sse)

# as seen in above plot, k=3 at albow. So, we have to take them as 3 groups

Task on K-means clustering

a) Use Iris flower dataset from sklearn library and try to form clusters of flowers using petal width
and length features. Drop the other two features for simplicity.

b) Figure out if any preprocessing such as scaling would help here

c) Draw elbow plot and from that figure out optimal value of k.

Potrebbero piacerti anche