K Means Clustering

Nageswarao Datatechs
K MEANS CLUSTERING
Machine learning algorithms are divided into 2 classes: 1. supervised 2. unsupervised learning.
K Means clustering comes under unsupervised learning.
When we are given the following dataset, we do not know what is target variable. We have to group
these data points. By visual examination, we can divide these points into 2 groups (or clusters) as
shown at right side.
Here, we can take k = 2 (as there are 2 clusters). So, we have to take 2 random points. The next step
is to place these two points in the clusters. It depends on the nearness of each random point to the
clusters.
For example, the red point is in the center of the red cluster and green point is in the center of green
cluster. We have to put (adjust) these 2 data points in the centroid of the 2 clusters such that the
points are clearly separated into these 2 groups.
When seeing these data points, we thought there are 2 clusters. But other people may imagine 4 or
6 clusters as shown in the figure below:
But how to determine correct number of clusters (k) ? For this purpose we use ‘elbow’ method.
In elbow method, we start with finding distance between the centroid and the data points. Find their
sum of squared distances. We are squaring to handle the negative values.
Finally, find the total sum of squared errors SSE = SSE1 + SSE2+SSE3+…
The above calculations are done for k=2. When we repeat the same process for k=3, k=4, etc. we get
SSE3, SSE4, … etc.
Let us plot a graph for k values and SSE values. When k values are increasing, SSE values generally go
down and at some point, SSE may become 0. This is shown in the following figure. The important
point is to find out the elbow point in the graph. That is 4. Hence k=4 is the optimum value for
grouping the data points.
Example for clustering: dividing the items (like vegetables, oils, soaps etc) in clusters in super
markets.
Problem: Cluster the data related to Age and Income and find some important characteristics.
Dataset: income.csv
# clustering with K - means

from sklearn.cluster import KMeans
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot as plt
# view the dataset

df = pd.read_csv("F:/k-means/income.csv")
df
# create scatter plot to see the groups or clusters

plt.scatter(df['Age'],df['Income($)'])
plt.xlabel('Age')
plt.ylabel('Income($)')
# since 3 clusters are seen, let us use K means clustering

km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']]) # this gives cluster number
y_predicted # 1,0,2 --> there are 3 clusters
# add this cluster as another column

df['cluster']=y_predicted
df.head()
# find the center coordinates of clusters

km.cluster_centers_
# separate the 3 clusters into 3 dataframes

df1 = df[df.cluster==0]
# scatter plot the clusters with cluster centers

plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
# [:,0] --> all rows and column 0, [:,1] --> all rows and column 1
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroi
d')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()
# the clusters are not grouped correctly. The reason is scaling is not good.
# use proper scaling. we use MinMaxScaler()
scaler = MinMaxScaler()
# fit the scale to income
scaler.fit(df[['Income($)']])
df['Income($)'] = scaler.transform(df[['Income($)']])
# fit the scale to Age
scaler.fit(df[['Age']])
df['Age'] = scaler.transform(df[['Age']])
df.head()
# now once again fit the K-means clustering

km = KMeans(n_clusters=3)
y_predicted = km.fit_predict(df[['Age','Income($)']])
y_predicted
# store the y_predicted values into cluster column

df['cluster']=y_predicted
df.head()
# find cluster centers

km.cluster_centers_
# draw the plot once again

plt.scatter(df1.Age,df1['Income($)'],color='green')
plt.scatter(df2.Age,df2['Income($)'],color='red')
plt.scatter(df3.Age,df3['Income($)'],color='black')
# [:,0] --> all rows and column 0, [:,1] --> all rows and column 1
plt.scatter(km.cluster_centers_[:,0],km.cluster_centers_[:,1],color='purple',marker='*',label='centroi
d')
plt.xlabel('Age')
plt.ylabel('Income ($)')
plt.legend()
# elbow plot
sse = []
k_rng = range(1,10)
for k in k_rng:
km = KMeans(n_clusters=k)
km.fit(df[['Age','Income($)']])
sse.append(km.inertia_)
plt.xlabel('K')
plt.ylabel('Sum of squared error')
plt.plot(k_rng,sse)
# as seen in above plot, k=3 at albow. So, we have to take them as 3 groups
Task on K-means clustering
a) Use Iris flower dataset from sklearn library and try to form clusters of flowers using petal width
and length features. Drop the other two features for simplicity.
b) Figure out if any preprocessing such as scaling would help here
c) Draw elbow plot and from that figure out optimal value of k.

K Means Clustering

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

K Means Clustering

Caricato da

Copyright:

Formati disponibili

Nageswarao Datatechs

K Means clustering comes under unsupervised learning.

# clustering with K - means

# view the dataset

# create scatter plot to see the groups or clusters

# since 3 clusters are seen, let us use K means clustering

# add this cluster as another column

# find the center coordinates of clusters

# separate the 3 clusters into 3 dataframes

# scatter plot the clusters with cluster centers

# now once again fit the K-means clustering

# store the y_predicted values into cluster column

# find cluster centers

# draw the plot once again

Task on K-means clustering

b) Figure out if any preprocessing such as scaling would help here

Potrebbero piacerti anche