Sei sulla pagina 1di 17

Cluster Analysis

Lakshmipathi.T H

What is Cluster Analysis?

Cluster analysis identifies and classifies objects individuals or variables on the basis of the similarity of the characteristics they possess. It seeks to minimize within-group variance and maximize between-group variance. The result of cluster analysis is a number of heterogeneous groups with homogeneous contents: There are substantial differences between the groups, but the individuals within a single group are similar.

Clustering procedures

Hierarchical Clustering Procedure Nonhierarchical Clustering Procedure

Hierarchical Clustering Procedure

Hierarchical clustering follows one of two approaches: Agglomerative methods start with each observation as a cluster and with each step combine observations to form clusters until there is only one large cluster. Divisive methods begin with one large cluster and proceed to split into smaller clusters items that are most dissimilar.

Hierarchical Clustering

This method does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4

agglomerative

a b c d e
Step 4

ab abcde cde de
divisive
Step 3 Step 2 Step 1 Step 0

Ways of defining inter-cluster distance

Single linkage (based on the shortest distance between objects) Complete linkage (based on the largest distance between objects) Average linkage (based on the average distance between objects) Ward's method (based on the sum of squares between the two clusters, summed over all variables) Centroid method (based on the distance between cluster centroids).

Nonhierarchical Clustering Procedure

K-Means Method This algorithm assigns each item to the cluster having the nearest centroid(mean).the process is composed of these three steps 1.partition the items into K initial clusters 2.proceed through the list of items, assigning an item to the cluster whose centroid (mean) is nearst.recalculate the centroid for the cluster receiving the new item and for the cluster losing the item. 3.Repeat step 2 until no more reassignments take place.

The K-Means Clustering Method

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Example
10 9 8 7 6 5 4

10 9 8 7 6 5 4

Assign each objects to most similar center

3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Update the cluster means

3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

reassign
10 9 8 7
10 9 8 7 6 5 4

reassign

K=2 Arbitrarily choose K object as initial cluster center

6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Update the cluster means

3 2 1 0 0 1 2 3 4 5 6 7 8 9 10

Case Study- Classification of occupations

Description of Data
The data gives the values of four variables for each of 36 occupations in the USA.The question to be addressed about these data is can the occupations be classified in some potentially informative way?

Data
prestige rating 82 90 76 90 87 93 90 88 89 97 59 73 81 45 39 34 41 16 33 53 67 57 26 29 10 15 19 10 13 24 20 7 16 11 8 41 suicide rate 23.8 37.5 37 20.7 10.6 14.2 45.6 31.9 24.3 31.9 16 16.8 64.8 47.3 21.9 16.9 32.4 24.1 32.7 30.8 34.2 34.5 24.4 29.4 14.4 41.7 19.2 24.9 17.9 15.7 36 24.4 42.2 38.2 20.3 47.6 income 3977 5509 4303 4091 2410 4366 6448 4590 6284 8302 3176 3456 4700 3806 2828 3480 3771 2543 2450 3447 4648 3303 2693 3353 1898 2410 3424 2213 2590 2915 2357 1942 2249 2551 1866 2866 education 14.4 16+ 13.6 16+ 16+ 16+ 16+ 16+ 16+ 16+ 15.8 16+ 12.2 11.6 12.7 12.2 12.7 12.1 8.7 11.1 8.8 9.6 9.4 9.3 10.3 8.2 9.2 8.9 9.6 9.6 8.8 9.8 8.7 8.5 8.2 10.6

Hierarchical clustering Output

The CLUSTER Procedure Average Linkage Cluster Analysis

Variable prestige_rating suicide_rate income education occupno

Mean 48.0278 29.0611 3533.8 11.9056 18.5000

Std Dev 31.3127 11.8376 1413.4 2.9777 10.5357

Skewness 0.2162 0.8011 1.4960 0.3413 0

Kurtosis -1.5505 0.8164 2.7124 -1.5276 -1.2000

Bimodality 0.6068 0.4012 0.5408 0.6388 0.4818

Hierarchical clustering Output Cont


NCL 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Clusters Joined OB33 OB29 OB11 OB31 OB28 OB4 OB27 OB20 OB23 OB15 OB26 CL31 OB19 OB1 CL26 OB2 OB25 OB17 CL19 CL20 OB5 CL18 CL22 CL25 CL21 OB9 CL16 OB14 CL11 CL9 CL7 CL8 CL13 CL3 CL2 OB34 OB30 OB12 CL35 OB32 OB6 CL34 OB22 OB24 OB16 CL32 OB35 CL27 CL30 OB18 OB8 CL29 CL28 CL24 OB3 CL33 CL23 CL15 OB36 CL14 OB10 OB7 OB21 CL17 CL10 CL12 CL5 CL6 OB13 CL4 FREQ 2 2 2 3 2 2 3 2 2 2 4 3 3 3 3 2 4 3 7 3 3 6 6 5 9 2 4 2 16 6 21 23 12 13 36 SPRSQ 0.0006 0.0006 0.0007 0.0009 0.0008 0.0011 0.0013 0.0012 0.0012 0.0013 0.0021 0.002 0.0025 0.0032 0.0032 0.0028 0.004 0.004 0.0078 0.0045 0.0056 0.0109 0.0118 0.0089 0.0175 0.0072 0.0097 0.0097 0.0544 0.0239 0.0646 0.0406 0.0858 0.0662 0.5373 RSQ 0.999 . 0.999 . 0.998 . 0.997 . 0.996 . 0.995 . 0.994 . 0.993 . 0.992 . 0.99 . 0.988 . 0.986 . 0.984 . 0.98 . 0.977 . 0.974 . 0.97 . 0.966 . 0.959 . 0.954 . 0.949 . 0.938 . 0.926 . 0.917 . 0.899 . 0.892 . 0.882 . 0.873 . 0.818 0.794 0.73 0.689 0.604 0.537 0 0.854 0.83 0.797 0.75 0.675 0.524 0 ERSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . -1.9 -1.4 -2.3 -1.8 -1.9 0.26 0. CCC PSF 51.7 . 50.6 . 48.5 . 45.8 45.9 . 43.6 . 41.3 40.8 . 40.9 . 40.9 . 38.5 37.3 35.6 33.4 32.2 32.1 . 31 30.5 27.5 27.8 27.6 25.4 23.9 24.1 22.3 23.9 . 25.3 27.4 . 21.8 23.2 20.9 23.7 25.1 39.5 13 4 9.8 4.3 12.1 4.7 39.5 2.7 4.1 3.4 4.5 1.6 7.7 4.9 4.4 7.5 5 2.9 2.5 2.1 2.9 2.5 2.1 1.5 PST2

Dendrogram

Interpreting or labeling the clusters

The two groups of occupations might be labelled blue collar and professional. Cluster 1 consists of bookkeepers, cooks etc., with generally lower prestige values, income and education, but higher suicide rates. Cluster 2 are occupations such as accountants, architects and dentists, having high prestige, income and education values and relatively low suicide rates.

Nonhierarchical Clustering

/*unstandardized variables*/ proc fastclus data=train.clus maxc=2 out=train.abc; var prestigerating suiciderate income education ; run; /*standardized variables*/ proc fastclus data=train.Zscore1 maxc=2 out=train.abc1; var zprestigerating zsuiciderate zincome zeducation ; run;

Profiling the Cluster Solution

A "profile" of a cluster is merely the set of mean values for that cluster. These can be for the internal variables (used to form the clusters) as well as external variables. The external variables may be demographic, psychographic, or consumption-pattern. Once the clusters are formed, other techniques such as profiling or discriminant analysis can be used to see what internal variables account for the clustering.

Thank You

Potrebbero piacerti anche