Cluster Analysis

Cluster Analysis
Lakshmipathi.T H
What is Cluster Analysis?
Cluster analysis identifies and classifies objects individuals or variables on the basis of the similarity of the characteristics they possess. It seeks to minimize within-group variance and maximize between-group variance. The result of cluster analysis is a number of heterogeneous groups with homogeneous contents: There are substantial differences between the groups, but the individuals within a single group are similar.
Clustering procedures
Hierarchical Clustering Procedure Nonhierarchical Clustering Procedure
Hierarchical Clustering Procedure
Hierarchical clustering follows one of two approaches: Agglomerative methods start with each observation as a cluster and with each step combine observations to form clusters until there is only one large cluster. Divisive methods begin with one large cluster and proceed to split into smaller clusters items that are most dissimilar.
Hierarchical Clustering
This method does not require the number of clusters k as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
a b c d e
Step 4
ab abcde cde de
divisive
Step 3 Step 2 Step 1 Step 0
Ways of defining inter-cluster distance
Single linkage (based on the shortest distance between objects) Complete linkage (based on the largest distance between objects) Average linkage (based on the average distance between objects) Ward's method (based on the sum of squares between the two clusters, summed over all variables) Centroid method (based on the distance between cluster centroids).
Nonhierarchical Clustering Procedure
K-Means Method This algorithm assigns each item to the cluster having the nearest centroid(mean).the process is composed of these three steps 1.partition the items into K initial clusters 2.proceed through the list of items, assigning an item to the cluster whose centroid (mean) is nearst.recalculate the centroid for the cluster receiving the new item and for the cluster losing the item. 3.Repeat step 2 until no more reassignments take place.
The K-Means Clustering Method
10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Example
10 9 8 7 6 5 4
10 9 8 7 6 5 4
Assign each objects to most similar center
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Update the cluster means
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
reassign
10 9 8 7
10 9 8 7 6 5 4
reassign
K=2 Arbitrarily choose K object as initial cluster center
6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Update the cluster means
3 2 1 0 0 1 2 3 4 5 6 7 8 9 10
Case Study- Classification of occupations
Description of Data
The data gives the values of four variables for each of 36 occupations in the USA.The question to be addressed about these data is can the occupations be classified in some potentially informative way?
Data
prestige rating 82 90 76 90 87 93 90 88 89 97 59 73 81 45 39 34 41 16 33 53 67 57 26 29 10 15 19 10 13 24 20 7 16 11 8 41 suicide rate 23.8 37.5 37 20.7 10.6 14.2 45.6 31.9 24.3 31.9 16 16.8 64.8 47.3 21.9 16.9 32.4 24.1 32.7 30.8 34.2 34.5 24.4 29.4 14.4 41.7 19.2 24.9 17.9 15.7 36 24.4 42.2 38.2 20.3 47.6 income 3977 5509 4303 4091 2410 4366 6448 4590 6284 8302 3176 3456 4700 3806 2828 3480 3771 2543 2450 3447 4648 3303 2693 3353 1898 2410 3424 2213 2590 2915 2357 1942 2249 2551 1866 2866 education 14.4 16+ 13.6 16+ 16+ 16+ 16+ 16+ 16+ 16+ 15.8 16+ 12.2 11.6 12.7 12.2 12.7 12.1 8.7 11.1 8.8 9.6 9.4 9.3 10.3 8.2 9.2 8.9 9.6 9.6 8.8 9.8 8.7 8.5 8.2 10.6
Hierarchical clustering Output
The CLUSTER Procedure Average Linkage Cluster Analysis
Variable prestige_rating suicide_rate income education occupno
Mean 48.0278 29.0611 3533.8 11.9056 18.5000
Std Dev 31.3127 11.8376 1413.4 2.9777 10.5357
Skewness 0.2162 0.8011 1.4960 0.3413 0
Kurtosis -1.5505 0.8164 2.7124 -1.5276 -1.2000
Bimodality 0.6068 0.4012 0.5408 0.6388 0.4818
Hierarchical clustering Output Cont

NCL 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 Clusters Joined OB33 OB29 OB11 OB31 OB28 OB4 OB27 OB20 OB23 OB15 OB26 CL31 OB19 OB1 CL26 OB2 OB25 OB17 CL19 CL20 OB5 CL18 CL22 CL25 CL21 OB9 CL16 OB14 CL11 CL9 CL7 CL8 CL13 CL3 CL2 OB34 OB30 OB12 CL35 OB32 OB6 CL34 OB22 OB24 OB16 CL32 OB35 CL27 CL30 OB18 OB8 CL29 CL28 CL24 OB3 CL33 CL23 CL15 OB36 CL14 OB10 OB7 OB21 CL17 CL10 CL12 CL5 CL6 OB13 CL4 FREQ 2 2 2 3 2 2 3 2 2 2 4 3 3 3 3 2 4 3 7 3 3 6 6 5 9 2 4 2 16 6 21 23 12 13 36 SPRSQ 0.0006 0.0006 0.0007 0.0009 0.0008 0.0011 0.0013 0.0012 0.0012 0.0013 0.0021 0.002 0.0025 0.0032 0.0032 0.0028 0.004 0.004 0.0078 0.0045 0.0056 0.0109 0.0118 0.0089 0.0175 0.0072 0.0097 0.0097 0.0544 0.0239 0.0646 0.0406 0.0858 0.0662 0.5373 RSQ 0.999 . 0.999 . 0.998 . 0.997 . 0.996 . 0.995 . 0.994 . 0.993 . 0.992 . 0.99 . 0.988 . 0.986 . 0.984 . 0.98 . 0.977 . 0.974 . 0.97 . 0.966 . 0.959 . 0.954 . 0.949 . 0.938 . 0.926 . 0.917 . 0.899 . 0.892 . 0.882 . 0.873 . 0.818 0.794 0.73 0.689 0.604 0.537 0 0.854 0.83 0.797 0.75 0.675 0.524 0 ERSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . -1.9 -1.4 -2.3 -1.8 -1.9 0.26 0. CCC PSF 51.7 . 50.6 . 48.5 . 45.8 45.9 . 43.6 . 41.3 40.8 . 40.9 . 40.9 . 38.5 37.3 35.6 33.4 32.2 32.1 . 31 30.5 27.5 27.8 27.6 25.4 23.9 24.1 22.3 23.9 . 25.3 27.4 . 21.8 23.2 20.9 23.7 25.1 39.5 13 4 9.8 4.3 12.1 4.7 39.5 2.7 4.1 3.4 4.5 1.6 7.7 4.9 4.4 7.5 5 2.9 2.5 2.1 2.9 2.5 2.1 1.5 PST2
Dendrogram
Interpreting or labeling the clusters
The two groups of occupations might be labelled blue collar and professional. Cluster 1 consists of bookkeepers, cooks etc., with generally lower prestige values, income and education, but higher suicide rates. Cluster 2 are occupations such as accountants, architects and dentists, having high prestige, income and education values and relatively low suicide rates.
Nonhierarchical Clustering
/*unstandardized variables*/ proc fastclus data=train.clus maxc=2 out=train.abc; var prestigerating suiciderate income education ; run; /*standardized variables*/ proc fastclus data=train.Zscore1 maxc=2 out=train.abc1; var zprestigerating zsuiciderate zincome zeducation ; run;
Profiling the Cluster Solution
A "profile" of a cluster is merely the set of mean values for that cluster. These can be for the internal variables (used to form the clusters) as well as external variables. The external variables may be demographic, psychographic, or consumption-pattern. Once the clusters are formed, other techniques such as profiling or discriminant analysis can be used to see what internal variables account for the clustering.
Thank You

Cluster Analysis

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Cluster Analysis

Caricato da

Copyright:

Formati disponibili

Cluster Analysis

What is Cluster Analysis?

Hierarchical Clustering Procedure Nonhierarchical Clustering Procedure

Hierarchical Clustering Procedure

Ways of defining inter-cluster distance

Nonhierarchical Clustering Procedure

The K-Means Clustering Method

Assign each objects to most similar center

Update the cluster means

K=2 Arbitrarily choose K object as initial cluster center

Update the cluster means

Case Study- Classification of occupations

Hierarchical clustering Output

The CLUSTER Procedure Average Linkage Cluster Analysis

Variable prestige_rating suicide_rate income education occupno

Mean 48.0278 29.0611 3533.8 11.9056 18.5000

Std Dev 31.3127 11.8376 1413.4 2.9777 10.5357

Skewness 0.2162 0.8011 1.4960 0.3413 0

Kurtosis -1.5505 0.8164 2.7124 -1.5276 -1.2000

Bimodality 0.6068 0.4012 0.5408 0.6388 0.4818

Hierarchical clustering Output Cont

Interpreting or labeling the clusters

Profiling the Cluster Solution

Potrebbero piacerti anche