IPL Cluster

Case Study on Cluster Analysis
In this case study we want to perform a cluster analysis to form homogeneous groups of players based
on their past performances. We have 435 players. We have also 13 variables based on which the clusters
will be formed. The explanation to the variable names is provided below.
VARIABLES
Mat
Inns
Not Out
Runs
HS
Ave
BF
SR
100s
50s
DUCK
DESCRIPTION
Matches played
Innings batted
Not outs
Runs scored
Highest inns scored
Batting average
Balls faced
Batting strike rate
Scored of hundreds
Scored of Fifties
Ducks scored
1. We first create a library, named ipl. Then we import the data using the following syntaxes:
libname ipl "E:\ARUP\cluster";
run;
proc import datafile="E:\ARUP\cluster\cluster _data.csv" out=ipl.data
dbms=csv replace;
getnames=yes;
datarow=2;
run;
2. Then we rename few variables to make it easier for future use. We also
create a pdf document variables which contains the variable names.
data ipl.data;
set ipl.data;
rename _00s=hundreds _0s=fifties _s=fours var14=sixes;
run;
ods pdf body="E:\ARUP\cluster\variables.pdf";
proc contents data=ipl.data position short;
run;
ods pdf close;
3. To do a cluster analysis we have to standardize the dataset. For that we set mean to zero and
standard deviation to one for each variable (excluding the variable player).
proc standard data=ipl.data mean=0 std=1 out=ipl.standard;
var Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;
run;
The information provided on these pages r emains, unless other wise stated, th e copyright of the r espective authors. Al l layout , design, original graphics, concepts and
other World Wid e Web Intell ectual Property R ights barring the information mentioned above , remains the property and copyright of OrangeTr ee Global
4. Now we will use wards method along with cubic clustering criterion to
create the output dataset cluter_tree. We use the keyword outtree
to create this dataset. Here id represents the variable player
which will be gouped based on the remaining variables.
ods pdf body="E:\ARUP\cluster\result_1.pdf";
proc cluster data=ipl.standard outtree=ipl.cluster_tree method=ward ccc;
id Player;
run;
ods pdf close;
Let us take a look at the eigen values.
The first three eigen values are more than one.the cumulative variation
explained by the first three variables is.8120 i.e. 81.2%. We may include the
fourth variable also to get the variance explained by 87.08%.
5. We will now use the dataset cluster_tree, created by the outtree
statement to generate a dendogram (tree diagram) to in out the number
of clusters.
ods pdf body="E:\ARUP\cluster\result_2.pdf";
proc tree data=ipl.cluster_tree;
id Player;
copy Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;
run;
ods pdf close;
Following is the output.

In y-axis we have the semi partial R squared values and along x-axis we have the players names. If we
draw a horizontal line at level 0.2 (or more) then it intersects two vertical lines. So we get two clusters.
At level 0.4 we have only one cluster i.e. all the players in one group. Certainly this is not desirable. At
level 0.08 the horizontal line will intersect four vertical lines. If we take this level as the desired one then
we get four clusters.
6. After deciding upon the number of clusters we move onto the next step. We run the following
set of statements.
ods pdf body= "E:\ARUP\cluster\result_3.pdf";
proc tree data=ipl.cluster_tree nclusters=4 out=ipl.cluster;
id Player;
copy Mat Inns Not_Outs Runs HS Ave BF SR hundreds fifties Ducks fours sixes;
run;
ods pdf close;
In this code we use the nclusters=4 option to get only four clusters. out is the dataset cluster which
contains all the standardized variables along with the players names and the corresponding cluster
numbers. The output also provides the same tree diagram.
We are done with creating clusters. Then we will move onto some further analysis based on the clusters
so created.
7. We create a new dataset final_cluster which contains only the players names and the
corresponding custer numbers.
data ipl.final_clusters;
set ipl.cluster (keep= Player cluster);
run;
8. Now we will merge the original dataset named data with the new dataset final_cluster. The
newly created dataset is clustered_data. At first we sort both the datasets by the common
column Player. We use the following set of statements.
proc sort data=ipl.final_clusters;
by Player;
run;
proc sort data=ipl.data;
by Player;
run;
data ipl.clustered_data;
merge ipl.data(in=x) ipl.final_clusters(in=y);
by player;
if x=1 and y=1;
run;
9. To have a look at the mean, standard deviations and some other summary statistics of each
cluster, we run a proc means statements using a class variable cluster.
ods pdf body= "E:\ARUP\cluster\clusterwise_summary.pdf";
proc means data=ipl.clustered_data n mean std skewness kurtosis;
class cluster;
run;
ods pdf close;
From the following figure we see that the four clusters have 193, 171, 52 and
19 players respectively. The averages of all the variables are also provided
below.
Cluster four seems to have an excellent grouping of players in the sense that
all the players have very high averages for all the variables.
10. We can also see the players belonging to the cluster four using the code:
ods pdf body="E:\ARUP\cluster\cluster_4.pdf";

proc print data=ipl.clustered_data (keep =cluster player);
where cluster=4;
run;
ods pdf close;
Below is the output.

IPL Cluster

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

IPL Cluster

Caricato da

Copyright:

Formati disponibili

Case Study on Cluster Analysis

Following is the output.

ods pdf body="E:\ARUP\cluster\cluster_4.pdf";

Below is the output.

Potrebbero piacerti anche