Sei sulla pagina 1di 2

Njohja e Mostrave

Universiteti i Prishtinës

Pjesa e dytë e projektit (Afati për dorëzim: 23.01.2017)

Problem Description

There exists an unclassified data set with hidden data structures in it. The task in
this assignment is to perform comprehensive Cluster Analysis in order to reveal
the structures and similar data groups.
The data set consists of unlabeled data set called test.txt and initial centroids data
set namely centroids.txt in the archive. Both files have the following format:
[attribute1_value <space> attribute2_value <space> ... <space>
attribute90_value].
The unlabeled data set includes 350 samples and the initial centroids set consists
of 15 samples. Data instances in both files have 90 attributes.
Finally, prepare an academic report and deliver it together with source code and
any additional material, which you were using during you work.

Tasks:

1. Implement a simple K-means method, which is able to handle real values


data in attributes. Also you need to add functionality in your program that
allows utilization of Euclidean, Manhattan (City Block), Euclidean Squared
(the same as the Euclidean distance, but does not take the square root) and
Chebyshev distances. You are free to use any kind of weights (for
feature or data instance) in the program if necessary.

2. Perform attributes values rescaling in order to obtain normalized data


within the range [0,1], which is more suitable and reliable for proper cluster
analysis. You can use following equation for rescaling: xNew=(x-Min)/
(Max-Min). Feel free to bring own rescaling method.

3. Perform clustering of the unlabeled data set. You could use provided initial
centroids set or generate your own. Also there could be considered next
stopping criteria:
3.1 Maximal number of iterations: 100
3.2 Cluster are consistent (no changes in group matrix or centroids on
current iteration, which mean that the clusters are balanced).
4. Cluster Analysis could be also represented more formally as optimization
procedure, which tries to minimize the Residual Sum of Squares objective
function:

where μ(ωk) – is a centroid of a particular cluster k, K – total amount of


clusters, x – data sample in this cluster ωk.

4.1 Please, provide value of RSS function on each iteration in your


program for a particular distance measure and K number.

4.2 Discuss the changing of RSS function value (increasing or decreasing


and why) during Cluster Analysis (from the first iteration until the last
one)?

5. Try different numbers of clusters in your program (K=2...15) and build a


plot that shows the dependency between number K and value of RSS
function on the last iteration.

5.1 What is the optimal number of clusters K for a given data set?

5.2 Did you get any empty clusters? What is the possible solution for this
problem?

Potrebbero piacerti anche