Sei sulla pagina 1di 4

K-Means Clustering in PHP

Unsupervised learning technique to find clusters (subsets of data with similar caracteristics) in unknown data.
Given a random set of numbers, the problem to solve here is to determine a fixed number of intervals that best describe the distribution of the initial dataset. Instead of trying to identify how many clusters exists in the dataset (also possible, but outside the scope of this article), a K-means clustering algorithm will always return a fixed (K) number of subsets. Given the following data set: D = [1, 3, 2, 5, 6, 2, 3, 1, 30, 36, 45, 3, 15, 19] The idea would be to find 3 clusters with the most similar numbers: S = [[1,1,2,2,3,3,3,7], [15,17], [30,36,45]] For a human being, given such a small set and number of clusters, it would be relatively easy to determine the intervals that best describe the dataset, however that task becomes nearly impossibe as the dataset grows. K-Means algorithm (http://en.wikipedia.org/wiki/K-means_algorithm) will allows us to do this task automatically. It works by identifying the space of possible solutions and, as a first step, it creates each of the K clusters and distributes them randomly (or using an heuristic) throughout the available solution space and recalculates its positions on each iteration until there are no more changes and the clusters are finally found. On each iteration, every value it the dataset is assigned to the closest cluster (which was initially positioned at random) by using a distance (or similarity) function. Once all the values have been assigned to one given cluster, the cluster position is then recalculated so that it is placed in the mean location of the values that are assigned to it. As the algorithm iterates, values will move from one cluster to the other until there arent any more changes at which point the algorithm is finished. Depending on the problem, it might be worth limiting the number of iterations to look out for infinite loops. For our current problem, because it needs to be integrated in an existing application that our company is currently in the process of improving and it is entirely written in PHP, we will choose PHP as our programming language. You can choose any language you are most comfortable with to implement it, but feel free to download the full version of the PHP code at the end of the article. The code comes with some examples and it is fairly well documented. If you ever need to use it, and email or link back is always welcomed.
<?php /** * This code was created by Jose Fonseca (josefonseca@blip.pt) * * Please feel free to use it in either commercial or non comercial applications! * and if you enjoy usin" it feel free to let us #now or to comment on our * technical blo" at http$//code.blip.pt */ print_r(kmeans(array(1, 3, 2, 5, 6, 2, 3, 1, 30, 36, 45, 3, 15, 17), 3)); /**

* This function ta#es a array of inte"ers and the number of clusters to create. * %t returns a multidimensional array containin" the ori"inal data or"ani&ed * in clusters. * * @param array 'data * @param int '# * * @return array */ function kmeans($data, $k) $!"#siti#ns = assi$n_initia%_p#siti#ns($data, $k); $!%&sters = array(); '(i%e(true) $!(an$es = kmeans_!%&sterin$($data, $!"#siti#ns, $!%&sters); i)(*$!(an$es) , $data, $!%&sters); , , ret&rn kmeans_$et_!%&ster_+a%&es($!%&sters, $data); $!"#siti#ns = kmeans_re!a%!&%ate_!p#siti#ns($!"#siti#ns,

/** * */ function kmeans_!%&sterin$($data, $!"#siti#ns, -$!%&sters) $n.(an$es = 0; )#rea!(($data as $data/ey =0 $+a%&e) $minDistan!e = null; $!%&ster = null; )#rea!(($!"#siti#ns as $k =0 $p#siti#n) $distan!e = distan!e($+a%&e, $p#siti#n); i)(is_n&%%($minDistan!e) 11 $minDistan!e 0 $distan!e) $minDistan!e = $distan!e; $!%&ster = $k; , $!%&ster) , i)(*isset($!%&sters2$data/ey3) 11 $!%&sters2$data/ey3 *= $n.(an$es44; , $!%&sters2$data/ey3 = $!%&ster; , , ret&rn $n.(an$es;

function kmeans_re!a%!&%ate_!p#siti#ns($!"#siti#ns, $data, $!%&sters) $k5a%&es = kmeans_$et_!%&ster_+a%&es($!%&sters, $data); )#rea!(($!"#siti#ns as $k =0 $p#siti#n) kmeans_a+$($k5a%&es2$k3); , ret&rn $!"#siti#ns; , $!"#siti#ns2$k3 = empty($k5a%&es2$k3) 6 0 7

function kmeans_$et_!%&ster_+a%&es($!%&sters, $data) $+a%&es = array(); )#rea!(($!%&sters as $data/ey =0 $!%&ster) $+a%&es2$!%&ster323 = $data2$data/ey3; , ret&rn $+a%&es; , function kmeans_a+$($+a%&es) $n = !#&nt($+a%&es); $s&m = array_s&m($+a%&es); ret&rn ($n == 0) 6 0 7 $s&m 8 $n; , /** * (alculates the distance (or similarity) between two )alues. The closer * the return )alue is to *+,-! the more similar the two )alues are. * * @param int '). * @param int ')/ * * @return int */ function distan!e($+1, $+2) ret&rn a9s($+1:$+2); , /** * (reates the initial positions for the "i)en * number of clusters and data. * @param array 'data * @param int '# * * @return array */ function assi$n_initia%_p#siti#ns($data, $k) $min = min($data);

$ma; = ma;($data); $int = !ei%(a9s($ma; : $min) 8 $k); '(i%e($k:: 0 0) $!"#siti#ns2$k3 = $min 4 $int < $k; , ret&rn $!"#siti#ns;

Output =rray ( [0] => =rray ( [0] => [1] => [2] => [3] => [4] => [5] => [6] => [7] => [>] => )

1 3 2 5 6 2 3 1 3

[2] => =rray ( [0] => 30 [1] => 36 [2] => 45 ) [1] => =rray ( [0] => 15 [1] => 17 ) )

Potrebbero piacerti anche