## k-means clustering #

$$k$$ -means clustering is an unsupervised learning technique whereby $$n$$ data points are partitioned into $$k$$ clusters, where the association of a data-point is to the cluster with the closest mean. This gives a Voronoi partitionining of the space.

### Problem description #

Given $$n$$ observations $$x_1,\ldots,x_n$$ and a non-negative integer $$k$$, partition the observations into $$k$$ disjoint sets $$\mathbf{S}=S_1,\ldots,S_k$$ such that the total within-cluster sum of squares (variance) is minimal, i.e. $\mathrm{argmin}_{\mathbf{S}}\sum_{i=1}^k\sum_{x\in S_i}\left\|x-\mu_i\right\|^2$ where $$\mu_i$$ is the mean of all the points in $$S_i$$.

### Algorithm #

This optimisation problem is NP-hard in general.

#### Naive k-means #

• Initiate a set of “means” $$m_1,\ldots,m_k$$. This can be done randomly or with some heuristic.
• At each “assignment step”, assign each observation to the cluster with the nearest mean, or more precisely, define a partition $$S_1,\ldots,S_k$$ where a point is assigned to $$S_i$$ if $$m_i$$ is the nearest mean (break ties arbitrarily).
• Recalculate the means as the centroids of the new clusters and continue until the algorithm converges.

### Drawbacks #

• $$k$$ must be chosen as an input parameter and finding the “right” value for a specific problem is nontrivial.
• The algorithm may converge to some local minimum which gives unintuitive clusters.
• It also assumes some “spherical” and separable clusters of roughly equal size.