k-means clustering #

\(k\) -means clustering is an unsupervised learning technique whereby \(n\) data points are partitioned into \(k\) clusters, where the association of a data-point is to the cluster with the closest mean. This gives a Voronoi partitionining of the space.

Problem description #

Given \(n\) observations \(x_1,\ldots,x_n\) and a non-negative integer \(k\), partition the observations into \(k\) disjoint sets \(\mathbf{S}=S_1,\ldots,S_k\) such that the total within-cluster sum of squares (variance) is minimal, i.e. \[ \mathrm{argmin}_{\mathbf{S}}\sum_{i=1}^k\sum_{x\in S_i}\left\|x-\mu_i\right\|^2 \] where \(\mu_i\) is the mean of all the points in \(S_i\).

Algorithm #

This optimisation problem is NP-hard in general.

Naive k-means #

  • Initiate a set of “means” \(m_1,\ldots,m_k\). This can be done randomly or with some heuristic.
  • At each “assignment step”, assign each observation to the cluster with the nearest mean, or more precisely, define a partition \(S_1,\ldots,S_k\) where a point is assigned to \(S_i\) if \(m_i\) is the nearest mean (break ties arbitrarily).
  • Recalculate the means as the centroids of the new clusters and continue until the algorithm converges.

Drawbacks #

  • \(k\) must be chosen as an input parameter and finding the “right” value for a specific problem is nontrivial.
  • The algorithm may converge to some local minimum which gives unintuitive clusters.
  • It also assumes some “spherical” and separable clusters of roughly equal size.