Early Classification: A New Heuristic to Improve the Classification Step of K-Means

Joaquín Pérez, Carlos Eduardo Pires, Leandro Balby, Adriana Mexicano, Miguel Ángel Hidalgo


Cluster analysis is the study of algorithms and techniques for grouping objects according to their intrinsic characteristics and similarity. A widely studied and popular clustering algorithm is K-Means, which is characterized by its ease of implementation and high computational cost. Although various performance improvements have been proposed for K-Means, the algorithm is still considered an expensive alternative for clustering large scale datasets. This work proposes a new heuristic for reducing the number of calculations needed in the classification step of K-Means, without high loss of quality reduction, by using statistical information about the displacement of centroids at each iteration. Our heuristic, called Early Classification (EC for short), identifies and excludes from future calculations those objects that, according to an equidistance threshold, have low likelihood of cluster change in subsequent iterations. To validate our proposal, a set of experiments is performed on synthetic and real-world datasets from the UCI Machine Learning repository. In addition, we compared our heuristic against two other state-of-the-art variations of K-Means. The Wilcoxon signed-rank test was used for calculating the statistical significance. The results are promising since the execution time of K-Means was reduced up to 82.49%, with a quality reduction of only 3.31%. Moreover, as the experiments will show, the superiority of our method is even more evident on large datasets.


Clustering; K-Means; Performance Optimization; Unsupervised Learning

Full Text:


An official publication of the Brazilian Computer Society Special Interest Group on Databases.