Estimation the Number of Clusters in Mixture Models using Data Thinning

Introduction

Determining the optimal number of clusters, K, for a dataset of n data points (X1,…,Xn) is a complex task, especially when K is unknown. A common approach involves running a clustering algorithm for various K values and selecting the one that minimizes an objective function. However, this method can be flawed, as it simultaneously fits and validates the model on the same data, potentially leading to overfitting and biased results.

Background

In supervised learning, the issues with fitting and validating the same data are avoided by splitting the data into training, validation, and test sets. However, in unsupervised settings, where labels are not available, this strategy is not directly applicable. Recently, a novel approach has been proposed: splitting each data point Xi​ into two points, Xitrain​ and Xitest​. The model is then trained on the trained set (X_1^train,…,Xntrain)​ , and validated on the test set (X1test,…,Xntest).

The concept of splitting a random variable Y into two independent random variables Y1​ and Y2​ such that Y=Y1+Y2​ is known as thinning. While thinning is well-known in Poisson processes [0], recent research has shown its applicability to a wider range of distributions [1, 2, 3]. This development enables sample splitting in various unsupervised tasks, facilitating the separation of training and validation sets even without labeled data.

Research Objective

This project investigates the use of data thinning to infer the number of clusters k in a mixture model. For mixtures from the exponential family, when the number of clusters k is known, optimal clustering can be achieved by adapting k-means (or Lloyd’s algorithm) to use Bregman divergence in place of the traditional squared Euclidean loss [4, 5].

References:

[0] See for example: https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/14%3A_The_Poisson_Process/14.05%3A_Thinning_and_Superpositon or https://www.probabilitycourse.com/chapter11/11_1_3_merging_and_splitting_poisson_processes.php

[1] Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data thinning for convolution-closed distributions. Journal of Machine Learning Research, 25(57), 1-35.

[2] Dharamshi, A., Neufeld, A., Motwani, K., Gao, L. L., Witten, D., & Bien, J. (2023). Generalized data thinning using sufficient statistics." Journal of the American Statistical Association (2024): 1-26.

[3] Dharamshi, A., Neufeld, A., Gao, L. L., Bien, J., & Witten, D. (2024). Decomposing Gaussians with Unknown Covariance. arXiv preprint arXiv:2409.11497.

[4] Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., & Lafferty, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6(10).

[5] Dreveton, Gözeten, Grossglauser, Thiran (2024). Universal Lower Bounds and Optimal Rates: Achieving Minimax Clustering Error in Sub-Exponential Mixture Models. Proceedings of Thirty Seventh Conference on Learning Theory 247:1451-1485