Enhancing Unsupervised Learning Through Data Thinning: An Exploration of Sample Splitting


Clustering a set of n data points, X_1, ..., X_n, into an optimal number of clusters, K, poses a challenge when K is unknown. A common yet flawed approach involves employing a clustering algorithm for different values of K (e.g., K=2 to K=10) and selecting the K that yields the lowest objective. This method is problematic as it involves both fitting and validating the model on the same dataset.


In supervised learning, the practice of fitting and validating different models is mitigated by splitting the data into training, validation, and test sets. Unfortunately, this standard data-splitting technique is not directly applicable in an unsupervised setting, where labels for validation and test sets are unavailable. Recently, a novel approach has been proposed to address this limitation: splitting each data point X_i into two sets, X_i^{train} and X_i^{test}. Subsequently, the model is trained on X_1^{train}, ..., X_n^{train}, and validated on X_1^{test}, ..., X_n^{test}.

The process of splitting a random variable Y into two independent random variables, Y_1 and Y_2, such that Y = Y_1 + Y_2 is known in statistics as thinning. While thinning Poisson processes is a well-established concept [0], recent research has demonstrated its broader applicability to various distributions beyond Poisson [1,2]. This breakthrough opens the door to sample splitting in a wide array of unsupervised tasks, allowing for the separation of training and validation sets even in the absence of labelled data.

Research Objective

This project aims to explore the application of data thinning for clustering non-Gaussian mixture models, focusing on mixture models of exponential family. In particular, we will use the inherent relationship between exponential family distributions and Bregman divergences and its application to clustering [3].


[0] See for example: https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/14%3A_The_Poisson_Process/14.05%3A_Thinning_and_Superpositon or https://www.probabilitycourse.com/chapter11/11_1_3_merging_and_splitting_poisson_processes.php

[1] Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2023). Data thinning for convolution-closed distributions. arXiv preprint arXiv:2301.07276.

[2] Dharamshi, A., Neufeld, A., Motwani, K., Gao, L. L., Witten, D., & Bien, J. (2023). Generalized Data Thinning Using Sufficient Statistics. arXiv preprint arXiv:2303.12931.

[3] Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., & Lafferty, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6(10).