Enhancing Unsupervised Learning Through Data Thinning: An Exploration of Sample Splitting

Introduction

Determining the optimal number of clusters K in a set of n data points (X1,…,Xn) is challenging when K is unknown. A common, yet flawed, approach involves using a clustering algorithm for different K values and selecting the one that minimizes the objective. This method is problematic because it involves fitting and validating the model on the same dataset. Background

Background

In supervised learning, the issues with fitting and validating the same data are avoided by splitting the data into training, validation, and test sets. However, in unsupervised settings, where labels are not available, this strategy is not directly applicable. Recently, a novel approach has been proposed: splitting each data point Xi​ into two sets, Xitrain​ and Xitest​. The model is then trained on X1train,…,Xntrain​ and validated on X1test,…,Xntest.

The concept of splitting a random variable Y into two independent random variables Y1​ and Y2​ such that Y=Y1+Y2​ is known as thinning. While thinning is well-known in Poisson processes [0], recent research has shown its applicability to a wider range of distributions [1, 2]. This development enables sample splitting in various unsupervised tasks, facilitating the separation of training and validation sets even without labeled data.

Research Objective

This project aims to explore the application of data thinning for clustering non-Gaussian mixture models, focusing on mixture models of the exponential family. We will particularly investigate the relationship between exponential family distributions, Bregman divergences, and their applications in clustering [3].

References:

[0] See for example: https://stats.libretexts.org/Bookshelves/Probability_Theory/Probability_Mathematical_Statistics_and_Stochastic_Processes_(Siegrist)/14%3A_The_Poisson_Process/14.05%3A_Thinning_and_Superpositon or https://www.probabilitycourse.com/chapter11/11_1_3_merging_and_splitting_poisson_processes.php

[1] Neufeld, A., Dharamshi, A., Gao, L. L., & Witten, D. (2024). Data thinning for convolution-closed distributions. Journal of Machine Learning Research, 25(57), 1-35.

[2] Dharamshi, A., Neufeld, A., Motwani, K., Gao, L. L., Witten, D., & Bien, J. (2023). Generalized data thinning using sufficient statistics." Journal of the American Statistical Association (2024): 1-26.

[3] Banerjee, A., Merugu, S., Dhillon, I. S., Ghosh, J., & Lafferty, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6(10).