INDY Lab - Bayesian Optimization for Soft Clustering

Bayesian Optimization for Soft Clustering

If you are interested about this project, please send a resume and transcripts to Anthony Bardou and Maximilien Dreveton.

Context

Mixture models are a flexible and powerful tool for modeling data generated from multiple underlying distributions. By assuming the data arises from a mixture of components, they can uncover latent structures and handle multimodal or heterogeneous data [MM1].

Student's t-distribution is a compelling alternative to Gaussian distributions in mixture models, particularly in scenarios where robustness to outliers and heavy-tailed data is crucial [MM2]. Unlike Gaussian distributions, which assume light tails and are highly sensitive to deviations from the mean, Student's t-distribution incorporates a tunable degree of freedom parameter that controls the tail heaviness. By using t-distributions in mixture models, we can achieve more robust clustering and better representation of real-world data distributions.

However, estimating the degree of freedom parameter of a Student's t-distribution is challenging due to the absence of closed-form expressions for maximum likelihood estimation, and various schemes have been proposed in the statistics litterature [S1].

The Bayesian Optimization (BO) framework proposes sample-efficient algorithms able to globally optimize an arbitrary black-box objective function [BO1]. As such, they have the potential to greatly improve the search for the optimal clustering models hyperparameters whose closed-form expressions are unknown. This in turn could lead to a drastic performance increase for clustering algorithms in several difficult contexts (e.g., exploitation of richer mixture models, large number of distributions in the mixture [BO2], dynamic datapoints [BO3]).

Subject

As far as we know, the interaction between the clustering framework and the BO framework remains largely unexplored.

The student will participate in the design and the implementation of clustering algorithms that exploit BO, and will run experiments to compare them to state-of-the-art clustering solutions.

What's in it for You?

This project is an opportunity to get familiar with both the clustering and the BO frameworks. Should you make significant contributions in the production of a competitive clustering algorithm with BO at its core, you will be given the opportunity to author a publication submitted at a top AI/ML conference.

Skills

Excellent programming skills
Python proficiency
Familiarity with BO would be a plus.
Familiarity with clustering techniques would be a plus.
Interest in pursuing a research-oriented career would be a plus.

References

[MM1] McLachlan, Geoffrey J., Sharon X. Lee, and Suren I. Rathnayake. "Finite mixture models." Annual review of statistics and its application 6.1 (2019): 355-378.

[MM2] Peel, David, and Geoffrey J. McLachlan. "Robust mixture modelling using the t distribution." Statistics and computing 10 (2000): 339-348.

[S1] Liu, Chuanhai, and Donald B. Rubin. "ML estimation of the t distribution using EM and its extensions, ECM and ECME." Statistica sinica (1995): 19-39.

[BO1] Williams, C. K., & Rasmussen, C. E. (2006). Gaussian processes for machine learning (Vol. 2, No. 3, p. 4). Cambridge, MA: MIT press.

[BO2] Bardou, A., Thiran, P., & Begin, T. Relaxing the Additivity Constraints in Decentralized No-Regret High-Dimensional Bayesian Optimization. In The Twelfth International Conference on Learning Representations.

[BO3] Bardou, A., Thiran, P., & Ranieri, G. (2024). This Too Shall Pass: Removing Stale Observations in Dynamic Bayesian Optimization. arXiv preprint arXiv:2405.14540.