INDY Lab - Research areas

Graph Mining

Graph-structured datasets range from online social networks (OSNs), such as Facebook, LinkedIn, and Twitter, to biological networks, such as proteins-protein interaction (PPI) and gene regulatory networks. Data analytics techniques have the potential to extract knowledge and make predictions from such graph data. We develop stochastic models for large graphs, algorithms for their analysis, and fundamental results about their properties.

Recent publications

War of Words: The Competitive Dynamics of Legislative Processes
V. Kristof, M. Grossglauser and P. Thiran
The Web Conference - WWW ’20, Taipei, Taiwan, April 20-24, 2020.
[video] [code] [data] [view at publisher] [abstract]

A body of law is an example of a dynamic corpus of text documents that are jointly maintained by a group of editors who compete and collaborate in complex constellations. Our goal is to develop predictive models for this process, thereby shedding light on the competitive dynamics of parliamentarians who make laws. For this purpose, we curated a dataset of 450 000 legislative edits introduced by European parliamentarians over the last ten years. An edit modifies the status quo of a law, and could be in competition with another edit if it modifies the same part of that law. We propose a model for predicting the success of such edits, in the face of both the inertia of the status quo and the competition between overlapping edits. The parameters of this model can be interpreted in terms of the influence of parliamentarians and of the controversy of laws.

Learning Hawkes Processes from a Handful of Events
F. Salehi, W. Trouleau, M. Grossglauser and P. Thiran
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, December 8-14, 2019.
[code] [abstract]

Learning the causal-interaction network of multivariate Hawkes processes is a useful task in many applications. Maximum-likelihood estimation is the most common approach to solve the problem in the presence of long observation sequences. However, when only short sequences are available, the lack of data amplifies the risk of overfitting and regularization becomes critical. Due to the challenges of hyper-parameter tuning, state-of-the-art methods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power of representation of the model. To solve both issues, we develop in this work an efficient algorithm based on variational expectation-maximization. Our approach is able to optimize over an extended set of hyper-parameters. It is also able to take into account the uncertainty in the model parameters by learning a posterior distribution over them. Experimental results on both synthetic and real datasets show that our approach significantly outperforms state-of-the-art methods under short observation sequences.

Learning Hawkes Processes Under Synchronization Noise
W. Trouleau, J. Etesami, M. Grossglauser, N. Kiyavash and P. Thiran
36th International Conference on Machine Learning, Long Beach, California, USA, June 9-15, 2019.
[abstract]

Multivariate Hawkes processes (MHP) are widely used in a variety of fields to model the occurrence of discrete events. Prior work on learning MHPs has only focused on inference in the presence of perfect traces without noise. We address the problem of learning the causal structure of MHPs when observations are subject to an unknown delay. In particular, we introduce the so-called synchronization noise, where the stream of events generated by each dimension is subject to a random and unknown time shift. We characterize the robustness of the classic maximum likelihood estimator to synchronization noise, and we introduce a new approach for learning the causal structure in the presence of noise. Our experimental results show that our approach accurately recovers the causal structure of MHPs for a wide range of noise levels, and significantly outperforms classic estimation methods.

Mobility Mining

The mobility traces generated by millions of users on the move with their smartphones is central in many sectors of the digital economy. Services built on mobility fuel a host of new business models, including location-based advertisement, navigation and transportation, and augmented reality. We explore machine learning problems and applications that exploit the rich structure in mobility data within a large population of mobile agents.

Recent publications

Uncovering Latent Behaviors in Ant Colonies
M. Kafsi, R. Braunschweig, D. Mersch, M. Grossglauser, L. Keller and P. Thiran
2016 SIAM International Conference on Data Mining, Miami, Florida, USA, May 5-7, 2016.
[full text] [view at publisher] [abstract]

Many biological systems exhibit collective behaviors that strengthen their adaptability to their environment, compared to more solitary species. Describing these behaviors is challenging yet necessary in order to understand these biological systems. We propose a probabilistic model that enables us to uncover the collective behaviors observed in a colony of ants. This model is based on the assumption that the behavior of an individual ant is a time-dependent mixture of latent behaviors that are specific to the whole colony. We apply this model to a large-scale dataset obtained by observing the mobility of nearly 1000 Camponotus fellah ants from six different colonies. Our results indicate that a colony typically exhibits three classes of behaviors, each characterized by a specific spatial distribution and a level of activity. Moreover, these spatial distributions, which are uncovered automatically by our model, match well with the ground truth as manually annotated by domain experts. We further explore the evolution of the behavior of individual ants and show that it is well captured by a second order Markov chain that encodes the fact that the future behavior of an ant depends not only on its current behavior but also on its preceding one.

Traveling Salesman in Reverse: Conditional Markov Entropy for Trajectory Segmentation
M. Kafsi, M. Grossglauser and P. Thiran
2015 IEEE International Conference on Data Mining, Atlantic City, NJ, USA, November 14-17, 2015.
[full text] [view at publisher] [abstract]

We are interested in inferring the set of waypoints (or intermediate destinations) of a mobility trajectory in the absence of timing information. We find that, by mining a dataset of real mobility traces, computing the entropy of conditional Markov trajectory enables us to uncover waypoints, even though no timing information nor absolute geographic location is provided. We build on this observation and design an efficient algorithm for trajectory segmentation. Our empirical evaluation demonstrates that the entropy-based heuristic used by our segmentation algorithm outperforms alternative approaches as it is 43% more accurate than a geometric approach and 20% more accurate than path-stretch based approach. We further explore the link between trajectory entropy, mobility predictability and the nature of intermediate locations using a route choice model on real city maps.

Hierarchical Routing Over Dynamic Wireless Networks
D. Tschopp, S. Diggavi and M. Grossglauser
Random Structures & Algorithms, 2015.
[view at publisher] [abstract]

The topology of a mobile wireless network changes over time. Maintaining routes between all nodes requires the continuous transmission of control information, which consumes precious power and bandwidth resources. Many routing protocols have been developed, trading off control overhead and route quality. In this paper, we ask whether there exist low-overhead schemes that produce low-stretch routes, even in large networks where all the nodes are mobile. We present a scheme that maintains a hierarchical structure within which constant-stretch routes can be efficiently computed between every pair of nodes. The scheme rebuilds each level of the hierarchy periodically, at a rate that decreases exponentially with the level of the hierarchy. We prove that this scheme achieves constant stretch under a mild smoothness condition on the nodal mobility processes. Furthermore, we prove tight bounds for the network-wide control overhead under the additional assumption of the connectivity graph forming a doubling metric space. Specifically, we show that for a connectivity model combining the random geometric graph with obstacles, constant-stretch routes can be maintained with a total overhead of n log2 n bits of control information per time unit.

Epidemics

Epidemic models capture how infectuous processes propagate on graphs, such as a disease through a contact graph, or an idea through a social network. These models allow us to study the dynamics and long-term asymptotics of epidemic processes, and to explore measures to manage them, such as vaccination to slow the process, or deliberate infections to optimize the spread of an opinion. We are also particularly interested in estimation problems on epidemics under monitoring budget constraints.

Recent publications

A General Framework for Sensor Placement in Source Localization
B. Spinelli, E. Celis and P. Thiran
IEEE Transactions on Network Science and Engineering, 2019.
[full text] [view at publisher] [abstract]

When an epidemic spreads in a given network of individuals or communities, can we detect its source using only the information provided by a small set of nodes? We propose a general framework that incorporates two dimensions. First, we can either rely exclusively on a set of selected nodes (i.e., sensors) which always reveal their state independently of any particular epidemic (these are called static), or we can add some sensors (called dynamic) as an epidemic spreads, depending on which additional information is required. Second, the method can either localizes the source after an epidemic has spread through the entire network (offline), or while the epidemic is ongoing (online). We empirically study the performance of offline and online localization both with and without dynamic sensors. Our analysis shows that, by using dynamic sensors, the number of sensors necessary to localize the source is reduced by up to a factor of 10 and that, even with high-variance transmission delays, the source can be localized by using fewer than 5% of the nodes as sensors.

Distributed Processes on Graphs

Large scales are inherent to modern information processing systems, which can no longer be designed to work in a centralized and controlled manner, but will need to cope with a strong random component and rely on principles of self-organization. We develop distributed algorithms for information processing, message passing, decision making, learning and optimization in large-scale networks whose nodes are often randomly deployed, have very limited functionalities, and may fail for various reasons.

Discrete-Choice Models

Discrete choice models capture how we select one option among a set of alternatives. In the digital world, we are forced to make choices all the time: we click on a search result in a list, choose a song from iTunes, select a route to a destination, or pick a restaurant on Yelp. Because of this prevalence of choosing, there has been a resurgence of interest in choice models in the machine learning community. We explore new applications and stochastic models for choices, comparisons, and rankings in different contexts, and problems of large-scale inference and active learning for these models.

Recent publications

It’s All Relative: Learning Interpretable Models for Scoring Subjective Bias in Documents from Pairwise Comparisons
A. Suresh, C. H. Wu and M. Grossglauser
18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024), Malta, March 17th -22nd, 2024.
[abstract]

We propose an interpretable model to score the subjective bias present in documents, based only on their textual content. Our model is trained on pairs of revisions of the same Wikipedia article, where one version is more biased than the other. Although prior approaches based on bias classification have struggled to obtain a high accuracy for the task, we are able to develop a useful model for scoring bias by learning to accurately perform pairwise comparisons. We show that we can interpret the parameters of the trained model to discover the words most indicative of bias. We also apply our model in three different settings by studying the temporal evolution of bias in Wikipedia articles, comparing news sources based on bias, and scoring bias in law amendments. In each case, we demonstrate that the outputs of the model can be explained and validated, even for the two domains that are outside the training-data domain. We also use the model to compare the general level of bias between domains, where we see that legal texts are the least biased and news media are the most biased, with Wikipedia articles in between.

War of Words II: Enriched Models of Law-Making Processes
V. Kristof, A. Suresh, M. Grossglauser and P. Thiran
The Web Conference 2021 (WWW '21), Ljubljana, Slovenia, April 19-23, 2021.
[view at publisher] [abstract]

The European Union law-making process is an instance of a peer- production system. We mine a rich dataset of law edits and intro- duce models predicting their adoption by parliamentary committees. Edits are proposed by parliamentarians, and they can be in conflict with edits of other parliamentarians and with the original proposition in the law. Our models combine three different categories of features: (a) Explicit features extracted from data related to the edits, the parliamentarians, and the laws, (b) latent features that capture bi-linear interactions between parliamentarians and laws, and (c) text features of the edits. We show experimentally that this combination enables us to accurately predict the success of the edits. Furthermore, it leads to model parameters that are interpretable, hence provides valuable insight into the law-making process.

Active Learning

A large class of applications require to learn tasks on the fly and/or to (inter-) actively obtain new training samples from a data-source, for example when the environment is time-varying, or when it is computationally difficult to train off-line over the whole dataset because of its large size. We explore, develop and apply learning methods that balance the competing goal of gaining more knowledge by exploring new data, with that of optimizing its operation by exploiting the best configuration of the system based on current knowledge.

Recent publications

Scalable and Efficient Comparison-based Search without Features
D. Chumbalov, L. Maystre and M. Grossglauser
37th International Conference on Machine Learning (ICML 2020), Online, July 13-18, 2020.
[abstract]

We consider the problem of finding a target object t using pairwise comparisons, by asking an oracle questions of the form “Which object from the pair (i, j) is more similar to t?”. Objects live in a space of latent features, from which the oracle generates noisy answers. First, we consider the non-blind setting where these features are accessible. We propose a new Bayesian comparison-based search algorithm with noisy answers; it has low computational complexity yet is efficient in the number of queries. We provide theoretical guarantees, deriving the form of the optimal query and proving almost sure convergence to the target t. Second, we consider the blind setting, where the object features are hidden from the search algorithm. In this setting, we combine our search method and a new distributional triplet embedding algorithm into one scalable learning framework called LEARN2SEARCH. We show that the query complexity of our approach on two real-world datasets is on par with the non-blind setting, which is not achievable using any of the current state-of-the- art embedding methods. Finally, we demonstrate the efficacy of our framework by conducting an experiment with users searching for movie actors.

A User Study of Perceived Carbon Footprint
V. Kristof, V. Quelquejay-Leclère, R. Zbinden, L. Maystre, M. Grossglauser and P. Thiran
Climate Change Workshop at NeurIPS, Vancouver, BC, Canada, December 8-14, 2019.
[abstract]

We propose a statistical model to understand people's perception of their carbon footprint. Driven by the observation that few people think of CO2 impact in absolute terms, we design a system to probe people's perception from simple pairwise comparisons of the relative carbon footprint of their actions. The formulation of the model enables us to take an active-learning approach to selecting the pairs of actions that are maximally informative about the model parameters. We define a set of 18 actions and collect a dataset of 2183 comparisons from 176 users on a university campus. The early results reveal promising directions to improve climate communication and enhance climate mitigation.

Just Sort It! A Simple and Effective Approach to Active Preference Learning
L. Maystre and M. Grossglauser
International Conference on Machine Learning, Sydney, Australia, August 6-11, 2017.
[full text] [abstract]

We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsistent with the ranking? We give favorable guarantees for Quicksort for the popular Bradley-Terry model, under natural assumptions on the parameters. Furthermore, we empirically demonstrate that sorting algorithms lead to a very simple and effective active learning strategy: repeatedly sort the items. This strategy performs as well as state-of-the-art methods (and much better than random sampling) at a minuscule fraction of the computational cost.

Wireless and Hybrid Networking

Wireless sensor and local networks rely on principles of self-organization for their operation. We develop algorithms to schedule and route transmissions in wireless networks and optimise throughput, delay, robustness and/or energy consumption. We also evaluate the benefits of combining different technologies such as wireless and power-line communications, which are widespread in home networking, to provide accrued and reliable performance in random and time-varying environments.

Recent publications

Multi-Armed Bandit in Action: Optimizing Performance in Dynamic Hybrid Networks
S. Henri, C. Vlachou and P. Thiran
IEEE/ACM Transactions on Networking, 2018-07-27.
[view at publisher] [abstract]

Today's home networks are often composed of several technologies such as Wi-Fi or power-line communication (PLC). Yet, current network protocols rarely fully harness this diversity and use each technology for a specific, pre-defined role, for example, wired media as a backbone and the wireless medium for mobility. Moreover, a single path is generally employed to transmit data; this path is established in advance and remains in use as long as it is valid, although multiple possible paths offer more robustness against varying environments. We introduce HyMAB, an algorithm that explores different multipaths and finds the best one in a mesh hybrid network, while keeping congestion under control. We employ the multi-armed-bandit framework and prove that HyMAB achieves optimal throughput under a static scenario. HyMAB design also accounts for real-network intricacies and dynamic conditions; it adapts to varying environments and switches multipaths when needed. We implement HyMAB on a PLC/Wi-Fi test bed. This is, to the best of our knowledge, the first implementation on a real test bed of multi-armed-bandit strategies in the context of routing. Our experimental results confirm the optimality of HyMAB and its ability to adapt to dynamic network conditions, as well as the gains provided by employing multi-armed-bandit strategies.

EMPoWER Hybrid Networks: Exploiting Multiple Paths over Wireless and ElectRical Mediums
S. Henri, C. Vlachou, J. Herzen and P. Thiran
ACM Conference on emerging Networking EXperiments and Technologies (CoNEXT) 2016, Irvine, California, USA, December 12-15, 2016.
[full text] [view at publisher] [abstract]

Several technologies, such as WiFi, Ethernet and power-line communications (PLC), can be used to build residential and enterprise networks. These technologies often co-exist; most networks use WiFi, and buildings are readily equipped with electrical wires that can offer a capacity up to 1 Gbps with PLC. Yet, current networks do not exploit this rich diversity and often operate far below the available capacity. We design, implement, and evaluate EMPoWER, a system that exploits simultaneously several potentially-interfering mediums. It operates at layer 2.5, between the MAC and IP layers, and combines routing (to find multiple concurrent routes) and congestion control (to efficiently balance traffic across the routes). To optimize resource utilization and robustness, both components exploit the heterogeneous nature of the network. They are fair and efficient, and they operate only within the local area network, without affecting remote Internet hosts. We demonstrate the performance gains of EMPoWER, by simulations and experiments on a 22-node testbed. We show that PLC/WiFi, benefiting from the diversity offered by wireless and electrical mediums, provides significant throughput gains (up to 10x) and improves coverage, compared to multi-channel WiFi.

Electri-Fi Your Data: Measuring and Combining Power-Line Communications with WiFi
C. Vlachou, S. Henri and P. Thiran
ACM Internet Measurement Conference (IMC) 2015, Tokyo, Japan, October 28-30, 2015.
[full text] [view at publisher] [abstract]

Power-line communication (PLC) is widely used as it offers high data-rates and forms a network over electrical wiring, an existing and ubiquitous infrastructure. PLC is increasingly being deployed in hybrid networks that combine multiple technologies, the most popular among which is WiFi. However, so far, it is not clear to which extent PLC can boost network performance or how hybrid implementations can exploit to the fullest this technology. We compare the spatial and temporal variations of WiFi and PLC. Despite the potential of PLC and its vast deployment in commercial products, little is known about its performance. To route or load balance traffic in hybrid networks, a solid understanding of PLC and its link metrics is required. We conduct experiments in a testbed of more than 140 links. We introduce link metrics that are crucial for studying PLC and that are required for quality-aware algorithms by recent standardizations of hybrid networks. We explore the spatial and temporal variation of PLC channels, showing that they are highly asymmetric and that link quality and link-metric temporal variability are strongly correlated. Based on our variation study, we propose and validate a capacity estimation technique via a metric that only uses the frame header. We also focus on retransmissions due to channel errors or to contention, a metric related to delay, and examine the sensitivity of metrics to background traffic. Our performance evaluation provides insight into the implementation of hybrid networks; we ease the intricacies of understanding the performance characteristics of the PHY and MAC layers.