Graph Mining

Graph-structured datasets range from online social networks (OSNs), such as Facebook, LinkedIn, and Twitter, to biological networks, such as proteins-protein interaction (PPI) and gene regulatory networks. Data analytics techniques have the potential to extract knowledge and make predictions from such graph data. We develop stochastic models for large graphs, algorithms for their analysis, and fundamental results about their properties.

Recent publications

Learning Hawkes Processes from a Handful of Events
F. Salehi, W. Trouleau, M. Grossglauser and P. Thiran
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, December 8-14, 2019.
[full text]

Learning the causal-interaction network of multivariate Hawkes processes is a useful task in many applications. Maximum-likelihood estimation is the most common approach to solve the problem in the presence of long observation sequences. However, when only short sequences are available, the lack of data amplifies the risk of overfitting and regularization becomes critical. Due to the challenges of hyper-parameter tuning, state-of-the-art methods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power of representation of the model. To solve both issues, we develop in this work an efficient algorithm based on variational expectation-maximization. Our approach is able to optimize over an extended set of hyper-parameters. It is also able to take into account the uncertainty in the model parameters by learning a posterior distribution over them. Experimental results on both synthetic and real datasets show that our approach significantly outperforms state-of-the-art methods under short observation sequences.

Learning Hawkes Processes Under Synchronization Noise
W. Trouleau, J. Etesami, M. Grossglauser, N. Kiyavash and P. Thiran
36th International Conference on Machine Learning, Long Beach, California, USA, June 9-15, 2019.

Multivariate Hawkes processes (MHP) are widely used in a variety of fields to model the occurrence of discrete events. Prior work on learning MHPs has only focused on inference in the presence of perfect traces without noise. We address the problem of learning the causal structure of MHPs when observations are subject to an unknown delay. In particular, we introduce the so-called synchronization noise, where the stream of events generated by each dimension is subject to a random and unknown time shift. We characterize the robustness of the classic maximum likelihood estimator to synchronization noise, and we introduce a new approach for learning the causal structure in the presence of noise. Our experimental results show that our approach accurately recovers the causal structure of MHPs for a wide range of noise levels, and significantly outperforms classic estimation methods.

PROPER: global protein interaction network alignment through percolation matching
E. Kazemi, H. Hassani, M. Grossglauser and H. P. Modarres
BMC Bioinformatics, 2016.
[full text] [view at publisher]

Background The alignment of protein-protein interaction (PPI) networks enables us to uncover the relationships between different species, which leads to a deeper understanding of biological systems. Network alignment can be used to transfer biological knowledge between species. Although different PI-network alignment algorithms were introduced during the last decade, developing an accurate and scalable algorithm that can find alignments with high biological and structural similarities among PPI networks is still challenging. Results In this paper, we introduce a new global network alignment algorithm for PPI networks called PROPER. Compared to other global network alignment methods, our algorithm shows higher accuracy and speed over real PPI datasets and synthetic networks. We show that the PROPER algorithm can detect large portions of conserved biological pathways between species. Also, using a simple parsimonious evolutionary model, we explain why PROPER performs well based on several different comparison criteria. Conclusions We highlight that PROPER has high potential in further applications such as detecting biological pathways, finding protein complexes and PPI prediction. The PROPER algorithm is available at http://proper.epfl.ch.

Mobility Mining

The mobility traces generated by millions of users on the move with their smartphones is central in many sectors of the digital economy. Services built on mobility fuel a host of new business models, including location-based advertisement, navigation and transportation, and augmented reality. We explore machine learning problems and applications that exploit the rich structure in mobility data within a large population of mobile agents.

Recent publications

Uncovering Latent Behaviors in Ant Colonies
M. Kafsi, R. Braunschweig, D. Mersch, M. Grossglauser, L. Keller and P. Thiran
2016 SIAM International Conference on Data Mining, Miami, Florida, USA, May 5-7, 2016.
[full text] [view at publisher]

Many biological systems exhibit collective behaviors that strengthen their adaptability to their environment, compared to more solitary species. Describing these behaviors is challenging yet necessary in order to understand these biological systems. We propose a probabilistic model that enables us to uncover the collective behaviors observed in a colony of ants. This model is based on the assumption that the behavior of an individual ant is a time-dependent mixture of latent behaviors that are specific to the whole colony. We apply this model to a large-scale dataset obtained by observing the mobility of nearly 1000 Camponotus fellah ants from six different colonies. Our results indicate that a colony typically exhibits three classes of behaviors, each characterized by a specific spatial distribution and a level of activity. Moreover, these spatial distributions, which are uncovered automatically by our model, match well with the ground truth as manually annotated by domain experts. We further explore the evolution of the behavior of individual ants and show that it is well captured by a second order Markov chain that encodes the fact that the future behavior of an ant depends not only on its current behavior but also on its preceding one.

Traveling Salesman in Reverse: Conditional Markov Entropy for Trajectory Segmentation
M. Kafsi, M. Grossglauser and P. Thiran
2015 IEEE International Conference on Data Mining, Atlantic City, NJ, USA, November 14-17, 2015.
[full text] [view at publisher]

We are interested in inferring the set of waypoints (or intermediate destinations) of a mobility trajectory in the absence of timing information. We find that, by mining a dataset of real mobility traces, computing the entropy of conditional Markov trajectory enables us to uncover waypoints, even though no timing information nor absolute geographic location is provided. We build on this observation and design an efficient algorithm for trajectory segmentation. Our empirical evaluation demonstrates that the entropy-based heuristic used by our segmentation algorithm outperforms alternative approaches as it is 43% more accurate than a geometric approach and 20% more accurate than path-stretch based approach. We further explore the link between trajectory entropy, mobility predictability and the nature of intermediate locations using a route choice model on real city maps.

Hierarchical Routing Over Dynamic Wireless Networks
D. Tschopp, S. Diggavi and M. Grossglauser
Random Structures & Algorithms, 2015.
[view at publisher]

The topology of a mobile wireless network changes over time. Maintaining routes between all nodes requires the continuous transmission of control information, which consumes precious power and bandwidth resources. Many routing protocols have been developed, trading off control overhead and route quality. In this paper, we ask whether there exist low-overhead schemes that produce low-stretch routes, even in large networks where all the nodes are mobile. We present a scheme that maintains a hierarchical structure within which constant-stretch routes can be efficiently computed between every pair of nodes. The scheme rebuilds each level of the hierarchy periodically, at a rate that decreases exponentially with the level of the hierarchy. We prove that this scheme achieves constant stretch under a mild smoothness condition on the nodal mobility processes. Furthermore, we prove tight bounds for the network-wide control overhead under the additional assumption of the connectivity graph forming a doubling metric space. Specifically, we show that for a connectivity model combining the random geometric graph with obstacles, constant-stretch routes can be maintained with a total overhead of n log2 n bits of control information per time unit.

Epidemics

Epidemic models capture how infectuous processes propagate on graphs, such as a disease through a contact graph, or an idea through a social network. These models allow us to study the dynamics and long-term asymptotics of epidemic processes, and to explore measures to manage them, such as vaccination to slow the process, or deliberate infections to optimize the spread of an opinion. We are also particularly interested in estimation problems on epidemics under monitoring budget constraints.

Recent publications

Learning Hawkes Processes from a Handful of Events
F. Salehi, W. Trouleau, M. Grossglauser and P. Thiran
33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada, December 8-14, 2019.
[full text]

Learning the causal-interaction network of multivariate Hawkes processes is a useful task in many applications. Maximum-likelihood estimation is the most common approach to solve the problem in the presence of long observation sequences. However, when only short sequences are available, the lack of data amplifies the risk of overfitting and regularization becomes critical. Due to the challenges of hyper-parameter tuning, state-of-the-art methods only parameterize regularizers by a single shared hyper-parameter, hence limiting the power of representation of the model. To solve both issues, we develop in this work an efficient algorithm based on variational expectation-maximization. Our approach is able to optimize over an extended set of hyper-parameters. It is also able to take into account the uncertainty in the model parameters by learning a posterior distribution over them. Experimental results on both synthetic and real datasets show that our approach significantly outperforms state-of-the-art methods under short observation sequences.

Learning Hawkes Processes Under Synchronization Noise
W. Trouleau, J. Etesami, M. Grossglauser, N. Kiyavash and P. Thiran
36th International Conference on Machine Learning, Long Beach, California, USA, June 9-15, 2019.

Multivariate Hawkes processes (MHP) are widely used in a variety of fields to model the occurrence of discrete events. Prior work on learning MHPs has only focused on inference in the presence of perfect traces without noise. We address the problem of learning the causal structure of MHPs when observations are subject to an unknown delay. In particular, we introduce the so-called synchronization noise, where the stream of events generated by each dimension is subject to a random and unknown time shift. We characterize the robustness of the classic maximum likelihood estimator to synchronization noise, and we introduce a new approach for learning the causal structure in the presence of noise. Our experimental results show that our approach accurately recovers the causal structure of MHPs for a wide range of noise levels, and significantly outperforms classic estimation methods.

A General Framework for Sensor Placement in Source Localization
B. Spinelli, E. Celis and P. Thiran
IEEE Transactions on Network Science and Engineering, 2017.
[full text] [view at publisher]

When an epidemic spreads in a given network of individuals or communities, can we detect its source using only the information provided by a small set of nodes? We propose a general framework that incorporates two dimensions. First, we can either rely exclusively on a set of selected nodes (i.e., sensors) which always reveal their state independently of any particular epidemic (these are called static), or we can add some sensors (called dynamic) as an epidemic spreads, depending on which additional information is required. Second, the method can either localizes the source after an epidemic has spread through the entire network (offline), or while the epidemic is ongoing (online). We empirically study the performance of offline and online localization both with and without dynamic sensors. Our analysis shows that, by using dynamic sensors, the number of sensors necessary to localize the source is reduced by up to a factor of 10 and that, even with high-variance transmission delays, the source can be localized by using fewer than 5% of the nodes as sensors.

Distributed Processes on Graphs

Large scales are inherent to modern information processing systems, which can no longer be designed to work in a centralized and controlled manner, but will need to cope with a strong random component and rely on principles of self-organization. We develop distributed algorithms for information processing, message passing, decision making, learning and optimization in large-scale networks whose nodes are often randomly deployed, have very limited functionalities, and may fail for various reasons.

Discrete-Choice Models

Discrete choice models capture how we select one option among a set of alternatives. In the digital world, we are forced to make choices all the time: we click on a search result in a list, choose a song from iTunes, select a route to a destination, or pick a restaurant on Yelp. Because of this prevalence of choosing, there has been a resurgence of interest in choice models in the machine learning community. We explore new applications and stochastic models for choices, comparisons, and rankings in different contexts, and problems of large-scale inference and active learning for these models.

Recent publications

Pairwise Comparisons with Flexible Time-Dynamics
L. Maystre, V. Kristof and M. Grossglauser
25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD), Anchorage, AK, August 04-08, 2019.
[view at publisher]

Inspired by applications in sports where the skill of players or teams competing against each other varies over time, we propose a probabilistic model of pairwise-comparison outcomes that can capture a wide range of time dynamics. We achieve this by replacing the static parameters of a class of popular pairwise-comparison models by continuous-time Gaussian processes; the covariance function of these processes enables expressive dynamics. We develop an efficient inference algorithm that computes an approximate Bayesian posterior distribution. Despite the flexbility of our model, our inference algorithm requires only a few linear-time iterations over the data and can take advantage of modern multiprocessor computer architectures. We apply our model to several historical databases of sports outcomes and find that our approach a) outperforms competing approaches in terms of predictive performance, b) scales to millions of observations, and c) generates compelling visualizations that help in understanding and interpreting the data.

Can Who-Edits-What Predict Edit Survival?
A. B. Yardım, V. Kristof, L. Maystre and M. Grossglauser
24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, United Kingdom, August 19-23, 2018.
[view at publisher]

As the number of contributors to online peer-production systems grows, it becomes increasingly important to predict whether the edits that users make will eventually be beneficial to the project. Existing solutions either rely on a user reputation system or consist of a highly specialized predictor that is tailored to a specific peer-production system. In this work, we explore a different point in the solution space that goes beyond user reputation but does not involve any content-based feature of the edits. We view each edit as a game between the editor and the component of the project. We posit that the probability that an edit is accepted is a function of the editor's skill, of the difficulty of editing the component and of a user-component interaction term. Our model is broadly applicable, as it only requires observing data about who makes an edit, what the edit affects and whether the edit survives or not. We apply our model on Wikipedia and the Linux kernel, two examples of large-scale peer-production systems, and we seek to understand whether it can effectively predict edit survival: in both cases, we provide a positive answer. Our approach significantly outperforms those based solely on user reputation and bridges the gap with specialized predictors that use content-based features. It is simple to implement, computationally inexpensive, and in addition it enables us to discover interesting structure in the data.

ChoiceRank: Identifying Preferences from Node Traffic in Networks
L. Maystre and M. Grossglauser
International Conference on Machine Learning, Sydney, Australia, August 6-11, 2017.
[code] [full text]

Understanding how users navigate in a network is of high interest in many applications. We consider a setting where only aggregate node-level traffic is observed and tackle the task of learning edge transition probabilities. We cast it as a preference learning problem, and we study a model where choices follow Luce's axiom. In this case, the O(n) marginal counts of node visits are a sufficient statistic for the O(n^2) transition probabilities. We show how to make the inference problem well-posed regardless of the network's structure, and we present ChoiceRank, an iterative algorithm that scales to networks that contains billions of nodes and edges. We apply the model to two clickstream datasets and show that it successfully recovers the transition probabilities using only the network structure and marginal (node-level) traffic data. Finally, we also consider an application to mobility networks and apply the model to one year of rides on New York City's bicycle-sharing system.

Active Learning

A large class of applications require to learn tasks on the fly and/or to (inter-) actively obtain new training samples from a data-source, for example when the environment is time-varying, or when it is computationally difficult to train off-line over the whole dataset because of its large size. We explore, develop and apply learning methods that balance the competing goal of gaining more knowledge by exploring new data, with that of optimizing its operation by exploiting the best configuration of the system based on current knowledge.

Recent publications

Just Sort It! A Simple and Effective Approach to Active Preference Learning
L. Maystre and M. Grossglauser
International Conference on Machine Learning, Sydney, Australia, August 6-11, 2017.
[full text]

We address the problem of learning a ranking by using adaptively chosen pairwise comparisons. Our goal is to recover the ranking accurately but to sample the comparisons sparingly. If all comparison outcomes are consistent with the ranking, the optimal solution is to use an efficient sorting algorithm, such as Quicksort. But how do sorting algorithms behave if some comparison outcomes are inconsistent with the ranking? We give favorable guarantees for Quicksort for the popular Bradley-Terry model, under natural assumptions on the parameters. Furthermore, we empirically demonstrate that sorting algorithms lead to a very simple and effective active learning strategy: repeatedly sort the items. This strategy performs as well as state-of-the-art methods (and much better than random sampling) at a minuscule fraction of the computational cost.

Wireless and Hybrid Networking

Wireless sensor and local networks rely on principles of self-organization for their operation. We develop algorithms to schedule and route transmissions in wireless networks and optimise throughput, delay, robustness and/or energy consumption. We also evaluate the benefits of combining different technologies such as wireless and power-line communications, which are widespread in home networking, to provide accrued and reliable performance in random and time-varying environments.

Recent publications

Multi-Armed Bandit in Action: Optimizing Performance in Dynamic Hybrid Networks
S. Henri, C. Vlachou and P. Thiran
IEEE/ACM Transactions on Networking, 2018-07-27.
[view at publisher]

Abstract: Today's home networks are often composed of several technologies such as Wi-Fi or power-line communication (PLC). Yet, current network protocols rarely fully harness this diversity and use each technology for a specific, pre-defined role, for example, wired media as a backbone and the wireless medium for mobility. Moreover, a single path is generally employed to transmit data; this path is established in advance and remains in use as long as it is valid, although multiple possible paths offer more robustness against varying environments. We introduce HyMAB, an algorithm that explores different multipaths and finds the best one in a mesh hybrid network, while keeping congestion under control. We employ the multi-armed-bandit framework and prove that HyMAB achieves optimal throughput under a static scenario. HyMAB design also accounts for real-network intricacies and dynamic conditions; it adapts to varying environments and switches multipaths when needed. We implement HyMAB on a PLC/Wi-Fi test bed. This is, to the best of our knowledge, the first implementation on a real test bed of multi-armed-bandit strategies in the context of routing. Our experimental results confirm the optimality of HyMAB and its ability to adapt to dynamic network conditions, as well as the gains provided by employing multi-armed-bandit strategies.

EMPoWER Hybrid Networks: Exploiting Multiple Paths over Wireless and ElectRical Mediums
S. Henri, C. Vlachou, J. Herzen and P. Thiran
ACM Conference on emerging Networking EXperiments and Technologies (CoNEXT) 2016, Irvine, California, USA, December 12-15, 2016.
[full text] [view at publisher]

Several technologies, such as WiFi, Ethernet and power-line communications (PLC), can be used to build residential and enterprise networks. These technologies often co-exist; most networks use WiFi, and buildings are readily equipped with electrical wires that can offer a capacity up to 1 Gbps with PLC. Yet, current networks do not exploit this rich diversity and often operate far below the available capacity. We design, implement, and evaluate EMPoWER, a system that exploits simultaneously several potentially-interfering mediums. It operates at layer 2.5, between the MAC and IP layers, and combines routing (to find multiple concurrent routes) and congestion control (to efficiently balance traffic across the routes). To optimize resource utilization and robustness, both components exploit the heterogeneous nature of the network. They are fair and efficient, and they operate only within the local area network, without affecting remote Internet hosts. We demonstrate the performance gains of EMPoWER, by simulations and experiments on a 22-node testbed. We show that PLC/WiFi, benefiting from the diversity offered by wireless and electrical mediums, provides significant throughput gains (up to 10x) and improves coverage, compared to multi-channel WiFi.

Electri-Fi Your Data: Measuring and Combining Power-Line Communications with WiFi
C. Vlachou, S. Henri and P. Thiran
ACM Internet Measurement Conference (IMC) 2015, Tokyo, Japan, October 28-30, 2015.
[full text]

Power-line communication (PLC) is widely used as it offers high data-rates and forms a network over electrical wiring, an existing and ubiquitous infrastructure. PLC is increasingly being deployed in hybrid networks that combine multiple technologies, the most popular among which is WiFi. However, so far, it is not clear to which extent PLC can boost network performance or how hybrid implementations can exploit to the fullest this technology. We compare the spatial and temporal variations of WiFi and PLC. Despite the potential of PLC and its vast deployment in commercial products, little is known about its performance. To route or load balance traffic in hybrid networks, a solid understanding of PLC and its link metrics is required. We conduct experiments in a testbed of more than 140 links. We introduce link metrics that are crucial for studying PLC and that are required for quality-aware algorithms by recent standardizations of hybrid networks. We explore the spatial and temporal variation of PLC channels, showing that they are highly asymmetric and that link quality and link-metric temporal variability are strongly correlated. Based on our variation study, we propose and validate a capacity estimation technique via a metric that only uses the frame header. We also focus on retransmissions due to channel errors or to contention, a metric related to delay, and examine the sensitivity of metrics to background traffic. Our performance evaluation provides insight into the implementation of hybrid networks; we ease the intricacies of understanding the performance characteristics of the PHY and MAC layers.