Technological advances and the information era allow the collection of massive amounts of data at unprecedented resolution. Making use of this data to gain insights into complex phenomena requires characterizing the relationships among a large number of variables. Probabilistic graphical models explicitly capture the statistical relationships between the variables of interest in the form of a network. Such a representation, in addition to enhancing interpretability of the model, often enables computationally efficient inference. My group studies graphical models and develops theory, methodology and algorithms to allow application of these models to scientifically important novel applications. In particular, our work to date has broken new grounds on providing a systematic approach to studying Gaussian graphical models, a framework that is rich enough to capture broad phenomena, but also allows systematic statistical and computational investigations. More generally, we study models with linear constraints on the covariance matrix or its inverse, as they arise in various applications and allow efficient computation. We use a holistic approach that combines ideas from machine learning, mathematical statistics, convex optimization, combinatorics, and applied algebraic geometry. For example, by leveraging the inherent algebraic and combinatorial structure in graphical models, we have uncovered statistical and computational limitations and developed new algorithms for learning directed graphical models to perform causal inference.

 

Building on our theoretical work, my group also develops scalable algorithms with provable guarantees for applications to genomics, in particular for learning gene regulatory networks. Recent technological developments have led to an explosion of single-cell imaging and sequencing data. Since most experiments require fixing a cell, one can only obtain one data modality per cell, take one snapshot of a particular cell in time, and observe a cell either before or after a perturbation (but not both). Hence a major computational challenge going forward is how to integrate the emerging single-cell data to identify regulatory modules in health and disease. Towards solving this challenge, my group has developed the first provably consistent causal structure discovery algorithms that can integrate observational and interventional data from gene knockout or knockdown experiments. In addition, we recently developed methods based on autoencoders and optimal transport to integrate and translate between different single-cell data modalities and data measured from different time points of a biological process. Since autoencoders have become key tools for representation learning in biological applications, my group studies their theoretical properties, and in recent work we characterized their inductive biases. Together with our geometric models that link the packing of the DNA in the cell nucleus to gene expression, our methods have led to new biomarkers for early cancer prognosis based on single-cell images of DNA stained cell nuclei.

 

In what follows, we provide more details about our work in the above-mentioned areas and some selected publications. A complete list of publications can be found here or in my CV.

Thanks to NSF, ONR, DARPA, IBM, the Sloan Foundation and the Simons Foundation for supporting my research!

Theory and Methodology for Causal Inference

  

Causal inference is a cornerstone of scientific discovery. Genomics provides a unique opportunity for method development in causal inference, since through the development of genome editing technologies it is for the first time possible to obtain large-scale interventional data sets. Answering a central question regarding experimental design, we showed, quite surprisingly, that soft interventions (such as gene knockdown experiments) provide the same amount of structural causal information as hard interventions (such as gene knockout experiments), despite being less invasive. This line of work resulted in the first provably consistent algorithms for learning causal graphs that can make use of soft and hard interventions, and non-Gaussian and zero-inflated data (as is common for single-cell RNA-seq). Most recently, we started developing optimal experimental design schemes for proposing perturbation experiments in genomics. The ultimate goal is to develop robust strategies for iteratively planning, performing, and learning from interventions for example to reprogram cells or direct the differentiation of pluripotent cells towards a specific cell type.

 

Research Uhler jpg.jpg
 
Representation Learning: Over-Parameterization and Generalization

  

Deep learning models have reached state-of-the-art accuracy in a number of tasks from image classification to machine translation; yet, the underlying mechanisms driving these successes are not yet well understood. Specifically, while deep networks used in practice are often over-parameterized, i.e., large enough to perfectly fit training data, these networks perform well on test data, which seemingly contradicts the notion of overfitting. Since representation learning is key in many biological applications, we have focused on neural networks used in generative modeling (autoencoders, GANs, flow-based models) to understand this phenomenon. In recent work, we showed that while over-parameterized autoencoders have the capacity to learn the identity map, they are self-regularizing and instead learn functions that are locally contractive at the training examples. In particular, we showed that increasing depth makes the autoencoders more contractive around the training examples, while increasing width increases the basins of attraction, thereby suggesting that wider, more shallow networks may be preferable in practice. We used this insight to develop autoencoders for embedding data from large-scale drug screens and identify drug signatures that are generalizable across different cell types.

 

 
Total Positivity for Modeling Positive Dependence

  

In many applications, it is of interest to model positive dependence (e.g. co-regulated genes, correlated stocks). MTP2 (multivariate total positivity of order 2), an important notion in probability theory and statistical physics introduced in the 1970s, is the strongest known form of positive dependence. While the MTP2 constraint is very restrictive and has not been used for modeling, we recently showed that MTP2 is implied by latent tree models and that, quite surprisingly, MTP2 distributions appear in practice in a variety of domains ranging from phylogenetics to finance. In addition, we showed that MTP2 has intriguing properties as an implicit regularizer for modeling in the high-dimensional setting: We showed that computing the MLE in MTP2 exponential families is a convex optimization problem. In particular, for quadratic exponential families (such as Gaussian or Ising models) it leads to sparsity of the underlying graph without the need of a tuning parameter. In fact, high-dimensional sparsistency in Gaussian graphical models under MTP2 can be obtained without any tuning parameters. More recently, we also studied MTP2 in the non-parametric setting and we exploited the MTP2 property for portfolio selection (assets are often positively dependent) and genomics (for identifying spatially clustered co-regulated genes).

 
 
 
3D Genome Organization and Gene Regulation

  

The proximity of chromosomes and genome regions are critical for gene activity. We introduced a new geometric model for the organization of chromosomes based on the theory of packing in mathematics. We modeled a chromosome arrangement as a minimal overlap configuration of ellipsoids of various sizes and shapes inside an ellipsoidal container, the cell nucleus. To find locally optimal ellipsoid packings, we devised a bilevel optimization procedure that provably converges to these optima. By applying this model, we were able to predict the organization of chromosomes in different cell types and how it effects gene regulation during health and disease.

Autoencoders and optimal transport for early cancer detection

  

Abnormalities in nuclear and chromatin organization are hallmarks of many diseases including cancer. We built on our packing models using neural networks to detect subtle changes in nuclear morphometrics in single-cell fluorescence images. Importantly, our method highlights the features indicative of each class in an image, thereby yielding high classification accuracy and interpretable image features that can aid the pathologist. Early cancer detection requires studying disease progression by observing single cells over time and integrating different data modalities. However, experimental techniques such as imaging or sequencing usually destroy the cell, making controlled time-course experiments unfeasible. To overcome this problem, we developed a framework based on autoencoders to integrate and translate different data modalities and combined it with optimal transport (OT) to infer cellular trajectories. Our results demonstrate the promise of computational methods based on autoencoding and OT in settings where existing experimental strategies fail.

 

Gaussian Graphical Models

  

We uncovered the deep interplay between mathematical statistics, applied algebraic geometry, and convex optimization with regards to Gaussian graphical models. By developing a geometric understanding for maximum likelihood estimation, we obtained new results on the minimum number of observations needed for existence of the maximum likelihood estimator in Gaussian graphical models. In the Bayesian treatment of Gaussian graphical models, the G-Wishart distribution plays an important role, since it serves as the conjugate prior. It has been unknown whether the normalizing constant of the G-Wishart distribution for a general graph could be represented explicitly, and a considerable body of computational literature emerged that attempted to avoid this apparent intractability. We solved this 20-year old problem by providing an explicit representation of the G-Wishart normalizing constant for general graphs.