© 2017 By Caroline Uhler

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Technological advances and the information era allow the collection of massive amounts of data at unprecedented resolution. Making use of this data to gain insights into complex phenomena requires characterizing the relationships among a large number of variables. Probabilistic graphical models explicitly capture the statistical relationships between the variables of interest in the form of a network. Such a representation, in addition to enhancing interpretability of the model, often enables computationally efficient inference. My group studies graphical models and develops theory, methodology and algorithms to allow application of these models to scientifically important novel applications. In particular, our work to date has broken new grounds on providing a systematic approach to studying Gaussian graphical models, a framework that is rich enough to capture broad phenomena, but also allows systematic statistical and computational investigations. More generally, we study models with linear constraints on the covariance matrix or its inverse, as they arise in various applications and allow efficient computation. We use a holistic approach that combines ideas from machine learning, mathematical statistics, convex optimization, combinatorics, and applied algebraic geometry. For example, by leveraging the inherent algebraic and combinatorial structure in graphical models, we have uncovered statistical and computational limitations and developed new algorithms for learning directed graphical models to perform causal inference.

 

Building on our theoretical work, my group also develops scalable algorithms with provable guarantees for applications to genomics, in particular for learning gene regulatory networks. Recent technological developments have led to an explosion of single-cell imaging and sequencing data. Since most experiments require fixing a cell, one can only obtain one data modality per cell, take one snapshot of a particular cell in time, and observe a cell either before or after a perturbation (but not both). Hence a major computational challenge going forward is how to integrate the emerging single-cell data to identify regulatory modules in health and disease. Towards solving this challenge, my group has developed the first provably consistent causal structure discovery algorithms that can integrate observational and interventional data from gene knockout or knockdown experiments. In addition, we recently developed methods based on autoencoders and optimal transport to integrate and translate between different single-cell data modalities and data measured from different time points of a biological process. Together with our geometric models that link the packing of the DNA in the cell nucleus to gene expression, our methods have led to new biomarkers for early cancer prognosis based on single-cell images of DNA stained cell nuclei.

 

In what follows, we provide more details about our work in the above-mentioned areas. A complete list of publications can be found here or in my CV.

Thanks to NSF, ONR, DARPA, IBM, the Sloan Foundation and the Simons Foundation for supporting my research!

 
Gaussian Graphical Models

  

We uncovered the deep interplay between mathematical statistics, applied algebraic geometry, and convex optimization with regards to Gaussian graphical models. By developing a geometric understanding for maximum likelihood estimation, we obtained new results on the minimum number of observations needed for existence of the maximum likelihood estimator in Gaussian graphical models. In the Bayesian treatment of Gaussian graphical models, the G-Wishart distribution plays an important role, since it serves as the conjugate prior. It has been unknown whether the normalizing constant of the G-Wishart distribution for a general graph could be represented explicitly, and a considerable body of computational literature emerged that attempted to avoid this apparent intractability. We solved this 20-year old problem by providing an explicit representation of the G-Wishart normalizing constant for general graphs.

Other Linearly Constrained Covariance Models

  

In many applications it is of interest to pose linear equality or inequality constraints on the covariance matrix or its inverse. While maximum likelihood estimation for Gaussian models with linear constraints on the inverse covariance matrix (e.g. Gaussian graphical models) leads to a convex optimization problem that can be solved efficiently, maximum likelihood estimation for Gaussian models with linear constraints on the covariance matrix is a non-convex problem that typically has many local optima. Nevertheless, in recent work we provide sufficient conditions for any hill-climbing method to converge to the global optimum provided the number of samples is large enough (n>14p). Hence, surprisingly, maximum likelihood estimation for linear Gaussian covariance models behaves as if it were a convex optimization problem. We also studied Gaussian models with linear inequality constraints, in particular distributions that are multivariate totally positive of order two (MTP2). This property, introduced in the 1970s, is a strong form of positive dependence, an important notion in probability theory and statistical physics. We showed that MTP2 distributions have remarkable properties with respect to Markov structures, making such distributions interesting for modeling in the high-dimensional setting. To relax the assumption of Gaussianity, we are currently developing methods for non-parametric density estimation under MTP2.

 
 
Causal Inference

  

Causal inference is a cornerstone of scientific discovery. It is of particular interest to determine causal structure among variables based on observational data, since conducting randomized controlled trials is often impractical or prohibitively expensive. Unfortunately, in general observational data alone cannot uniquely identify a directed graphical model, since different directed graphical models can satisfy the same conditional independence relations (such graphs are Markov equivalent). It is therefore important to understand the set of Markov equivalence classes and their sizes. In recent work we shed new light onto this statistical problem by recasting it into the language of combinatorial optimization. On the learning side, using methods from algebraic geometry and combinatorics we proved that current methodologies for learning causal graphs have severe limitations, namely they require the so-called faithfulness assumption, which is extremely restrictive. It is therefore of interest to study the minimal assumptions required for learning directed graphical models. In recent work, we introduced the sparsest permutation algorithm and proved that it is consistent under strictly weaker assumptions than faithfulness. It is conjectured that this algorithm in fact meets information-theoretically the minimal assumptions needed for causal inference. Most recently, we showed that also a greedy version of the sparsest permutation algorithm is consistent and can in fact be adapted to obtain the first consistent algorithm for learning interventional Markov equivalence classes from a mix of observational and interventional data, as is becoming available in genomics.

 

 
Gene Regulation and Chromosome Packing

  

The same string of genetic information encodes about 200 different cell types in our body. The emerging hypothesis is that the spatial organization of the genome is crucial in order to differentially turn on expression programs. In collaboration with the Shivashankar lab, a cell biology lab at the National University of Singapore, we probe this hypothesis using experiments and geometric models by integrating different single-cell modalities.