From causal representation learning to multi-study genomics

In this talk, we begin with an introduction to causal representation learning, an emerging field which synthesizes latent variable modeling, causal graphical models, and nonparametric statistics and deep learning. We then introduce nonlinear multi-study factor analysis, where the goal is to identify the graphical structure relating latent factors to high-dimensional observed data, while separating components that are shared across studies from those that are study-specific. As a particular example, we consider platelet gene expression data from patients in different disease groups. In this data, factors correspond to clusters of genes which are co-expressed; we may expect some clusters (or biological pathways) to be active for all diseases, while some clusters are only active for a specific disease. To learn such structure, we propose a nonlinear multi-study sparse factor model, in which each observed feature depends on only a small subset of latent factors. In the genomics example, this corresponds to the assumption that each gene is active in only a few biological processes. Under an anchor assumption, we prove that the latent factors are identified. Further, we demonstrate our method recovers meaningful factors in the platelet gene expression data. We conclude by discussing a method to assess estimated latent distribution quality, based on Bayesian predictive checks. We also discuss open questions and challenges in causal representation learning.

Further information

Time:

Venue:

Speaker:

Series:

Forthcoming Seminars

News, Announcements and Events

Social media

Study at Cambridge

About the University

Research at Cambridge