Biostatistics Seminars
The Department of Biostatistics at the University of Michigan is proud to invite leading scholars from around the world to visit Ann Arbor to share their expertise, wisdom and experience. All are welcome to attend these seminars, which are held in-person.
Tianxi Cai
Professor of Biomedical Informatics, Harvard Medical School; John Rock Professor of
Population and Translational Data Sciences, Harvard T.H. Chan School of Public Health
Harvard University
Learn more about this presenter
DATE: Thursday, January 22, 2026
TIME: 3:15 p.m.
LOCATION: SPH I, Room 1690
TITLE: Toward Durable AI in Healthcare: Generalizable Learning from Imperfect EHR Data
ABSTRACT: Electronic Health Record (EHR) data offers a promising foundation for real-world evidence, yet its utility is often severely limited by the reality of fragmented, imperfect data and significant heterogeneity across health systems. These inherent data flaws create major bottlenecks in generating evidence efficiently, often resulting in fragile models that are highly susceptible to data shift and rapid aging. Consequently, the challenge lies not just in accessing data, but in efficiently transforming these messy, disparate sources into reliable, enduring AI solutions. This presentation outlines a comprehensive strategy to overcome these limitations and derive robust clinical insights from imperfect data. We will discuss how representation learning can address data sparsity and fragmentation by extracting stable latent features from incomplete patient histories. To tackle system heterogeneity and ensure model longevity, we introduce robust transfer learning frameworks designed to immunize algorithms against distributional shifts. Furthermore, we demonstrate how leveraging knowledge networks can bridge gaps in fragmented data by grounding models in broader biomedical context. Complementing these structural approaches, we touch upon the use of Large Language Models (LLMs) to identify clinical outcomes not directly available in structured fields, solving the problem of unobserved endpoints. By integrating these diverse methodologies, we aim to establish a blueprint for efficiently building AI ecosystems that remain reliable and durable despite the complexities of real-world healthcare data.
TOPICS: Data Integration, High-Dimensional Data, Machine Learning, Network Inference, Precision Health, Predictive Modeling
Yixin Wang
Assistant Professor of Statistics
University of Michigan
Learn more about this presenter
DATE: Thursday, February 12, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690
TITLE: Causal Inference with Unstructured Data
ABSTRACT: Causal inference traditionally relies on tabular data, where treatments, outcomes, and covariates are manually collected and labeled. However, many real-world problems involve unstructured data—images, text, and videos—where treatments or outcomes are high-dimensional and unstructured, or all causal variables are hidden within the unstructured observations. This talk explores causal inference in such settings. We begin with cases where all causal variables (including treatments, outcomes, covariates) are hidden in unstructured observations. These causal problems require a crucial first step, extracting high-level latent causal factors from raw unstructured inputs. We develop algorithms to identify these factors. While traditional methods often assume statistical independence, causal factors are often correlated or causally connected. Our key observation is that, despite correlations, the causal connections (or the lack of) among factors leave geometric signatures in the latent factors' support - the ranges of values each can take. These signatures allow us to provably identify latent causal factors from passive observations, interventions, or multi-domain datasets (up to different transformations). Next, we tackle cases where unstructured data itself serves as either the treatment or the outcome. In these cases, standard causal queries like average treatment effect (ATE) are not suitable—subtracting one text, image, or video outcome from another is meaningless. High-dimensional unstructured treatments also challenge the overlap assumption required for causal identification. To address these challenges, we propose new causal queries: for unstructured outcomes, we pinpoint outcome features most affected by the treatment; for unstructured treatments, we identify influential treatment features driving outcome differences. Finally, we extend these ideas to decision-making algorithms, such as optimizing natural language actions for desired outcomes.
TOPICS: Artificial Intelligence, Causal Inference, Machine Learning
Kris Sankaran
Assistant Professor of Statistics
University of Wisconsin, Madison
Learn more about this presenter
DATE: Thursday, February 19, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690
TITLE: New Diagnostics for Dimensionality Reduction of Genomic Data
ABSTRACT: Dimensionality reduction helps organize high-dimensional genomics data into manageable low-dimensional representations, like differentiation trajectories in single cell data and community profiles in metagenomic sequencing. Such reductions are powerful but sensitive to hyperparameters and prone to misinterpretation. This talk addresses two risks in dimensionality reduction. First, we consider how to choose the number of topics K in topic modeling, a method from population genetics and language modeling, now common in microbiome analysis. While broadly useful, topic models require users to specify the number of topics K, which governs the resolution of learned topics. We discuss a new technique, topic alignment, for comparing topics across models with different resolutions. Simulation studies show that this approach distinguishes between true and spurious topics. Second, we examine the distortions introduced by nonlinear dimensionality reduction methods, like t-SNE and UMAP, when applied to single cell data. For example, these methods can introduce spurious clusters and fail to preserve cell density. We adopt the RMetric algorithm from manifold learning to measure local distortions. We also develop visualizations to explore these distortions. Representing cells as deformed ellipses highlights changes in local geometry, and an interactive interface selectively reveals evidence for distortion without overwhelming the viewer. Through case studies on simulated and real data, we find that the visualizations can flag fragmented neighborhoods, support hyperparameter tuning, and enable method selection. Topic alignment and distortion visualization are available as software packages, with case studies in online vignettes: [https://go.wisc.edu/7h58r9](https://go.wisc.edu/7h58r9), https://go.wisc.edu/ss5ts9.
TOPICS: Computational Statistics, Genomics Research, High-Dimensional Data, Machine Learning, Microbiome Research, Multi-Omics Research, Data Visualization
Liyue Shen
Assistant Professor of Electrical Engineering and Computer Science
University of Michigan
Learn more about this presenter
DATE: Friday, March 13, 2026
TIME: 10:00 a.m.
LOCATION: SPH I, Room 1690
TITLE: To be announced
ABSTRACT: To be announced
TOPICS: To be announced
Ambuj Tewari
Professor of Statistics
University of Michigan
Learn more about this presenter
DATE: Thursday, March 26, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690
TITLE: Statistical Perspectives on Large Language Models
ABSTRACT: Large language models (LLMs) dominate natural language processing and are among the most widely deployed AI technologies in history. At their heart, however, lies a statistical pipeline composed of distinct phases. LLMs are first trained using next-token prediction, with the transformer architecture serving as a large parametric model. At deployment, model outputs are governed by choices that range from simple decoding rules to more complex test-time strategies. Finally, the resulting text is judged using a variety of downstream metrics, or may be fed into an AI text detector. This talk examines different stages of the LLM pipeline from a statistical perspective. First, we address a basic question about the transformer architecture: since the parameterization cost of a transformer layer is independent of context length, can one prove that the same is true of generalization bounds? Second, we study the relationship between training objectives and test-time evaluation metrics through the lens of surrogate loss theory, clarifying when these objectives are aligned and when they are fundamentally mismatched. Third, we ask whether it is possible to derive rigorous statistical tests for LLM-generated text with provable guarantees, even for short text samples.
TOPICS: Artificial Intelligence, Machine Learning
Alfred Hero
John H. Holland Distinguished University Professor of Electrical Engineering and Computer
Science; R. Jamison and Betty Williams Professor of Engineering
University of Michigan
Learn more about this presenter
DATE: Thursday, April 02, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690
TITLE: To be announced
ABSTRACT: To be announced
TOPICS: To be announced
Xiaowu Dai
Assistant Professor of Statistics and Data Science
University of California, Los Angeles
Learn more about this presenter
DATE: Thursday, April 09, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690
TITLE: To be announced
ABSTRACT: To be announced
TOPICS: To be announced






