Biostatistics Seminars

The Department of Biostatistics at the University of Michigan is proud to invite leading scholars from around the world to visit Ann Arbor to share their expertise, wisdom and experience. All are welcome to attend these seminars, which are held in-person.

Tianxi Cai

Professor of Biomedical Informatics, Harvard Medical School; John Rock Professor of Population and Translational Data Sciences, Harvard T.H. Chan School of Public Health
Harvard University

Learn more about this presenter

DATE: Thursday, January 22, 2026
TIME: 3:15 p.m.
LOCATION: SPH I, Room 1690

TITLE: Toward Durable AI in Healthcare: Generalizable Learning from Imperfect EHR Data

ABSTRACT: Electronic Health Record (EHR) data offers a promising foundation for real-world evidence, yet its utility is often severely limited by the reality of fragmented, imperfect data and significant heterogeneity across health systems. These inherent data flaws create major bottlenecks in generating evidence efficiently, often resulting in fragile models that are highly susceptible to data shift and rapid aging. Consequently, the challenge lies not just in accessing data, but in efficiently transforming these messy, disparate sources into reliable, enduring AI solutions. This presentation outlines a comprehensive strategy to overcome these limitations and derive robust clinical insights from imperfect data. We will discuss how representation learning can address data sparsity and fragmentation by extracting stable latent features from incomplete patient histories. To tackle system heterogeneity and ensure model longevity, we introduce robust transfer learning frameworks designed to immunize algorithms against distributional shifts. Furthermore, we demonstrate how leveraging knowledge networks can bridge gaps in fragmented data by grounding models in broader biomedical context. Complementing these structural approaches, we touch upon the use of Large Language Models (LLMs) to identify clinical outcomes not directly available in structured fields, solving the problem of unobserved endpoints. By integrating these diverse methodologies, we aim to establish a blueprint for efficiently building AI ecosystems that remain reliable and durable despite the complexities of real-world healthcare data.

TOPICS: Data Integration, High-Dimensional Data, Machine Learning, Network Inference, Precision Health, Predictive Modeling

Yixin Wang

Assistant Professor of Statistics
University of Michigan

Learn more about this presenter

DATE: Thursday, February 12, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690

TITLE: Causal Inference with Unstructured Data

ABSTRACT: Causal inference traditionally relies on tabular data, where treatments, outcomes, and covariates are manually collected and labeled. However, many real-world problems involve unstructured data—images, text, and videos—where treatments or outcomes are high-dimensional and unstructured, or all causal variables are hidden within the unstructured observations. This talk explores causal inference in such settings. We begin with cases where all causal variables (including treatments, outcomes, covariates) are hidden in unstructured observations. These causal problems require a crucial first step, extracting high-level latent causal factors from raw unstructured inputs. We develop algorithms to identify these factors. While traditional methods often assume statistical independence, causal factors are often correlated or causally connected. Our key observation is that, despite correlations, the causal connections (or the lack of) among factors leave geometric signatures in the latent factors' support - the ranges of values each can take. These signatures allow us to provably identify latent causal factors from passive observations, interventions, or multi-domain datasets (up to different transformations). Next, we tackle cases where unstructured data itself serves as either the treatment or the outcome. In these cases, standard causal queries like average treatment effect (ATE) are not suitable—subtracting one text, image, or video outcome from another is meaningless. High-dimensional unstructured treatments also challenge the overlap assumption required for causal identification. To address these challenges, we propose new causal queries: for unstructured outcomes, we pinpoint outcome features most affected by the treatment; for unstructured treatments, we identify influential treatment features driving outcome differences. Finally, we extend these ideas to decision-making algorithms, such as optimizing natural language actions for desired outcomes.

TOPICS: Artificial Intelligence, Causal Inference, Machine Learning

Kris Sankaran

Assistant Professor of Statistics
University of Wisconsin, Madison

Learn more about this presenter

DATE: Thursday, February 19, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690

TITLE: New Diagnostics for Dimensionality Reduction of Genomic Data

ABSTRACT: Dimensionality reduction helps organize high-dimensional genomics data into manageable low-dimensional representations, like differentiation trajectories in single cell data and community profiles in metagenomic sequencing. Such reductions are powerful but sensitive to hyperparameters and prone to misinterpretation. This talk addresses two risks in dimensionality reduction. First, we consider how to choose the number of topics K in topic modeling, a method from population genetics and language modeling, now common in microbiome analysis. While broadly useful, topic models require users to specify the number of topics K, which governs the resolution of learned topics. We discuss a new technique, topic alignment, for comparing topics across models with different resolutions. Simulation studies show that this approach distinguishes between true and spurious topics. Second, we examine the distortions introduced by nonlinear dimensionality reduction methods, like t-SNE and UMAP, when applied to single cell data. For example, these methods can introduce spurious clusters and fail to preserve cell density. We adopt the RMetric algorithm from manifold learning to measure local distortions. We also develop visualizations to explore these distortions. Representing cells as deformed ellipses highlights changes in local geometry, and an interactive interface selectively reveals evidence for distortion without overwhelming the viewer. Through case studies on simulated and real data, we find that the visualizations can flag fragmented neighborhoods, support hyperparameter tuning, and enable method selection. Topic alignment and distortion visualization are available as software packages, with case studies in online vignettes: [https://go.wisc.edu/7h58r9](https://go.wisc.edu/7h58r9), https://go.wisc.edu/ss5ts9.

TOPICS: Computational Statistics, Genomics Research, High-Dimensional Data, Machine Learning, Microbiome Research, Multi-Omics Research, Data Visualization

Liyue Shen

Assistant Professor of Electrical Engineering and Computer Science
University of Michigan

Learn more about this presenter

DATE: Friday, March 13, 2026
TIME: 10:00 a.m.
LOCATION: SPH I, Room 1690

TITLE: To be announced

ABSTRACT: To be announced

TOPICS: To be announced

Ambuj Tewari

Professor of Statistics
University of Michigan

Learn more about this presenter

DATE: Thursday, March 26, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690

TITLE: Statistical Perspectives on Large Language Models

ABSTRACT: Large language models (LLMs) dominate natural language processing and are among the most widely deployed AI technologies in history. At their heart, however, lies a statistical pipeline composed of distinct phases. LLMs are first trained using next-token prediction, with the transformer architecture serving as a large parametric model. At deployment, model outputs are governed by choices that range from simple decoding rules to more complex test-time strategies. Finally, the resulting text is judged using a variety of downstream metrics, or may be fed into an AI text detector. This talk examines different stages of the LLM pipeline from a statistical perspective. First, we address a basic question about the transformer architecture: since the parameterization cost of a transformer layer is independent of context length, can one prove that the same is true of generalization bounds? Second, we study the relationship between training objectives and test-time evaluation metrics through the lens of surrogate loss theory, clarifying when these objectives are aligned and when they are fundamentally mismatched. Third, we ask whether it is possible to derive rigorous statistical tests for LLM-generated text with provable guarantees, even for short text samples.

TOPICS: Artificial Intelligence, Machine Learning

Alfred Hero

John H. Holland Distinguished University Professor of Electrical Engineering and Computer Science; R. Jamison and Betty Williams Professor of Engineering
University of Michigan

Learn more about this presenter

DATE: Thursday, April 02, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690

TITLE: To be announced

ABSTRACT: To be announced

TOPICS: To be announced

Xiaowu Dai

Assistant Professor of Statistics and Data Science
University of California, Los Angeles

Learn more about this presenter

DATE: Thursday, April 09, 2026
TIME: 3:30 p.m.
LOCATION: SPH I, Room 1690

TITLE: To be announced

ABSTRACT: To be announced

TOPICS: To be announced

Biostatistics Seminars

Tianxi Cai

Yixin Wang

Kris Sankaran

Liyue Shen

Ambuj Tewari

Alfred Hero

Xiaowu Dai

Information For

About Us

Student Resources

Connect