Seminar

Making a Training Dataset from Multiple Data Distributions

Over time we might accumulate lots of data from several different populations: e.g., the spread of a virus across different countries. Yet what we wish to model is not any one of these populations. One might want a model for the spread of the virus that is robust to the different countries, or is predictive on a new location we have only limited data for. We overview and formalize the objectives these present for mixing different distributions to make a training dataset, which have historically been hard to optimize. We show that by assuming we train models near "optimal" for our training distribution these objectives simplify to convex objectives, and provide methods to optimize these reduced objectives. Experimental results show improvements across language modeling, bio-assays, and census data tasks.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.

Rolling Extrapolation of Censored Survival Data and Its Applications to Lifetime Outcome Estimation

Estimation of lifetime outcomes is a fundamental problem in biostatistics, epidemiology, and health economic evaluation. In many cohort studies, however, follow-up durations are limited and survival data are heavily censored, making direct estimation of lifetime survival, life expectancy, and cumulative disease burden impossible. Conventional approaches typically rely on parametric survival models, but long-term extrapolations are often highly sensitive to model misspecification and may produce substantial bias.

In this talk, I will present a novel statistical framework, termed the Rolling Extrapolation Algorithm (REA), for extrapolating censored survival data beyond the observed follow-up period. The method incorporates external population information through a matched reference cohort and models relative survival between the study and reference populations. A key observation is that the logit transformation of relative survival often exhibits approximate linearity under broad classes of excess hazard models. Rather than performing a single long-term extrapolation, REA fits a restricted cubic spline model and iteratively predicts one step ahead, updating the fitted model in a rolling fashion until a lifetime horizon is reached.

Simulation studies demonstrate that REA substantially improves extrapolation accuracy compared with conventional one-shot spline and parametric approaches under a variety of hazard patterns. The resulting lifetime survival estimates can be combined with longitudinal quality-of-life, disability, healthcare expenditure, and productivity data to estimate life expectancy, years of life lost, disability-adjusted life years, and lifetime economic burden. Applications will be illustrated using nationwide cohort studies in Taiwan, including analyses of long-term PM2.5 exposure and healthy lifestyle factors.

The talk will focus on the statistical principles underlying REA, its theoretical motivation, empirical performance, practical implementation, and remaining methodological challenges in survival extrapolation and lifetime outcome estimation.

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca.

Event Photo

A novel class of mixed Poisson distributions and wastewater-based epidemiology

Mixed Poisson families are widely used to model count data with overdispersion, zero inflation, or heavy tails in a variety of applications including finance, biology, and the physical sciences. The mixing distribution assigned to the Poisson rate is typically restricted to have nonnegative support. Surprisingly, this assumption is unnecessary. For example, the Hermite distribution is analogous to mixing a Poisson with an untruncated Gaussian and can be derived using generating functions so long as constraints on the natural parameter are satisfied. I will give a general characterization of this unusual class as well as several concrete examples, including an apparently novel generalization of the discrete stable family. I will also briefly present some applied work in wastewater-based epidemiology examining spatiotemporal variation of the pepper mild mottle virus biomarker.

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca.

Event Photo

Two MSc student presentations (Zhili Jiang & Zachary Lau)

Presentation 1

Time: 11:00am - 11:30am

Speaker: Zhili Jiang, UBC Statistics MSc student

Title: A Joint Model for Longitudinal and Survival Data with Nonlinear Trajectories and Interval-Censored Dropout, with Application to HIV Vaccine Studies

Abstract: Joint modeling of longitudinal biomarkers and time-to-event outcomes provides an important framework for understanding vaccine-induced immune responses and their relationship with clinical outcomes. In vaccine trials, dropout is often assumed to be non-informative and exactly observed, which may lead to biased inference when these assumptions are violated. In this study, we extend existing joint modeling approaches by incorporating a biologically motivated nonlinear mixed-effects model for longitudinal antibody trajectories and modeling dropout under both right- and interval-censored settings. The proposed framework provides a more realistic characterization of the association between immune dynamics and dropout through shared random effects. The method is applied to data from the VAX004 HIV-1 vaccine trial. The results suggest that dropout is associated with the underlying longitudinal antibody processes through shared random effects, supporting the presence of informative dropout under the proposed joint modeling framework. This association is consistently observed across Cox right-censored, Weibull right-censored, and Weibull interval-censored specifications. For the longitudinal component, the exponential-decay model provides a substantially better representation of antibody dynamics than linear and power-law alternatives. Simulation studies demonstrate reliable parameter estimation, although Hessian-based standard errors may underestimate uncertainty for parameters associated with the nonlinear component of the model. Overall, the proposed framework provides a flexible and biologically interpretable approach for joint modeling in vaccine studies, offering improved handling of realistic dropout mechanisms and the potential for extension to more complex longitudinal and survival settings.

Presentation 2

Time: 11:30am – 12:00pm

Speaker: Zachary Lau, UBC Statistics MSc student

Title: Scalable Gaussian Processes and Active Learning for Emulator Design in Solar Wind Simulation

Abstract: In this work, we discuss emulator design for solar wind simulators. Our work focuses on two areas. Firstly, we focus on scaling Gaussian Process regression to work well on simulator grids with millions of points. We accomplish this by extending existing work on Kronecker product covariance based algorithms to work efficiently with a dataset larger than working memory. Secondly, we implement and experiment with existing acquisition functions for active learning in the large data regime found in simulators. We find encouraging, though not definitive, results in favour of the Expected Predictive Information Gain acquisition function, particularly when it targets a prior concentrated in a particular part of the search space. To the best of our knowledge, this work is the first time that Gaussian Process Regression has been applied at this scale in Solar Wind modelling, the first time that these acquisition functions have been implemented at this scale for Gaussian Process models, and the first time that active learning has been applied to the problem of emulator design for the solar wind.

To join these seminars virtually, please request Zoom connection details from hr.ops@stat.ubc.ca.

Manifold Sampling with Automatic Tuning

Many statistical and applied problems involve sampling from distributions constrained to curved lower-dimensional spaces, or manifolds. Standard MCMC methods are inapplicable in these settings because they do not naturally respect the constraint geometry, while existing manifold samplers can be highly sensitive to step-size tuning.

Our main contribution is an automatically tuned manifold sampler with a local step-size selection procedure that adapts to the geometry of the manifold. Under regularity conditions, we show that our method is invariant using the involutive MCMC framework. We further implement a contour-based sampling method with automatic tuning that achieves strong performance in terms of effective sample size per second while maintaining stable acceptance rates on several challenging target distributions. Empirical results show that automatic tuning can make manifold sampling more reliable and less sensitive to step-size choice for constrained and contour-based inference problems.

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca.

Event Photo

Introduction to the Computer-Based Testing Facility (CBTF)

The Computer-Based Testing Facility (CBTF) aims to solve a key problem in teaching and learning: helping instructors run digital assessments at scale securely and equitably. The easiest way to describe the CBTF is to imagine it as a network-filtered computer lab dedicated to running digital assessments in 50-minute increments throughout the day, invigilated by trained proctors. Students are typically given a multi-day window to write their tests at a time and location convenient for them. With network filtering, centralized invigilation, a distributed exam model, and flexibility for students and instructors, the mission of the CBTF is to spur pedagogical innovation at the university in a broad range of classes, programs, and departments. Though not the initial motivation, recently the CBTF has also been used to maintain exam integrity in the face of modern AI tools, particularly for computer-based exams where any element of programming is needed under controlled environments. This session will be useful for a range of people including faculty members teaching courses, administrators, IT staff, grad students as well as anyone with an interest in pedagogy and innovative teaching methods. We’ll also hear about the experience of several Statistics faculty members in using the CBTF in their courses as pilots in previous terms. There will be plenty of time for Q&A and a larger conversation around migration to computer-based testing and different learning technologies. The CBTF currently supports a variety of assessment options including Canvas, PrairieLearn, MTA and others. We will also discuss the advantages and disadvantages of different learning technologies in the CBTF from a pedagogical, logistical, and financial perspective.

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca.

UBC Statistics Department Colloquium: Statistical methods for single-cell and spatial data science

The UBC Statistics Department Colloquium Series features talks that are broad, accessible, and engaging - and open to everyone!

The third talk of our series will take place on Monday, June 8th where we will welcome Stephanie Hicks, Associate Professor of Biomedical Engineering and Biostatistics at Johns Hopkins University.

Date: Monday, June 8, 2026
Time: 3 - 4 PM
Location: ESB 5104/5106

Title: Statistical methods for single-cell and spatial data science

Abstract: Genomics is going through a data revolution where we can now profile gene expression at a single-cell or 2D spatial resolution. However, these data present unique challenges that have required the development of specialized statistical and computational methods and software infrastructure to successfully derive biological insights. Compared to bulk RNA-seq, there is an increased scale of the number of observations (or cells) that are measured and there is increased sparsity of the data, or fraction of observed zeros. Furthermore, as single-cell technologies mature, the increasing complexity and volume of data require fundamental changes in data access, management, and infrastructure alongside specialized methods to facilitate scalable analyses. I will discuss some challenges in the analysis of data and present some solutions that we have made towards addressing these challenges.

This colloquium series is sponsored in part by the Constance van Eeden Endowment.

Nonlinear difference-in-differences with cocycles: idea and some simulation experiments

Recovering counterfactual distributions is an important goal in causal inference. In recent years, there has been a growing literature on transport-based models that directly model transport maps between outcome distributions. These methods avoid specifying full data-generating processes and are therefore more robust to mis-specification. The multivariate nonlinear difference-in-differences (DiD) model is one such example. It recovers counterfactual distributions using optimal transport when the treatment is discrete. This approach, however, does not readily generalize to continuous treatments. In this talk, we extend the nonlinear DiD to a continuous treatment setting using cocycles, which are constructed using a different class of transport maps. We propose an estimator for the average treatment effect on the treated and conduct simulation experiments to empirically study its convergence. We also investigate whether having anchoring treatment groups can result in faster convergence and answer that in the negative based on the simulation results.

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca.

Advanced statistical methods for uncovering complex latent processes in animal movement

Telemetry data offer unprecedented opportunities to study wildlife behaviour, but extracting ecological insights from these complex processes requires advanced statistical methods. Using narwhal (Monodon monoceros) movement data as a case study, my PhD develops novel statistical methods to address three key challenges in animal movement analysis. First, while hidden Markov models (HMMs) provide a natural and powerful framework for inferring latent behavioural states from movement data, selecting the number of hidden states in such models is a notoriously difficult task. Common information criteria perform poorly in selecting the number of states under model misspecifications. I build upon a double penalized maximum likelihood estimator (DPMLE) for simultaneous estimation of the number of states and parameters of non-stationary HMMs. Through simulations and the narwhal case study, I show that the DPMLE outperforms traditional methods under misspecifications and enables more realistic modelling of movement data. Second, as human activities expand across wildlife habitats, quantifying behavioural responses to disturbances is crucial for conservation. I introduce a lasso-penalized threshold HMM that jointly estimates the distance at which animals react to a stimulus and ensures this distance threshold corresponds to a meaningful behavioural shift. Results suggest that narwhal react to vessels up to 4 km away, reducing movement persistence and spending more time in deeper waters. To my knowledge, this is the first model-based estimate of a disturbance threshold in movement ecology. Third, understanding habitat selection requires methods robust to location error inherent in animal tracking data. I extend the Langevin diffusion habitat selection model to accommodate error-prone observations, using automatic differentiation and the Laplace approximation for efficient maximum-likelihood estimation, providing the first Template Model Builder (TMB) implementation capable of handling covariates depending on latent variables. Simulations indicate that the proposed method performs better than conventional two-step procedures, which tend to produce estimates biased towards zero. Application to narwhal data reveals a stronger selection signal towards deeper water under my approach. Together, the methods developed in my dissertation advance the statistical toolkit for movement ecology and beyond, as the frameworks developed here are broadly applicable to time series analysis across a wide range of fields.

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca.

Event Photo

Bridging prediction and causality: evaluating algorithms that predict treatment benefit

A treatment benefit predictor is a function that maps patient characteristics to a putative treatment benefit for that patient. Such predictors support the optimization of individualized treatment decisions, a central idea of precision medicine. However, evaluating the predictive performance of a treatment benefit predictor is challenging, as we often cannot observe each individual's treatment benefit. This work theoretically underpins common predictive metrics and demonstrates conceptual and practical evaluation of prespecified treatment benefit predictors in the target population. At a conceptual level, we define the estimands of a set of predictive performance metrics. A particular measure of discrimination is used as an illustrative example to reveal methodological concerns on multiple fronts. We describe how to evaluate a treatment benefit predictor using observational data from the target population and explore how predictive performance metrics may change when confounding is not fully controlled. In practice, we propose and implement estimation methods for evaluating the predictive performance of treatment benefit predictors, assessing their reliability through simulation studies. We illustrate their practical use in real-world observation data, including cohort construction and modeling strategies. Overall, this work helps bridge the gap between predictive modeling and causal inference, providing a framework for evaluating treatment benefit predictors using predictive performance metrics.

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca.

Event Photo