Statistics Seminar

Making a Training Dataset from Multiple Data Distributions

Over time we might accumulate lots of data from several different populations: e.g., the spread of a virus across different countries. Yet what we wish to model is not any one of these populations. One might want a model for the spread of the virus that is robust to the different countries, or is predictive on a new location we have only limited data for. We overview and formalize the objectives these present for mixing different distributions to make a training dataset, which have historically been hard to optimize. We show that by assuming we train models near "optimal" for our training distribution these objectives simplify to convex objectives, and provide methods to optimize these reduced objectives. Experimental results show improvements across language modeling, bio-assays, and census data tasks.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Reassessing the Statistical Evidence in Clinical Trials with Extended Approximate Objective Bayes Factors

Bayesian hypothesis testing using the Bayes factor is an alternative to hypothesis testing based on the p-Value. They are especially useful if one considers the p-postulate, which suggests that equal p-values, irrespective of sample size, should represent equal evidence against a null hypothesis, false. Bayes factors can, however, be computationally intensive and require a prior distribution. We define an extension of Jeffrey's approximate objective Bayes factor (eJAB) based on a generalization of the unit information prior. Its computation requires nothing more than the p-value and the sample size and it provides a measure of evidence that allows one to interpret the p-value in light of the associated sample size through the lens of an approximate Bayes factor corresponding to an objective prior. We apply eJAB to reexamine the evidence from 71,130 clinical trial findings with particular attention to contradictions between Bayes factors and NHST—i.e., instances of the Jeffreys–Lindley paradox (JLP). Our findings reflect increasing evidence in the literature of problematic clinical trial design and results.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

AI, BI & SI—Artificial, Biological and Statistical Intelligences

Artificial Intelligence (AI) is clearly one of the hottest subjects these days. Basically, AI employs a huge number of inputs (training data), super-efficient computer power/memory, and smart algorithms to perform its intelligence. In contrast, Biological Intelligence (BI) is a natural intelligence that requires very little or even no input. This talk will first discuss the fundamental issue of input (training data) for AI. After all, not-so-informative inputs (even if they are huge) will result in a not-so-intelligent AI. Specifically, three issues will be discussed: (1) input bias, (2) data right vs. right data, and (3) sample vs. population. Finally, the importance of Statistical Intelligence (SI) will be introduced. SI is somehow in between AI and BI. It employs important sample data, solid theoretically proven statistical inference/models, and natural intelligence. In my view, AI will become more and more powerful in many senses, but it will never replace BI. After all, it is said that “The truth is stranger than fiction, because fiction must make sense.” The ultimate goal of this study is to find out “how can humans use AI, BI, and SI together to do things better.”

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Dr. Dennis K. J. Lin is a Distinguished Professor of Statistics at Purdue University. He served as the Department Head during 2020-2022. Prior to this current job, he was a University Distinguished Professor of Supply Chain Management and Statistics at Penn State, where he worked for 25 years. His research interests are data quality, industrial statistics, statistical inference, and data science. He has published nearly 300 SCI/SSCI papers in a wide variety of journals. He currently serves or has served as an associate editor for more than 10 professional journals and was a co-editor for Applied Stochastic Models for Business and Industry. Dr. Lin is an elected fellow of ASA, IMS, ASQ, & RSS, an elected member of ISI, and a lifetime member of ICSA. He is an honorary chair professor for various universities, including Fudan University, and National Taiwan Normal University and a Chang-Jiang Scholar at Renmin University of China. His recent awards include, the Youden Address (ASQ, 2010), the Shewell Award (ASQ, 2010), the Don Owen Award (ASA, 2011), the Loutit Address (SSC, 2011), the Hunter Award (ASQ, 2014), the Shewhart Medal (ASQ, 2015), and the SPES Award (ASA, 2016). He won the Deming Lecturer Award at 2020 JSM. His most recent award is “The 2022 Distinguished Alumni Award” (National Tsing Hua University, Taiwan).

 

Locally Equivalent Weights for Bayesian Multilevel Regression and Poststratification

Multilevel Regression with Post-stratification (MrP) has become a workhorse method for estimating population quantities using non-probability surveys, and is the primary alternative to traditional survey calibration weights, e.g.~ as computed by raking. For simple linear regression models, MrP methods admit “equivalent weights”, allowing for direct comparisons between MrP and traditional calibration weights (Gelman 2006). In the present paper, we develop a more general framework for computing and interpreting “MrP approximate weights” (MrPaw), which admit direct comparison with calibration weights in terms of important diagnostic quantities such as covariate balance, frequentist sampling variability, and partial pooling. MrPaw is based on a local equivalent weighting approximation, which we show in theory and practice to be accurate. Importantly, MrPaw can be easily computed based on existing MCMC samples and conveniently wraps standard MrP software implementations. We illustrate our approach for several canonical studies that use MrP, including for the binary outcome of vote choice, showing a high degree of variability in the performance of MrP models in terms of frequentist diagnostics relative to raking.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Bayesian Modeling for Functional Neuroimaging Data

Functional neuroimaging data, such as electroencephalography (EEG) and functional magnetic resonance imaging (fMRI), often exhibit rich temporal, spatial, and spectral structure, posing unique challenges for statistical modeling. This talk presents Bayesian modeling approaches for functional neuroimaging data, focusing on time-frequency representations of EEG signals from multi-condition experiments. In such experiments, brain activity is recorded as subjects engage in various tasks or are exposed to different stimuli. The resulting data often exhibit smooth variation across time and frequency and can be naturally represented as two-way functional data, with conditions nested within subjects. To jointly account for the data’s multilevel structure, functional nature, and subject-level covariates, we propose a Bayesian mixed-effects model incorporating covariate-dependent fixed effects and multilevel random effects. For interpretability and parsimony, we introduce a novel decomposition of the fixed effects with marginally interpretable time and frequency patterns, along with a sparsity-inducing prior for rank selection. The proposed method is evaluated through extensive simulations and applied to EEG data collected to investigate the effects of alcoholism on cognitive processing in response to visual stimuli.  Extensions to modeling dynamic functional connectivity and other Bayesian methods developed for fMRI data will also be discussed. 

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

A Practical Introduction to LLMs, Chatbots, and Dashboards

LLMs have a lot of hype around them these days. Let’s demystify how they work and see how we can put them in context for data science use. As data scientists, we want to make sure our results are inspectable, reliable, reproducible, and replicable. We already have many tools to help us in this front. However, LLMs provide a new challenge; we may not always be given the same results back from a query. This means trying to work out areas where LLMs excel in, and use those behaviours in our data science artifacts. This talk will introduce you to LLms, the Ellmer, and Chatlas packages for R and Python, and how they can be integrated into a Shiny to create an AI-powered dashboard. We’ll see how we can leverage the tasks LLMs are good at to better our data science products.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Copula-based Non-Gaussian Time Series Models

There are many non-Gaussian time series models available in the literature. Copula-based time series models are particularly relevant as they can handle serial tail dependence or the clustering of extreme observations. To date, mainly copula-based Markov time series models that extend the autoregressive time series model have been studied and applied. In this talk, I will consider non-Markovian copula-based time series models that can be viewed as an extension of Gaussian autoregressive moving average (ARMA) models. I derive distributional properties and discuss conditions for stationarity, as well as the asymptotic properties of the maximum-likelihood estimators. Finally, the probabilistic forecasting performance is evaluated.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Generative Data Mining with Longtail-Guided Diffusion

It is difficult to anticipate the myriad challenges that a predictive model will encounter once deployed. Common practice entails a reactive, cyclical approach: model deployment, data mining, and retraining. We instead develop a proactive longtail discovery process by imagining additional data during training. In particular, we develop general model-based longtail signals, including a differentiable, single forward pass formulation of epistemic uncertainty that does not impact model parameters or predictive performance but can flag rare or hard inputs. We leverage these signals as guidance to generate additional training data from a latent diffusion model in a process we call Longtail Guidance (LTG). Crucially, we can perform LTG without retraining the diffusion model or the predictive model, and we do not need to expose the predictive model to intermediate diffusion states. Data generated by LTG exhibit semantically meaningful variation, yield significant generalization improvements on numerous image classification benchmarks, and can be analyzed by a VLM to proactively discover, textually explain, and address conceptual gaps in a deployed predictive model.

Bio

David Hayden leads Perception AI Research at Cruise, where he focuses on generative and world models, foundation model alignment and guidance, longtail robustness, uncertainty quantification, and synthetic data. He has consulted on machine learning and computer vision for diverse industries including pharmaceuticals, retail, and competitive sports. His work has shipped to hundreds of driverless cars, ran live in stadiums of 40,000 people, supported seed and Series A rounds, and is published in top conferences and journals including ICML, CVPR, Neurips, and Nature. He previously founded Essistive Technologies, where he developed and licensed discreet note-taking tech for individuals with limited vision. David received a PhD at MIT working on interpretable machine learning and computer vision, with emphasis on behavior analysis, multi-object tracking, Bayesian nonparametrics for time-series, distributions on manifolds, and uncertainty to guide decision making. 

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

A Computational Theory for Black-Box Variational Inference

Variational inference with stochastic gradients, commonly called black-box variational inference (BBVI) or stochastic gradient variational inference, is the workhorse of probabilistic inference in the large data, large model regime. For a decade, however, the computational properties of VI have largely been unknown. For instance, under what conditions is BBVI guaranteed to converge, and is it provably efficient? In this talk, I will present recent theoretical results on VI in the form of quantitative non-asymptotic convergence guarantees for obtaining a variational posterior. Following this, I will demonstrate the usefulness of the theoretical framework by investigating the theoretical properties of various design choices and algorithmic modifications, such as parametrizations of variational approximation, variance-reduced gradient estimators such as sticking-the-landing, structured variational families, and beyond.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca

First passage time distributions for jump-diffusion processes and flexible boundaries

To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca 

Abstract: The first passage time (FPT) is a useful tool in stochastic modeling of many biological, physical, social and economic processes evolving with time. It refers to the time when a random process first passes a threshold, e.g., when the population of an endangered species reaches a certain critical level, or when the number of infected individuals with a disease reaches a limit. Other examples include the survival time of a cancer patient, failure time of a mechanical system, and default time of a business, etc.

We study the boundary crossing problem for jump-diffusion processes over a discontinuous boundary and provide a complete characterization on the FPT distributions. We derive new formulas for piecewise linear boundary crossing probabilities and density of Brownian motion with general random jumps. These formulas can be used to approximate the boundary crossing distributions for general nonlinear boundaries. The method can be extended to more general diffusion processes such as geometric Brownian motion and Ornstein-Uhlenbeck processes with jumps. The numerical computation can be done by Monte Carlo integration which is straightforward and easy to implement. Some numerical examples are presented for illustration.