Statistics Seminar

Making a Training Dataset from Multiple Data Distributions

Over time we might accumulate lots of data from several different populations: e.g., the spread of a virus across different countries. Yet what we wish to model is not any one of these populations. One might want a model for the spread of the virus that is robust to the different countries, or is predictive on a new location we have only limited data for. We overview and formalize the objectives these present for mixing different distributions to make a training dataset, which have historically been hard to optimize. We show that by assuming we train models near "optimal" for our training distribution these objectives simplify to convex objectives, and provide methods to optimize these reduced objectives. Experimental results show improvements across language modeling, bio-assays, and census data tasks.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Distributional Balancing for Causal Inference: A Unified Framework via Characteristic Function Distance

Weighting methods are essential tools for estimating causal effects in observational studies, with the goal of balancing pre-treatment covariates across treatment groups. Traditional approaches pursue this objective indirectly, for example, via inverse propensity score weighting or by matching a finite number of covariate moments, and therefore do not guarantee balance of the full joint covariate distributions. Recently, distributional balancing methods have emerged as robust, nonparametric alternatives that directly target alignment of entire covariate distributions, but they lack a unified framework, formal theoretical guarantees, and valid inferential procedures. We introduce a unified framework for nonparametric distributional balancing based on the characteristic function distance (CFD) and show that widely used discrepancy measures, including the maximum mean discrepancy and energy distance, arise as special cases. Our theoretical analysis establishes conditions under which the resulting CFD-based weighting estimator achieves root-N consistency. Since the standard bootstrap may fail for this estimator, we propose subsampling as a valid alternative for inference. We further extend our approach to an instrumental variable setting to address potential unmeasured confounding. Finally, we evaluate the performance of our method through simulation studies and a real-world application, where the proposed estimator performs well and exhibits results consistent with our theoretical predictions.

The paper is available at https://arxiv.org/abs/2601.15449

Bio:

Dr. Chan Park is an assistant professor at the University of Illinois Urbana-Champaign. His research focuses on causal inference in complex settings, including dependence among units and omitted variables. He specializes in applying nonparametric methods and semiparametric theory to address these challenges.

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca.

Variational Inference for Variable Selection in Scalar-on-Function Regression

In practical regression applications, multiple covariates are often measured, but not all may be associated with the response variable. Identifying and including only the relevant covariates in the model is crucial for improving prediction accuracy. In this work, we develop a variational inference approach for estimation and variable selection in scalar-on-function regression, involving only functional covariates, and in partially functional regression models that also include scalar covariates. Specifically, we develop a variational expectation–maximization algorithm, with a variational Bayes procedure implemented in the E-step to obtain approximate marginal posterior distributions for most model parameters, except for the regularization parameters, which are updated in the M-step. Our method accurately identifies relevant covariates while maintaining strong predictive performance, as demonstrated through extensive simulation studies across diverse scenarios. Compared with alternative approaches, including BGLSS (Bayesian Group Lasso with Spike-and-Slab priors), grLASSO (group Least Absolute Shrinkage and Selection Operator), grMCP (group Minimax Concave Penalty), and grSCAD (group Smoothly Clipped Absolute Deviation), our approach achieves a superior balance between goodness-of-fit and sparsity in most scenarios. We further illustrate its practical utility through real-data applications involving spectral analysis of sugar samples and weather measurements from Japan.
 

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca. 

UBC Statistics Department Colloquium Series: A Debiased Machine Learning Single-Imputation Framework for Item Nonresponse in Surveys

Machine learning methods are now increasingly studied and used in National Statistical Offices, in particular to handle item nonresponse, where some survey respondents answer certain questions but leave others missing. In most surveys, item nonresponse affects key study variables, and imputation is routinely used to handle the resulting missing data. Standard parametric imputation methods can support rigorous inference when their modeling assumptions are approximately correct. However, when the imputation model is misspecified, the resulting inferences may be potentially misleading. Machine learning offers a flexible alternative by learning complex relationships between variables from the data, which can reduce the risk of misspecification. At the same time, this flexibility introduces new challenges for survey inference, since modern learning algorithms may converge more slowly than classical parametric models and may not automatically deliver valid uncertainty quantification. In this talk, I will present a survey sampling extension of the double/debiased machine learning framework of Chernozhukov et al. (2018). The proposed approach combines machine learning-based imputation with design-based survey weighting and an orthogonalized estimating strategy, leading to root-$n$ consistent and asymptotically normal estimation of population means under realistic conditions. We also develop a consistent variance estimator, yielding asymptotically valid confidence intervals while allowing the use of a wide range of machine learning algorithms. I will briefly discuss aggregation procedures and conclude with simulation results illustrating the performance of the proposed methodology.
 

This talk is part of the UBC Statistics Colloquium Series, which features broad and accessible seminars throughout the term.

Functional State Space Models and the Kalman Filter

In this talk, we propose a state space model for functional time series data, which extends many time series models to the realm of functional data. Most notably, we introduce the Functional ARMAX process (FARMAX), which is developed in the fully functional setting, i.e. without relying on projection onto a finite number of basis functions. These models are fit via our fully functional variant of the Kalman filter and smoother methods. The theoretical soundness of this approach is proven using tools from the theory of Gaussian measures in locally convex spaces. As an application, we consider signals data collected from small wearable medical tri-axial accelerometers affixed to a patient's wrists or ankles. Each device collects three time series (x, y, z directions) at 100Hz and can continuously collect data for 14 days.

//--Note updated time is 11 AM, March 10th--//

To join this seminar virtually, please request Zoom connection details from hr.ops@stat.ubc.ca. 

 

An Economical Approach to Design Posterior Analyses

To design Bayesian studies, criteria for the operating characteristics of posterior analyses—such as power and the Type I error rate—are often assessed by estimating sampling distributions of posterior probabilities via simulation. In this work, we propose an economical method to determine optimal sample sizes and decision for such studies. Using our theoretical results that model posterior probabilities as a function of the sample size, we assess operating characteristics throughout the sample size space given simulations conducted at only two sample sizes. These theoretical results are used to construct bootstrap confidence intervals for the sample sizes and decision criteria that reflect the stochastic nature of simulation-based design. The broad applicability and wide impact of our methodology is illustrated using two clinical examples.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Category tree Gaussian process for computer experiments with many-category qualitative factors and application to cooling system design

In computer experiments, Gaussian process (GP) models are widely employed for emulation. However, when both qualitative and quantitative factors are involved, especially when qualitative factors have many categories, GP-based emulation becomes challenging, and existing methods can become unwieldy due to the curse of dimensionality. Motivated by computer experiments for the design of a cooling system, we introduce a new tree-based GP model for emulating computer codes with high-cardinality qualitative factors, referred to as the category tree GP (ctGP). The proposed approach incorporates a tree structure to partition the categories of the qualitative factors, after which GP or mixed-input GP models are fitted to the simulation outputs within the leaf nodes. The splitting rule is designed to reflect the cross-correlations among the categories of the qualitative factors, which a recent theoretical study has identified as a key component for improving prediction accuracy, and a pruning procedure based on cross-validation error is introduced to further ensure strong predictive performance. An application to the design of a cooling system demonstrates that the proposed method not only yields substantial computational gains and accurate predictions, but also offers meaningful insights into the system by uncovering an interpretable tree structure. Furthermore, in this cooling system design problem, the computer code is capable of generating multiple responses in addition to a single objective response; to accommodate this, we extend the ctGP framework to handle multiple responses by introducing an additional categorical variable that indicates which response is associated with each experimental point. Finally, we complete the cooling system design study by addressing the corresponding global optimization problem using Bayesian optimization with ctGP and an expected-improvement-type criterion.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Bio: Ray-Bing Chen is a Professor in the Institute of Statistics and Data Science at National Tsing Hua University. He received his Ph.D. in Statistics from the University of California, Los Angeles in 2003. Prof. Chen’s research interests include statistical and machine learning, statistical modeling, computer experiments, and optimal design. His work has been published in leading journals such as the Annals of Applied Statistics, Journal of Computational and Graphical Statistics, Statistics and Computing, Technometrics, Journal of Quality Technology and Computational Statistics and Data Science. In recognition of his contributions to the field, he was elected as an Elected Member of the International Statistical Institute in 2020.

Event Photo
Ray-Bing Chen

Reassessing the Statistical Evidence in Clinical Trials with Extended Approximate Objective Bayes Factors

Bayesian hypothesis testing using the Bayes factor is an alternative to hypothesis testing based on the p-Value. They are especially useful if one considers the p-postulate, which suggests that equal p-values, irrespective of sample size, should represent equal evidence against a null hypothesis, false. Bayes factors can, however, be computationally intensive and require a prior distribution. We define an extension of Jeffrey's approximate objective Bayes factor (eJAB) based on a generalization of the unit information prior. Its computation requires nothing more than the p-value and the sample size and it provides a measure of evidence that allows one to interpret the p-value in light of the associated sample size through the lens of an approximate Bayes factor corresponding to an objective prior. We apply eJAB to reexamine the evidence from 71,130 clinical trial findings with particular attention to contradictions between Bayes factors and NHST—i.e., instances of the Jeffreys–Lindley paradox (JLP). Our findings reflect increasing evidence in the literature of problematic clinical trial design and results.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

AI, BI & SI—Artificial, Biological and Statistical Intelligences

Artificial Intelligence (AI) is clearly one of the hottest subjects these days. Basically, AI employs a huge number of inputs (training data), super-efficient computer power/memory, and smart algorithms to perform its intelligence. In contrast, Biological Intelligence (BI) is a natural intelligence that requires very little or even no input. This talk will first discuss the fundamental issue of input (training data) for AI. After all, not-so-informative inputs (even if they are huge) will result in a not-so-intelligent AI. Specifically, three issues will be discussed: (1) input bias, (2) data right vs. right data, and (3) sample vs. population. Finally, the importance of Statistical Intelligence (SI) will be introduced. SI is somehow in between AI and BI. It employs important sample data, solid theoretically proven statistical inference/models, and natural intelligence. In my view, AI will become more and more powerful in many senses, but it will never replace BI. After all, it is said that “The truth is stranger than fiction, because fiction must make sense.” The ultimate goal of this study is to find out “how can humans use AI, BI, and SI together to do things better.”

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca. 

Dr. Dennis K. J. Lin is a Distinguished Professor of Statistics at Purdue University. He served as the Department Head during 2020-2022. Prior to this current job, he was a University Distinguished Professor of Supply Chain Management and Statistics at Penn State, where he worked for 25 years. His research interests are data quality, industrial statistics, statistical inference, and data science. He has published nearly 300 SCI/SSCI papers in a wide variety of journals. He currently serves or has served as an associate editor for more than 10 professional journals and was a co-editor for Applied Stochastic Models for Business and Industry. Dr. Lin is an elected fellow of ASA, IMS, ASQ, & RSS, an elected member of ISI, and a lifetime member of ICSA. He is an honorary chair professor for various universities, including Fudan University, and National Taiwan Normal University and a Chang-Jiang Scholar at Renmin University of China. His recent awards include, the Youden Address (ASQ, 2010), the Shewell Award (ASQ, 2010), the Don Owen Award (ASA, 2011), the Loutit Address (SSC, 2011), the Hunter Award (ASQ, 2014), the Shewhart Medal (ASQ, 2015), and the SPES Award (ASA, 2016). He won the Deming Lecturer Award at 2020 JSM. His most recent award is “The 2022 Distinguished Alumni Award” (National Tsing Hua University, Taiwan).

 

Locally Equivalent Weights for Bayesian Multilevel Regression and Poststratification

Multilevel Regression with Post-stratification (MrP) has become a workhorse method for estimating population quantities using non-probability surveys, and is the primary alternative to traditional survey calibration weights, e.g.~ as computed by raking. For simple linear regression models, MrP methods admit “equivalent weights”, allowing for direct comparisons between MrP and traditional calibration weights (Gelman 2006). In the present paper, we develop a more general framework for computing and interpreting “MrP approximate weights” (MrPaw), which admit direct comparison with calibration weights in terms of important diagnostic quantities such as covariate balance, frequentist sampling variability, and partial pooling. MrPaw is based on a local equivalent weighting approximation, which we show in theory and practice to be accurate. Importantly, MrPaw can be easily computed based on existing MCMC samples and conveniently wraps standard MrP software implementations. We illustrate our approach for several canonical studies that use MrP, including for the binary outcome of vote choice, showing a high degree of variability in the performance of MrP models in terms of frequentist diagnostics relative to raking.

To join this seminar virtually, please request Zoom connection details from ea@stat.ubc.ca.