Statistics Seminar
Causal Inference with Cocycles
To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.
Abstract: Many interventions in causal inference can be represented as transformations of the variables of interest. Abstracting interventions in this way allows us to identify a local symmetry property exhibited by many causal models under interventions. Where present, this symmetry can be characterized by a type of map called a cocycle, an object that is central to dynamical systems theory. We show that such cocycles exist under general conditions and are sufficient to identify interventional distributions and, under suitable assumptions, counterfactual distributions. We use these results to derive cocycle-based estimators for causal estimands and show that they achieve semiparametric efficiency under standard conditions. Since entire families of distributions can share the same cocycle, these estimators can make causal inference robust to mis-specification by sidestepping superfluous modelling assumptions. We demonstrate both robustness and state-of-the-art performance in several simulations, and apply our method to estimate the effects of 401(k) pension plan eligibility on asset accumulation using a real dataset.
Joint work with Hugh Dance (UCL/Gatsby Unit): https://arxiv.org/abs/2405.13844
Online Kernel-Based Mode Learning
To join this seminar virtually: Please request Zoom connection details from ea@stat.ubc.ca.
Abstract: The presence of big data, characterized by exceptionally large sample size, often brings the challenge of outliers and data distributions that exhibit heavy tails. An online learning estimation that incorporates anti-outlier capabilities while not relying on historical data is therefore urgently required to achieve robust and efficient estimators. In this talk, we introduce an innovative online learning approach based on a mode kernel-based objective function, specifically designed to address outliers and heavy-tailed distributions in the context of big data. The developed approach leverages mode regression within an online learning framework that operates on data subsets, which enables the continuous updating of historical data using pertinent information extracted from a new data subset. We demonstrate that the resulting estimator is asymptotically equivalent to the mode estimator calculated using the entire dataset. Monte Carlo simulations and an empirical study are presented to illustrate the finite sample performance of the proposed estimator.
van Eeden seminar: The four pillars of machine learning
Registration
To join this seminar, please register via Zoom. Once your registration is approved, you'll receive an email with details on how to join the meeting.
If you have any questions about your registration or the seminar, please contact headsec@stat.ubc.ca.
Title
The four pillars of machine learning
Abstract
I will present a unified perspective on the field of machine learning research, following the structure of my recent book, "Probabilistic Machine Learning: Advanced Topics" (https://probml.github.io/book2). In particular, I will discuss various models and algorithms for tackling the following four key tasks, which I call the "pillars of ML": prediction, control, discovery and generation. For each of these tasks, I will also briefly summarize a few of my own contributions, including methods for robust prediction under distribution shift, statistically efficient online decision making, discovering hidden regimes in high-dimensional time series data, and for generating high-resolution images.
van Eeden speakers
Dr. Kevin Patrick Murphy has been invited by our department's graduate students to be this year's van Eeden speaker. A van Eeden speaker is a prominent statistician who is chosen by our graduate students each year to give a lecture, supported by the Constance van Eeden Fund.
Meta-Analytic Inference for the COVID-19 Infection Fatality Rate
To join via Zoom: To join this seminar, please request Zoom connection details from pims@uvic.ca
Title: Meta-Analytic Inference for the COVID-19 Infection Fatality Rate
Abstract: Estimating the COVID-19 infection fatality rate (IFR) has proven to be challenging, since data on deaths and data on the number of infections are subject to various biases. I will describe some joint work with Harlan Campbell and others on both methodological and applied aspects of meeting this challenge, in a meta-analytic framework of combining data from different populations. I will start with the easier case when the infection data are obtained via random sampling. Then I will discuss drawing in additional infection data obtained in decidedly non-random manner.
Adjusting for Bias Induced by Informative Dose Selection Procedures
Many fields such acute toxicity studies, Phase I cancer trials, sensory studies and psychometric testing use informative dose allocation procedures. In this talk, we explain how such adaptive designs induce bias, and in the context of dose-finding designs we show how to modify frequency data to adjust for this bias.
To provide context, we start the talk with a general discussion of issues in inference following adaptive designs. Then, we assume a binary response Y has a monotone positive response prob- ability to a stimulus or treatment X, and we consider designs that sequentially select X values for new subjects in a way that concentrates treatments in a certain region of interest under the dose-response curve. We discuss how data analysis at the end of a study is affected by choosing the stimulus value for each subject sequentially according to some informative sampling rule.
Without loss of generality, we call a positive response a toxicity and the stimulus a dose. For simplicity, we restrict this talk to the case of a univariate treatment X and binary Y, and further assume that treatments are limited to a finite set {d1, d2, . . . , dM } of M values we call doses. Now suppose n subjects receive treatments that were sequentially selected (according so some rule using data from prior subjects) from the restricted set of M doses. Let Nm and Tm denote the number of subjects receiving treatment dm and the number of toxicities observed on treatment dm, respectively. Define Fm = P{Y = 1|X = dm} = E[Y |X = dm].
Then it is often said that the distribution of Tm given Nm is Binomial with parameters (Fm, Nm). But taking Nm as fixed is not the same as conditioning on this random variable, and conditioning on informative dose assignments is not the same as conditioning on summary dose frequencies. Indeed, it is easy to show that the observed dose-specific toxicity rate, Tm/Nm, is biased for Fm. From first principals, we obtain
E[Tm / Nm] = Fm - Cov[Tm/Nm, Nm] / E[Nm]
The observed toxicity rate is biased for Fm because adaptive allocations, by design, induce a correlation between toxicity rates and allocation frequencies.
This bias impacts inference procedures: Isotonic regression methods use dose-specific toxicity rates directly. Standard likelihood-based methods mask the bias by providing first-order linear approximations. We illustrate these biases using isotonic and likelihood-based regression methods in some well known (small sample size) adaptive methods including selected up-and-down designs, interval designs, and the continual reassessment method. Then we propose a bias adjustment inspired by Firth (1993).
[Nancy Flournoy; University of Missouri – http://web.missouri.edu/flournoyn/]
[flournoyn@missouri.edu – https://en.wikipedia.org/wiki/Nancy_Flournoy]
Methods for Preferential Sampling in Geostatistics
Preferential sampling in geostatistics refers to the instance in which the process that determines the sampling locations may depend on the spatial process that is being modelled. If ignored, this dependency can result in biased parameter estimates and may affect the resulting spatial prediction. Recent research on correcting for preferential sampling bias has been limited to stationary sampling locations, such as air-quality monitoring sites. We propose a flexible framework for inference on preferentially sampled fields, which can be used to expand preferential sampling methodology to the case in which the preferentially sampled locations are obtained from a process moving in space and time. An example of such data, the preferential sampling of ocean temperature by tagged marine mammals, is presented.
Modelling under-reported data through INAR-hidden Markov chains - June 22, 2017 at 4pm in ESB 4192
The interest in the analysis of count time series has been growing in the past years, and many models have been considered in the literature (Al-Osh and Alzaid, 1987, J Time Series Analysis). The main reason for this increasing popularity is the limited performance of the classical time series analysis approach when dealing with discrete valued time series. With the introduction of discrete time series analysis techniques, several challenges appeared such as unobserved heterogeneity, periodicity, under-reporting, among others. Many efforts have been devoted in order to introduce seasonality in these models (Morina et al., 2011, Statistics in Medicine) and also coping with unobserved heterogeneity. However, the problem of under-reported data is still in a quite early stage of study in many different fields. This phenomenon is very common in many contexts such as epidemiological and biomedical research. It might lead to potentially biased inferences and may also invalidate the main assumptions of the classical models. Especially, in public health context it is well known that several diseases have been traditionally under-reported (occupational related diseases, food exposures diseases, ...). The model we will present in this work considers two discrete time series: the observed series of counts Y_t which may be under-reported, and the underlying series X_t with an INAR(1) structure X_n = alpha*X_{n-1}+W_n, where 0 < alpha < 1 is a fixed parameter and W_n are the innovations which are Poisson(lambda) distributed. The binomial thinning operator (or binomial subsampling) is defined as alpha*X_{n-1}= \sum_{i=1}^{X_{n-1}} Z_i(alpha); where Z_i are i.i.d Bernoulli random variables with probability of success equal to alpha. The way we allow the observed process Y_n to be under-reported is by defining that Y_n is X_n with probability 1-omega or is q*X_n with probability omega. Obviously, this definition means that the observed Y_n coincides with the underlying series X_n, and therefore the count at time n is not under-reported with probability 1-omega. Several applications in the field of public health will be discussed, using real data regarding incidence and mortality attributable to diseases related to occupational and environmental exposures and known toxics and traditionally under-reported. Full details of the work can be found in Fernandez-Fontelo et al. (2016, Statistics in Medicine, v 35, pp 4875-4890).