Previous seminars 2024

Autumn

Seminar 2024-10-23: Planetary Causal Inference: Understanding Environment, Society and Economy through Earth Observation and Computer Vision

Speaker Adel Daoud, Institute for Analytical Sociology, Linköping University

Time and place: 2024-10-23 at 10:15 - 11:30, Ekonomikum, Room H317

Topic Planetary Causal Inference: Understanding Environment, Society and Economy through Earth Observation and Computer Vision

Abstract Planetary Causal Inference explores how social science can benefit from Earth observation (EO) data to advance understanding of humans as a species and their impact on their environment, society, and economy. Traditional methods relying on tabular data, like surveys and national statistics, are costly and sometimes limited in scope, hindering planetary-scale analysis. EO data gathered via satellites offer a complementary approach that captures global, real-time information, enabling researchers to study phenomena like urbanization, poverty, conflict, and deforestation at fine spatial and temporal resolutions. We introduce the emerging field of causally-oriented EO-based machine learning (ML), where spatial data derived from images are analyzed using advanced ML models to create proxies for social science metrics and for use in causal inference pipelines. We discuss how these planetary causal inference methods can produce high-resolution insights about global social issues, providing new ways to assess conflict, sustainable development, and a range of other phenomena. By combining insights from geography, history, and multi-scale analysis, Planetary Causal Inference lays a foundation for researchers to address broad, integrated questions across household, neighborhood, regional, and global scales with a principled approach to inference.

Seminar 2024-10-16: From Stories to Sonnets: Data-Centered NLP for Creative Works

Speaker Maria Antoniak, Pioneer Centre for AI, University of Copenhagen

Time and place: 2024-10-16 at 10:15 - 11:00, Ekonomikum, Room H317

Topic From Stories to Sonnets: Data-Centered NLP for Creative Works

Abstract In this talk, I'll share two recent studies that use natural language processing (NLP) techniques to model creative works like stories and poetry. In the first part of the talk, I'll discuss NLP approaches for story detection and analysis, focusing on how NLP methods can help us study storytelling at large scales and across diverse contexts. In the second part, I'll discuss the poetic capabilities of large language models (LLMs), focusing on audits of the vast pretraining datasets used to build these models. Both studies will highlight the challenges in creating open evaluation datasets for creative works and the importance of interdisciplinary collaboration between NLP and the humanities.

Seminar 2024-09-04: Transformer assisted survey sampling for efficient finite population statistics in highly imbalanced textual data: public hate crime estimation

Speaker: Hannes Waldetoft, Department of Statistics, Uppsala University

Time and place: 2024-09-04, at 10:15 - 11:45, Ekonomikum, Room H317

Abstract

Estimating population parameters in finite populations of text documents can be challenging in cases where obtaining the labels for the target variable requires manual annotation. To address this problem, we combine predictions from a transformer encoder neural network with well-established survey sampling estimators. This is done by training a classifier and then using the model predictions as an auxiliary variable in the estimators. The applicability is demonstrated on Swedish hate crime statistics, which are based on Swedish police reports, for which approximately 1.5 million are being filed annually. Estimates of the yearly number of hate crimes, the police's under-reporting, and proportions of specific hate crime types are derived using the Hansen-Hurwits estimator, regression estimation, and stratified random sampling. We conclude that if labeled training data is available, the proposed method can provide efficient estimates with reduced time spent on manual annotation.

Spring

Seminar 2024-05-29: Frequentist Oracle Properties of Bayesian Stacking Estimators

Speaker: Valentin Zulj, Department of Statistics, Uppsala University

Time and place: 2024-05-29 at 10:15-12, Ekonomikum, Room H317

Abstract

Compromise estimation entails using a weighted average of outputs from several candidate models, and is a viable alternative to model selection when the choice of model is not obvious. As such, it is a tool used by both frequentists and Bayesians, and in both cases, the literature is vast and includes studies of performance in simulations and applied examples. However, frequentist researchers often prove oracle properties, showing that a proposed average asymptotically performs at least as well as any other average comprising the same candidates. On the Bayesian side, such oracle properties are yet to be established. This paper considers Bayesian stacking estimators, and evaluates their performance using frequentist asymptotics. Oracle properties are derived for estimators stacking Bayesian linear and logistic regression models, and combined with Monte Carlo experiments that show Bayesian stacking may outperform the best candidate model included in the stack. Thus, the result is not only a frequentist motivation of a fundamentally Bayesian procedure, but also an extended range of methods available to frequentist practitioners.

Seminar 2024-03-27: A computed 95% confidence interval does cover the true value with probability 0.95 – epistemically interpreted

Speaker: Dan Hedlin, Department of Statistics, Stockholm University

Time and place: 2024-03-27 at 10:15-12, Ekonomikum, Room H317

Abstract

How to interpret and make use of a computed (realised) confidence interval has been debated for a long time. I advocate that a realised confidence interval covers the parameter with probability . Suppose a sample of batteries in routine use is taken and their lifetimes are measured. A confidence interval is computed to days. The standard interpretation is that if we repeatedly draw samples of the same size and compute confidence intervals, 95% of the intervals will cover the unknown true life time. What can be said about the realised interval has not been clear in the literature. I shall give contradictory examples of interpretations. To appreciate the claim made here, we need to have a look at philosophy of probability and the Fisherian concept of relevant subsets.

Seminar 2024-03-13: The Cambridge Law Corpus: A Corpus for Legal AI Research

Speaker: Andreas Östling, Department of Statistics, Uppsala University

Time and place: 2024-03-13 at 10:15-11:30, Ekonomikum, Room H317

Abstract

We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions. (Joint work with Holli Sargeant, Huiyuan Xie, Ludwig Bull, Alexander Terenin, Leif Jonsson, Måns Magnusson, Felix Steffek)

Seminar 2024-02-21: Robust Estimation for Multivariate Contaminated Time Series

Speaker: Yukai Yang, Department of Statistics, Uppsala University

Time and place: 2024-02-21 at 10:15-11:30, Ekonomikum, Room H317

Abstract

This paper addresses the challenging problem of robust estimation in multivariate contaminated time series, where the presence of unobservable additive and innovative outliers poses a significant hurdle. We elucidate the inherent difficulties associated with handling outliers in time series models. We introduce a novel algorithm that can be integrated with any subsample-based estimator, and then becomes a new robust estimator. We establish the sufficient conditions for the algorithm and the new estimator to achieve robust estimation and consistency in the presence of contamination and ensure robust estimation in asymptotics. The simulation studies demonstrate its superior performance and robustness in finite samples. Two practical applications further illustrate the efficacy of the algorithm and the resulting robust estimator in real-world scenarios. (Joint work with Sebastian Ankargren and Rickard Sandberg)

Seminar 2024-02-14: Measures of Explained Standard Deviation

Speaker: Mathias Berggren, Department of Psychology, Uppsala University

Time and place: 2024-02-14 at 10:15-11:30, Ekonomikum, Room H317

Abstract

The coefficient of determination, R2, also called the explained variance, is often taken as a proportional measure of the relative determination of model on outcome. However, while R2 has some attractive statistical properties, its reliance on squared variations (variances) may limit its use as an easily interpretable descriptive statistic of that determination. Here, properties of this coefficient on the squared scale are discussed and generalized to three relative measures on the original scale (in terms of standard deviations). These generalizations can all be expressed as transformations of R2, and alternatives can therefore also be calculated by plugging in related estimates, such as the adjusted R2. It is argued that the third alternative, new for this article, and here termed the CoDSD (the coefficient of determination in terms of standard deviations), which equals √(R^2 )/(√(R^2 )+√(1-R^2 )), most usefully captures the relative determination of the model. When the contribution of the error is c times that of the model, the CoDSD equals 1 / (1 + c), while R2 equals 1 / (1 + c2).

Seminar 2024-02-07: Modification Index for ARMA Models

Speaker: Viktor Eriksson, Department of Statistics, Uppsala University

Time and place: 2024-02-07 at 10:15-11:30, Ekonomikum, Room H317

Abstract

Model selection procedures often involves estimation of similar models. A Modification Index is a direct way of estimating how to extend an estimated model without the need to re-estimate it. Modification index are commonly used in structural equation modelling and factor analysis, but has of yet not been applied within the context of ARMA modelling of time series. The modification index is analytically derived and programmed in R. An empirical illustration exemplifies. Monte Carlo simulations are used to investigate small sample properties.

Seminar 2024-01-17: Uniformly valid causal inference with high-dimensional nuisance parameters

Speaker: Xavier de Luna, Department of Statistics, Umeå University

Time and place: 2024-01-17 at 10:15-11:30, Ekonomikum, Room H317

Abstract

Important advances have recently been achieved in developing procedures yielding uniformly valid inference for a low dimensional causal parameter when high-dimensional nuisance models must be estimated. In this talk, we review the literature on uniformly valid causal inference and discuss the costs and benefits of using uniformly valid inference procedures. Naive estimation strategies based on regularization, machine learning, or a preliminary model selection stage for the nuisance models have finite sample distributions which are badly approximated by their asymptotic distributions. To solve this serious problem, estimators which converge uniformly in distribution over a class of data generating mechanisms have been proposed in the literature. In order to obtain uniformly valid results in high-dimensional situations, sparsity conditions for the nuisance models need typically to be made, although a double robustness property holds, whereby if one of the nuisance model is more sparse, the other nuisance model is allowed to be less sparse. While uniformly valid inference is a highly desirable property, uniformly valid procedures pay a high price in terms of inflated variability. Our discussion of this dilemma is illustrated by the study of a double-selection outcome regression estimator, which we show is uniformly asymptotically unbiased, but is less variable than uniformly valid estimators in the numerical experiments conducted. Justification with real-world data and sensitivity analysis to untestable ignorability assumptions will be presented if times allows. (Joint work with Niloofar Moosavi and Jenny Häggström.)

FOLLOW UPPSALA UNIVERSITY ON

Uppsala University on Facebook
Uppsala University on Instagram
Uppsala University on Youtube
Uppsala University on Linkedin