Ahmed Ruby: Modeling Implicit Discourse Relations Across Modalities and Languages

Datum
27 april 2026, kl. 14.00
Plats
Humanistiska Teatern, Thunbergsvägen 3C, Uppsala
Typ
Disputation
Respondent
Ahmed Ruby
Opponent
Philippe Muller
Handledare
Sara Stymne, Christian Hardmeier
Forskningsämne
Datorlingvistik
Publikation
https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-581125

Abstract

Ideas in communication do not stand in isolation, but are linked to each other through discourse relations such as cause, contrast, and elaboration. While some of these relations are explicitly marked by connectives, such as "so", "but", and "then", many are left implicit. Identifying these implicit relations is particularly challenging as it requires inferring meaning from context. This context is not always captured by text, since cues may be distributed across modalities. This dissertation therefore focuses on modeling implicit discourse relations across modalities and languages to better capture the contextual information needed for identification.

In this dissertation, I present a controlled approach to studying how prosody relates to implicit discourse relations. I construct a dataset of ambiguous implicit discourse relations in text and speech for English and Egyptian Arabic. I then conduct a controlled experiment to examine the impact of prosody when context is absent. I find that speakers prosodically distinguish between causal and concessive relations, using features such as pause duration and pitch variation. 

To explore this at scale, I introduce a novel method for automatically constructing implicit discourse relation datasets across text, speech, and video modalities in four languages. This method identifies implicit relation instances by leveraging connective explicitation in translation, where translators insert explicit connectives for relations that remain implicit in the source. Using these datasets, I present modelling approaches for implicit discourse relation classification across text, speech, and video modalities and their combinations. I evaluate these approaches in four languages under both monolingual and multilingual settings. I find that text-based models outperform audio and video models. While adding audio or visual cues to text can improve performance, simply combining all modalities shows no improvement or even degrades performance compared to text alone. However, controlled fusion, which integrates text, speech, and visual cues through learned gating, consistently outperforms both single-modality and simple combination models, with substantial improvements for low-resource scenarios in the multilingual setting.

FÖLJ UPPSALA UNIVERSITET PÅ

Uppsala universitet på facebook
Uppsala universitet på Instagram
Uppsala universitet på Youtube
Uppsala universitet på Linkedin