Giulia Polverini: Vision-language model performance on physics tasks: Exploring AI capabilities on research-based assessments involving visual representations
- Date
- 9 June 2026, 13:00
- Location
- Polhemsalen, Å10134, Ångströmlaboratoriet, Regementsvägen 1, Uppsala
- Type
- Thesis defence
- Thesis author
- Giulia Polverini
- External reviewer
- Tor Ole Bigton Odden
- Supervisor
- Bor Gregorcic
- Research subject
- Physics with specialization in Physics Education
- Publication
- https://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-584529
Abstract
This thesis examines how vision-language models (VLMs) perform on research-based physics tasks that require the interpretation and coordination of visual representations. It is situated in Physics Education Research (PER), a field that has long argued that correctness alone is not enough to show understanding and that physics competence depends strongly on working with representations such as graphs, diagrams, and field sketches. From this perspective, generative AI is treated not as a separate topic from PER, but as a new case entering tasks and assessment settings that PER has studied for decades.
Methodologically, the thesis evaluates publicly available VLMs as deployed systems rather than as fixed model architectures. Across the included studies, the models are tested on standardized, research-based concept inventories using multimodal inputs in their original visual form, minimal prompting, and repeated independent runs. The empirical work covers kinematics, electricityand magnetism, and broader cross-inventory benchmarks. It combines quantitative scoring with qualitative analysis of model explanations where needed, especially to distinguish failures of physics reasoning from failures of visual interpretation.
The results show that current VLMs can perform strongly on conceptual physics assessments and, in some cases, reach or exceed typical student performance and even approach expert-level performance on some benchmarks. At the same time, this performance is uneven and often fragile. Across the studies, a recurring pattern is that models can state plausible or even correct physics strategies while misreading the visual representation that the task depends on. Graph interpretation, spatial coordination, and the use of embodied procedures such as the right-hand rule remain recurring sources of error. Performance also varies substantially across models, versions, access tiers, inventories, representation types, languages, and cost conditions.
The thesis argues that these findings matter for both PER and educational uses of AI. First, strong overall performance does not justify broad claims that these models understand physics in the educational sense used in PER. Second, the findings show the value of a PER lens for AI evaluation, since representation-rich physics tasks reveal weaknesses that are easy to miss intext-only benchmarks. Third, they suggest that VLMs should not be treated as reliable tutors, accessibility tools, or assessment supports on representation-rich physics tasks without careful validation in the specific setting of use. More broadly, the thesis shows how AI performance on physics tasks can be interpreted more carefully by attending not only to correctness, but also to representation, assessment, and the limits of what model output can be taken to show.