Umeå universitets logga

umu.sePublikationer
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Learning, reasoning, and compositional generalisation in multimodal language models
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.ORCID-id: 0000-0002-1112-2981
2024 (Engelska)Doktorsavhandling, monografi (Övrigt vetenskapligt)Alternativ titel
Inlärning, resonemang, och kompositionalitet i multimodala språkmodeller (Svenska)
Abstract [en]

We humans learn language and how to interact with the world through our different senses, grounding our language in what we can see, touch, hear, and smell. We call these streams of information different modalities, and our efficient processing and synthesis of the interactions between different modalities is a cornerstone of our intelligence. Therefore, it is important to study how we can build multimodal language models, where machine learning models learn from more than just text. This is particularly important in the era of large language models (LLMs), where their general capabilities are unclear and unreliable. This thesis investigates learning and reasoning in multimodal language models, and their capabilities to compositionally generalise in visual question answering tasks. Compositional generalisation is the process in which we produce and understand novel sentences, by systematically combining words and sentences to uncover the meaning in language, and has proven a challenge for neural networks. Previously, the literature has focused on compositional generalisation in text-only language models. One of the main contributions of this work is the extensive investigation of text-image language models. The experiments in this thesis compare three neural network-based models, and one neuro-symbolic method, and operationalise language grounding as the ability to reason with relevant functions over object affordances.

In order to better understand the capabilities of multimodal models, this thesis introduces CLEVR-Math as a synthetic benchmark of visual mathematical reasoning. The CLEVR-Math dataset involve tasks such as adding and removing objects from 3D scenes based on textual instructions, such as \emph{Remove all blue cubes. How many objects are left?}, and is given as a curriculum of tasks of increasing complexity. The evaluation set of CLEVR-Math includes extensive testing of different functional and object attribute generalisations. We open up the internal representations of these models using a technique called probing, where linear classifiers are trained to recover concepts such as colours or named entities from the internal embeddings of input data. The results show that while models are fairly good at generalisation with attributes (i.e.~solving tasks involving never before seen objects), it is a big challenge to generalise over functions and to learn abstractions such as categories. The results also show that complexity in the training data is a driver of generalisation, where an extended curriculum improves the general performance across tasks and generalisation tests. Furthermore, it is shown that training from scratch versus transfer learning has significant effects on compositional generalisation in models.

The results identify several aspects of how current methods can be improved in the future, and highlight general challenges in multimodal language models. A thorough investigation of compositional generalisation suggests that the pre-training of models allow models access to inductive biases that can be useful to solve new tasks. Contrastingly, models trained from scratch show much lower overall performance on the synthetic tasks at hand, but show lower relative generalisation gaps. In the conclusions and outlook, we discuss the implications of these results as well as future research directions.

Ort, förlag, år, upplaga, sidor
Umeå: Umeå University, 2024. , s. 192
Serie
Report / UMINF, ISSN 0348-0542 ; 24.07
Nyckelord [en]
multimodal, language models, compositional, generalisation, generalisation, generalization, reasoning, probing, grounding
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
datalogi
Identifikatorer
URN: urn:nbn:se:umu:diva-224571ISBN: 9789180704175 (tryckt)ISBN: 9789180704182 (digital)OAI: oai:DiVA.org:umu-224571DiVA, id: diva2:1858932
Disputation
2024-06-13, Aula Biologica, Biologihuset, Umeå, 13:00 (Engelska)
Opponent
Handledare
Tillgänglig från: 2024-05-23 Skapad: 2024-05-20 Senast uppdaterad: 2025-02-07Bibliografiskt granskad

Open Access i DiVA

fulltext(9558 kB)951 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 9558 kBChecksumma SHA-512
58456a8d9d2cc4891ae1408eeac63c0781fd1fa9c5180b21fc077ece4b46009272ecef4ecbe3c3d9483cf91d6e4b92cbbe972b74752d7120e3d21523fa83862d
Typ fulltextMimetyp application/pdf
spikblad(81 kB)60 nedladdningar
Filinformation
Filnamn SPIKBLAD01.pdfFilstorlek 81 kBChecksumma SHA-512
7f85082dcd15fb6d0966aef8306bdf81611b503fa8ef161f4d16663b0ea9ee057399c8f1e5d2d7b832f7446acf6d8dca66445f5fa03a24654d87f9953d854b2c
Typ spikbladMimetyp application/pdf

Person

Dahlgren Lindström, Adam

Sök vidare i DiVA

Av författaren/redaktören
Dahlgren Lindström, Adam
Av organisationen
Institutionen för datavetenskap
Språkbehandling och datorlingvistik

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 951 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

isbn
urn-nbn

Altmetricpoäng

isbn
urn-nbn
Totalt: 2135 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf