umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
General Purpose Vector Representation for Swedish Documents: An application of Neural Language Models
Umeå University, Faculty of Science and Technology, Department of Physics.
2019 (English)Independent thesis Advanced level (professional degree), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

This thesis is a proof-of-concept for embedding Swedish documents using continuous vectors. These vectors can be used as input in any subsequent task and serves as an alternative to discrete bag of words vectors. The differences goes beyond fewer dimensions as the continuous vectors also hold contextual information. This means that documents with no shared vocabulary can be directly identified as contextually similar, which is impossible for the bag of words vectors. The continuous vectors are the result of neural language models and algorithms that pool the model output into document-level representations. This thesis has looked into the latest research regarding such models, starting from the Word2Vec algorithms.

A wide variety of neural language models were selected together with algorithms for pooling word and sentence vectors into document vectors. For the training of the neural language models we have assembled a training corpus spanning 1.2 billion Swedish words. The trained neural language models were later paired with pooling algorithms to finalize an array of document vector models. The document vector models were evaluated on five classifications tasks and compared against the baseline bag of words vectors. A few models that were trained directly on the evaluation data were also included as reference. For each evaluation task the setup was held constant, which ensured that any difference in performance came from the quality of the document representations.

The results show that the continuous document vectors outperform the baseline on topic and text format classifications tasks. It was noted that the best performance was achieved when a document vector model was trained directly on the evaluation data. However, this result was only marginally better than that of the best general document vector models. In conclusion it was a successful proof of concept but there are still improvements to be made, such as optimizing the composition of the training corpus. Due to its simplicity and overall performance we recommend a general Sent2Vec model as a new baseline for future projects.

Place, publisher, year, edition, pages
2019.
National Category
Probability Theory and Statistics Computer Sciences
Identifiers
URN: urn:nbn:se:umu:diva-160109OAI: oai:DiVA.org:umu-160109DiVA, id: diva2:1323994
External cooperation
Advectas
Subject / course
Examensarbete i teknisk fysik
Educational program
Master of Science Programme in Engineering Physics
Available from: 2019-06-19 Created: 2019-06-13 Last updated: 2019-06-19Bibliographically approved

Open Access in DiVA

fulltext(11917 kB)93 downloads
File information
File name FULLTEXT01.pdfFile size 11917 kBChecksum SHA-512
a665b225310a333547ebea33af48283db454db05a320b7cb658bd80e6896d703647662e311374e5a6bbaabe760ad956b92e40d082cada072513c4fd27f951c35
Type fulltextMimetype application/pdf

By organisation
Department of Physics
Probability Theory and StatisticsComputer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 93 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 44 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf