Umeå universitets logga

umu.sePublikationer
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Topic modeling by clustering language model embeddings: human validation on an industry dataset
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. (Foundations of Language Processing)ORCID-id: 0000-0002-4366-7863
Adlede AB.ORCID-id: 0000-0001-6601-5190
2022 (Engelska)Konferensbidrag, Enbart muntlig presentation (Refereegranskat)
Abstract [en]

Topic models are powerful tools to get an overview of large collections of text data, a situation that is prevalent in industry applications. A rising trend within topic modeling is to directly cluster dimension-reduced embeddings created with pretrained language models. It is difficult to evaluate these models because there is no ground truth and automatic measurements may not mimic human judgment. To address this problem, we created a tool called STELLAR for interactive topic browsing which we used for human evaluation of topics created from a real-world dataset used in industry. Embeddings created with BERT were used together with UMAP and HDBSCAN to model the topics. The human evaluation found that our topic model creates coherent topics. The following discussion revolves around the requirements of industry and what research is needed for production-ready systems.

Ort, förlag, år, upplaga, sidor
2022.
Nyckelord [en]
topic modeling, dynamic topic modeling, topic modeling evaluation, pretrained language model, document clustering
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling)
Identifikatorer
URN: urn:nbn:se:umu:diva-202499OAI: oai:DiVA.org:umu-202499DiVA, id: diva2:1725711
Konferens
The 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, December 7-11, 2022
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), ID190055Tillgänglig från: 2023-01-11 Skapad: 2023-01-11 Senast uppdaterad: 2023-01-11Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Person

Eklund, Anton

Sök vidare i DiVA

Av författaren/redaktören
Eklund, AntonForsman, Mona
Av organisationen
Institutionen för datavetenskap
Språkteknologi (språkvetenskaplig databehandling)

Sök vidare utanför DiVA

GoogleGoogle Scholar

urn-nbn

Altmetricpoäng

urn-nbn
Totalt: 284 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf