Umeå University's logo

umu.sePublications
Change search
Refine search result
1 - 6 of 6
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Rows per page
  • 5
  • 10
  • 20
  • 50
  • 100
  • 250
Sort
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
  • Standard (Relevance)
  • Author A-Ö
  • Author Ö-A
  • Title A-Ö
  • Title Ö-A
  • Publication type A-Ö
  • Publication type Ö-A
  • Issued (Oldest first)
  • Issued (Newest first)
  • Created (Oldest first)
  • Created (Newest first)
  • Last updated (Oldest first)
  • Last updated (Newest first)
  • Disputation date (earliest first)
  • Disputation date (latest first)
Select
The maximal number of hits you can export is 250. When you want to export more records please use the Create feeds function.
  • 1.
    Devinney, Hannah
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Umeå University, Faculty of Social Sciences, Umeå Centre for Gender Studies (UCGS).
    Eklund, Anton
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Ryazanov, Igor
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Cai, Jingwen
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Developing a multilingual corpus of wikipedia biographies2023In: International conference. Recent advances in natural language processing 2023, large language models for natural language processing: proceedings / [ed] Ruslan Mitkov; Maria Kunilovskaya; Galia Angelova, Shoumen, Bulgaria: Incoma ltd. , 2023, article id 2023.ranlp-1.32Conference paper (Refereed)
    Abstract [en]

    For many languages, Wikipedia is the mostaccessible source of biographical information. Studying how Wikipedia describes the lives ofpeople can provide insights into societal biases, as well as cultural differences more generally. We present a method for extracting datasetsof Wikipedia biographies. The accompanying codebase is adapted to English, Swedish, Russian, Chinese, and Farsi, and is extendable to other languages. We present an exploratory analysis of biographical topics and gendered patterns in four languages using topic modelling and embedding clustering. We find similarities across languages in the types of categories present, with the distribution of biographies concentrated in the language’s core regions. Masculine terms are over-represented and spread out over a wide variety of topics. Feminine terms are less frequent and linked to more constrained topics. Non-binary terms are nearly non-represented.

    Download full text (pdf)
    fulltext
  • 2.
    Eklund, Anton
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Forsman, Mona
    Adlede AB.
    Topic modeling by clustering language model embeddings: human validation on an industry dataset2022Conference paper (Refereed)
    Abstract [en]

    Topic models are powerful tools to get an overview of large collections of text data, a situation that is prevalent in industry applications. A rising trend within topic modeling is to directly cluster dimension-reduced embeddings created with pretrained language models. It is difficult to evaluate these models because there is no ground truth and automatic measurements may not mimic human judgment. To address this problem, we created a tool called STELLAR for interactive topic browsing which we used for human evaluation of topics created from a real-world dataset used in industry. Embeddings created with BERT were used together with UMAP and HDBSCAN to model the topics. The human evaluation found that our topic model creates coherent topics. The following discussion revolves around the requirements of industry and what research is needed for production-ready systems.

  • 3.
    Eklund, Anton
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Adlede AB, Umeå, Sweden.
    Forsman, Mona
    Adlede AB, Umeå, Sweden.
    Topic modeling by clustering language model embeddings: human validation on an industry dataset2022In: EMNLP 2022 Industry Track: Proceedings of the conference, Association for Computational Linguistics (ACL) , 2022, p. 645-653Conference paper (Refereed)
    Abstract [en]

    Topic models are powerful tools to get an overview of large collections of text data, a situation that is prevalent in industry applications. A rising trend within topic modeling is to directly cluster dimension-reduced embeddings created with pretrained language models. It is difficult to evaluate these models because there is no ground truth and automatic measurements may not mimic human judgment. To address this problem, we created a tool called STELLAR for interactive topic browsing which we used for human evaluation of topics created from a real-world dataset used in industry. Embeddings created with BERT were used together with UMAP and HDBSCAN to model the topics. The human evaluation found that our topic model creates coherent topics. The following discussion revolves around the requirements of industry and what research is needed for production-ready systems.

  • 4.
    Eklund, Anton
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Adlede, Umeå, Sweden.
    Forsman, Mona
    Adlede, Umeå, Sweden.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    An empirical configuration study of a common document clustering pipeline2023In: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 9, no 1Article in journal (Refereed)
    Abstract [en]

    Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or create topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters.

    Download full text (pdf)
    fulltext
  • 5.
    Eklund, Anton
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Adlede AB, Umeå, Sweden.
    Forsman, Mona
    Adlede AB, Umeå, Sweden.
    Drewes, Frank
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Dynamic topic modeling by clustering embeddings from pretrained language models: a research proposal2022In: Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop / [ed] Yan Hanqi; Yang Zonghan; Sebastian Ruder; Wan Xiaojun, Association for Computational Linguistics , 2022, p. 84-91Conference paper (Refereed)
    Abstract [en]

    A new trend in topic modeling research is to do Neural Topic Modeling by Clustering document Embeddings (NTM-CE) created with a pretrained language model. Studies have evaluated static NTM-CE models and found them performing comparably to, or even better than other topic models. An important extension of static topic modeling is making the models dynamic, allowing the study of topic evolution over time, as well as detecting emerging and disappearing topics. In this research proposal, we present two research questions to understand dynamic topic modeling with NTM-CE theoretically and practically. To answer these, we propose four phases with the aim of establishing evaluation methods for dynamic topic modeling, finding NTM-CE-specific properties, and creating a framework for dynamic NTM-CE. For evaluation, we propose to use both quantitative measurements of coherence and human evaluation supported by our recently developed tool.

  • 6.
    Hatefi, Arezoo
    et al.
    Umeå University, Faculty of Science and Technology, Department of Computing Science.
    Eklund, Anton
    Umeå University, Faculty of Science and Technology, Department of Computing Science. Aeterna Labs, Umeå, Sweden.
    Forsman, Mona
    Aeterna Labs, Umeå, Sweden.
    PromptStream: self-supervised news story discovery using topic-aware article representations2024In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) / [ed] Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue, ELRA Language Resource Association , 2024, p. 13222-13232Conference paper (Refereed)
    Abstract [en]

    Given the importance of identifying and monitoring news stories within the continuous flow of news articles, this paper presents PromptStream, a novel method for unsupervised news story discovery. In order to identify coherent and comprehensive stories across the stream, it is crucial to create article representations that incorporate as much topic-related information from the articles as possible. PromptStream constructs these article embeddings using cloze-style prompting. These representations continually adjust to the evolving context of the news stream through self-supervised learning, employing a contrastive loss and a memory of the most confident article-story assignments from the most recent days. Extensive experiments with real news datasets highlight the notable performance of our model, establishing a new state of the art. Additionally, we delve into selected news stories to reveal how the model’s structuring of the article stream aligns with story progression.

    Download full text (pdf)
    fulltext
1 - 6 of 6
CiteExportLink to result list
Permanent link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf