Umeå University's logo

umu.sePublications
Change search
Link to record
Permanent link

Direct link
Publications (9 of 9) Show all publications
Eklund, A., Forsman, M. & Drewes, F. (2025). Comparing human-perceived cluster characteristics through the lens of CIPHE: measuring coherence beyond keywords. Journal of Data Mining and Digital Humanities, NLP4DH, Article ID 32.
Open this publication in new window or tab >>Comparing human-perceived cluster characteristics through the lens of CIPHE: measuring coherence beyond keywords
2025 (English)In: Journal of Data Mining and Digital Humanities, E-ISSN 2416-5999, Vol. NLP4DH, article id 32Article in journal (Refereed) Published
Abstract [en]

A frequent problem in document clustering and topic modeling is the lack of ground truth. Models are typically intended to reflect some aspect of how human readers view texts (the general theme, sentiment, emotional response, etc), but it can be difficult to assess whether they actually do. The only real ground truth is human judgement. To enable researchers and practitioners to collect such judgement in a cost-efficient standardized way, we have developed the crowdsourcing solution CIPHE -- Cluster Interpretation and Precision from Human Exploration. CIPHE is an adaptable framework which systematically gathers and evaluates data on the human perception of a set of document clusters where participants read sample texts from the cluster. In this article, we use CIPHE to study the limitations that keyword-based methods pose in topic modeling coherence evaluation. Keyword methods, including word intrusion, are compared with the outcome of the thorougher CIPHE on scoring and characterizing clusters. The results show how the abstraction of keywords skews the cluster interpretation for almost half of the compared instances, meaning that many important cluster characteristics are missed. Further, we present a case study where CIPHE is used to (a) provide insights into the UK news domain and (b) find out how the evaluated clustering model should be tuned to better suit the intended application. The experiments provide evidence that CIPHE characterizes clusters in a predictable manner and has the potential to be a valuable framework for using human evaluation in the pursuit of nuanced research aims.

Place, publisher, year, edition, pages
Centre pour la Communication Scientifique Directe (CCSD), 2025
Keywords
document clustering, topic modeling, topic modeling evaluation, news clustering, topic coherence, human evaluation methods, crowdsourced cluster validation, BERTopic, CIPHE
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-236229 (URN)10.46298/jdmdh.15044 (DOI)
Note

The code for the CIPHE platform is uploaded at https://github.com/antoneklund/CIPHE/

Available from: 2025-03-09 Created: 2025-03-09 Last updated: 2025-03-11Bibliographically approved
Eklund, A. (2025). Evaluating document clusters through human interpretation. (Doctoral dissertation). Umeå: Umeå University
Open this publication in new window or tab >>Evaluating document clusters through human interpretation
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Utvärdering av dokumentkluster genom mänsklig tolkning
Abstract [en]

Document clustering is a technique for organizing and discovering patterns in large collections of text, often used in applications such as news aggregation and contextual advertising. An example is the automatic grouping of news articles by theme, which is the focus of this thesis. For a clustering to be successful, typically the resulting clusters need to appear interpretable and coherent to a human. However, there is a lack of efficient methods to reliably assess the quality of a clustering in terms of human-perceived coherence, which is essential for ensuring its usefulness in real-world applications.

To address the lack of evaluation methods for document clustering focusing on human interpretation, we introduced Cluster Interpretation and Precision from Human Exploration (CIPHE). CIPHE tasks human evaluators to explore document samples from a cluster and collects their interpretation. The interpretation is collected through a standardized survey and then processed with the framework metrics to yield the cluster precision and characteristics. This thesis presents and discusses the development process of CIPHE. The feasibility of performing the exploratory tasks of CIPHE in a crowdsourcing environment was investigated, which resulted in insights on how to formulate instructions. Additionally, CIPHE was confirmed to identify characteristics other than the main theme such as the negative emotional response.

CIPHE was paired with a standard clustering pipeline to evaluate its capabilities and limitations. The pipeline is widely applied for its adaptability and conceptual simplicity, and also being part of the popular topic model BERTopic. The empirical results of applying CIPHE suggest that the pipeline, when integrated with a Transformer-based language model, generally yields coherent clusters.

Additionally, topic models have a similar aim as document clustering which is to automate the corpus processing and present the underlying themes to a human. Topic modeling has rich research on the human interpretation of topic coherence. In the thesis, the human interpretation collected with CIPHE was related to established research in topic coherence. Specifically, the human interpretation collected with CIPHE was used to highlight limitations with the keyword representations that topic coherence evaluation relies on.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2025. p. 44
Series
Report / UMINF, ISSN 0348-0542 ; 25.03
Keywords
document clustering, topic modeling, information retrieval, human evaluation, human-in-the-loop, news clustering, natural language processing, topic coherence, human interpretation
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-236295 (URN)9789180706469 (ISBN)9789180706476 (ISBN)
Public defence
2025-04-03, Lindellhallen 3 (UB.A.230), Samhällsvetarhuset, Umeå, 13:15 (English)
Opponent
Supervisors
Funder
Swedish Foundation for Strategic Research, ID19-0055
Available from: 2025-03-13 Created: 2025-03-10 Last updated: 2025-03-12Bibliographically approved
Eklund, A., Forsman, M. & Drewes, F. (2024). CIPHE: A Framework for Document Cluster Interpretation and Precision from Human Exploration. In: Mika Hämäläinen; Emily Öhman; So Miyagawa; Khalid Alnajjar; Yuri Bizzoni (Ed.), Proceedings of the 4th international conference on natural language processing for digital humanities: . Paper presented at 4th International Conference on Natural Language Processing for Digital Humanities, Miami, USA, November 15-16, 2024 (pp. 536-548). Association for Computational Linguistics
Open this publication in new window or tab >>CIPHE: A Framework for Document Cluster Interpretation and Precision from Human Exploration
2024 (English)In: Proceedings of the 4th international conference on natural language processing for digital humanities / [ed] Mika Hämäläinen; Emily Öhman; So Miyagawa; Khalid Alnajjar; Yuri Bizzoni, Association for Computational Linguistics, 2024, p. 536-548Conference paper, Published paper (Refereed)
Abstract [en]

Document clustering models serve unique application purposes, which turns model quality into a property that depends on the needs of the individual investigator. We propose a framework, Cluster Interpretation and Precision from Human Exploration (CIPHE), for collecting and quantifying human interpretations of cluster samples. CIPHE tasks survey participants to explore actual document texts from cluster samples and records their perceptions. It also includes a novel inclusion task that is used to calculate the cluster precision in an indirect manner. A case study on news clusters shows that CIPHE reveals which clusters have multiple interpretation angles, aiding the investigator in their exploration.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2024
Keywords
document clustering, topic modeling, clustering, human evaluation, CIPHE, news articles
National Category
Natural Language Processing
Research subject
computational linguistics
Identifiers
urn:nbn:se:umu:diva-231697 (URN)2-s2.0-85216576924 (Scopus ID)979-8-89176-181-0 (ISBN)
Conference
4th International Conference on Natural Language Processing for Digital Humanities, Miami, USA, November 15-16, 2024
Available from: 2024-11-11 Created: 2024-11-11 Last updated: 2025-04-02Bibliographically approved
Hatefi, A., Eklund, A. & Forsman, M. (2024). PromptStream: self-supervised news story discovery using topic-aware article representations. In: Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue (Ed.), Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): . Paper presented at The 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), Torino, Italy, May 20-25, 2024 (pp. 13222-13232). ELRA Language Resource Association
Open this publication in new window or tab >>PromptStream: self-supervised news story discovery using topic-aware article representations
2024 (English)In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) / [ed] Nicoletta Calzolari; Min-Yen Kan; Veronique Hoste; Alessandro Lenci; Sakriani Sakti; Nianwen Xue, ELRA Language Resource Association , 2024, p. 13222-13232Conference paper, Published paper (Refereed)
Abstract [en]

Given the importance of identifying and monitoring news stories within the continuous flow of news articles, this paper presents PromptStream, a novel method for unsupervised news story discovery. In order to identify coherent and comprehensive stories across the stream, it is crucial to create article representations that incorporate as much topic-related information from the articles as possible. PromptStream constructs these article embeddings using cloze-style prompting. These representations continually adjust to the evolving context of the news stream through self-supervised learning, employing a contrastive loss and a memory of the most confident article-story assignments from the most recent days. Extensive experiments with real news datasets highlight the notable performance of our model, establishing a new state of the art. Additionally, we delve into selected news stories to reveal how the model’s structuring of the article stream aligns with story progression.

Place, publisher, year, edition, pages
ELRA Language Resource Association, 2024
Series
International conference on computational linguistics, ISSN 2951-2093
Keywords
news story discovery, online clustering, data stream, contrastive learning, cloze-style prompting, article embedding
National Category
Computer and Information Sciences
Research subject
computational linguistics; Computer Science
Identifiers
urn:nbn:se:umu:diva-222533 (URN)2-s2.0-85195905320 (Scopus ID)978-2-493814-10-4 (ISBN)
Conference
The 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), Torino, Italy, May 20-25, 2024
Note

Also part of series: LREC proceedings, ISBN: 2522-2686

Available from: 2024-03-20 Created: 2024-03-20 Last updated: 2024-06-25Bibliographically approved
Eklund, A., Forsman, M. & Drewes, F. (2023). An empirical configuration study of a common document clustering pipeline. Northern European Journal of Language Technology (NEJLT), 9(1)
Open this publication in new window or tab >>An empirical configuration study of a common document clustering pipeline
2023 (English)In: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 9, no 1Article in journal (Refereed) Published
Abstract [en]

Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or create topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters.

Place, publisher, year, edition, pages
Linköping University Electronic Press, 2023
Keywords
document clustering, topic modeling, dimension reduction, clustering, BERT, doc2vec, UMAP, PCA, K-Means, HDBSCAN
National Category
Natural Language Processing
Identifiers
urn:nbn:se:umu:diva-214455 (URN)10.3384/nejlt.2000-1533.2023.4396 (DOI)
Funder
Swedish Foundation for Strategic Research, ID19-0055
Available from: 2023-09-15 Created: 2023-09-15 Last updated: 2025-03-10Bibliographically approved
Devinney, H., Eklund, A., Ryazanov, I. & Cai, J. (2023). Developing a multilingual corpus of wikipedia biographies. In: Ruslan Mitkov; Maria Kunilovskaya; Galia Angelova (Ed.), International conference. Recent advances in natural language processing 2023, large language models for natural language processing: proceedings. Paper presented at 14th international conference on Recent Advances in Natural Language Processing 2023, Varna, Bulgaria, September 4-6, 2023.. Shoumen, Bulgaria: Incoma ltd., Article ID 2023.ranlp-1.32.
Open this publication in new window or tab >>Developing a multilingual corpus of wikipedia biographies
2023 (English)In: International conference. Recent advances in natural language processing 2023, large language models for natural language processing: proceedings / [ed] Ruslan Mitkov; Maria Kunilovskaya; Galia Angelova, Shoumen, Bulgaria: Incoma ltd. , 2023, article id 2023.ranlp-1.32Conference paper, Published paper (Refereed)
Abstract [en]

For many languages, Wikipedia is the mostaccessible source of biographical information. Studying how Wikipedia describes the lives ofpeople can provide insights into societal biases, as well as cultural differences more generally. We present a method for extracting datasetsof Wikipedia biographies. The accompanying codebase is adapted to English, Swedish, Russian, Chinese, and Farsi, and is extendable to other languages. We present an exploratory analysis of biographical topics and gendered patterns in four languages using topic modelling and embedding clustering. We find similarities across languages in the types of categories present, with the distribution of biographies concentrated in the language’s core regions. Masculine terms are over-represented and spread out over a wide variety of topics. Feminine terms are less frequent and linked to more constrained topics. Non-binary terms are nearly non-represented.

Place, publisher, year, edition, pages
Shoumen, Bulgaria: Incoma ltd., 2023
Series
International conference Recent advances in natural language processing, ISSN 2603-2813 ; 2023
National Category
Natural Language Processing
Research subject
computational linguistics
Identifiers
urn:nbn:se:umu:diva-213781 (URN)10.26615/978-954-452-092-2_032 (DOI)2-s2.0-85179178058 (Scopus ID)978-954-452-092-2 (ISBN)
Conference
14th international conference on Recent Advances in Natural Language Processing 2023, Varna, Bulgaria, September 4-6, 2023.
Available from: 2023-11-10 Created: 2023-11-10 Last updated: 2025-02-07Bibliographically approved
Eklund, A., Forsman, M. & Drewes, F. (2022). Dynamic topic modeling by clustering embeddings from pretrained language models: a research proposal. In: Yan Hanqi; Yang Zonghan; Sebastian Ruder; Wan Xiaojun (Ed.), Proceedings of the 2nd conference of the Asia-Pacific chapter of the Association for computational linguistics and the 12th international joint conference on natural language processing: student research workshop. Paper presented at The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, Online, November 21-24, 2022 (pp. 84-91). Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>Dynamic topic modeling by clustering embeddings from pretrained language models: a research proposal
2022 (English)In: Proceedings of the 2nd conference of the Asia-Pacific chapter of the Association for computational linguistics and the 12th international joint conference on natural language processing: student research workshop / [ed] Yan Hanqi; Yang Zonghan; Sebastian Ruder; Wan Xiaojun, Association for Computational Linguistics (ACL) , 2022, p. 84-91Conference paper, Published paper (Refereed)
Abstract [en]

A new trend in topic modeling research is to do Neural Topic Modeling by Clustering document Embeddings (NTM-CE) created with a pretrained language model. Studies have evaluated static NTM-CE models and found them performing comparably to, or even better than other topic models. An important extension of static topic modeling is making the models dynamic, allowing the study of topic evolution over time, as well as detecting emerging and disappearing topics. In this research proposal, we present two research questions to understand dynamic topic modeling with NTM-CE theoretically and practically. To answer these, we propose four phases with the aim of establishing evaluation methods for dynamic topic modeling, finding NTM-CE-specific properties, and creating a framework for dynamic NTM-CE. For evaluation, we propose to use both quantitative measurements of coherence and human evaluation supported by our recently developed tool.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2022
Keywords
topic modeling, dynamic topic modeling, topic modeling evaluation, research proposal, pretrained language model
National Category
Natural Language Processing
Identifiers
urn:nbn:se:umu:diva-202486 (URN)10.18653/v1/2022.aacl-srw.12 (DOI)9781955917568 (ISBN)
Conference
The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, Online, November 21-24, 2022
Funder
Swedish Foundation for Strategic Research, ID19-0055
Available from: 2023-01-11 Created: 2023-01-11 Last updated: 2025-03-11Bibliographically approved
Eklund, A. & Forsman, M. (2022). Topic modeling by clustering language model embeddings: human validation on an industry dataset. In: Yunyao Li; Angeliki Lazaridou (Ed.), EMNLP 2022 industry track: proceedings of the conference. Paper presented at 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, UAE, December 7-11, 2022 (pp. 645-653). Association for Computational Linguistics (ACL)
Open this publication in new window or tab >>Topic modeling by clustering language model embeddings: human validation on an industry dataset
2022 (English)In: EMNLP 2022 industry track: proceedings of the conference / [ed] Yunyao Li; Angeliki Lazaridou, Association for Computational Linguistics (ACL) , 2022, p. 645-653Conference paper, Published paper (Refereed)
Abstract [en]

Topic models are powerful tools to get an overview of large collections of text data, a situation that is prevalent in industry applications. A rising trend within topic modeling is to directly cluster dimension-reduced embeddings created with pretrained language models. It is difficult to evaluate these models because there is no ground truth and automatic measurements may not mimic human judgment. To address this problem, we created a tool called STELLAR for interactive topic browsing which we used for human evaluation of topics created from a real-world dataset used in industry. Embeddings created with BERT were used together with UMAP and HDBSCAN to model the topics. The human evaluation found that our topic model creates coherent topics. The following discussion revolves around the requirements of industry and what research is needed for production-ready systems.

Place, publisher, year, edition, pages
Association for Computational Linguistics (ACL), 2022
National Category
Natural Language Processing
Identifiers
urn:nbn:se:umu:diva-207702 (URN)10.18653/v1/2022.emnlp-industry.65 (DOI)2-s2.0-85152892684 (Scopus ID)9781952148255 (ISBN)
Conference
2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, UAE, December 7-11, 2022
Funder
Swedish Foundation for Strategic Research, 19-0055
Available from: 2023-04-28 Created: 2023-04-28 Last updated: 2025-03-11Bibliographically approved
Eklund, A., Nordström, A., Drewes, F. & Forsman, M. Industry quality control for efficient continuous human validation of deployed text classification systems.
Open this publication in new window or tab >>Industry quality control for efficient continuous human validation of deployed text classification systems
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Validating the accuracy of a classifier in its application environment is a labor-intensive and often subjective task. Adapting large language models (LLMs) to classification tasks makes reliable human evaluation increasingly important due to their generative capabilities. Acceptance Sampling is a method long used in traditional production lines to validate products before shipment. We adapt Acceptance Sampling concepts to the validation of text classifiers and conduct a case study focusing on news. We show how Acceptance Sampling can efficiently optimize the required amount of manual labor and discuss how the subjective elements of classifying texts can be handled by defining acceptance criteria.

National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-236230 (URN)
Note

The code for the algorithm and simulations can befound at https://github.com/antoneklund/IntegratedAS

Available from: 2025-03-09 Created: 2025-03-09 Last updated: 2025-03-11Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-4366-7863

Search in DiVA

Show all publications