Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Comparing human-perceived cluster characteristics through the lens of CIPHE: measuring coherence beyond keywords
Umeå University, Faculty of Science and Technology, Department of Computing Science. Aeterna Labs, Sweden.ORCID iD: 0000-0002-4366-7863
Swedish University of Agricultural Sciences, Sweden.ORCID iD: 0000-0001-6601-5190
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0001-7349-7693
2025 (English)In: Journal of Data Mining and Digital Humanities, E-ISSN 2416-5999, Vol. NLP4DH, article id 32Article in journal (Refereed) Published
Abstract [en]

A frequent problem in document clustering and topic modeling is the lack of ground truth. Models are typically intended to reflect some aspect of how human readers view texts (the general theme, sentiment, emotional response, etc), but it can be difficult to assess whether they actually do. The only real ground truth is human judgement. To enable researchers and practitioners to collect such judgement in a cost-efficient standardized way, we have developed the crowdsourcing solution CIPHE -- Cluster Interpretation and Precision from Human Exploration. CIPHE is an adaptable framework which systematically gathers and evaluates data on the human perception of a set of document clusters where participants read sample texts from the cluster. In this article, we use CIPHE to study the limitations that keyword-based methods pose in topic modeling coherence evaluation. Keyword methods, including word intrusion, are compared with the outcome of the thorougher CIPHE on scoring and characterizing clusters. The results show how the abstraction of keywords skews the cluster interpretation for almost half of the compared instances, meaning that many important cluster characteristics are missed. Further, we present a case study where CIPHE is used to (a) provide insights into the UK news domain and (b) find out how the evaluated clustering model should be tuned to better suit the intended application. The experiments provide evidence that CIPHE characterizes clusters in a predictable manner and has the potential to be a valuable framework for using human evaluation in the pursuit of nuanced research aims.

Place, publisher, year, edition, pages
Centre pour la Communication Scientifique Directe (CCSD) , 2025. Vol. NLP4DH, article id 32
Keywords [en]
document clustering, topic modeling, topic modeling evaluation, news clustering, topic coherence, human evaluation methods, crowdsourced cluster validation, BERTopic, CIPHE
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-236229DOI: 10.46298/jdmdh.15044OAI: oai:DiVA.org:umu-236229DiVA, id: diva2:1943168
Note

The code for the CIPHE platform is uploaded at https://github.com/antoneklund/CIPHE/

Available from: 2025-03-09 Created: 2025-03-09 Last updated: 2025-03-11Bibliographically approved
In thesis
1. Evaluating document clusters through human interpretation
Open this publication in new window or tab >>Evaluating document clusters through human interpretation
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Utvärdering av dokumentkluster genom mänsklig tolkning
Abstract [en]

Document clustering is a technique for organizing and discovering patterns in large collections of text, often used in applications such as news aggregation and contextual advertising. An example is the automatic grouping of news articles by theme, which is the focus of this thesis. For a clustering to be successful, typically the resulting clusters need to appear interpretable and coherent to a human. However, there is a lack of efficient methods to reliably assess the quality of a clustering in terms of human-perceived coherence, which is essential for ensuring its usefulness in real-world applications.

To address the lack of evaluation methods for document clustering focusing on human interpretation, we introduced Cluster Interpretation and Precision from Human Exploration (CIPHE). CIPHE tasks human evaluators to explore document samples from a cluster and collects their interpretation. The interpretation is collected through a standardized survey and then processed with the framework metrics to yield the cluster precision and characteristics. This thesis presents and discusses the development process of CIPHE. The feasibility of performing the exploratory tasks of CIPHE in a crowdsourcing environment was investigated, which resulted in insights on how to formulate instructions. Additionally, CIPHE was confirmed to identify characteristics other than the main theme such as the negative emotional response.

CIPHE was paired with a standard clustering pipeline to evaluate its capabilities and limitations. The pipeline is widely applied for its adaptability and conceptual simplicity, and also being part of the popular topic model BERTopic. The empirical results of applying CIPHE suggest that the pipeline, when integrated with a Transformer-based language model, generally yields coherent clusters.

Additionally, topic models have a similar aim as document clustering which is to automate the corpus processing and present the underlying themes to a human. Topic modeling has rich research on the human interpretation of topic coherence. In the thesis, the human interpretation collected with CIPHE was related to established research in topic coherence. Specifically, the human interpretation collected with CIPHE was used to highlight limitations with the keyword representations that topic coherence evaluation relies on.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2025. p. 44
Series
Report / UMINF, ISSN 0348-0542 ; 25.03
Keywords
document clustering, topic modeling, information retrieval, human evaluation, human-in-the-loop, news clustering, natural language processing, topic coherence, human interpretation
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-236295 (URN)9789180706469 (ISBN)9789180706476 (ISBN)
Public defence
2025-04-03, Lindellhallen 3 (UB.A.230), Samhällsvetarhuset, Umeå, 13:15 (English)
Opponent
Supervisors
Funder
Swedish Foundation for Strategic Research, ID19-0055
Available from: 2025-03-13 Created: 2025-03-10 Last updated: 2025-03-12Bibliographically approved

Open Access in DiVA

fulltext(2210 kB)35 downloads
File information
File name FULLTEXT01.pdfFile size 2210 kBChecksum SHA-512
80e0e189738c3fde927f835e6d535a1e715f0f0d02968ae3a1d11952b7446ec0e4f85e217c17a72edd5ca6b6b3cf15c4dcda4a1d9f6abe11be56f6781fd2512e
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Authority records

Eklund, AntonDrewes, Frank

Search in DiVA

By author/editor
Eklund, AntonForsman, MonaDrewes, Frank
By organisation
Department of Computing Science
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 36 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 271 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf