Open this publication in new window or tab >>
2025 (English) In: Journal of Data Mining and Digital Humanities, E-ISSN 2416-5999, Vol. NLP4DH, article id 32Article in journal (Refereed) Published
Abstract [en] A frequent problem in document clustering and topic modeling is the lack of ground truth. Models are typically intended to reflect some aspect of how human readers view texts (the general theme, sentiment, emotional response, etc), but it can be difficult to assess whether they actually do. The only real ground truth is human judgement. To enable researchers and practitioners to collect such judgement in a cost-efficient standardized way, we have developed the crowdsourcing solution CIPHE -- Cluster Interpretation and Precision from Human Exploration. CIPHE is an adaptable framework which systematically gathers and evaluates data on the human perception of a set of document clusters where participants read sample texts from the cluster. In this article, we use CIPHE to study the limitations that keyword-based methods pose in topic modeling coherence evaluation. Keyword methods, including word intrusion, are compared with the outcome of the thorougher CIPHE on scoring and characterizing clusters. The results show how the abstraction of keywords skews the cluster interpretation for almost half of the compared instances, meaning that many important cluster characteristics are missed. Further, we present a case study where CIPHE is used to (a) provide insights into the UK news domain and (b) find out how the evaluated clustering model should be tuned to better suit the intended application. The experiments provide evidence that CIPHE characterizes clusters in a predictable manner and has the potential to be a valuable framework for using human evaluation in the pursuit of nuanced research aims.
Place, publisher, year, edition, pages
Centre pour la Communication Scientifique Directe (CCSD), 2025
Keywords document clustering, topic modeling, topic modeling evaluation, news clustering, topic coherence, human evaluation methods, crowdsourced cluster validation, BERTopic, CIPHE
National Category
Natural Language Processing
Research subject
Computer Science
Identifiers urn:nbn:se:umu:diva-236229 (URN) 10.46298/jdmdh.15044 (DOI)
Note The code for the CIPHE platform is uploaded at https://github.com/antoneklund/CIPHE/
2025-03-092025-03-092025-03-11 Bibliographically approved