Evaluating document clusters through human interpretation
2025 (English) Doctoral thesis, comprehensive summary (Other academic) Alternative title
Utvärdering av dokumentkluster genom mänsklig tolkning (Swedish)
Abstract [en]
Document clustering is a technique for organizing and discovering patterns in large collections of text, often used in applications such as news aggregation and contextual advertising. An example is the automatic grouping of news articles by theme, which is the focus of this thesis. For a clustering to be successful, typically the resulting clusters need to appear interpretable and coherent to a human. However, there is a lack of efficient methods to reliably assess the quality of a clustering in terms of human-perceived coherence, which is essential for ensuring its usefulness in real-world applications.
To address the lack of evaluation methods for document clustering focusing on human interpretation, we introduced Cluster Interpretation and Precision from Human Exploration (CIPHE). CIPHE tasks human evaluators to explore document samples from a cluster and collects their interpretation. The interpretation is collected through a standardized survey and then processed with the framework metrics to yield the cluster precision and characteristics. This thesis presents and discusses the development process of CIPHE. The feasibility of performing the exploratory tasks of CIPHE in a crowdsourcing environment was investigated, which resulted in insights on how to formulate instructions. Additionally, CIPHE was confirmed to identify characteristics other than the main theme such as the negative emotional response.
CIPHE was paired with a standard clustering pipeline to evaluate its capabilities and limitations. The pipeline is widely applied for its adaptability and conceptual simplicity, and also being part of the popular topic model BERTopic. The empirical results of applying CIPHE suggest that the pipeline, when integrated with a Transformer-based language model, generally yields coherent clusters.
Additionally, topic models have a similar aim as document clustering which is to automate the corpus processing and present the underlying themes to a human. Topic modeling has rich research on the human interpretation of topic coherence. In the thesis, the human interpretation collected with CIPHE was related to established research in topic coherence. Specifically, the human interpretation collected with CIPHE was used to highlight limitations with the keyword representations that topic coherence evaluation relies on.
Place, publisher, year, edition, pages Umeå: Umeå University, 2025. , p. 44
Series
Report / UMINF, ISSN 0348-0542 ; 25.03
Keywords [en]
document clustering, topic modeling, information retrieval, human evaluation, human-in-the-loop, news clustering, natural language processing, topic coherence, human interpretation
National Category
Natural Language Processing
Research subject Computer Science
Identifiers URN: urn:nbn:se:umu:diva-236295 ISBN: 9789180706469 (print) ISBN: 9789180706476 (electronic) OAI: oai:DiVA.org:umu-236295 DiVA, id: diva2:1943258
Public defence
2025-04-03, Lindellhallen 3 (UB.A.230), Samhällsvetarhuset, Umeå, 13:15 (English)
Opponent
Supervisors
Funder Swedish Foundation for Strategic Research, ID19-0055 2025-03-132025-03-102025-03-12 Bibliographically approved
List of papers