Umeå universitets logga

umu.sePublikationer
Driftmeddelande
För närvarande är det driftstörningar. Felsökning pågår.
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
An empirical configuration study of a common document clustering pipeline
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Adlede, Umeå, Sweden.ORCID-id: 0000-0002-4366-7863
Adlede, Umeå, Sweden.ORCID-id: 0000-0001-6601-5190
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.ORCID-id: 0000-0001-7349-7693
2023 (Engelska)Ingår i: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 9, nr 1Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or create topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters.

Ort, förlag, år, upplaga, sidor
Linköping University Electronic Press, 2023. Vol. 9, nr 1
Nyckelord [en]
document clustering, topic modeling, dimension reduction, clustering, BERT, doc2vec, UMAP, PCA, K-Means, HDBSCAN
Nationell ämneskategori
Språkbehandling och datorlingvistik
Identifikatorer
URN: urn:nbn:se:umu:diva-214455DOI: 10.3384/nejlt.2000-1533.2023.4396OAI: oai:DiVA.org:umu-214455DiVA, id: diva2:1797692
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), ID19-0055Tillgänglig från: 2023-09-15 Skapad: 2023-09-15 Senast uppdaterad: 2025-03-10Bibliografiskt granskad
Ingår i avhandling
1. Evaluating document clusters through human interpretation
Öppna denna publikation i ny flik eller fönster >>Evaluating document clusters through human interpretation
2025 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Alternativ titel[sv]
Utvärdering av dokumentkluster genom mänsklig tolkning
Abstract [en]

Document clustering is a technique for organizing and discovering patterns in large collections of text, often used in applications such as news aggregation and contextual advertising. An example is the automatic grouping of news articles by theme, which is the focus of this thesis. For a clustering to be successful, typically the resulting clusters need to appear interpretable and coherent to a human. However, there is a lack of efficient methods to reliably assess the quality of a clustering in terms of human-perceived coherence, which is essential for ensuring its usefulness in real-world applications.

To address the lack of evaluation methods for document clustering focusing on human interpretation, we introduced Cluster Interpretation and Precision from Human Exploration (CIPHE). CIPHE tasks human evaluators to explore document samples from a cluster and collects their interpretation. The interpretation is collected through a standardized survey and then processed with the framework metrics to yield the cluster precision and characteristics. This thesis presents and discusses the development process of CIPHE. The feasibility of performing the exploratory tasks of CIPHE in a crowdsourcing environment was investigated, which resulted in insights on how to formulate instructions. Additionally, CIPHE was confirmed to identify characteristics other than the main theme such as the negative emotional response.

CIPHE was paired with a standard clustering pipeline to evaluate its capabilities and limitations. The pipeline is widely applied for its adaptability and conceptual simplicity, and also being part of the popular topic model BERTopic. The empirical results of applying CIPHE suggest that the pipeline, when integrated with a Transformer-based language model, generally yields coherent clusters.

Additionally, topic models have a similar aim as document clustering which is to automate the corpus processing and present the underlying themes to a human. Topic modeling has rich research on the human interpretation of topic coherence. In the thesis, the human interpretation collected with CIPHE was related to established research in topic coherence. Specifically, the human interpretation collected with CIPHE was used to highlight limitations with the keyword representations that topic coherence evaluation relies on.

Ort, förlag, år, upplaga, sidor
Umeå: Umeå University, 2025. s. 44
Serie
Report / UMINF, ISSN 0348-0542 ; 25.03
Nyckelord
document clustering, topic modeling, information retrieval, human evaluation, human-in-the-loop, news clustering, natural language processing, topic coherence, human interpretation
Nationell ämneskategori
Språkbehandling och datorlingvistik
Forskningsämne
datalogi
Identifikatorer
urn:nbn:se:umu:diva-236295 (URN)9789180706469 (ISBN)9789180706476 (ISBN)
Disputation
2025-04-03, Lindellhallen 3 (UB.A.230), Samhällsvetarhuset, Umeå, 13:15 (Engelska)
Opponent
Handledare
Forskningsfinansiär
Stiftelsen för strategisk forskning (SSF), ID19-0055
Tillgänglig från: 2025-03-13 Skapad: 2025-03-10 Senast uppdaterad: 2025-03-12Bibliografiskt granskad

Open Access i DiVA

fulltext(2698 kB)477 nedladdningar
Filinformation
Filnamn FULLTEXT01.pdfFilstorlek 2698 kBChecksumma SHA-512
ece569a5bf1ec48d0ca8a50d3ae983ab9adb8b1397fbb7f20a34eba36d8976d53e15c38dd449d953b36b551ceca09ed1c22b4c59d8d12d44ba19e4f6bbd66bdb
Typ fulltextMimetyp application/pdf

Övriga länkar

Förlagets fulltext

Person

Eklund, AntonDrewes, Frank

Sök vidare i DiVA

Av författaren/redaktören
Eklund, AntonForsman, MonaDrewes, Frank
Av organisationen
Institutionen för datavetenskap
I samma tidskrift
Northern European Journal of Language Technology (NEJLT)
Språkbehandling och datorlingvistik

Sök vidare utanför DiVA

GoogleGoogle Scholar
Totalt: 477 nedladdningar
Antalet nedladdningar är summan av nedladdningar för alla fulltexter. Det kan inkludera t.ex tidigare versioner som nu inte längre är tillgängliga.

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 901 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf