Umeå University's logo

umu.sePublikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
An empirical configuration study of a common document clustering pipeline
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. Adlede, Umeå, Sweden.ORCID-id: 0000-0002-4366-7863
Adlede, Umeå, Sweden.ORCID-id: 0000-0001-6601-5190
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.ORCID-id: 0000-0001-7349-7693
2023 (engelsk)Inngår i: Northern European Journal of Language Technology (NEJLT), ISSN 2000-1533, Vol. 9, nr 1Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Document clustering is frequently used in applications of natural language processing, e.g. to classify news articles or create topic models. In this paper, we study document clustering with the common clustering pipeline that includes vectorization with BERT or Doc2Vec, dimension reduction with PCA or UMAP, and clustering with K-Means or HDBSCAN. We discuss the inter- actions of the different components in the pipeline, parameter settings, and how to determine an appropriate number of dimensions. The results suggest that BERT embeddings combined with UMAP dimension reduction to no less than 15 dimensions provides a good basis for clustering, regardless of the specific clustering algorithm used. Moreover, while UMAP performed better than PCA in our experiments, tuning the UMAP settings showed little impact on the overall performance. Hence, we recommend configuring UMAP so as to optimize its time efficiency. According to our topic model evaluation, the combination of BERT and UMAP, also used in BERTopic, performs best. A topic model based on this pipeline typically benefits from a large number of clusters.

sted, utgiver, år, opplag, sider
Linköping University Electronic Press, 2023. Vol. 9, nr 1
Emneord [en]
document clustering, topic modeling, dimension reduction, clustering, BERT, doc2vec, UMAP, PCA, K-Means, HDBSCAN
HSV kategori
Identifikatorer
URN: urn:nbn:se:umu:diva-214455DOI: 10.3384/nejlt.2000-1533.2023.4396OAI: oai:DiVA.org:umu-214455DiVA, id: diva2:1797692
Forskningsfinansiär
Swedish Foundation for Strategic Research, ID19-0055Tilgjengelig fra: 2023-09-15 Laget: 2023-09-15 Sist oppdatert: 2023-09-15bibliografisk kontrollert

Open Access i DiVA

fulltext(2698 kB)156 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 2698 kBChecksum SHA-512
ece569a5bf1ec48d0ca8a50d3ae983ab9adb8b1397fbb7f20a34eba36d8976d53e15c38dd449d953b36b551ceca09ed1c22b4c59d8d12d44ba19e4f6bbd66bdb
Type fulltextMimetype application/pdf

Andre lenker

Forlagets fulltekst

Person

Eklund, AntonDrewes, Frank

Søk i DiVA

Av forfatter/redaktør
Eklund, AntonForsman, MonaDrewes, Frank
Av organisasjonen
I samme tidsskrift
Northern European Journal of Language Technology (NEJLT)

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 156 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

doi
urn-nbn

Altmetric

doi
urn-nbn
Totalt: 377 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf