Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Incremental Re-tokenization in BPE-trained SentencePiece Models
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2024 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

This bachelor's thesis in Computer Science explores the efficiency of an incremental re-tokenization algorithm in the context of BPE-trained SentencePiece models used in natural language processing. The thesis begins by underscoring the critical role of tokenization in NLP, particularly highlighting the complexities introduced by modifications in tokenized text. It then presents an incremental re-tokenization algorithm, detailing its development and evaluating its performance against a full text re-tokenization. Experimental results demonstrate that this incremental approach is more time-efficient than full re-tokenization, especially evident in large text datasets. This efficiency is attributed to the algorithm's localized re-tokenization strategy, which limits processing to text areas around modifications. The research concludes by suggesting that incremental re-tokenization could significantly enhance the responsiveness and resource efficiency of text-based applications, such as chatbots and virtual assistants. Future work may focus on predictive models to anticipate the impact of text changes on token stability and optimizing the algorithm for different text contexts.

Place, publisher, year, edition, pages
2024. , p. 22
Series
UMNAD ; 1452
Keywords [en]
BPE, Byte Pair Encoding, SentencePiece, NLP, Natural Language Processing, Tokenization, Re-tokenization
National Category
Natural Language Processing
Identifiers
URN: urn:nbn:se:umu:diva-221890OAI: oai:DiVA.org:umu-221890DiVA, id: diva2:1843124
Educational program
Bachelor of Science Programme in Computing Science
Supervisors
Examiners
Available from: 2024-03-08 Created: 2024-03-07 Last updated: 2025-02-07Bibliographically approved

Open Access in DiVA

fulltext(1784 kB)371 downloads
File information
File name FULLTEXT01.pdfFile size 1784 kBChecksum SHA-512
4b3a4aee7555a34817eb8294bb2bd65c7325ab3097e5f4cadba865f0ffa53a0609045740eba887ed3499cade54d17aac1cdb5afa35bece57e33e4ea76398aaa0
Type fulltextMimetype application/pdf

Search in DiVA

By author/editor
Hellsten, Simon
By organisation
Department of Computing Science
Natural Language Processing

Search outside of DiVA

GoogleGoogle Scholar
Total: 371 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 587 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf