Umeå University's logo

umu.sePublikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Incremental Re-tokenization in BPE-trained SentencePiece Models
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
2024 (engelsk)Independent thesis Basic level (degree of Bachelor), 10 poäng / 15 hpOppgave
Abstract [en]

This bachelor's thesis in Computer Science explores the efficiency of an incremental re-tokenization algorithm in the context of BPE-trained SentencePiece models used in natural language processing. The thesis begins by underscoring the critical role of tokenization in NLP, particularly highlighting the complexities introduced by modifications in tokenized text. It then presents an incremental re-tokenization algorithm, detailing its development and evaluating its performance against a full text re-tokenization. Experimental results demonstrate that this incremental approach is more time-efficient than full re-tokenization, especially evident in large text datasets. This efficiency is attributed to the algorithm's localized re-tokenization strategy, which limits processing to text areas around modifications. The research concludes by suggesting that incremental re-tokenization could significantly enhance the responsiveness and resource efficiency of text-based applications, such as chatbots and virtual assistants. Future work may focus on predictive models to anticipate the impact of text changes on token stability and optimizing the algorithm for different text contexts.

sted, utgiver, år, opplag, sider
2024. , s. 22
Serie
UMNAD ; 1452
Emneord [en]
BPE, Byte Pair Encoding, SentencePiece, NLP, Natural Language Processing, Tokenization, Re-tokenization
HSV kategori
Identifikatorer
URN: urn:nbn:se:umu:diva-221890OAI: oai:DiVA.org:umu-221890DiVA, id: diva2:1843124
Utdanningsprogram
Bachelor of Science Programme in Computing Science
Veileder
Examiner
Tilgjengelig fra: 2024-03-08 Laget: 2024-03-07 Sist oppdatert: 2025-02-07bibliografisk kontrollert

Open Access i DiVA

fulltext(1784 kB)415 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 1784 kBChecksum SHA-512
4b3a4aee7555a34817eb8294bb2bd65c7325ab3097e5f4cadba865f0ffa53a0609045740eba887ed3499cade54d17aac1cdb5afa35bece57e33e4ea76398aaa0
Type fulltextMimetype application/pdf

Søk i DiVA

Av forfatter/redaktør
Hellsten, Simon
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 415 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

urn-nbn

Altmetric

urn-nbn
Totalt: 603 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf