Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Formalizing BPE Tokenization
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-3692-6994
Department of Computer Science, Stellenbosch University, Stellenbosch, South Africa.
2023 (English)In: 13th International Workshop onNon-Classical Models of Automata and Applications (NCMA 2023) / [ed] Nagy B., Freund R., Open Publishing Association , 2023, p. 16-27Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we formalize practical byte pair encoding tokenization as it is used in large language models and other NLP systems, in particular we formally define and investigate the semantics of the SentencePiece and HuggingFace tokenizers, in particular how they relate to each other, depending on how the tokenization rules are constructed. Beyond this we consider how tokenization can be performed in an incremental fashion, as well as doing it left-to-right using an amount of memory constant in the length of the string, enabling e.g. using a finite state string-to-string transducer.

Place, publisher, year, edition, pages
Open Publishing Association , 2023. p. 16-27
Series
Electronic Proceedings in Theoretical Computer Science, EPTCS, ISSN 2075-2180
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:umu:diva-215226DOI: 10.4204/EPTCS.388.4Scopus ID: 2-s2.0-85173029200OAI: oai:DiVA.org:umu-215226DiVA, id: diva2:1805813
Conference
13th International Workshop on Non-Classical Models of Automata and Applications, NCMA 2023, 18-19 September, 2023, Famagusta, Cyprus
Available from: 2023-10-18 Created: 2023-10-18 Last updated: 2023-10-18Bibliographically approved

Open Access in DiVA

fulltext(285 kB)251 downloads
File information
File name FULLTEXT01.pdfFile size 285 kBChecksum SHA-512
773701364ffc10b313c128167be9abd019913cfc18a411f0c747a6fadc4c62dea45d970e00cf55eb676888de53d84d32ef015a26c5e99b03714544044d3e2436
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Berglund, Martin

Search in DiVA

By author/editor
Berglund, Martin
By organisation
Department of Computing Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 251 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 343 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf