Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Byte Pair Encoding Dictionary Equivalence
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2025 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

This thesis proposes a rewrite procedure for determining the equivalence of a subset of byte-pair encoding (BPE) dictionaries, a data structure essential to the tokenization process in modern large language models. We present a normalization procedure that constructs a normal form for proper BPE dictionaries, where a dictionary is proper if every token in a merge rule of length greater than one has been created by a higher-priority rule. We prove that two proper dictionaries containing the same rules are equivalent if and only if their normal forms are identical. The existing approach determines dictionary equivalence by transforming dictionaries into deterministic finite automata (DFAs) and checking for DFA equivalence. This approach is computationally expensive for large vocabularies, where the performance depends on the size of the token alphabet. In contrast, the new normalization procedure is more efficient in practice, since the time complexity of the normalization procedure depends on the length of the tokens instead, which are typically considerably smaller than the size of the token alphabet. This results in a more scalable solution for equivalence checking in real-world scenarios.

Place, publisher, year, edition, pages
2025.
Series
UMNAD ; 1562
Keywords [en]
Byte Pair, Byte Pair Encoding, Byte Pair Dictionary, Byte Pair Encoding Dictionary, BPE, BPE Dictionary, Large Language Model, LLM, Dictionary Equivalence, Equivalence, Normal Form, Tokenization, Proper Dictionary, Proper Dictionaries, Normalization, Normalization procedure, Natural Language Processing, NLP
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:umu:diva-240756OAI: oai:DiVA.org:umu-240756DiVA, id: diva2:1973545
Educational program
Bachelor of Science Programme in Computing Science
Presentation
2025-06-03, NAT.D.440, Universitetsvägen, Umeå, 09:20 (English)
Supervisors
Examiners
Available from: 2025-06-23 Created: 2025-06-19 Last updated: 2025-06-23Bibliographically approved

Open Access in DiVA

bilaga(2424 kB)122 downloads
File information
File name ATTACHMENT01.pdfFile size 2424 kBChecksum SHA-512
5a7d2cad5cff2553ece4755d4a31c56d473ac94bd7067fcb96ddff69dd8d9fb5856a6ce3a921449814fcab05aecf2964c4e7ec00060bdcf41be7bf2773e5052c
Type attachmentMimetype application/pdf

Search in DiVA

By author/editor
Bergmark, Anton
By organisation
Department of Computing Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 530 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf