Byte Pair Encoding Dictionary Equivalence
2025 (English)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE credits
Student thesis
Abstract [en]
This thesis proposes a rewrite procedure for determining the equivalence of a subset of byte-pair encoding (BPE) dictionaries, a data structure essential to the tokenization process in modern large language models. We present a normalization procedure that constructs a normal form for proper BPE dictionaries, where a dictionary is proper if every token in a merge rule of length greater than one has been created by a higher-priority rule. We prove that two proper dictionaries containing the same rules are equivalent if and only if their normal forms are identical. The existing approach determines dictionary equivalence by transforming dictionaries into deterministic finite automata (DFAs) and checking for DFA equivalence. This approach is computationally expensive for large vocabularies, where the performance depends on the size of the token alphabet. In contrast, the new normalization procedure is more efficient in practice, since the time complexity of the normalization procedure depends on the length of the tokens instead, which are typically considerably smaller than the size of the token alphabet. This results in a more scalable solution for equivalence checking in real-world scenarios.
Place, publisher, year, edition, pages
2025.
Series
UMNAD ; 1562
Keywords [en]
Byte Pair, Byte Pair Encoding, Byte Pair Dictionary, Byte Pair Encoding Dictionary, BPE, BPE Dictionary, Large Language Model, LLM, Dictionary Equivalence, Equivalence, Normal Form, Tokenization, Proper Dictionary, Proper Dictionaries, Normalization, Normalization procedure, Natural Language Processing, NLP
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:umu:diva-240756OAI: oai:DiVA.org:umu-240756DiVA, id: diva2:1973545
Educational program
Bachelor of Science Programme in Computing Science
Presentation
2025-06-03, NAT.D.440, Universitetsvägen, Umeå, 09:20 (English)
Supervisors
Examiners
2025-06-232025-06-192025-06-23Bibliographically approved