Byte Pair Encoding Dictionary Equivalence
2025 (Engelska)Självständigt arbete på grundnivå (kandidatexamen), 10 poäng / 15 hp
Studentuppsats (Examensarbete)
Abstract [en]
This thesis proposes a rewrite procedure for determining the equivalence of a subset of byte-pair encoding (BPE) dictionaries, a data structure essential to the tokenization process in modern large language models. We present a normalization procedure that constructs a normal form for proper BPE dictionaries, where a dictionary is proper if every token in a merge rule of length greater than one has been created by a higher-priority rule. We prove that two proper dictionaries containing the same rules are equivalent if and only if their normal forms are identical. The existing approach determines dictionary equivalence by transforming dictionaries into deterministic finite automata (DFAs) and checking for DFA equivalence. This approach is computationally expensive for large vocabularies, where the performance depends on the size of the token alphabet. In contrast, the new normalization procedure is more efficient in practice, since the time complexity of the normalization procedure depends on the length of the tokens instead, which are typically considerably smaller than the size of the token alphabet. This results in a more scalable solution for equivalence checking in real-world scenarios.
Ort, förlag, år, upplaga, sidor
2025.
Serie
UMNAD ; 1562
Nyckelord [en]
Byte Pair, Byte Pair Encoding, Byte Pair Dictionary, Byte Pair Encoding Dictionary, BPE, BPE Dictionary, Large Language Model, LLM, Dictionary Equivalence, Equivalence, Normal Form, Tokenization, Proper Dictionary, Proper Dictionaries, Normalization, Normalization procedure, Natural Language Processing, NLP
Nationell ämneskategori
Datavetenskap (datalogi)
Identifikatorer
URN: urn:nbn:se:umu:diva-240756OAI: oai:DiVA.org:umu-240756DiVA, id: diva2:1973545
Utbildningsprogram
Kandidatprogrammet i Datavetenskap
Presentation
2025-06-03, NAT.D.440, Universitetsvägen, Umeå, 09:20 (Engelska)
Handledare
Examinatorer
2025-06-232025-06-192025-06-23Bibliografiskt granskad