Umeå University's logo

umu.sePublikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Byte Pair Encoding Dictionary Equivalence
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap.
2025 (engelsk)Independent thesis Basic level (degree of Bachelor), 10 poäng / 15 hpOppgave
Abstract [en]

This thesis proposes a rewrite procedure for determining the equivalence of a subset of byte-pair encoding (BPE) dictionaries, a data structure essential to the tokenization process in modern large language models. We present a normalization procedure that constructs a normal form for proper BPE dictionaries, where a dictionary is proper if every token in a merge rule of length greater than one has been created by a higher-priority rule. We prove that two proper dictionaries containing the same rules are equivalent if and only if their normal forms are identical. The existing approach determines dictionary equivalence by transforming dictionaries into deterministic finite automata (DFAs) and checking for DFA equivalence. This approach is computationally expensive for large vocabularies, where the performance depends on the size of the token alphabet. In contrast, the new normalization procedure is more efficient in practice, since the time complexity of the normalization procedure depends on the length of the tokens instead, which are typically considerably smaller than the size of the token alphabet. This results in a more scalable solution for equivalence checking in real-world scenarios.

sted, utgiver, år, opplag, sider
2025.
Serie
UMNAD ; 1562
Emneord [en]
Byte Pair, Byte Pair Encoding, Byte Pair Dictionary, Byte Pair Encoding Dictionary, BPE, BPE Dictionary, Large Language Model, LLM, Dictionary Equivalence, Equivalence, Normal Form, Tokenization, Proper Dictionary, Proper Dictionaries, Normalization, Normalization procedure, Natural Language Processing, NLP
HSV kategori
Identifikatorer
URN: urn:nbn:se:umu:diva-240756OAI: oai:DiVA.org:umu-240756DiVA, id: diva2:1973545
Utdanningsprogram
Bachelor of Science Programme in Computing Science
Presentation
2025-06-03, NAT.D.440, Universitetsvägen, Umeå, 09:20 (engelsk)
Veileder
Examiner
Tilgjengelig fra: 2025-06-23 Laget: 2025-06-19 Sist oppdatert: 2025-06-23bibliografisk kontrollert

Open Access i DiVA

bilaga(2424 kB)122 nedlastinger
Filinformasjon
Fil ATTACHMENT01.pdfFilstørrelse 2424 kBChecksum SHA-512
5a7d2cad5cff2553ece4755d4a31c56d473ac94bd7067fcb96ddff69dd8d9fb5856a6ce3a921449814fcab05aecf2964c4e7ec00060bdcf41be7bf2773e5052c
Type attachmentMimetype application/pdf

Søk i DiVA

Av forfatter/redaktør
Bergmark, Anton
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

urn-nbn

Altmetric

urn-nbn
Totalt: 530 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf