Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Constructing a BPE Tokenization DFA
Umeå University, Faculty of Science and Technology, Department of Computing Science.ORCID iD: 0000-0002-3692-6994
Umeå University, Faculty of Science and Technology, Department of Computing Science.
Department of Computer Science, Stellenbosch University, Stellenbosch, South Africa.
2024 (English)In: Implementation and Application of Automata: 28th International Conference, CIAA 2024, Akita, Japan, September 3–6, 2024, Proceedings / [ed] Szilárd Zsolt Fazekas, Springer, 2024, Vol. 15015, p. 66-78Conference paper, Published paper (Refereed)
Abstract [en]

Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem. In this paper, we give and analyze an algorithm for the efficient construction of deterministic finite automata (DFA) designed to operate directly on tokenizations produced by the popular byte pair encoding (BPE) technique. This makes it possible to apply many existing techniques and algorithms to the tokenized case, such as pattern matching, equivalence checking of tokenization dictionaries, and composing tokenized languages in various ways.

Place, publisher, year, edition, pages
Springer, 2024. Vol. 15015, p. 66-78
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 15015
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-229437DOI: 10.1007/978-3-031-71112-1_5Scopus ID: 2-s2.0-85204399907ISBN: 978-3-031-71111-4 (print)ISBN: 978-3-031-71112-1 (electronic)OAI: oai:DiVA.org:umu-229437DiVA, id: diva2:1896031
Conference
Implementation and Application of Automata, CIAA 2024, Akita, Japan, September 3-6, 2024
Available from: 2024-09-09 Created: 2024-09-09 Last updated: 2024-09-27Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full textScopus

Authority records

Berglund, MartinMartens, Willeke

Search in DiVA

By author/editor
Berglund, MartinMartens, Willeke
By organisation
Department of Computing Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 21 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf