Umeå University's logo

umu.sePublications
Planned maintenance
A system upgrade is planned for 24/9-2024, at 12:00-14:00. During this time DiVA will be unavailable.
Change search
Link to record
Permanent link

Direct link
Publications (10 of 29) Show all publications
Berglund, M., Martens, W. & van der Merwe, B. (2024). Constructing a BPE Tokenization DFA. In: Szilárd Zsolt Fazekas (Ed.), Implementation and Application of Automata: 28th International Conference, CIAA 2024, Akita, Japan, September 3–6, 2024, Proceedings. Paper presented at Implementation and Application of Automata, CIAA 2024, Akita, Japan, September 3-6, 2024 (pp. 66-78). Springer, 15015
Open this publication in new window or tab >>Constructing a BPE Tokenization DFA
2024 (English)In: Implementation and Application of Automata: 28th International Conference, CIAA 2024, Akita, Japan, September 3–6, 2024, Proceedings / [ed] Szilárd Zsolt Fazekas, Springer, 2024, Vol. 15015, p. 66-78Conference paper, Published paper (Refereed)
Abstract [en]

Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem. In this paper, we give and analyze an algorithm for the efficient construction of deterministic finite automata (DFA) designed to operate directly on tokenizations produced by the popular byte pair encoding (BPE) technique. This makes it possible to apply many existing techniques and algorithms to the tokenized case, such as pattern matching, equivalence checking of tokenization dictionaries, and composing tokenized languages in various ways.

Place, publisher, year, edition, pages
Springer, 2024
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 15015
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-229437 (URN)10.1007/978-3-031-71112-1_5 (DOI)978-3-031-71111-4 (ISBN)978-3-031-71112-1 (ISBN)
Conference
Implementation and Application of Automata, CIAA 2024, Akita, Japan, September 3-6, 2024
Available from: 2024-09-09 Created: 2024-09-09 Last updated: 2024-09-12Bibliographically approved
Berglund, M., Björklund, H. & Björklund, J. (2024). Parsing unranked tree languages, folded once. Algorithms, 17(6), Article ID 268.
Open this publication in new window or tab >>Parsing unranked tree languages, folded once
2024 (English)In: Algorithms, E-ISSN 1999-4893, Vol. 17, no 6, article id 268Article in journal (Refereed) Published
Abstract [en]

A regular unranked tree folding consists of a regular unranked tree language and a folding operation that merges (i.e., folds) selected nodes of a tree to form a graph; the combination is a formal device for representing graph languages. If, in the process of folding, the order among edges is discarded so that the result is an unordered graph, then two applications of a fold operation are enough to make the associated parsing problem NP-complete. However, if the order is kept, then the problem is solvable in non-uniform polynomial time. In this paper, we address the remaining case, where only one fold operation is applied, but the order among the edges is discarded. We show that, under these conditions, the problem is solvable in non-uniform polynomial time.

Place, publisher, year, edition, pages
MDPI, 2024
Keywords
graphs, transducers, trees, vector addition systems
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-227569 (URN)10.3390/a17060268 (DOI)2-s2.0-85196886791 (Scopus ID)
Funder
Swedish Research Council, 2020-03852Wallenberg AI, Autonomous Systems and Software Program (WASP)Knut and Alice Wallenberg Foundation
Note

This paper is an extended version of a paper published in International Symposium on Fundamentals of Computation Theory, Trier, Germany, 18–21 September.

Available from: 2024-07-02 Created: 2024-07-02 Last updated: 2024-07-02Bibliographically approved
Berglund, M. & van der Merwe, B. (2023). Formalizing BPE Tokenization. In: Nagy B., Freund R. (Ed.), 13th International Workshop onNon-Classical Models of Automata and Applications (NCMA 2023): . Paper presented at 13th International Workshop on Non-Classical Models of Automata and Applications, NCMA 2023, 18-19 September, 2023, Famagusta, Cyprus (pp. 16-27). Open Publishing Association
Open this publication in new window or tab >>Formalizing BPE Tokenization
2023 (English)In: 13th International Workshop onNon-Classical Models of Automata and Applications (NCMA 2023) / [ed] Nagy B., Freund R., Open Publishing Association , 2023, p. 16-27Conference paper, Published paper (Refereed)
Abstract [en]

In this paper, we formalize practical byte pair encoding tokenization as it is used in large language models and other NLP systems, in particular we formally define and investigate the semantics of the SentencePiece and HuggingFace tokenizers, in particular how they relate to each other, depending on how the tokenization rules are constructed. Beyond this we consider how tokenization can be performed in an incremental fashion, as well as doing it left-to-right using an amount of memory constant in the length of the string, enabling e.g. using a finite state string-to-string transducer.

Place, publisher, year, edition, pages
Open Publishing Association, 2023
Series
Electronic Proceedings in Theoretical Computer Science, EPTCS, ISSN 2075-2180
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-215226 (URN)10.4204/EPTCS.388.4 (DOI)2-s2.0-85173029200 (Scopus ID)
Conference
13th International Workshop on Non-Classical Models of Automata and Applications, NCMA 2023, 18-19 September, 2023, Famagusta, Cyprus
Available from: 2023-10-18 Created: 2023-10-18 Last updated: 2023-10-18Bibliographically approved
Berglund, M., Björklund, H. & Björklund, J. (2023). Parsing unranked tree languages, folded once. In: Henning Fernau; Klaus Jansen (Ed.), Fundamentals of computation theory: 24th International Symposium, FCT 2023, Trier, Germany, September 18–21, 2023, Proceedings. Paper presented at 24th International Symposium on Fundamentals of Computation Theory, FCT 2023, Trier, Germany, September 18–21, 2023 (pp. 60-73). Springer Nature
Open this publication in new window or tab >>Parsing unranked tree languages, folded once
2023 (English)In: Fundamentals of computation theory: 24th International Symposium, FCT 2023, Trier, Germany, September 18–21, 2023, Proceedings / [ed] Henning Fernau; Klaus Jansen, Springer Nature, 2023, p. 60-73Conference paper, Published paper (Refereed)
Abstract [en]

A regular unranked tree folding consists of a regular unranked tree language and a folding operation that merges, i.e., folds, selected nodes of a tree to form a graph; the combination is a formal device for representing graph languages. If, in the process of folding, the order among edges is discarded so that the result is an unordered graph, then two applications of a fold operation is enough to make the associated parsing problem NP-complete. However, if the order is kept, then the problem is solvable in non-uniform polynomial time. In this paper we address the remaining case where only one fold operation is applied, but the order among edges is discarded. We show that under these conditions, the problem is solvable in non-uniform polynomial time.

Place, publisher, year, edition, pages
Springer Nature, 2023
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 14292
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-215936 (URN)10.1007/978-3-031-43587-4_5 (DOI)2-s2.0-85174590997 (Scopus ID)9783031435867 (ISBN)
Conference
24th International Symposium on Fundamentals of Computation Theory, FCT 2023, Trier, Germany, September 18–21, 2023
Available from: 2023-11-02 Created: 2023-11-02 Last updated: 2023-11-02Bibliographically approved
Berglund, M. & van der Merwe, B. (2023). Re-examining regular expressions with backreferences. Theoretical Computer Science, 940, 66-80
Open this publication in new window or tab >>Re-examining regular expressions with backreferences
2023 (English)In: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 940, p. 66-80Article in journal (Refereed) Published
Abstract [en]

Most modern regular expression matching libraries (one of the rare exceptions being Google's RE2) allow backreferences, operations which bind a substring to a variable, allowing it to be matched again verbatim. However, both real-world implementations and definitions in the literature use different syntactic restrictions and have differences in the semantics of the matching of backreferences. Our aim is to compare these various flavors by considering the classes of formal languages that each can describe, establishing, as a result, a hierarchy of language classes. Beyond the hierarchy itself, some complexity results are given, and as part of the effort on comparing language classes new pumping lemmas are established, old classes are extended to new ones, and several incidental results on the nature of these language classes are given.

Place, publisher, year, edition, pages
Elsevier, 2023
Keywords
Regular expressions, Backreferences
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-200951 (URN)10.1016/j.tcs.2022.10.041 (DOI)2-s2.0-85141509500 (Scopus ID)
Available from: 2022-11-11 Created: 2022-11-11 Last updated: 2023-04-24Bibliographically approved
Berglund, M., Björklund, H., Björklund, J. & Boiret, A. (2023). Transduction from trees to graphs through folding. Information and Computation, 295, Article ID 105111.
Open this publication in new window or tab >>Transduction from trees to graphs through folding
2023 (English)In: Information and Computation, ISSN 0890-5401, E-ISSN 1090-2651, Vol. 295, article id 105111Article in journal (Refereed) Published
Abstract [en]

We introduce a fold operation that realises a tree-to-graph transduction by merging selected nodes in the input tree to form a possibly cyclic output graph. The work is motivated by the increasing use of graph-based representations in semantic parsing. We show that a suitable class of graphs languages can be generated by applying the fold operation to regular unranked tree languages. We investigate two versions of the fold operation, one that preserves a depth-first ordering between the edges, and one that does not. Finally, we demonstrate that the time complexity for the associated non-uniform membership problem is solvable in polynomial time for the order-preserving version, and NP-complete for the order-cancelling one.

Place, publisher, year, edition, pages
Elsevier, 2023
Keywords
Graphs, Semantic representations, Tranducers, Trees
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-216195 (URN)10.1016/j.ic.2023.105111 (DOI)2-s2.0-85175145940 (Scopus ID)
Funder
Swedish Research Council, 2020-03852
Available from: 2023-11-08 Created: 2023-11-08 Last updated: 2023-11-08Bibliographically approved
van der Merwe, B. & Berglund, M. (2022). Ordered Context-Free Grammars. In: Pascal Caron; Ludovic Mignot (Ed.), Implementation and Application of Automata: 26th International Conference, CIAA 2022, Rouen, France, June 28 – July 1, 2022, Proceedings. Paper presented at 26th International Conference on Implementation and Application of Automata, CIAA 2022, Rouen, France, June 28-July 1, 2022. (pp. 53-66). Springer Science+Business Media B.V.
Open this publication in new window or tab >>Ordered Context-Free Grammars
2022 (English)In: Implementation and Application of Automata: 26th International Conference, CIAA 2022, Rouen, France, June 28 – July 1, 2022, Proceedings / [ed] Pascal Caron; Ludovic Mignot, Springer Science+Business Media B.V., 2022, p. 53-66Conference paper, Published paper (Refereed)
Abstract [en]

We propose a new unambiguous grammar formalism, referred to as ordered context-free grammars, which is identical to context-free grammars, apart from the property that it also places an order on parse trees. Since only a minor modification to ordered context-free grammars is required to obtain parsing expression grammars, the relationship between context-free grammars and parsing expression grammars becomes more evident. By preserving how ordered context-free grammars support left-recursion, parsing expression grammars is modified to support left recursion in ways much more natural than current approaches.

Place, publisher, year, edition, pages
Springer Science+Business Media B.V., 2022
Series
Lecture Notes in Computer Science, ISSN 03029743, E-ISSN 16113349 ; 13266
Keywords
Ordered context-free grammars, Parsing expression grammars, unambiguous grammar formalisms
National Category
Language Technology (Computational Linguistics)
Identifiers
urn:nbn:se:umu:diva-197501 (URN)10.1007/978-3-031-07469-1_4 (DOI)000876366600004 ()2-s2.0-85131934265 (Scopus ID)9783031074684 (ISBN)978-3-031-07469-1 (ISBN)
Conference
26th International Conference on Implementation and Application of Automata, CIAA 2022, Rouen, France, June 28-July 1, 2022.
Available from: 2022-06-30 Created: 2022-06-30 Last updated: 2023-09-05Bibliographically approved
Berglund, M., Bester, W. & van der Merwe, B. (2021). Formalising and implementing Boost POSIX regular expression matching. Theoretical Computer Science, 857, 147-165
Open this publication in new window or tab >>Formalising and implementing Boost POSIX regular expression matching
2021 (English)In: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 857, p. 147-165Article in journal (Refereed) Published
Abstract [en]

Whereas Perl-compatible regular expression matchers typically exhibit some variation of leftmost-greedy semantics, those conforming to the posix standard are prescribed leftmost-longest semantics. However, the posix standard leaves some room for interpretation, and Fowler and Kuklewicz have done experimental work to confirm differences between various posix matchers. The Boost library has an interesting take on the posix standard, where it maximises the leftmost match not with respect to subexpressions of the regular expression pattern, but rather, with respect to capturing groups. In our work, we provide the first formalisation of Boost semantics, analyze the complexity of regular expression matching when using Boost semantics, and provide efficient algorithms for both online and multipass matching.

Place, publisher, year, edition, pages
Elsevier, 2021
Keywords
Regular expression matching, Posix, Boost
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-192864 (URN)10.1016/j.tcs.2021.01.010 (DOI)000613450400012 ()2-s2.0-85099463759 (Scopus ID)
Available from: 2022-03-02 Created: 2022-03-02 Last updated: 2022-03-03Bibliographically approved
van der Merwe, B., Mouton, J., van Litsenborgh, S. & Berglund, M. (2021). Memoized Regular Expressions. In: Sebastian Maneth (Ed.), Implementation and Application of Automata: 25th International Conference, CIAA 2021, Virtual Event, July 19-22, 2021, Proceedings. Paper presented at 25th International Conference, CIAA 2021, Virtual Event, July 19-22, 2021 (pp. 39-52). Springer
Open this publication in new window or tab >>Memoized Regular Expressions
2021 (English)In: Implementation and Application of Automata: 25th International Conference, CIAA 2021, Virtual Event, July 19-22, 2021, Proceedings / [ed] Sebastian Maneth, Springer, 2021, p. 39-52Conference paper, Published paper (Refereed)
Abstract [en]

We extend non-deterministic finite automata (NFAs) and regular expressions (regexes) by adding memoization to these formalisms. These extensions are aimed at improving the matching time of backtracking regex matchers. Additionally, we discuss how to extend the concept of ambiguity in order to be applicable to memoized extensions of regexes and NFAs. These more general notions of ambiguity can be used to analyze the matching time of backtracking regex matchers enhanced with memoization.

Place, publisher, year, edition, pages
Springer, 2021
Series
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 12803
Keywords
Ambiguity, Regular expression matching, Memoization
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-192865 (URN)10.1007/978-3-030-79121-6_4 (DOI)2-s2.0-85112599570 (Scopus ID)978-3-030-79120-9 (ISBN)978-3-030-79121-6 (ISBN)
Conference
25th International Conference, CIAA 2021, Virtual Event, July 19-22, 2021
Note

Also part of the Theoretical Computer Science and General Issues book sub series (LNTCS, volume 12803)

Available from: 2022-03-02 Created: 2022-03-02 Last updated: 2022-03-03Bibliographically approved
Berglund, M., van der Merwe, B. & van Litsenborgh, S. (2021). Regular Expressions with Lookahead. Journal of universal computer science (Online), 27(4), 324-340
Open this publication in new window or tab >>Regular Expressions with Lookahead
2021 (English)In: Journal of universal computer science (Online), ISSN 0948-695X, E-ISSN 0948-6968, Vol. 27, no 4, p. 324-340Article in journal (Refereed) Published
Abstract [en]

This paper investigates regular expressions which in addition to the standard operators of union, concatenation, and Kleene star, have lookaheads. We show how to translate regular expressions with lookaheads (REwLA) to equivalent Boolean automata having at most 3 states more than the length of the REwLA. We also investigate the state complexity when translating REwLA to equivalent deterministic finite automata (DFA).

Keywords
Regular expressions, Lookahead expressions, Boolean automata
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-192863 (URN)10.3897/jucs.66330 (DOI)000689213900001 ()2-s2.0-85106941209 (Scopus ID)
Available from: 2022-03-02 Created: 2022-03-02 Last updated: 2022-03-03Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0002-3692-6994

Search in DiVA

Show all publications