Umeå University's logo

umu.sePublikasjoner
Endre søk
Link to record
Permanent link

Direct link
Publikasjoner (10 av 29) Visa alla publikasjoner
Berglund, M., Martens, W. & van der Merwe, B. (2024). Constructing a BPE Tokenization DFA. In: Szilárd Zsolt Fazekas (Ed.), Implementation and Application of Automata: 28th International Conference, CIAA 2024, Akita, Japan, September 3–6, 2024, Proceedings. Paper presented at Implementation and Application of Automata, CIAA 2024, Akita, Japan, 3-6 September, 2024. (pp. 66-78). Springer, 15015
Åpne denne publikasjonen i ny fane eller vindu >>Constructing a BPE Tokenization DFA
2024 (engelsk)Inngår i: Implementation and Application of Automata: 28th International Conference, CIAA 2024, Akita, Japan, September 3–6, 2024, Proceedings / [ed] Szilárd Zsolt Fazekas, Springer, 2024, Vol. 15015, s. 66-78Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

Many natural language processing systems operate over tokenizations of text to address the open-vocabulary problem. In this paper, we give and analyze an algorithm for the efficient construction of deterministic finite automata (DFA) designed to operate directly on tokenizations produced by the popular byte pair encoding (BPE) technique. This makes it possible to apply many existing techniques and algorithms to the tokenized case, such as pattern matching, equivalence checking of tokenization dictionaries, and composing tokenized languages in various ways.

sted, utgiver, år, opplag, sider
Springer, 2024
Serie
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 15015
HSV kategori
Forskningsprogram
datalogi
Identifikatorer
urn:nbn:se:umu:diva-229437 (URN)10.1007/978-3-031-71112-1_5 (DOI)978-3-031-71111-4 (ISBN)
Konferanse
Implementation and Application of Automata, CIAA 2024, Akita, Japan, 3-6 September, 2024.
Tilgjengelig fra: 2024-09-09 Laget: 2024-09-09 Sist oppdatert: 2024-09-10
Berglund, M., Björklund, H. & Björklund, J. (2024). Parsing unranked tree languages, folded once. Algorithms, 17(6), Article ID 268.
Åpne denne publikasjonen i ny fane eller vindu >>Parsing unranked tree languages, folded once
2024 (engelsk)Inngår i: Algorithms, E-ISSN 1999-4893, Vol. 17, nr 6, artikkel-id 268Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

A regular unranked tree folding consists of a regular unranked tree language and a folding operation that merges (i.e., folds) selected nodes of a tree to form a graph; the combination is a formal device for representing graph languages. If, in the process of folding, the order among edges is discarded so that the result is an unordered graph, then two applications of a fold operation are enough to make the associated parsing problem NP-complete. However, if the order is kept, then the problem is solvable in non-uniform polynomial time. In this paper, we address the remaining case, where only one fold operation is applied, but the order among the edges is discarded. We show that, under these conditions, the problem is solvable in non-uniform polynomial time.

sted, utgiver, år, opplag, sider
MDPI, 2024
Emneord
graphs, transducers, trees, vector addition systems
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-227569 (URN)10.3390/a17060268 (DOI)2-s2.0-85196886791 (Scopus ID)
Forskningsfinansiär
Swedish Research Council, 2020-03852Wallenberg AI, Autonomous Systems and Software Program (WASP)Knut and Alice Wallenberg Foundation
Merknad

This paper is an extended version of a paper published in International Symposium on Fundamentals of Computation Theory, Trier, Germany, 18–21 September.

Tilgjengelig fra: 2024-07-02 Laget: 2024-07-02 Sist oppdatert: 2024-07-02bibliografisk kontrollert
Berglund, M. & van der Merwe, B. (2023). Formalizing BPE Tokenization. In: Nagy B., Freund R. (Ed.), 13th International Workshop onNon-Classical Models of Automata and Applications (NCMA 2023): . Paper presented at 13th International Workshop on Non-Classical Models of Automata and Applications, NCMA 2023, 18-19 September, 2023, Famagusta, Cyprus (pp. 16-27). Open Publishing Association
Åpne denne publikasjonen i ny fane eller vindu >>Formalizing BPE Tokenization
2023 (engelsk)Inngår i: 13th International Workshop onNon-Classical Models of Automata and Applications (NCMA 2023) / [ed] Nagy B., Freund R., Open Publishing Association , 2023, s. 16-27Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

In this paper, we formalize practical byte pair encoding tokenization as it is used in large language models and other NLP systems, in particular we formally define and investigate the semantics of the SentencePiece and HuggingFace tokenizers, in particular how they relate to each other, depending on how the tokenization rules are constructed. Beyond this we consider how tokenization can be performed in an incremental fashion, as well as doing it left-to-right using an amount of memory constant in the length of the string, enabling e.g. using a finite state string-to-string transducer.

sted, utgiver, år, opplag, sider
Open Publishing Association, 2023
Serie
Electronic Proceedings in Theoretical Computer Science, EPTCS, ISSN 2075-2180
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-215226 (URN)10.4204/EPTCS.388.4 (DOI)2-s2.0-85173029200 (Scopus ID)
Konferanse
13th International Workshop on Non-Classical Models of Automata and Applications, NCMA 2023, 18-19 September, 2023, Famagusta, Cyprus
Tilgjengelig fra: 2023-10-18 Laget: 2023-10-18 Sist oppdatert: 2023-10-18bibliografisk kontrollert
Berglund, M., Björklund, H. & Björklund, J. (2023). Parsing unranked tree languages, folded once. In: Henning Fernau; Klaus Jansen (Ed.), Fundamentals of computation theory: 24th International Symposium, FCT 2023, Trier, Germany, September 18–21, 2023, Proceedings. Paper presented at 24th International Symposium on Fundamentals of Computation Theory, FCT 2023, Trier, Germany, September 18–21, 2023 (pp. 60-73). Springer Nature
Åpne denne publikasjonen i ny fane eller vindu >>Parsing unranked tree languages, folded once
2023 (engelsk)Inngår i: Fundamentals of computation theory: 24th International Symposium, FCT 2023, Trier, Germany, September 18–21, 2023, Proceedings / [ed] Henning Fernau; Klaus Jansen, Springer Nature, 2023, s. 60-73Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

A regular unranked tree folding consists of a regular unranked tree language and a folding operation that merges, i.e., folds, selected nodes of a tree to form a graph; the combination is a formal device for representing graph languages. If, in the process of folding, the order among edges is discarded so that the result is an unordered graph, then two applications of a fold operation is enough to make the associated parsing problem NP-complete. However, if the order is kept, then the problem is solvable in non-uniform polynomial time. In this paper we address the remaining case where only one fold operation is applied, but the order among edges is discarded. We show that under these conditions, the problem is solvable in non-uniform polynomial time.

sted, utgiver, år, opplag, sider
Springer Nature, 2023
Serie
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 14292
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-215936 (URN)10.1007/978-3-031-43587-4_5 (DOI)2-s2.0-85174590997 (Scopus ID)9783031435867 (ISBN)
Konferanse
24th International Symposium on Fundamentals of Computation Theory, FCT 2023, Trier, Germany, September 18–21, 2023
Tilgjengelig fra: 2023-11-02 Laget: 2023-11-02 Sist oppdatert: 2023-11-02bibliografisk kontrollert
Berglund, M. & van der Merwe, B. (2023). Re-examining regular expressions with backreferences. Theoretical Computer Science, 940, 66-80
Åpne denne publikasjonen i ny fane eller vindu >>Re-examining regular expressions with backreferences
2023 (engelsk)Inngår i: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 940, s. 66-80Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Most modern regular expression matching libraries (one of the rare exceptions being Google's RE2) allow backreferences, operations which bind a substring to a variable, allowing it to be matched again verbatim. However, both real-world implementations and definitions in the literature use different syntactic restrictions and have differences in the semantics of the matching of backreferences. Our aim is to compare these various flavors by considering the classes of formal languages that each can describe, establishing, as a result, a hierarchy of language classes. Beyond the hierarchy itself, some complexity results are given, and as part of the effort on comparing language classes new pumping lemmas are established, old classes are extended to new ones, and several incidental results on the nature of these language classes are given.

sted, utgiver, år, opplag, sider
Elsevier, 2023
Emneord
Regular expressions, Backreferences
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-200951 (URN)10.1016/j.tcs.2022.10.041 (DOI)2-s2.0-85141509500 (Scopus ID)
Tilgjengelig fra: 2022-11-11 Laget: 2022-11-11 Sist oppdatert: 2023-04-24bibliografisk kontrollert
Berglund, M., Björklund, H., Björklund, J. & Boiret, A. (2023). Transduction from trees to graphs through folding. Information and Computation, 295, Article ID 105111.
Åpne denne publikasjonen i ny fane eller vindu >>Transduction from trees to graphs through folding
2023 (engelsk)Inngår i: Information and Computation, ISSN 0890-5401, E-ISSN 1090-2651, Vol. 295, artikkel-id 105111Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

We introduce a fold operation that realises a tree-to-graph transduction by merging selected nodes in the input tree to form a possibly cyclic output graph. The work is motivated by the increasing use of graph-based representations in semantic parsing. We show that a suitable class of graphs languages can be generated by applying the fold operation to regular unranked tree languages. We investigate two versions of the fold operation, one that preserves a depth-first ordering between the edges, and one that does not. Finally, we demonstrate that the time complexity for the associated non-uniform membership problem is solvable in polynomial time for the order-preserving version, and NP-complete for the order-cancelling one.

sted, utgiver, år, opplag, sider
Elsevier, 2023
Emneord
Graphs, Semantic representations, Tranducers, Trees
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-216195 (URN)10.1016/j.ic.2023.105111 (DOI)2-s2.0-85175145940 (Scopus ID)
Forskningsfinansiär
Swedish Research Council, 2020-03852
Tilgjengelig fra: 2023-11-08 Laget: 2023-11-08 Sist oppdatert: 2023-11-08bibliografisk kontrollert
van der Merwe, B. & Berglund, M. (2022). Ordered Context-Free Grammars. In: Pascal Caron; Ludovic Mignot (Ed.), Implementation and Application of Automata: 26th International Conference, CIAA 2022, Rouen, France, June 28 – July 1, 2022, Proceedings. Paper presented at 26th International Conference on Implementation and Application of Automata, CIAA 2022, Rouen, France, June 28-July 1, 2022. (pp. 53-66). Springer Science+Business Media B.V.
Åpne denne publikasjonen i ny fane eller vindu >>Ordered Context-Free Grammars
2022 (engelsk)Inngår i: Implementation and Application of Automata: 26th International Conference, CIAA 2022, Rouen, France, June 28 – July 1, 2022, Proceedings / [ed] Pascal Caron; Ludovic Mignot, Springer Science+Business Media B.V., 2022, s. 53-66Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We propose a new unambiguous grammar formalism, referred to as ordered context-free grammars, which is identical to context-free grammars, apart from the property that it also places an order on parse trees. Since only a minor modification to ordered context-free grammars is required to obtain parsing expression grammars, the relationship between context-free grammars and parsing expression grammars becomes more evident. By preserving how ordered context-free grammars support left-recursion, parsing expression grammars is modified to support left recursion in ways much more natural than current approaches.

sted, utgiver, år, opplag, sider
Springer Science+Business Media B.V., 2022
Serie
Lecture Notes in Computer Science, ISSN 03029743, E-ISSN 16113349 ; 13266
Emneord
Ordered context-free grammars, Parsing expression grammars, unambiguous grammar formalisms
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-197501 (URN)10.1007/978-3-031-07469-1_4 (DOI)000876366600004 ()2-s2.0-85131934265 (Scopus ID)9783031074684 (ISBN)978-3-031-07469-1 (ISBN)
Konferanse
26th International Conference on Implementation and Application of Automata, CIAA 2022, Rouen, France, June 28-July 1, 2022.
Tilgjengelig fra: 2022-06-30 Laget: 2022-06-30 Sist oppdatert: 2023-09-05bibliografisk kontrollert
Berglund, M., Bester, W. & van der Merwe, B. (2021). Formalising and implementing Boost POSIX regular expression matching. Theoretical Computer Science, 857, 147-165
Åpne denne publikasjonen i ny fane eller vindu >>Formalising and implementing Boost POSIX regular expression matching
2021 (engelsk)Inngår i: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 857, s. 147-165Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Whereas Perl-compatible regular expression matchers typically exhibit some variation of leftmost-greedy semantics, those conforming to the posix standard are prescribed leftmost-longest semantics. However, the posix standard leaves some room for interpretation, and Fowler and Kuklewicz have done experimental work to confirm differences between various posix matchers. The Boost library has an interesting take on the posix standard, where it maximises the leftmost match not with respect to subexpressions of the regular expression pattern, but rather, with respect to capturing groups. In our work, we provide the first formalisation of Boost semantics, analyze the complexity of regular expression matching when using Boost semantics, and provide efficient algorithms for both online and multipass matching.

sted, utgiver, år, opplag, sider
Elsevier, 2021
Emneord
Regular expression matching, Posix, Boost
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-192864 (URN)10.1016/j.tcs.2021.01.010 (DOI)000613450400012 ()2-s2.0-85099463759 (Scopus ID)
Tilgjengelig fra: 2022-03-02 Laget: 2022-03-02 Sist oppdatert: 2022-03-03bibliografisk kontrollert
van der Merwe, B., Mouton, J., van Litsenborgh, S. & Berglund, M. (2021). Memoized Regular Expressions. In: Sebastian Maneth (Ed.), Implementation and Application of Automata: 25th International Conference, CIAA 2021, Virtual Event, July 19-22, 2021, Proceedings. Paper presented at 25th International Conference, CIAA 2021, Virtual Event, July 19-22, 2021 (pp. 39-52). Springer
Åpne denne publikasjonen i ny fane eller vindu >>Memoized Regular Expressions
2021 (engelsk)Inngår i: Implementation and Application of Automata: 25th International Conference, CIAA 2021, Virtual Event, July 19-22, 2021, Proceedings / [ed] Sebastian Maneth, Springer, 2021, s. 39-52Konferansepaper, Publicerat paper (Fagfellevurdert)
Abstract [en]

We extend non-deterministic finite automata (NFAs) and regular expressions (regexes) by adding memoization to these formalisms. These extensions are aimed at improving the matching time of backtracking regex matchers. Additionally, we discuss how to extend the concept of ambiguity in order to be applicable to memoized extensions of regexes and NFAs. These more general notions of ambiguity can be used to analyze the matching time of backtracking regex matchers enhanced with memoization.

sted, utgiver, år, opplag, sider
Springer, 2021
Serie
Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349 ; 12803
Emneord
Ambiguity, Regular expression matching, Memoization
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-192865 (URN)10.1007/978-3-030-79121-6_4 (DOI)2-s2.0-85112599570 (Scopus ID)978-3-030-79120-9 (ISBN)978-3-030-79121-6 (ISBN)
Konferanse
25th International Conference, CIAA 2021, Virtual Event, July 19-22, 2021
Merknad

Also part of the Theoretical Computer Science and General Issues book sub series (LNTCS, volume 12803)

Tilgjengelig fra: 2022-03-02 Laget: 2022-03-02 Sist oppdatert: 2022-03-03bibliografisk kontrollert
Berglund, M., van der Merwe, B. & van Litsenborgh, S. (2021). Regular Expressions with Lookahead. Journal of universal computer science (Online), 27(4), 324-340
Åpne denne publikasjonen i ny fane eller vindu >>Regular Expressions with Lookahead
2021 (engelsk)Inngår i: Journal of universal computer science (Online), ISSN 0948-695X, E-ISSN 0948-6968, Vol. 27, nr 4, s. 324-340Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

This paper investigates regular expressions which in addition to the standard operators of union, concatenation, and Kleene star, have lookaheads. We show how to translate regular expressions with lookaheads (REwLA) to equivalent Boolean automata having at most 3 states more than the length of the REwLA. We also investigate the state complexity when translating REwLA to equivalent deterministic finite automata (DFA).

Emneord
Regular expressions, Lookahead expressions, Boolean automata
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-192863 (URN)10.3897/jucs.66330 (DOI)000689213900001 ()2-s2.0-85106941209 (Scopus ID)
Tilgjengelig fra: 2022-03-02 Laget: 2022-03-02 Sist oppdatert: 2022-03-03bibliografisk kontrollert
Organisasjoner
Identifikatorer
ORCID-id: ORCID iD iconorcid.org/0000-0002-3692-6994