umu.sePublications
Change search
Link to record
Permanent link

Direct link
BETA
Berglund, Martin
Publications (10 of 16) Show all publications
Berglund, M. & van der Merwe, B. (2017). On the semantics of regular expression parsing in the wild. Theoretical Computer Science, 679, 69-82
Open this publication in new window or tab >>On the semantics of regular expression parsing in the wild
2017 (English)In: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 679, p. 69-82Article in journal (Refereed) Published
Abstract [en]

We introduce prioritized transducers to formalize capturing groups in regular expression matching in a way that permits straightforward modeling of capturing in Java's 1 regular expression library. The broader questions of parsing semantics and performance are also considered. In addition, the complexity of deciding equivalence of regular expressions with capturing groups is investigated.

Place, publisher, year, edition, pages
Elsevier, 2017
Keywords
Regular expression matchers, Capturing groups, Prioritized transducers
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-137402 (URN)10.1016/j.tcs.2016.09.006 (DOI)000403125700006 ()
Available from: 2017-07-05 Created: 2017-07-05 Last updated: 2018-06-09Bibliographically approved
Weideman, N., van der Merwe, B., Berglund, M. & Watson, B. (2016). Analyzing Matching Time Behavior of Backtracking Regular Expression Matchers by Using Ambiguity of NFA. In: Implementation and Application of Automata: . Paper presented at 21st International Conference on Implementation and Application of Automata (CIAA), JUL 19-22, 2016, Seoul, SOUTH KOREA (pp. 322-334). Springer
Open this publication in new window or tab >>Analyzing Matching Time Behavior of Backtracking Regular Expression Matchers by Using Ambiguity of NFA
2016 (English)In: Implementation and Application of Automata, Springer, 2016, p. 322-334Conference paper, Published paper (Refereed)
Abstract [en]

We apply results from ambiguity of non-deterministic finite automata to the problem of determining the asymptotic worst-case matching time, as a function of the length of the input strings, when attempting to match input strings with a given regular expression, where the matcher being used is a backtracking regular expression matcher.

Place, publisher, year, edition, pages
Springer, 2016
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 9705
Keywords
Regular expression, Backtracking matcher, Ambiguity
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-130123 (URN)10.1007/978-3-319-40946-7_27 (DOI)000389401500027 ()978-3-319-40946-7; 978-3-319-40945-0 (ISBN)
Conference
21st International Conference on Implementation and Application of Automata (CIAA), JUL 19-22, 2016, Seoul, SOUTH KOREA
Available from: 2017-01-13 Created: 2017-01-11 Last updated: 2018-06-09Bibliographically approved
Björklund, H., Berglund, M. & Petter, E. (2016). Uniform vs. Nonuniform Membership for Mildly Context-Sensitive Languages: A Brief Survey. Algorithms, 9(2), Article ID 32.
Open this publication in new window or tab >>Uniform vs. Nonuniform Membership for Mildly Context-Sensitive Languages: A Brief Survey
2016 (English)In: Algorithms, ISSN 1999-4893, E-ISSN 1999-4893, Vol. 9, no 2, article id 32Article in journal (Refereed) Published
Abstract [en]

Parsing for mildly context-sensitive language formalisms is an important area within natural language processing. While the complexity of the parsing problem for some such formalisms is known to be polynomial, this is not the case for all of them. This article presents a series of results regarding the complexity of parsing for linear context-free rewriting systems and deterministic tree-walking transducers. We discuss the difference between uniform and nonuniform complexity measures and how parameterized complexity theory can be used to investigate how different aspects of the formalisms influence how hard the parsing problem is. The main results we survey are all hardness results and indicate that parsing is hard even for relatively small values of parameters such as rank and fan-out in a rewriting system.

Place, publisher, year, edition, pages
MDPI, 2016
Keywords
mildly context sensitive languages, parsing, formal languages, parameterized complexity
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-120238 (URN)10.3390/a9020032 (DOI)000379833700011 ()
Projects
Parameterized Natural Language Parsing
Funder
Swedish Research Council, 621-2011-6080
Available from: 2016-05-12 Created: 2016-05-12 Last updated: 2018-06-07Bibliographically approved
Berglund, M., Drewes, F. & van der Merwe, B. (2014). Analyzing Catastrophic Backtracking Behavior in Practical Regular Expression Matching. In: : . Paper presented at 14th International Conference Automata and Formal Languages.
Open this publication in new window or tab >>Analyzing Catastrophic Backtracking Behavior in Practical Regular Expression Matching
2014 (English)Conference paper, Published paper (Refereed)
Abstract [en]

We consider in some detail how regular expression matching happens in Java, as a popular representative of the category of regex-directed matching engines. We extract a slightly idealized algorithm for this scenario. Next we define an automata model which captures all the aspects needed to perform matching, of the Java style, in a formal way. Finally, two types of static analysis, which take a regular expression and tells whether there exists a family of strings which make Java-style matching run in exponential time, are done.

Keywords
regular expressions, catastrophic backtracking, static analysis
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-88232 (URN)
Conference
14th International Conference Automata and Formal Languages
Available from: 2014-04-28 Created: 2014-04-28 Last updated: 2018-06-07Bibliographically approved
Berglund, M. (2014). Characterizing Non-Regularity. Umeå: Umeå universitet
Open this publication in new window or tab >>Characterizing Non-Regularity
2014 (English)Report (Other academic)
Abstract [en]

This paper considers a characterization of the context-free non-regular languages, conjecturing that there for all such languages exists a fixed string thatcan be pumped to exhibit infinitely many equivalence classes. A proof is given only for a special case, but the general statement is conjectured to hold. The conjecture is then shown to imply that the shuffle of two context-free languagesis not context-free.

Place, publisher, year, edition, pages
Umeå: Umeå universitet, 2014. p. 9
Series
Report / UMINF, ISSN 0348-0542 ; 14.12
Keywords
context-free languages, characterization, regular languages, pumping lemma, shuffle
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-88231 (URN)
Available from: 2014-04-28 Created: 2014-04-28 Last updated: 2018-06-07Bibliographically approved
Berglund, M. (2014). Complexities of Order-Related Formal Language Extensions. (Doctoral dissertation). Umeå: Umeå Universitet
Open this publication in new window or tab >>Complexities of Order-Related Formal Language Extensions
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Komplexiteter hos ordnings-relaterade utökningar av formella språk
Abstract [en]

The work presented in this thesis discusses various formal language formalisms that extend classical formalisms like regular expressions and context-free grammars with additional abilities, most relating to order. This is done while focusing on the impact these extensions have on the efficiency of parsing the languages generated. That is, rather than taking a step up on the Chomsky hierarchy to the context-sensitive languages, which makes parsing very difficult, a smaller step is taken, adding some mechanisms which permit interesting spatial (in)dependencies to be modeled.

The most immediate example is shuffle formalisms, where existing language formalisms are extended by introducing operators which generate arbitrary interleavings of argument languages. For example, introducing a shuffle operator to the regular expressions does not make it possible to recognize context-free languages like anbn, but it does capture some non-context-free languages like the language of all strings containing the same number of as, bs and cs. The impact these additions have on parsing has many facets. Other than shuffle operators we also consider formalisms enforcing repeating substrings, formalisms moving substrings around, and formalisms that restrict which substrings may be concatenated. The formalisms studied here all have a number of properties in common.

  1. They are closely related to existing regular and context-free formalisms. They operate in a step-wise fashion, deriving strings by sequences of rule applications of individually limited power.
  2. Each step generates a constant number of symbols and does not modify parts that have already been generated. That is, strings are built in an additive fashion that does not explode in size (in contrast to e.g. Lindenmayer systems). All languages here will have a semi-linear Parikh image.
  3. They feature some interesting characteristic involving order or other spatial constraints. In the example of the shuffle multiple derivations are in a sense interspersed in a way that each is unaware of.
  4. All of the formalisms are intended to be limited enough to make an efficient parsing algorithm at least for some cases a reasonable goal.

This thesis will give intuitive explanations of a number of formalisms fulfilling these requirements, and will sketch some results relating to the parsing problem for them. This should all be viewed as preparation for the more complete results and explanations featured in the papers given in the appendices.

Abstract [sv]

Denna avhandling diskuterar utökningar av klassiska formalismer inom formella språk, till exempel reguljära uttryck och kontextfria grammatiker. Utökningarna handlar på ett eller annat sätt omordning, och ett särskilt fokus ligger på att göra utökningarna på ett sätt som dels har intressanta spatiala/ordningsrelaterade effekter och som dels bevarar den effektiva parsningen som är möjlig för de ursprungliga klassiska formalismerna. Detta står i kontrast till att ta det större steget upp i Chomsky-hierarkin till de kontextkänsliga språken, vilket medför ett svårt parsningsproblem.

Ett omedelbart exempel på en sådan utökning är s.k. shuffle-formalismer. Dessa utökar existerande formalismer genom att introducera operatorer som godtyckligt sammanflätar strängar från argumentspråk. Om shuffle-operator introduceras till de reguljära uttrycken ger det inte förmågan att känna igen t.ex. det kontextfria språket anbn, men det fångar istället vissa språk som inte är kontextfria, till exempel språket som består av alla strängar som innehåller lika många a:n, b:n och c:n. Sättet på vilket dessa utökningar påverkar parsningsproblemet är mångfacetterat. Utöver dessa shuffle-operatorer tas också formalismer där delsträngar kan upprepas, formalismer där delsträngar flyttas runt, och formalismer som begränsar hur delsträngar får konkateneras upp. Formalismerna som tas upp här har dock vissa egenskaper gemensamma.

  1. De är nära besläktade med de klassiska reguljära och kontextfria formalismerna. De arbetar stegvis, och konstruerar strängar genom successiva applikationer av individuellt enkla regler.
  2. Varje steg genererar ett konstant antal symboler och modifierar inte det som redan genererats. Det vill säga, strängar byggs additivt och längden på dem kan inte explodera (i kontrast till t.ex. Lindenmayer-system). Alla språk som tas upp kommer att ha en semi-linjär Parikh-avbildning.
  3. De har någon instressant spatial/ordningsrelaterad egenskap. Exempelvis sättet på vilket shuffle-operatorer sammanflätar annars oberoende deriveringar.
  4. Alla formalismerna är tänkta att vara begränsade nog att det är resonabelt att ha effektiv parsning som mål.

Denna avhandling kommer att ge intuitiva förklaring av ett antal formalismer som uppfyller ovanstående krav, och kommer att skissa en blandning av resultat relaterade till parsningsproblemet för dem. Detta bör ses som förberedande inför läsning av de mer djupgående och komplexa resultaten och förklaringarna i de artiklar som finns inkluderade som appendix.

Place, publisher, year, edition, pages
Umeå: Umeå Universitet, 2014. p. 60
Series
Report / UMINF, ISSN 0348-0542 ; 2014:13
Keywords
complexity, automata, languages, order, shuffle, regular expressions, context-free, mildly context-sensitive, parsing, membership, algorithm analysis
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-88235 (URN)978-91-7601-047-1 (ISBN)
Public defence
2014-05-21, Naturvetarhuset, N320, Umeå Universitet, Umeå, 13:15 (English)
Opponent
Supervisors
Available from: 2014-04-30 Created: 2014-04-28 Last updated: 2018-06-07Bibliographically approved
Berglund, M., Björklund, H., Drewes, F., van der Merwe, B. & Watson, B. (2013). Cuts in Regular Expressions. In: Marie-Pierre Béal, Olivier Carton (Ed.), Developments in Language Theory: 17th International Conference, DLT 2013, Marne-la-Vallée, France, June 18-21, 2013. Proceedings. Paper presented at 17th International Conference on Developments in Language Theory, DLT 2013, 18 June 2013 through 21 June 2013, Marne-la-Vallee (pp. 70-81). Springer Berlin/Heidelberg
Open this publication in new window or tab >>Cuts in Regular Expressions
Show others...
2013 (English)In: Developments in Language Theory: 17th International Conference, DLT 2013, Marne-la-Vallée, France, June 18-21, 2013. Proceedings / [ed] Marie-Pierre Béal, Olivier Carton, Springer Berlin/Heidelberg, 2013, p. 70-81Conference paper, Published paper (Refereed)
Abstract [en]

Most software packages with regular expression matching engines offer operators that extend the classical regular expressions, such as counting, intersection, complementation, and interleaving. Some of the most popular engines, for example those of Java and Perl, also provide operators that are intended to control the nondeterminism inherent in regular expressions. We formalize this notion in the form of the cut and iterated cut operators. They do not extend the class of languages that can be defined beyond the regular, but they allow for exponentially more succinct representation of some languages. Membership testing remains polynomial, but emptiness testing becomes PSPACE-hard. 

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2013
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 7907
Keywords
regular expression, finite automaton, cut operator
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-67738 (URN)10.1007/978-3-642-38771-5_8 (DOI)978-3-642-38770-8 (ISBN)978-3-642-38771-5 (ISBN)
Conference
17th International Conference on Developments in Language Theory, DLT 2013, 18 June 2013 through 21 June 2013, Marne-la-Vallee
Available from: 2013-03-28 Created: 2013-03-28 Last updated: 2018-06-08Bibliographically approved
Berglund, M., Björklund, H. & Drewes, F. (2013). On the Parameterized Complexity of Linear Context-Free Rewriting Systems. In: Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13): . Paper presented at 13th Meeting on the Mathematics of Language (MoL 13), Sofia, Bulgaria, August 9, 2013 (pp. 21-29). Association for Computational Linguistics
Open this publication in new window or tab >>On the Parameterized Complexity of Linear Context-Free Rewriting Systems
2013 (English)In: Proceedings of the 13th Meeting on the Mathematics of Language (MoL 13), Association for Computational Linguistics, 2013, p. 21-29Conference paper, Published paper (Other academic)
Abstract [en]

We study the complexity of uniform membership for Linear Context-Free RewritingSystems, i.e., the problem where we aregiven a string w and a grammar G and areasked whether w ∈ L(G). In particular,we use parameterized complexity theoryto investigate how the complexity dependson various parameters. While we focusprimarily on rank and fan-out, derivationlength is also considered.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2013
Keywords
parameterized complexity, linear context-free rewriting system
National Category
Language Technology (Computational Linguistics) Computer Sciences
Research subject
computational linguistics; Computer Science
Identifiers
urn:nbn:se:umu:diva-71671 (URN)978-1-937284-65-7 (ISBN)
Conference
13th Meeting on the Mathematics of Language (MoL 13), Sofia, Bulgaria, August 9, 2013
Projects
Parameterized Natural Language Parsing
Funder
Swedish Research Council, 2011-6080
Available from: 2013-06-04 Created: 2013-06-04 Last updated: 2018-06-08Bibliographically approved
Berglund, M., Björklund, H. & Björklund, J. (2013). Shuffled languages: representation and recognition. Theoretical Computer Science, 489-490, 1-20
Open this publication in new window or tab >>Shuffled languages: representation and recognition
2013 (English)In: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 489-490, p. 1-20Article in journal (Refereed) Published
Abstract [en]

Language models that use interleaving, or shuffle, operators have applications in various areas of computer science, including system verification, plan recognition, and natural language processing. We study the complexity of the membership problem for such models, in other words, how difficult it is to determine if a string belongs to a language or not. In particular, we investigate how interleaving can be introduced into models that capture the context-free languages.

Place, publisher, year, edition, pages
Elsevier, 2013
Keywords
Interleaving, Shuffle languages, Membership problems
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-79714 (URN)10.1016/j.tcs.2013.04.022 (DOI)000320973900001 ()
Projects
Parameterized Natural Language Parsing
Funder
Swedish Research Council, 621-2011-6080
Available from: 2013-08-29 Created: 2013-08-29 Last updated: 2018-06-08Bibliographically approved
Berglund, M. (2012). Complexities of Parsing in the Presence of Reordering. (Licentiate dissertation). Umeå: Department of Computing Science, Umeå University
Open this publication in new window or tab >>Complexities of Parsing in the Presence of Reordering
2012 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]

The work presented in this thesis discusses various formalisms for representing the addition of order-controlling and order-relaxing mechanisms to existing formal language models. An immediate example is shuffle expressions, which can represent not only all regular languages (a regular expression is a shuffle expression), but also features additional operations that generate arbitrary interleavings of its argument strings. This defines a language class which, on the one hand, does not contain all context-free languages, but, on the other hand contains an infinite number of languages that are not context-free. Shuffle expressions are, however, not themselves the main interest of this thesis. Instead we consider several formalisms that share many of their properties, where some are direct generalisations of shuffle expressions, while others feature very different methods of controlling order. Notably all formalisms that are studied here

  • have a semi-linear Parikh image,
  • are structured so that each derivation step generates at most a constant number of symbols (as opposed to the parallel derivations in for example Lindenmayer systems),
  • feature interesting ordering characteristics, created either by derivation steps that may generate symbols in multiple places at once, or by multiple generating processes that produce output independently in an interleaved fashion, and
  • are all limited enough to make the question of efficient parsing an interesting and reasonable goal.

This vague description already hints towards the formalisms considered; the different classes of mildly context-sensitive devices and concurrent finite-state automata.

This thesis will first explain and discuss these formalisms, and will then primarily focus on the associated membership problem (or parsing problem). Several parsing results are discussed here, and the papers in the appendix give a more complete picture of these problems and some related ones.

Place, publisher, year, edition, pages
Umeå: Department of Computing Science, Umeå University, 2012. p. 100
Series
Report / UMINF, ISSN 0348-0542 ; 12.10
Keywords
parsing, membership problems, complexity theory, reordering, shuffle, mildly context-sensitive, formal languages
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-54643 (URN)978-91-7459-435-5 (ISBN)
Presentation
2012-05-18, Department of Computing Science, 13:00 (English)
Opponent
Supervisors
Available from: 2012-05-29 Created: 2012-05-03 Last updated: 2018-06-08Bibliographically approved
Organisations

Search in DiVA

Show all publications