Millstream systems have recently been proposed as a formalization of the linguistic idea that natural language should be described as a combination of different modules related by interfaces. In this paper we investigate algorithmic properties of Millstream systems having regular tree grammars as modules and MSO logic as interface logic. We focus on the so-called completion problem: Given trees generated by a subset of the modules, can they be completed into a valid configuration of the Millstream system?
This paper considers a characterization of the context-free non-regular languages, conjecturing that there for all such languages exists a fixed string thatcan be pumped to exhibit infinitely many equivalence classes. A proof is given only for a special case, but the general statement is conjectured to hold. The conjecture is then shown to imply that the shuffle of two context-free languagesis not context-free.
A regular unranked tree folding consists of a regular unranked tree language and a folding operation that merges, i.e., folds, selected nodes of a tree to form a graph; the combination is a formal device for representing graph languages. If, in the process of folding, the order among edges is discarded so that the result is an unordered graph, then two applications of a fold operation is enough to make the associated parsing problem NP-complete. However, if the order is kept, then the problem is solvable in non-uniform polynomial time. In this paper we address the remaining case where only one fold operation is applied, but the order among edges is discarded. We show that under these conditions, the problem is solvable in non-uniform polynomial time.
A regular unranked tree folding consists of a regular unranked tree language and a folding operation that merges (i.e., folds) selected nodes of a tree to form a graph; the combination is a formal device for representing graph languages. If, in the process of folding, the order among edges is discarded so that the result is an unordered graph, then two applications of a fold operation are enough to make the associated parsing problem NP-complete. However, if the order is kept, then the problem is solvable in non-uniform polynomial time. In this paper, we address the remaining case, where only one fold operation is applied, but the order among the edges is discarded. We show that, under these conditions, the problem is solvable in non-uniform polynomial time.
Language models that use interleaving, or shuffle, operators have applications in various areas of computer science, including system verification, plan recognition, and natural language processing. We study the complexity of the membership problem for such models, in other words, how difficult it is to determine if a string belongs to a language or not. In particular, we investigate how interleaving can be introduced into models that capture the context-free languages.
We introduce a fold operation that realises a tree-to-graph transduction by merging selected nodes in the input tree to form a possibly cyclic output graph. The work is motivated by the increasing use of graph-based representations in semantic parsing. We show that a suitable class of graphs languages can be generated by applying the fold operation to regular unranked tree languages. We investigate two versions of the fold operation, one that preserves a depth-first ordering between the edges, and one that does not. Finally, we demonstrate that the time complexity for the associated non-uniform membership problem is solvable in polynomial time for the order-preserving version, and NP-complete for the order-cancelling one.
We study the complexity of uniform membership for Linear Context-Free RewritingSystems, i.e., the problem where we aregiven a string w and a grammar G and areasked whether w ∈ L(G). In particular,we use parameterized complexity theoryto investigate how the complexity dependson various parameters. While we focusprimarily on rank and fan-out, derivationlength is also considered.
In a recent survey (Drewes, 2017) of results on DAG automata some open problems are formulated for the case where the DAG language accepted by a DAG automaton A is restricted to DAGs with a single root, denoted by L(A)u. Here we consider each of those problems, demonstrating that: (i) the finiteness of L(A)u is decidable, (ii) the path languages of L(A)u can be characterized in terms of the string languages accepted by partially blind multicounter automata, and (iii) the Parikh image of L(A)u is semilinear.
Most software packages with regular expression matching engines offer operators that extend the classical regular expressions, such as counting, intersection, complementation, and interleaving. Some of the most popular engines, for example those of Java and Perl, also provide operators that are intended to control the nondeterminism inherent in regular expressions. We formalize this notion in the form of the cut and iterated cut operators. They do not extend the class of languages that can be defined beyond the regular, but they allow for exponentially more succinct representation of some languages. Membership testing remains polynomial, but emptiness testing becomes PSPACE-hard.
Language models that use interleaving, or shuffle, operators have applications in various areas of computer science, including system verification, plan recognition, and natural language processing. We study the complexity of the membership problem for such models, i.e., how difficult it is to determine if a string belongs to a language or not. In particular, we investigate how interleaving can be introduced into models that capture the context-free languages.
Parsing for mildly context-sensitive language formalisms is an important area within natural language processing. While the complexity of the parsing problem for some such formalisms is known to be polynomial, this is not the case for all of them. This article presents a series of results regarding the complexity of parsing for linear context-free rewriting systems and deterministic tree-walking transducers. We discuss the difference between uniform and nonuniform complexity measures and how parameterized complexity theory can be used to investigate how different aspects of the formalisms influence how hard the parsing problem is. The main results we survey are all hardness results and indicate that parsing is hard even for relatively small values of parameters such as rank and fan-out in a rewriting system.
MICO is an open-source platform for cross-media analysis, querying, and recommendation. It is the major outcome of the European research project Media in Context, and has been contributed to by academic and industrial partners from Germany, Austria, Sweden, Italy, and the UK. A central idea is to group sets of related media objects into multimodal content items, and to process and store these as logical units. The platform is designed to be easy to extend and adapt, and this makes it a useful building block for a diverse set of multimedia applications. To promote the platform and demonstrate its potential, we describe our work on a Kaldi-based speech-recognition pipeline.
Order-preserving DAG grammars (OPDGs) is a formalism for processing semantic infor- mation in natural languages [5, 4]. OPDGs are sufficiently expressive to model abstract meaning representations, a graph-based form of semantic representation in which nodes en- code objects and edges relations. At the same time, they allow for efficient parsing in the uniform setting, where both the grammar and subject graph are taken as part of the input.
In this article, we introduce an initial algebra semantic for OPDGs, which allows us to view them as regular tree grammars. This makes it possible to transfer a number of results from that domain to OPDGs, both in the unweighted and the weighted case. In particular, we show that deterministic OPDGs can be minimised efficiently, and that they are learnable in the so-called MAT setting. To conclude, we show that the languages generated by OPDGs are MSO-definable.
Order-Preserving DAG Grammars (OPDGs) is a subclass of Hyper-Edge Replacement Grammars that can be parsed in polynomial time. Their associated class of languages is known as Ordered DAG Lan- guages, and the graphs they generate are characterised by being acyclic, rooted, and having a natural order on their nodes. OPDGs are useful in natural-language processing to model abstract meaning representa- tions. We state and prove a Myhill-Nerode theorem for ordered DAG languages, and translate it into a MAT-learning algorithm for the same class. The algorithm infers a minimal OPDG G for the target language in time polynomial in G and the samples provided by the MAT oracle.
Order-preserving DAG grammars (OPDGs) is a formalism for representing languages of structurally restricted graphs. As demonstrated in [17], they are sufficiently expressive to model abstract meaning representations in natural language processing, a graph-based form of semantic representation in which nodes encode objects and edges relations. At the same time, they can be parsed in O (n2 + nm) , where m and n are the sizes of the grammar and the input graph, respectively. In this work, we provide an initial algebra semantic for OPDGs, which allows us to view them as regular tree grammars under an equivalence theory. This makes it possible to transfer results from the field of formal tree languages to the domain of OPDGs, both in the unweighted and the weighted case. In particular, we show that deterministic OPDGs can be minimised efficiently, and that they are learnable under the \minimal adequeate teacher" paradigm, that is, by querying an oracle for equivalence between languages, and membership of individual graphs. To conclude, we demonstrate that the languages generated by OPDGs are definable in monadic second-order logic.
Several linear-time algorithms for automata-based pattern matching rely on failure transitions for efficient back-tracking. Like epsilon transitions, failure transition do not consume input symbols, but unlike them, they may only be taken when no other transition is applicable. At a semantic level, this conveniently models catch-all clauses and allows for compact language representation.
This work investigates the transition-reduction problem for deterministic finite-state automata (DFA). The input is a DFA A and an integer k. The question is whether k or more transitions can be saved by replacing regular transitions with failure transitions. We show that while the problem is NP-complete, there are approximation techniques and heuristics that mitigate the computational complexity. We conclude by demonstrating the computational difficulty of two related minimisation problems, thereby cancelling the ongoing search for efficient algorithms.
Part of Speech (POS) taggers for Swedish routinely fail for the third person gender-neutral pronoun hen, despite the fact that it has been a well-established part of the Swedish language since at least 2014. In addition to simply being a form of gender bias, this failure can have negative effects on other tasks relying on POS information. We demonstrate the usefulness of semi-synthetic augmented datasets in a case study, retraining a POS tagger to correctly recognize hen as a personal pronoun. We evaluate our retrained models for both tag accuracy and on a downstream task (dependency parsing) in a classicial NLP pipeline.
Our results show that adding such data works to correct for the disparity in performance. The accuracy rate for identifying hen as a pronoun can be brought up to acceptable levels with only minor adjustments to the tagger’s vocabulary files. Performance parity to gendered pronouns can be reached after retraining with only a few hundred examples. This increase in POS tag accuracy also results in improvements for dependency parsing sentences containing hen.
Despite the fact that the gender-neutral pro-noun hen was officially added to the Swedish language in 2014, state of the art part of speech taggers still routinely fail to identify it as a pronoun. We retrain both efselab and spaCy models with augmented (semi-synthetic) data, where instances of gendered pronouns are replaced by hen to correct for the lack of representation in the original training data. Our results show that adding such data works to correct for the disparity in performance
Motivated by applications in natural language processing, we study the uniform membership problem for hyperedge-replacement grammars that generate directed acyclic graphs. Our major result is a low-degree polynomial-time algorithm that solves the uniform membership problem for a restricted type of such grammars. We motivate the necessity of the restrictions by two different NP-completeness results.
We introduce a weighted extension of the recently proposed notion oforder-preserving hyperedge-replacement grammars and prove that the weightof a graph according to such a weighted graph grammar can be computeduniformly in quadratic time (under assumptions made precise in the paper).
We introduce a weighted extension of the recently proposed notion of order-preserving hyperedge-replacement grammars and prove that the weight of a graph according to such a weighted graph grammar can be computed uniformly in quadratic time (under assumptions made precise in the paper).
It is well known that hyperedge-replacement grammars can generate NP-complete graph languages even under seemingly harsh restrictions. This means that the parsing problem is difficult even in the non-uniform setting, in which the grammar is considered to be fixed rather than being part of the input. Little is known about restrictions under which truly uniform polynomial parsing is possible. In this paper we propose a low-degree polynomial-time algorithm that solves the uniform parsing problem for a restricted type of hyperedge-replacement grammars which we expect to be of interest for practical applications.
We study the problem of incrementally maintaining an XPath query on an XML database under updates. The updates we consider are node insertion, node deletion, and node relabeling. Our main results are that downward XPath queries can be incrementally maintained in time O(depth(D) · poly(Q)) and conjunctive forward XPath queries in time O(depth(D)· log(width(D))·poly(Q)), where D is the size of the database, Q the size of the query, and depth(D) and width(D) are the nesting depth and maximum number of siblings in the database, respectively. The auxiliary data structures for maintenance are linear in D and polynomial in Q in all these cases.
Incremental view maintenance for XPath queries asks to maintain a materialized XPath view over an XML database. It assumes an underlying XML database D and a query Q. One is given a sequence of updates U to D, and the problem is to compute the result of Q(U(D)): the result of evaluating query Q on database D after having applied updates U. This article initiates a systematic study of the Boolean version of this problem. In the Boolean version, one only wants to know whether Q(U(D)) is empty or not.
In order to quickly answer this question, we are allowed to maintain an auxiliary data structure. The complexity of the maintenance algorithms is measured in, (1) the size of the auxiliary data structure, (2) the worst-case time per update needed to compute Q(U(D)), and (3) the worst-case time per update needed to bring the auxiliary data structure up to date. We allow three kinds of updates: node insertion, node deletion, and node relabeling. Our main results are that downward XPath queries can be incrementally maintained in time O(depth(D)·poly(|Q|)) per update and conjunctive forward XPath queries in time O(depth(D) · log(width(D))·poly(|Q|)) per update, where |Q| is the size of the query, and depth(D) and width(D) are the nesting depth and maximum number of siblings in database D, respectively. The auxiliary data structures for maintenance are linear in |D| and polynomial in |Q| in all these cases.
We prove that minimizing finite automata is NP-hard for almost all classes of automata that extend the class of deterministic finite automata. More specifically, we show that minimization is NP-hard for all finite automata classes that subsume the class of δNFAs which accept strings of length at most three. Here, δNFAs are the finite automata that are unambiguous, allow at most one state q with a non-deterministic transition for at most one alphabet symbol a, and are allowed to visit state q at most once in a run. As a corollary, we also obtain that the same result holds for all finite automata classes that subsume that class of finite automata that are unambiguous, have at most two initial states, and accept strings of length at most two.
The complexity of containment and satisfiability of conjunctive queries over finite, unranked, labeled trees is studied with respect to the axes Child, NextSibling, their transitive and reflexive closures, andFollowing. For the containment problem a trichotomy is presented, classifying the problems as in PTIME, coNP-complete, or -complete. For the satisfiability problem most problems are classified as either in PTIME or NP-complete.
We study the containment, satisfiability, and validity problems for conjunctive queries over trees with respect to a schema. We show that conjunctive query containment and validity are 2EXPTIME -complete with respect to a schema, in both cases where the schema is given as a DTD or as a tree automaton. Furthermore, we show that satisfiability for conjunctive queries with respect to a schema can be decided in NP . The problem is NP -hard already for queries using only one kind of axis. Finally, we consider conjunctive queries that can test for equalities and inequalities of data values. Here, satisfiability and validity are decidable, but containment is undecidable, even without schema information. On the other hand, containment with respect to a schema becomes decidable again if the "larger" query is not allowed to use both equalities and inequalities.
We prove that various containment and validity problems for tree pattern queries with respect to a schema are EXPTIME-complete. When one does not require the root of a tree pattern query to match the root of atree, validity of a non-branching tree pattern query with respect to a Relax NG schema or W3C XML Schema is already EXPTIME-hard when the query does not branch and uses only child axes. These hardness results already hold when the alphabet size is fixed. Validity with respect to a DTD is proved to be EXPTIME-hard already when the query only uses child axes and is allowed to branch only once.
Regular expressions are omnipresent in database applications. They form the structural core of schema languages for XML, they are a fundamental ingredient for navigational queries in graph databases, and are being considered in languages for upcoming technologies such as schema- and transformation languages for tabular data on the Web. In this paper we study the usage and effectiveness of the counting operator (or: limited repetition) in regular expressions. The counting operator is a popular extension which is part of the POSIX standard and therefore also present in regular expressions in grep, Java, Python, Perl, and Ruby. In a database context, expressions with counting appear in XML Schema and languages for querying graphs such as SPARQL 1.1 and Cypher.
We first present a practical study that suggests that counters are extensively used in practice. We then investigate evaluation methods for such expressions and develop a new algorithm for efficient incremental evaluation. Finally, we conduct an extensive benchmark study that shows that exploiting counting operators can lead to speed-ups of several orders of magnitude in a wide range of settings: normal and incremental evaluation on synthetic and real expressions.
Active context-free games are two-player games on strings over finite alphabets with one player trying to rewrite the input string to match a target specification. These games have been investigated in the context of exchanging Active XML (AXML) data. While it was known that the rewriting problem is undecidable in general, it is shown here that it is EXPSPACE-complete to decide for a given contextfree game, whether all safely rewritable strings can be safely rewritten in a left-to-right manner, a problem that was previously considered by Abiteboul et al. Furthermore, it is shown that the corresponding problem for games with finite replacement languages is EXPTIME-complete.
Data words are an extension of traditional strings that have at each position, besides a symbol from some finite alphabet, a data value from some infinite domain. Class-memory automata constitute an automata model for data words with a decidable emptiness problem. The paper improves the previous result that class-memory automata are strictly more expressive than register automata, another automata model for data words. More specifically, it shows that weak class-memory automata, a restriction of class-memory automata introduced by Cotton-Barratt, Murawski, and Ong are strictly more powerful than the extension of register automata by a data-guessing facility. While weak deterministic class-memory automata yield a restriction of class-memory automata which is closed under Boolean operations, the paper also proposes an extension of deterministic class-memory automata with this property.
With motivation from considerations in XML database theory and model checking, data strings have been introduced as an extension of finite alphabet strings which carry, at each position, a symbol and a data value from an infinite domain. Previous work has shown that it is difficult to come up with an expressive yet decidable automaton model for data languages. Recently, such a model, data automata, was introduced. This paper introduces a simpler but equivalent model and investigates its expressive power, algorithmic and closure properties, and some extensions.
Gender bias in natural language processing (NLP) tools, deriving from implicit human bias embedded in language data, is an important and complicated problem on the road to fair algorithms. We leverage topic modeling to retrieve documents associated with particular gendered categories, and discuss how exploring these documents can inform our understanding of the corpora we may use to train NLP tools. This is a starting point for challenging the systemic power structures and producing a justice-focused approach to NLP.
Gender bias has been identified in many models for Natural Language Processing, stemming from implicit biases in the text corpora used to train the models. Such corpora are too large to closely analyze for biased or stereotypical content. Thus, we argue for a combination of quantitative and qualitative methods, where the quantitative part produces a view of the data of a size suitable for qualitative analysis. We investigate the usefulness of semi-supervised topic modeling for the detection and analysis of gender bias in three corpora (mainstream news articles in English and Swedish, and LGBTQ+ web content in English). We compare differences in topic models for three gender categories (masculine, feminine, and nonbinary or neutral) in each corpus. We find that in all corpora, genders are treated differently and that these differences tend to correspond to hegemonic ideas of gender.
The rise of concern around Natural Language Processing (NLP) technologies containing and perpetuating social biases has led to a rich and rapidly growing area of research. Gender bias is one of the central biases being analyzed, but to date there is no comprehensive analysis of how “gender” is theorized in the field. We survey nearly 200 articles concerning gender bias in NLP to discover how the field conceptualizes gender both explicitly (e.g. through definitions of terms) and implicitly (e.g. through how gender is operationalized in practice). In order to get a better idea of emerging trajectories of thought, we split these articles into two sections by time.
We find that the majority of the articles do not make their theo- rization of gender explicit, even if they clearly define “bias.” Almost none use a model of gender that is intersectional or inclusive of non- binary genders; and many conflate sex characteristics, social gender, and linguistic gender in ways that disregard the existence and expe- rience of trans, nonbinary, and intersex people. There is an increase between the two time-sections in statements acknowledging that gender is a complicated reality, however, very few articles manage to put this acknowledgment into practice. In addition to analyzing these findings, we provide specific recommendations to facilitate interdisciplinary work, and to incorporate theory and methodol- ogy from Gender Studies. Our hope is that this will produce more inclusive gender bias research in NLP.
Despite concerns that Large Language Models (LLMs) are vectors for reproducing and ampli- fying social biases such as sexism, transpho- bia, islamophobia, and racism, there is a lack of work qualitatively analyzing how such pat- terns of bias are generated by LLMs. We use mixed-methods approaches and apply a femi- nist, intersectional lens to the problem across two language domains, Swedish and English, by generating narrative texts using LLMs. We find that hegemonic norms are consistently re- produced; dominant identities are often treated as ‘default’; and discussion of identity itself may be considered ‘inappropriate’ by the safety features applied to some LLMs. Due to the dif- fering behaviors of models, depending both on their design and the language they are trained on, we observe that strategies of identifying “bias” must be adapted to individual models and their socio-cultural contexts.
We investigate the problem of predicting the competence of users of the crowd-sourcing platform Zooniverse by analyzing their chat texts. Zooniverse is an online platform where objects of different types are displayed to volunteer users to classify. Our research focuses on the Zoonivers Galaxy Zoo project, where users classify the images of galaxies and discuss their classifications in text. We apply natural language processing methods to extract linguistic features including syntactic categories, bag-of-words, and punctuation marks. We trained three supervised machine-learning classifiers on the resulting dataset: k-nearest neighbors, decision trees (with gradient boosting) and naive Bayes. They are evaluated (regarding accuracy and F-measure) with two different but related domain datasets. The performance of the classifiers varies across the feature set configurations designed during the training phase. A challenging part of this research is to compute the competence of the users without ground truth data available. We implemented a tool that estimates the proficiency of users and annotates their text with computed competence. Our evaluation results show that the trained classifier models give results that are significantly better than chance and can be deployed for other crowd-sourcing projects as well.