The abundance of digital content requires cost-effective technologies to extract the hidden meaning from media objects. However, current approaches fail to deal with the challenges related to cross-media analysis, metadata publishing, querying and recommendation that are necessary to overcome this challenge. In this paper, we describe the EU project MICO (Media in Context) which aims to provide the necessary technologies based on open-source software (OSS) core components.
Millstream systems have recently been proposed as a formalization of the linguistic idea that natural language should be described as a combination of different modules related by interfaces. In this paper we investigate algorithmic properties of Millstream systems having regular tree grammars as modules and MSO logic as interface logic. We focus on the so-called completion problem: Given trees generated by a subset of the modules, can they be completed into a valid configuration of the Millstream system?
We introduce Millstream systems, a mathematical framework in the tradition of the Theory of Computation that uses logic to formalize the interfaces between different aspects of language, the latter being described by any number of independent modules. Unlike other approaches that serve a similar goal, Millstream systems neither presuppose nor establish a particular linguistic theory or focus, but can be instantiated in various ways to accomodate different points of view.
Millstream systems have been proposed as a non-hierarchical method for modelling natural language. Millstream congurations represent and connect multiple structural aspects of sentences. We present a method by which the Millstream congurations corresponding to a sentence are constructed. The construction is incremental, that is, it proceeds as the sentence is being read and is complete when the end of the sentence is reached. It is based on graph transformations and a lexicon which associates words with rules for the graph transformations.
Millstream systems have been proposed as a non-hierarchical method for modelling natural language. Millstream configurations represent and connect multiple structural aspects of sentences. We present a method by which the Millstream configurations corresponding to a sentence are constructed. The construction is incremental, that is, it proceeds as the sentence is being read and is complete when the end of the sentence is reached. It is based on graph transformations and a lexicon which associates words with graph transformation rules that implement the incremental construction process.
This paper considers a characterization of the context-free non-regular languages, conjecturing that there for all such languages exists a fixed string thatcan be pumped to exhibit infinitely many equivalence classes. A proof is given only for a special case, but the general statement is conjectured to hold. The conjecture is then shown to imply that the shuffle of two context-free languagesis not context-free.
We study the complexity of uniform membership for Linear Context-Free RewritingSystems, i.e., the problem where we aregiven a string w and a grammar G and areasked whether w ∈ L(G). In particular,we use parameterized complexity theoryto investigate how the complexity dependson various parameters. While we focusprimarily on rank and fan-out, derivationlength is also considered.
In a recent survey (Drewes, 2017) of results on DAG automata some open problems are formulated for the case where the DAG language accepted by a DAG automaton A is restricted to DAGs with a single root, denoted by L(A)u. Here we consider each of those problems, demonstrating that: (i) the finiteness of L(A)u is decidable, (ii) the path languages of L(A)u can be characterized in terms of the string languages accepted by partially blind multicounter automata, and (iii) the Parikh image of L(A)u is semilinear.
Most software packages with regular expression matching engines offer operators that extend the classical regular expressions, such as counting, intersection, complementation, and interleaving. Some of the most popular engines, for example those of Java and Perl, also provide operators that are intended to control the nondeterminism inherent in regular expressions. We formalize this notion in the form of the cut and iterated cut operators. They do not extend the class of languages that can be defined beyond the regular, but they allow for exponentially more succinct representation of some languages. Membership testing remains polynomial, but emptiness testing becomes PSPACE-hard.
We investigate the problem of extracting the k best strings from a nondeterministic weighted automaton over a semiring S. This problem, which has been considered earlier in the literature, is more difficult than extracting the k best runs, since distinct runs may not correspond to distinct strings. Unsurprisingly, the computational complexity of the problem depends on the semiring S used. We study three different cases, namely the tropical and complex tropical semirings, and the semiring of positive real numbers. For the first case, we establish a polynomial algorithm. For the second and third cases, NP-completeness and undecidability results are shown.
Modern regular expression matching software features many extensions, some general, while some are very narrowly specified. Here we consider the generalization of adding a class of operators which can be described by, e.g. finite-state transducers. Combined with backreferences, they enable new classes of languages to be matched. The addition of finite-state transducers is shown to make membership testing undecidable. Following this result, we study the complexity of membership testing for various restricted cases of the model.
The output size problem, for a string-to-tree transducer, is to determine the asymptotic behavior of the function describing the maximum size of output trees, with respect to the length of input strings. We show that the problem to determine, for a given regular expression, the worst-case matching time of a backtracking regular expression matcher, can be reduced to the output size problem. The latter can, in turn, be solved by determining the degree of ambiguity of a non-deterministic finite automaton.
We consider in some detail how regular expression matching happens in Java, as a popular representative of the category of regex-directed matching engines. We extract a slightly idealized algorithm for this scenario. Next we define an automata model which captures all the aspects needed to perform matching, of the Java style, in a formal way. Finally, two types of static analysis, which take a regular expression and tells whether there exists a family of strings which make Java-style matching run in exponential time, are done.
Motivated by applications in natural language processing, we study the uniform membership problem for hyperedge-replacement grammars that generate directed acyclic graphs. Our major result is a low-degree polynomial-time algorithm that solves the uniform membership problem for a restricted type of such grammars. We motivate the necessity of the restrictions by two different NP-completeness results.
We introduce a weighted extension of the recently proposed notion oforder-preserving hyperedge-replacement grammars and prove that the weightof a graph according to such a weighted graph grammar can be computeduniformly in quadratic time (under assumptions made precise in the paper).
We introduce a weighted extension of the recently proposed notion of order-preserving hyperedge-replacement grammars and prove that the weight of a graph according to such a weighted graph grammar can be computed uniformly in quadratic time (under assumptions made precise in the paper).
It is well known that hyperedge-replacement grammars can generate NP-complete graph languages even under seemingly harsh restrictions. This means that the parsing problem is difficult even in the non-uniform setting, in which the grammar is considered to be fixed rather than being part of the input. Little is known about restrictions under which truly uniform polynomial parsing is possible. In this paper we propose a low-degree polynomial-time algorithm that solves the uniform parsing problem for a restricted type of hyperedge-replacement grammars which we expect to be of interest for practical applications.
We propose a formal model for translating unranked syntactic trees, such as dependency trees, into semantic graphs. These tree-to-graph transducers can serve as a formal basis of transition systems for semantic parsing which recently have been shown to perform very well, yet hitherto lack formalization. Our model features "extended" rules and an arc-factored normal form, comes with an efficient translation algorithm, and can be equipped with weights in a straightforward manner.
We conduct a comparative study of two state-of-the-art al- gorithms for extracting the N best trees from a weighted tree automaton (wta). The algorithms are Best Trees, which uses a priority queue to structure the search space, and Filtered Runs, which is based on an algorithm by Huang and Chiang that extracts N best runs, implemented as part of the Tiburon wta toolkit. The experiments are run on four data sets, each consisting of a sequence of wtas of increasing sizes. Our conclusion is that Best Trees can be recommended when the input wtas exhibit a high or unpredictable degree of nondeterminism, whereas Filtered Runs is the better option when the input wtas are large but essentially deterministic.
We propose an algorithm for computing the N best vertices in a weighted acyclic hypergraph over a nice semiring. A semiring is nice if it is finitely-generated, idempotent, and has 1 as its minimal element. We then apply the algorithm to the problem of computing the N best trees with respect to a weighted tree automaton, and complement theoretical correctness and complexity arguments with experimental data. The algorithm has several practical applications in natural language processing, for example, to derive the N most likely parse trees with respect to a probabilistic context-free grammar.
We propose an algorithm for computing the $N$ best roots of a weighted hypergraph, in which the weight function is given over an idempotent and multiplicatively monotone semiring. We give a set of conditions that ensures that the weight function is well-defined and that solutions exist. Under these conditions, we prove that the proposed algorithm is correct. This generalizes a previous result for weighted tree automata, and in doing so, broadens the practical applications.
Unranked tree languages are valuable in natural language processing for modelling dependency trees. We introduce a new type of automaton for unranked tree languages, called Z-automaton, that is tailored for this particular application. The Z-automaton offers a compact form of representation, and unlike the closely related notion of stepwise automata, does not require a binary encoding of its input. We establish an arc-factored normal form, and prove the membership problem of Z-automata in normal form to be in O(mn), where m is the size of the transition table of the Z-automaton and n is the size of the input tree.
We generalise a search algorithm by Mohri and Riley from strings to trees. The original algorithm takes as input a weighted automaton M over the tropical semiring, together with an integer N, and outputs N strings of minimal weight with respect to M. In our setting, M defines a weighted tree language, again over the tropical semiring, and the output is a set of N trees with minimal weight. We prove that the algorithm is correct, and that its time complexity is a low polynomial in N and the relevant size parameters of M.
We generalise a search algorithm by Mohri and Riley from strings to trees. The original algorithm takes as input a nondeterministic weighted automaton M over the tropical semiring and an integer N, and outputs N strings of minimal weight with respect to M. In our setting, M is a weighted tree automaton, again over the tropical semiring, and the output is a set of N trees with minimal weight in this language. We prove that the algorithm is correct, and that its time complexity is a low polynomial in N and the relevant size parameters of M.
We study sets of directed acyclic graphs, called regular DAG languages, which are accepted by a recently introduced type of DAG automata motivated by current developments in natural language processing. We prove (or disprove) closure properties, establish pumping lemmata, characterize finite regular DAG languages, and show that "unfolding" turns regular DAG languages into regular tree languages, which implies a linear growth property and the regularity of the path languages of regular DAG languages. Further, we give polynomial decision algorithms for the emptiness and finiteness problems, and show that deterministic DAG automata can be minimized and tested for equivalence in polynomial time.
A DAG is a directed acyclic graph. We study the properties of DAG automata and their languages, called regular DAG languages. In particular, we prove results resembling pumping lemmas and show that the finiteness problem for regular DAG languages is in P.
Graphs have a variety of uses in natural language processing, particularly as representations of linguistic meaning. A deficit in this area of research is a formal framework for creating, combining, and using models involving graphs that parallels the frameworks of finite automata for strings and finite tree automata for trees. A possible starting point for such a framework is the formalism of directed acyclic graph (DAG) automata, defined by Kamimura and Slutzki and extended by Quernheim and Knight. In this article, we study the latter in depth, demonstrating several new results, including a practical recognition algorithm that can be used for inference and learning with models defined on DAG automata. We also propose an extension to graphs with unbounded node degree and show that our results carry over to the extended formalism.
The relationship between Term Graph Rewriting and Term Rewriting is well understood: a single term graph reduction may correspond to several term reductions, due to sharing. It is also known that if term graphs are allowed to contain cycles, then one term graph reduction may correspond to infinitely many term reductions. We stress that this fact can be interpreted in two ways. According to the "sequential interpretation", a term graph reduction corresponds to an infinite sequence of term reductions, as formalized by Kennaway et.al. using strongly converging derivations over the complete metric space of infinite terms. Instead according to the "parallel interpretation" a term graph reduction corresponds to the parallel reduction of an infinite set of redexes in a rational term. We formalize the latter notion by exploiting the complete partial order of infinite and possibly partial terms, and we stress that this interpretation allows to explain the result of reducing circular redexes in several approaches to term graph rewriting.
Languages of directed acyclic graphs (DAGs) are of interest in Natural Lanuage Processing because they can be used to capture the structure of semantic graphs like those of Abstract Meaning Representation. This paper gives an overview of recent results on a family of automata recognizing such DAG languages.
We introduce a generalization of tree-based generators called delegation networks. These make it possible to generate objects such as strings, trees, graphs, and pictures in a modular way by combining tree-based generators of several types. Our main result states that, if all underlying tree generators generate regular tree languages (or finite tree languages), then the tree-generating power of delegation networks is the same as that of context-free tree grammars working in IO mode.
Allan Turing [5] suggested to regard the ability to communicate in human language as an indication of true intelligence. If a computer would be able to engage in such a communication with human beings without them being able to identify the computer, the computer should be considered to be intelligent. Although it is debatable whether this conclusion could really be drawn from the Turing test (see also [6]), it shows how complex human language is and how many facets it has. Some of the most important dimensions of language are phonology, morphology, syntax, semantics, and pragmatics. Pragmatics includes the whole field of contextual and ontological knowledge. These dimensions are not orthogonal, but are intertwined in many ways. Even if we restrict ourselves to text input and output, thus disregarding phonology, this creates an amazingly complex structure. While computational linguists try to make progress understanding the relation between the various dimensions, we usually restrict ourselves to syntax in natural language processing, sometimes extended by limited attempts to make a semantic interpretation or to make use of ontological knowledge. The reason for this is, of course, the descriptional and computational complexity of the models required.
We review recent results regarding DAG automata and regular DAG languages and point out some open problems that may be interesting to work on. Moreover, a notion of DAG transducers is suggested.
We consider collage grammars whose rules subdivide the unit square into smaller and smaller rectangles. The decidability status of selected decision problems for this type of grammars is surveyed: the membership problem, the emptiness and finiteness problems, connectedness and disconnectedness of the generated pictures, and the question whether a generated collage contains a rectangle whose lower-left corner is a point on the diagonal.
The conceptual ideas that are intended to become the basis for the tree automata workbench Marbles are sketched. The goal is to design and implement an extensible system that facilitates experiments with virtually any kind of algorithm on tree automata. Moreover, the system will be released with a library and an application programmer's interface to make it accessible to anyone who wants to apply tree automata algorithms in research and development.