The abundance of digital content requires cost-effective technologies to extract the hidden meaning from media objects. However, current approaches fail to deal with the challenges related to cross-media analysis, metadata publishing, querying and recommendation that are necessary to overcome this challenge. In this paper, we describe the EU project MICO (Media in Context) which aims to provide the necessary technologies based on open-source software (OSS) core components.
We introduce LOVELACE, a tool for creating corpora of semantic graphs.The system uses graph expansion grammar as a representational language, thus allowing users to craft a grammar that describes a corpus with desired properties. When given such grammar as input, the system generates a set of output graphs that are well-formed according to the grammar, i.e., a graph bank.The generation process can be controlled via a number of configurable parameters that allow the user to, for example, specify a range of desired output graph sizes.Central use cases are the creation of synthetic data to augment existing corpora, and as a pedagogical tool for teaching formal language theory.
Millstream systems have recently been proposed as a formalization of the linguistic idea that natural language should be described as a combination of different modules related by interfaces. In this paper we investigate algorithmic properties of Millstream systems having regular tree grammars as modules and MSO logic as interface logic. We focus on the so-called completion problem: Given trees generated by a subset of the modules, can they be completed into a valid configuration of the Millstream system?
We introduce Millstream systems, a mathematical framework in the tradition of the Theory of Computation that uses logic to formalize the interfaces between different aspects of language, the latter being described by any number of independent modules. Unlike other approaches that serve a similar goal, Millstream systems neither presuppose nor establish a particular linguistic theory or focus, but can be instantiated in various ways to accomodate different points of view.
Millstream systems have been proposed as a non-hierarchical method for modelling natural language. Millstream congurations represent and connect multiple structural aspects of sentences. We present a method by which the Millstream congurations corresponding to a sentence are constructed. The construction is incremental, that is, it proceeds as the sentence is being read and is complete when the end of the sentence is reached. It is based on graph transformations and a lexicon which associates words with rules for the graph transformations.
Millstream systems have been proposed as a non-hierarchical method for modelling natural language. Millstream configurations represent and connect multiple structural aspects of sentences. We present a method by which the Millstream configurations corresponding to a sentence are constructed. The construction is incremental, that is, it proceeds as the sentence is being read and is complete when the end of the sentence is reached. It is based on graph transformations and a lexicon which associates words with graph transformation rules that implement the incremental construction process.
This paper considers a characterization of the context-free non-regular languages, conjecturing that there for all such languages exists a fixed string thatcan be pumped to exhibit infinitely many equivalence classes. A proof is given only for a special case, but the general statement is conjectured to hold. The conjecture is then shown to imply that the shuffle of two context-free languagesis not context-free.
We study the complexity of uniform membership for Linear Context-Free RewritingSystems, i.e., the problem where we aregiven a string w and a grammar G and areasked whether w ∈ L(G). In particular,we use parameterized complexity theoryto investigate how the complexity dependson various parameters. While we focusprimarily on rank and fan-out, derivationlength is also considered.
In a recent survey (Drewes, 2017) of results on DAG automata some open problems are formulated for the case where the DAG language accepted by a DAG automaton A is restricted to DAGs with a single root, denoted by L(A)u. Here we consider each of those problems, demonstrating that: (i) the finiteness of L(A)u is decidable, (ii) the path languages of L(A)u can be characterized in terms of the string languages accepted by partially blind multicounter automata, and (iii) the Parikh image of L(A)u is semilinear.
Most software packages with regular expression matching engines offer operators that extend the classical regular expressions, such as counting, intersection, complementation, and interleaving. Some of the most popular engines, for example those of Java and Perl, also provide operators that are intended to control the nondeterminism inherent in regular expressions. We formalize this notion in the form of the cut and iterated cut operators. They do not extend the class of languages that can be defined beyond the regular, but they allow for exponentially more succinct representation of some languages. Membership testing remains polynomial, but emptiness testing becomes PSPACE-hard.
We investigate the problem of extracting the k best strings from a nondeterministic weighted automaton over a semiring S. This problem, which has been considered earlier in the literature, is more difficult than extracting the k best runs, since distinct runs may not correspond to distinct strings. Unsurprisingly, the computational complexity of the problem depends on the semiring S used. We study three different cases, namely the tropical and complex tropical semirings, and the semiring of positive real numbers. For the first case, we establish a polynomial algorithm. For the second and third cases, NP-completeness and undecidability results are shown.
Modern regular expression matching software features many extensions, some general, while some are very narrowly specified. Here we consider the generalization of adding a class of operators which can be described by, e.g. finite-state transducers. Combined with backreferences, they enable new classes of languages to be matched. The addition of finite-state transducers is shown to make membership testing undecidable. Following this result, we study the complexity of membership testing for various restricted cases of the model.
The output size problem, for a string-to-tree transducer, is to determine the asymptotic behavior of the function describing the maximum size of output trees, with respect to the length of input strings. We show that the problem to determine, for a given regular expression, the worst-case matching time of a backtracking regular expression matcher, can be reduced to the output size problem. The latter can, in turn, be solved by determining the degree of ambiguity of a non-deterministic finite automaton.
We consider in some detail how regular expression matching happens in Java, as a popular representative of the category of regex-directed matching engines. We extract a slightly idealized algorithm for this scenario. Next we define an automata model which captures all the aspects needed to perform matching, of the Java style, in a formal way. Finally, two types of static analysis, which take a regular expression and tells whether there exists a family of strings which make Java-style matching run in exponential time, are done.
Motivated by applications in natural language processing, we study the uniform membership problem for hyperedge-replacement grammars that generate directed acyclic graphs. Our major result is a low-degree polynomial-time algorithm that solves the uniform membership problem for a restricted type of such grammars. We motivate the necessity of the restrictions by two different NP-completeness results.
We introduce a weighted extension of the recently proposed notion oforder-preserving hyperedge-replacement grammars and prove that the weightof a graph according to such a weighted graph grammar can be computeduniformly in quadratic time (under assumptions made precise in the paper).
We introduce a weighted extension of the recently proposed notion of order-preserving hyperedge-replacement grammars and prove that the weight of a graph according to such a weighted graph grammar can be computed uniformly in quadratic time (under assumptions made precise in the paper).
It is well known that hyperedge-replacement grammars can generate NP-complete graph languages even under seemingly harsh restrictions. This means that the parsing problem is difficult even in the non-uniform setting, in which the grammar is considered to be fixed rather than being part of the input. Little is known about restrictions under which truly uniform polynomial parsing is possible. In this paper we propose a low-degree polynomial-time algorithm that solves the uniform parsing problem for a restricted type of hyperedge-replacement grammars which we expect to be of interest for practical applications.
We develop a finite-state transducer for translating unranked trees into general graphs. This work is motivated by recent progress in semantic parsing for natural language, where sentences are first mapped into tree-shaped syntactic representations, and then these trees are translated into graph semantic representations. We investigate formal properties of our tree-to-graph transducers and develop a polynomial time algorithm for translating a weighted language of input trees into a packed representation, from which best-score graphs can be efficiently recovered.
We propose a formal model for translating unranked syntactic trees, such as dependency trees, into semantic graphs. These tree-to-graph transducers can serve as a formal basis of transition systems for semantic parsing which recently have been shown to perform very well, yet hitherto lack formalization. Our model features "extended" rules and an arc-factored normal form, comes with an efficient translation algorithm, and can be equipped with weights in a straightforward manner.
There is a growing consensus that surface form alone does not enable models to learn meaning and gain language understanding. This warrants an interest in hybrid systems that combine the strengths of neural and symbolic methods. We favour triadic systems consisting of neural networks, knowledge bases, and inference engines. The network provides perception, that is, the interface between the system and its environment. The knowledge base provides explicit memory and thus immediate access to established facts. Finally, inference capabilities are provided by the inference engine which reflects on the perception, supported by memory, to reason and discover new facts. In this work, we probe six popular language models for semantic relations and outline a future line of research to study how the constituent subsystems can be jointly realised and integrated.
The N-best extraction problem consists in selecting the N highest ranking hypotheses from a set of hypotheses, with respect to a given ranking system. In our setting, the hypotheses and ranking are jointly represented by a weighted tree automaton (wta) over the tropical semiring: the hypotheses are trees, or runs on trees, and the ranking is decided by the weight assigned to them. In previous work, we presented an algorithm for N-best extraction that combines techniques to restrict the search space, and proved it to be correct and efficient. The algorithm is now implemented in the software Betty, allowing us to complement the deductive study with an empirical investigation. In particular, we compare our algorithm to the state-of-the-art algorithm for extracting the N best runs, implemented in in the software toolkit Tiburon. The data sets used in the experiments are wtas resulting from real-world natural language processing tasks, as well as artificially created wtas with varying degrees of nondeterminism. We find that Betty outperforms Tiburon on all tested data sets.
We conduct a comparative study of two state-of-the-art al- gorithms for extracting the N best trees from a weighted tree automaton (wta). The algorithms are Best Trees, which uses a priority queue to structure the search space, and Filtered Runs, which is based on an algorithm by Huang and Chiang that extracts N best runs, implemented as part of the Tiburon wta toolkit. The experiments are run on four data sets, each consisting of a sequence of wtas of increasing sizes. Our conclusion is that Best Trees can be recommended when the input wtas exhibit a high or unpredictable degree of nondeterminism, whereas Filtered Runs is the better option when the input wtas are large but essentially deterministic.
We show that a previously proposed algorithm for the N-best trees problem – not to be confused with the easier N-best runs problem – can be made more efficient by changing how it arranges and explores the search space. Given an integer N and a weighted tree automaton (wta) M over the tropical semiring, the algorithm computes N trees of minimal weight with respect to M. Compared to the original algorithm, the modifications increase the laziness of the evaluation strategy, which makes the new algorithm asymptotically more efficient than its predecessor.
We propose an algorithm for computing the N best vertices in a weighted acyclic hypergraph over a nice semiring. A semiring is nice if it is finitely-generated, idempotent, and has 1 as its minimal element. We then apply the algorithm to the problem of computing the N best trees with respect to a weighted tree automaton, and complement theoretical correctness and complexity arguments with experimental data. The algorithm has several practical applications in natural language processing, for example, to derive the N most likely parse trees with respect to a probabilistic context-free grammar.
Graph-based semantic representations are popular in natural language processing (NLP), where it is often convenient to model linguistic concepts as nodes and relations as edges between them. Several attempts have been made to find a generative device that is sufficiently powerful to describe languages of semantic graphs, while at the same allowing efficient parsing. We contribute to this line of work by introducing graph extension grammar, a variant of the contextual hyperedge replacement grammars proposed by Hoffmann et al. Contextual hyperedge replacement can generate graphs with non-structural reentrancies, a type of node-sharing that is very common in formalisms such as abstract meaning representation, but which context-free types of graph grammars cannot model. To provide our formalism with a way to place reentrancies in a linguistically meaningful way, we endow rules with logical formulas in counting monadic second-order logic. We then present a parsing algorithm and show as our main result that this algorithm runs in polynomial time on graph languages generated by a subclass of our grammars, the so-called local graph extension grammars.
We show that a previously proposed algorithm for the N-best trees problem can be made more efficient by changing how it arranges and explores the search space. Given an integer N and a weighted tree automaton (wta) M over the tropical semiring, the algorithm computes N trees of minimal weight with respect to M. Compared with the original algorithm, the modifications increase the laziness of the evaluation strategy, which makes the new algorithm asymptotically more efficient than its predecessor. The algorithm is implemented in the software BETTY, and compared to the state-of-the-art algorithm for extracting the N best runs, implemented in the software toolkit TIBURON. The data sets used in the experiments are wtas resulting from real-world natural language processing tasks, as well as artificially created wtas with varying degrees of nondeterminism. We find that BETTY outperforms TIBURON on all tested data sets with respect to running time, while TIBURON seems to be the more memory-efficient choice.
We propose an algorithm for computing the $N$ best roots of a weighted hypergraph, in which the weight function is given over an idempotent and multiplicatively monotone semiring. We give a set of conditions that ensures that the weight function is well-defined and that solutions exist. Under these conditions, we prove that the proposed algorithm is correct. This generalizes a previous result for weighted tree automata, and in doing so, broadens the practical applications.
Graph-based semantic representations are valuable in natural language processing, where it is often simple and effective to represent linguistic concepts as nodes, and relations as edges between them. Several attempts has been made to find a generative device that is sufficiently powerful to represent languages of semantic graphs, while at the same allowing efficient parsing. We add to this line of work by introducing graph extension grammar, which consists of an algebra over graphs together with a regular tree grammar that generates expressions over the operations of the algebra. Due to the design of the operations, these grammars can generate graphs with non-structural reentrancies, a type of node-sharing that is excessively common in formalisms such as abstract meaning representation, but for which existing devices offer little support. We provide a parsing algorithm for graph extension grammars, which is proved to be correct and run in polynomial time.
We outline the initial ideas for a representational framework for capturing temporal aspects in semantic parsing of multimodal data.As a starting point, we take the Abstract Meaning Representations of Banarescu et al. andpropose a way of extending them to coversequential progressions of events. The firstmodality to be considered is text, but the long-term goal is to also incorporate informationfrom visual and audio modalities, as well ascontextual information.
Unranked tree languages are valuable in natural language processing for modelling dependency trees. We introduce a new type of automaton for unranked tree languages, called Z-automaton, that is tailored for this particular application. The Z-automaton offers a compact form of representation, and unlike the closely related notion of stepwise automata, does not require a binary encoding of its input. We establish an arc-factored normal form, and prove the membership problem of Z-automata in normal form to be in O(mn), where m is the size of the transition table of the Z-automaton and n is the size of the input tree.
We generalise a search algorithm by Mohri and Riley from strings to trees. The original algorithm takes as input a weighted automaton M over the tropical semiring, together with an integer N, and outputs N strings of minimal weight with respect to M. In our setting, M defines a weighted tree language, again over the tropical semiring, and the output is a set of N trees with minimal weight. We prove that the algorithm is correct, and that its time complexity is a low polynomial in N and the relevant size parameters of M.
We generalise a search algorithm by Mohri and Riley from strings to trees. The original algorithm takes as input a nondeterministic weighted automaton M over the tropical semiring and an integer N, and outputs N strings of minimal weight with respect to M. In our setting, M is a weighted tree automaton, again over the tropical semiring, and the output is a set of N trees with minimal weight in this language. We prove that the algorithm is correct, and that its time complexity is a low polynomial in N and the relevant size parameters of M.
We study sets of directed acyclic graphs, called regular DAG languages, which are accepted by a recently introduced type of DAG automata motivated by current developments in natural language processing. We prove (or disprove) closure properties, establish pumping lemmata, characterize finite regular DAG languages, and show that "unfolding" turns regular DAG languages into regular tree languages, which implies a linear growth property and the regularity of the path languages of regular DAG languages. Further, we give polynomial decision algorithms for the emptiness and finiteness problems, and show that deterministic DAG automata can be minimized and tested for equivalence in polynomial time.
A DAG is a directed acyclic graph. We study the properties of DAG automata and their languages, called regular DAG languages. In particular, we prove results resembling pumping lemmas and show that the finiteness problem for regular DAG languages is in P.
Graphs have a variety of uses in natural language processing, particularly as representations of linguistic meaning. A deficit in this area of research is a formal framework for creating, combining, and using models involving graphs that parallels the frameworks of finite automata for strings and finite tree automata for trees. A possible starting point for such a framework is the formalism of directed acyclic graph (DAG) automata, defined by Kamimura and Slutzki and extended by Quernheim and Knight. In this article, we study the latter in depth, demonstrating several new results, including a practical recognition algorithm that can be used for inference and learning with models defined on DAG automata. We also propose an extension to graphs with unbounded node degree and show that our results carry over to the extended formalism.
The relationship between Term Graph Rewriting and Term Rewriting is well understood: a single term graph reduction may correspond to several term reductions, due to sharing. It is also known that if term graphs are allowed to contain cycles, then one term graph reduction may correspond to infinitely many term reductions. We stress that this fact can be interpreted in two ways. According to the "sequential interpretation", a term graph reduction corresponds to an infinite sequence of term reductions, as formalized by Kennaway et.al. using strongly converging derivations over the complete metric space of infinite terms. Instead according to the "parallel interpretation" a term graph reduction corresponds to the parallel reduction of an infinite set of redexes in a rational term. We formalize the latter notion by exploiting the complete partial order of infinite and possibly partial terms, and we stress that this interpretation allows to explain the result of reducing circular redexes in several approaches to term graph rewriting.
Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed. While the power of visual-semantic embeddings comes from the distillation and enrichment of information through machine learning, their inner workings are poorly understood and there is a shortage of analysis tools. To address this problem, we generalize the notion of probing tasks to the visual-semantic case. To this end, we (i) discuss the formalization of probing tasks for embeddings of image-caption pairs, (ii) define three concrete probing tasks within our general framework, (iii) train classifiers to probe for those properties, and (iv) compare various state-of-the-art embeddings under the lens of the proposed probing tasks. Our experiments reveal an up to 12% increase in accuracy on visual-semantic embeddings compared to the corresponding unimodal embeddings, which suggest that the text and image dimensions represented in the former do complement each other
Languages of directed acyclic graphs (DAGs) are of interest in Natural Lanuage Processing because they can be used to capture the structure of semantic graphs like those of Abstract Meaning Representation. This paper gives an overview of recent results on a family of automata recognizing such DAG languages.
We introduce a generalization of tree-based generators called delegation networks. These make it possible to generate objects such as strings, trees, graphs, and pictures in a modular way by combining tree-based generators of several types. Our main result states that, if all underlying tree generators generate regular tree languages (or finite tree languages), then the tree-generating power of delegation networks is the same as that of context-free tree grammars working in IO mode.