Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Best Trees Extraction and Contextual Grammars for Language Processing
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Foundations of Language Processing)ORCID iD: 0000-0002-9873-4170
2021 (English)Doctoral thesis, comprehensive summary (Other academic)Alternative title
Extrahering av optimala träd samt kontextuella grafgrammatiker för språkbearbetning (Swedish)
Abstract [en]

In natural language processing, the syntax of a sentence refers to the words used in the sentence, their grammatical role, and their order. Semantics concerns the concepts represented by the words in the sentence and their relations, i.e., the meaning of the sentence. While a human can easily analyse a sentence in a language they understand to figure out its grammatical construction and meaning, this is a difficult task for a computer. To analyse natural language, the computer needs a language model. First and foremost, the computer must have data structures that can represent syntax and semantics. Then, the computer requires some information about what is considered correct syntax and semantics – this can be provided in the form of human-annotated corpora of natural language. Computers use formal languages such as programming languages, and our goal is thus to model natural languages using formal languages. There are several ways to capture the correctness aspect of a natural language corpus in a formal language model. One strategy is to specify a formal language using a set of rules that are, in a sense, very similar to the grammatical rules of natural language. In this thesis, we only consider such rule-based formalisms.

Trees are commonly used to represent syntactic analyses of sentences, and graphs can represent the semantics of sentences. Examples of rule-based formalisms that define languages of trees and graphs are tree automata and graph grammars, respectively. When used in language processing, the rules of a formalism are normally given weights, which are then combined as specified by the formalism to assign weights to the trees or graphs in its language. The weights enable us to rank the trees and graphs by their similarity to the linguistic data in the human-annotated corpora. 

Since natural language is very complicated to model, there are many small gaps in the research of natural language processing to address. The research of this thesis considers two separate but related problems: First, we have the N-best problem, which is about finding a number N of top-ranked hypotheses given a ranked hypothesis space. In our case, the hypothesis space is represented by a weighted rule-based formalism, making the hypothesis space a weighted formal language. The hypotheses themselves can for example have the form of weighted syntax trees. The second problem is that of semantic modelling, whose aim is to find a formalism complex enough to define languages of semantic representations. This model can however not be too complex since we still want to be able to efficiently compute solutions to language processing tasks.

This thesis is divided into two parts according to the two problems introduced above. The first part covers the N-best problem for weighted tree automata. In this line of research, we develop and evaluate multiple versions of an efficient algorithm that solves the problem in question. Since our algorithm is the first to do so, we theoretically and experimentally evaluate it in comparison to the state-of-the-art algorithm for solving an easier version of the problem. In the second part, we study how rule-based formalisms can be used to model graphs that represent meaning, i.e., semantic graphs. We investigate an existing formalism and through this work learn what properties of that formalism are necessary for semantic modelling. Finally, we use our new-found knowledge to develop a more specialised formalism, and argue that it is better suited for the task of semantic modelling than existing formalisms.

Place, publisher, year, edition, pages
Umeå: Umeå universitet , 2021. , p. 60
Series
Report / UMINF, ISSN 0348-0542 ; 21.04
Keywords [en]
Weighted tree automata, the N-best problem, efficient algorithms, semantic graph, abstract meaning representation, contextual graph grammars, hyperedge replacement, graph extensions
National Category
Computer Sciences
Identifiers
URN: urn:nbn:se:umu:diva-182989ISBN: 978-91-7855-521-5 (print)ISBN: 978-91-7855-522-2 (electronic)OAI: oai:DiVA.org:umu-182989DiVA, id: diva2:1554039
Public defence
2021-06-11, MA316, MIT-huset, plan 3, Umeå, 10:00 (English)
Opponent
Supervisors
Available from: 2021-05-21 Created: 2021-05-11 Last updated: 2021-05-17Bibliographically approved
List of papers
1. Finding the N Best Vertices in an Infinite Weighted Hypergraph
Open this publication in new window or tab >>Finding the N Best Vertices in an Infinite Weighted Hypergraph
2017 (English)In: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 682, p. 78p. 30-41Article in journal (Refereed) Published
Abstract [en]

We propose an algorithm for computing the N best vertices in a weighted acyclic hypergraph over a nice semiring. A semiring is nice if it is finitely-generated, idempotent, and has 1 as its minimal element. We then apply the algorithm to the problem of computing the N best trees with respect to a weighted tree automaton, and complement theoretical correctness and complexity arguments with experimental data. The algorithm has several practical applications in natural language processing, for example, to derive the N most likely parse trees with respect to a probabilistic context-free grammar. 

Place, publisher, year, edition, pages
Elsevier, 2017. p. 78
Keywords
Hypergraph, N-best problem, Idempotent semiring
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-132501 (URN)10.1016/j.tcs.2017.03.010 (DOI)000405062100005 ()2-s2.0-85016174936 (Scopus ID)
Note

Special Issue: SI

Available from: 2017-03-15 Created: 2017-03-15 Last updated: 2023-03-24Bibliographically approved
2. A Comparison of Two N-Best Extraction Methods for Weighted Tree Automata
Open this publication in new window or tab >>A Comparison of Two N-Best Extraction Methods for Weighted Tree Automata
2018 (English)In: Implementation and Application of Automata: 23rd International Conference, CIAA 2018, Charlottetown, PE, Canada, July 30 – August 2, 2018, Proceedings, Springer, 2018, p. 197-108Conference paper, Published paper (Refereed)
Abstract [en]

We conduct a comparative study of two state-of-the-art al- gorithms for extracting the N best trees from a weighted tree automaton (wta). The algorithms are Best Trees, which uses a priority queue to structure the search space, and Filtered Runs, which is based on an algorithm by Huang and Chiang that extracts N best runs, implemented as part of the Tiburon wta toolkit. The experiments are run on four data sets, each consisting of a sequence of wtas of increasing sizes. Our conclusion is that Best Trees can be recommended when the input wtas exhibit a high or unpredictable degree of nondeterminism, whereas Filtered Runs is the better option when the input wtas are large but essentially deterministic.

Place, publisher, year, edition, pages
Springer, 2018
Series
Lecture Notes in Computer Science
Keywords
N-best list, tree automaton
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-149994 (URN)10.1007/978-3-319-94812-6_9 (DOI)000469285600009 ()2-s2.0-85051127322 (Scopus ID)978-3-319-94812-6 (ISBN)978-3-319-94811-9 (ISBN)
Conference
23rd International Conference on Implementation and Applications of Automata (CIAA 2018), Charlottetown, Canada, July 30-August 2, 2018
Available from: 2018-06-30 Created: 2018-06-30 Last updated: 2023-03-24Bibliographically approved
3. Faster Computation of N-Best Lists for Weighted Tree Automata
Open this publication in new window or tab >>Faster Computation of N-Best Lists for Weighted Tree Automata
(English)Manuscript (preprint) (Other academic)
Abstract [en]

We show that a previously proposed algorithm for the N-best trees problem – not to be confused with the easier N-best runs problem – can be made more efficient by changing how it arranges and explores the search space. Given an integer N and a weighted tree automaton (wta) M over the tropical semiring, the algorithm computes N trees of minimal weight with respect to M. Compared to the original algorithm, the modifications increase the laziness of the evaluation strategy, which makes the new algorithm asymptotically more efficient than its predecessor. 

Keywords
N-best lists, weighted tree automata, tropical semiring
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-182984 (URN)
Available from: 2021-05-11 Created: 2021-05-11 Last updated: 2021-05-11
4. A Comparative Evaluation of the Efficiency of N-Best Algorithms on Language Data
Open this publication in new window or tab >>A Comparative Evaluation of the Efficiency of N-Best Algorithms on Language Data
(English)Manuscript (preprint) (Other academic)
Abstract [en]

The N-best extraction problem consists in selecting the N highest ranking hypotheses from a set of hypotheses, with respect to a given ranking system. In our setting, the hypotheses and ranking are jointly represented by a weighted tree automaton (wta) over the tropical semiring: the hypotheses are trees, or runs on trees, and the ranking is decided by the weight assigned to them. In previous work, we presented an algorithm for N-best extraction that combines techniques to restrict the search space, and proved it to be correct and efficient. The algorithm is now implemented in the software Betty, allowing us to complement the deductive study with an empirical investigation.  In particular, we compare our algorithm to the state-of-the-art algorithm for extracting the N best runs, implemented in in the software toolkit Tiburon. The data sets used in the experiments are wtas resulting from real-world natural language processing tasks, as well as artificially created wtas with varying degrees of nondeterminism. We find that Betty outperforms Tiburon on all tested data sets.

National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-182985 (URN)
Available from: 2021-05-11 Created: 2021-05-11 Last updated: 2021-05-11
5. Contextual Hyperedge Replacement Grammars for Abstract Meaning Representations
Open this publication in new window or tab >>Contextual Hyperedge Replacement Grammars for Abstract Meaning Representations
2017 (English)In: Proceedings of the 13th International Workshop on Tree Adjoining Grammars and Related Formalisms / [ed] M. Kuhlmann, T. Scheffler, Association for Computational Linguistics, 2017, p. 102-111Conference paper, Published paper (Refereed)
Abstract [en]

We show how contextual hyperedge replacement grammars can be used to generate abstract meaning representations (AMRs), and argue that they are more suitable for this purpose than hyperedge replacement grammars. Contextual hyperedge replacement turns out to have two advantages over plain hyperedge replacement: it can completely cover the language of all AMRs over a given domain of concepts, and at the same time its grammars become both smaller and simpler.

Place, publisher, year, edition, pages
Association for Computational Linguistics, 2017
Keywords
Abstract Meaning Representation, DAG Language, Contextual Hyperedge-Replacement
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-137921 (URN)
Conference
13th International Workshop on Tree-Adjoining Grammar and Related Formalisms (TAG+13), Umeå, Sweden, September 4-6, 2017
Available from: 2017-07-31 Created: 2017-07-31 Last updated: 2021-05-11Bibliographically approved
6. Polynomial Graph Parsing with Non-Structural Reentrancies
Open this publication in new window or tab >>Polynomial Graph Parsing with Non-Structural Reentrancies
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Graph-based semantic representations are valuable in natural language processing, where it is often simple and effective to represent linguistic concepts as nodes, and relations as edges between them. Several attempts has been made to find a generative device that is sufficiently powerful to represent languages of semantic graphs, while at the same allowing efficient parsing. We add to this line of work by introducing graph extension grammar, which consists of an algebra over graphs together with a regular tree grammar that generates expressions over the operations of the algebra. Due to the design of the operations, these grammars can generate graphs with non-structural reentrancies, a type of node-sharing that is excessively common in formalisms such as abstract meaning representation, but for which existing devices offer little support. We provide a parsing algorithm for graph extension grammars, which is proved to be correct and run in polynomial time. 

National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-182986 (URN)
Available from: 2021-05-11 Created: 2021-05-11 Last updated: 2021-05-11

Open Access in DiVA

fulltext(642 kB)390 downloads
File information
File name FULLTEXT02.pdfFile size 642 kBChecksum SHA-512
11398b1bfa0f08ac1640b63841e9b1877acaf2105d656ba84dda13e529a188fe7ae73e4faa512f4abb5e91839b4e56aec539ac18d8f5da6e581f3b85b3b000a7
Type fulltextMimetype application/pdf
spikblad(142 kB)78 downloads
File information
File name SPIKBLAD01.pdfFile size 142 kBChecksum SHA-512
d224942765f1857c859ac7db46aaa4a2aca55b260782e2c548f80b3fba518d647ee77a1c8825f60afcf81747cd0c495977a800ca1454f247f6309c0b1cb0879b
Type spikbladMimetype application/pdf
omslag(106 kB)64 downloads
File information
File name COVER01.pdfFile size 106 kBChecksum SHA-512
9f0ce8cb4611bcdd9ce6463fae3cf5ca98e20b61941080dc585e361b76a470bff9d76febf2c6ed3627272a922f83200162d1456f497addd976b25524ea2c8af8
Type coverMimetype application/pdf

Authority records

Jonsson, Anna

Search in DiVA

By author/editor
Jonsson, Anna
By organisation
Department of Computing Science
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 390 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 2848 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf