umu.sePublikationer
Ändra sökning
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Efficient enumeration of weighted tree languages over the tropical semiring
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. (Foundations of Language Processing)
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. (Foundations of Language Processing)ORCID-id: 0000-0001-7349-7693
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap. (Foundations of Language Processing)
2019 (Engelska)Ingår i: Journal of computer and system sciences (Print), ISSN 0022-0000, E-ISSN 1090-2724, Vol. 104, s. 78s. 119-130Artikel i tidskrift (Refereegranskat) Published
Abstract [en]

We generalise a search algorithm by Mohri and Riley from strings to trees. The original algorithm takes as input a nondeterministic weighted automaton M over the tropical semiring and an integer N, and outputs N strings of minimal weight with respect to M. In our setting, M is a weighted tree automaton, again over the tropical semiring, and the output is a set of N trees with minimal weight in this language. We prove that the algorithm is correct, and that its time complexity is a low polynomial in N and the relevant size parameters of M. 

Ort, förlag, år, upplaga, sidor
2019. Vol. 104, s. 78s. 119-130
Nyckelord [en]
weighted tree automaton, N-best analysis, tropical semiring
Nationell ämneskategori
Datavetenskap (datalogi)
Forskningsämne
datalogi
Identifikatorer
URN: urn:nbn:se:umu:diva-132963DOI: 10.1016/j.jcss.2017.03.006ISI: 000472246300008OAI: oai:DiVA.org:umu-132963DiVA, id: diva2:1084722
Tillgänglig från: 2017-03-27 Skapad: 2017-03-27 Senast uppdaterad: 2019-07-12Bibliografiskt granskad
Ingår i avhandling
1. A novel approach to text classification
Öppna denna publikation i ny flik eller fönster >>A novel approach to text classification
2017 (Engelska)Doktorsavhandling, sammanläggning (Övrigt vetenskapligt)
Abstract [en]

This thesis explores the foundations of text classification, using both empirical and deductive methods, with a focus on author identification and syntactic methods. We strive for a thorough theoretical understanding of what affects the effectiveness of classification in general. 

To begin with, we systematically investigate the effects of some parameters on the accuracy of author identification. How is the accuracy affected by the number of candidate authors, and the amount of data per candidate? Are there differences in how methods react to the changes in parameters? Using the same techniques, we see indications that methods previously thought to be topic-independent might not be so, but that syntactic methods may be the best option for avoiding topic dependence. This means that previous studies may have overestimated the power of lexical methods. We also briefly look for ways of spotting which particular features might be the most effective for classification. Apart from author identification, we apply similar methods to identifying properties of the author, including age and gender, and attempt to estimate the number of distinct authors in a text sample. In all cases, the techniques are proven viable if not overwhelmingly accurate, and we see that lexical and syntactic methods give very similar results. 

In the final parts, we see some results of automata theory that can be of use for syntactic analysis and classification. First, we generalise a known algorithm for finding a list of the best-ranked strings according to a weighted automaton, to doing the same with trees and a tree automaton. This result can be of use for speeding up parsing, which often runs in several steps, where each step needs several trees from the previous as input. Second, we use a compressed version of deterministic finite automata, known as failure automata, and prove that finding the optimal compression is NP-complete, but that there are efficient algorithms for finding good approximations. Third, we find and prove the derivatives of regular expressions with cuts. Derivatives are an operation on expressions to calculate the remaining expression after reading a given symbol, and cuts are an extension to regular expressions found in many programming languages. Together, these findings may be able to improve on the syntactic analysis which we have seen is a valuable tool for text classification.

Ort, förlag, år, upplaga, sidor
Umeå: Umeå universitet, 2017. s. 176
Serie
Report / UMINF, ISSN 0348-0542 ; 17.16
Nyckelord
Text classification, natural language processing, automata
Nationell ämneskategori
Språkteknologi (språkvetenskaplig databehandling) Datavetenskap (datalogi)
Forskningsämne
datalogi
Identifikatorer
urn:nbn:se:umu:diva-138917 (URN)978-91-7601-740-1 (ISBN)
Disputation
2017-09-29, N430, Naturvetarhuset, Umeå, 13:00 (Engelska)
Opponent
Handledare
Tillgänglig från: 2017-09-04 Skapad: 2017-09-03 Senast uppdaterad: 2018-06-09Bibliografiskt granskad

Open Access i DiVA

Fulltext saknas i DiVA

Övriga länkar

Förlagets fulltext

Personposter BETA

Björklund, JohannaDrewes, FrankZechner, Niklas

Sök vidare i DiVA

Av författaren/redaktören
Björklund, JohannaDrewes, FrankZechner, Niklas
Av organisationen
Institutionen för datavetenskap
I samma tidskrift
Journal of computer and system sciences (Print)
Datavetenskap (datalogi)

Sök vidare utanför DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetricpoäng

doi
urn-nbn
Totalt: 907 träffar
RefereraExporteraLänk till posten
Permanent länk

Direktlänk
Referera
Referensformat
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Annat format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annat språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf