umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Derivatives of regular expressions with cuts
Umeå University, Faculty of Science and Technology, Department of Computing Science. (Foundations of Language Processing)
2017 (English)Report (Other academic)
Resource type
Text
Abstract [en]

Derivatives of regular expressions are an operation which for a given expression produces an expression for what remains after a specific symbol has been read. This can be used as a step in transforming an expression into a finite string automaton. Cuts are an extension of the ordinary regular expressions; the cut operator is essentially a concatenation without backtracking, formalising a behaviour found in many programming languages. Just as for concatenation, we can also define an iterated cut operator. We show and derive expressions for the derivatives of regular expressions with cuts and iterated cuts.

Place, publisher, year, edition, pages
2017. , p. 7
Series
Report / UMINF, ISSN 0348-0542 ; 17.03
Keywords [en]
regular expression, derivative, cut expression
National Category
Language Technology (Computational Linguistics)
Research subject
computational linguistics
Identifiers
URN: urn:nbn:se:umu:diva-138916OAI: oai:DiVA.org:umu-138916DiVA, id: diva2:1137968
Available from: 2017-09-03 Created: 2017-09-03 Last updated: 2018-06-09Bibliographically approved
In thesis
1. A novel approach to text classification
Open this publication in new window or tab >>A novel approach to text classification
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis explores the foundations of text classification, using both empirical and deductive methods, with a focus on author identification and syntactic methods. We strive for a thorough theoretical understanding of what affects the effectiveness of classification in general. 

To begin with, we systematically investigate the effects of some parameters on the accuracy of author identification. How is the accuracy affected by the number of candidate authors, and the amount of data per candidate? Are there differences in how methods react to the changes in parameters? Using the same techniques, we see indications that methods previously thought to be topic-independent might not be so, but that syntactic methods may be the best option for avoiding topic dependence. This means that previous studies may have overestimated the power of lexical methods. We also briefly look for ways of spotting which particular features might be the most effective for classification. Apart from author identification, we apply similar methods to identifying properties of the author, including age and gender, and attempt to estimate the number of distinct authors in a text sample. In all cases, the techniques are proven viable if not overwhelmingly accurate, and we see that lexical and syntactic methods give very similar results. 

In the final parts, we see some results of automata theory that can be of use for syntactic analysis and classification. First, we generalise a known algorithm for finding a list of the best-ranked strings according to a weighted automaton, to doing the same with trees and a tree automaton. This result can be of use for speeding up parsing, which often runs in several steps, where each step needs several trees from the previous as input. Second, we use a compressed version of deterministic finite automata, known as failure automata, and prove that finding the optimal compression is NP-complete, but that there are efficient algorithms for finding good approximations. Third, we find and prove the derivatives of regular expressions with cuts. Derivatives are an operation on expressions to calculate the remaining expression after reading a given symbol, and cuts are an extension to regular expressions found in many programming languages. Together, these findings may be able to improve on the syntactic analysis which we have seen is a valuable tool for text classification.

Place, publisher, year, edition, pages
Umeå: Umeå universitet, 2017. p. 176
Series
Report / UMINF, ISSN 0348-0542 ; 17.16
Keywords
Text classification, natural language processing, automata
National Category
Language Technology (Computational Linguistics) Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-138917 (URN)978-91-7601-740-1 (ISBN)
Public defence
2017-09-29, N430, Naturvetarhuset, Umeå, 13:00 (English)
Opponent
Supervisors
Available from: 2017-09-04 Created: 2017-09-03 Last updated: 2018-06-09Bibliographically approved

Open Access in DiVA

fulltext(165 kB)128 downloads
File information
File name FULLTEXT01.pdfFile size 165 kBChecksum SHA-512
ff23c929cfc72998f3b503fc721c91606722a7c03d562192f741370c9226eaef0f6ad41b2c3d19910b246b5ec19cb4caf0bc8add0a575abe8afa23b7f2195c48
Type fulltextMimetype application/pdf

Authority records BETA

Zechner, Niklas

Search in DiVA

By author/editor
Zechner, Niklas
By organisation
Department of Computing Science
Language Technology (Computational Linguistics)

Search outside of DiVA

GoogleGoogle Scholar
Total: 128 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 363 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf