Umeå University's logo

umu.sePublications
Change search
Link to record
Permanent link

Direct link
Zechner, Niklas
Publications (8 of 8) Show all publications
Björklund, J., Drewes, F. & Zechner, N. (2019). Efficient enumeration of weighted tree languages over the tropical semiring. Journal of computer and system sciences (Print), 104, 119-130
Open this publication in new window or tab >>Efficient enumeration of weighted tree languages over the tropical semiring
2019 (English)In: Journal of computer and system sciences (Print), ISSN 0022-0000, E-ISSN 1090-2724, Vol. 104, p. 78p. 119-130Article in journal (Refereed) Published
Abstract [en]

We generalise a search algorithm by Mohri and Riley from strings to trees. The original algorithm takes as input a nondeterministic weighted automaton M over the tropical semiring and an integer N, and outputs N strings of minimal weight with respect to M. In our setting, M is a weighted tree automaton, again over the tropical semiring, and the output is a set of N trees with minimal weight in this language. We prove that the algorithm is correct, and that its time complexity is a low polynomial in N and the relevant size parameters of M. 

Publisher
p. 78
Keywords
weighted tree automaton, N-best analysis, tropical semiring
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-132963 (URN)10.1016/j.jcss.2017.03.006 (DOI)000472246300008 ()2-s2.0-85048737102 (Scopus ID)
Available from: 2017-03-27 Created: 2017-03-27 Last updated: 2023-03-23Bibliographically approved
Zechner, N. (2017). A novel approach to text classification. (Doctoral dissertation). Umeå: Umeå universitet
Open this publication in new window or tab >>A novel approach to text classification
2017 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

This thesis explores the foundations of text classification, using both empirical and deductive methods, with a focus on author identification and syntactic methods. We strive for a thorough theoretical understanding of what affects the effectiveness of classification in general. 

To begin with, we systematically investigate the effects of some parameters on the accuracy of author identification. How is the accuracy affected by the number of candidate authors, and the amount of data per candidate? Are there differences in how methods react to the changes in parameters? Using the same techniques, we see indications that methods previously thought to be topic-independent might not be so, but that syntactic methods may be the best option for avoiding topic dependence. This means that previous studies may have overestimated the power of lexical methods. We also briefly look for ways of spotting which particular features might be the most effective for classification. Apart from author identification, we apply similar methods to identifying properties of the author, including age and gender, and attempt to estimate the number of distinct authors in a text sample. In all cases, the techniques are proven viable if not overwhelmingly accurate, and we see that lexical and syntactic methods give very similar results. 

In the final parts, we see some results of automata theory that can be of use for syntactic analysis and classification. First, we generalise a known algorithm for finding a list of the best-ranked strings according to a weighted automaton, to doing the same with trees and a tree automaton. This result can be of use for speeding up parsing, which often runs in several steps, where each step needs several trees from the previous as input. Second, we use a compressed version of deterministic finite automata, known as failure automata, and prove that finding the optimal compression is NP-complete, but that there are efficient algorithms for finding good approximations. Third, we find and prove the derivatives of regular expressions with cuts. Derivatives are an operation on expressions to calculate the remaining expression after reading a given symbol, and cuts are an extension to regular expressions found in many programming languages. Together, these findings may be able to improve on the syntactic analysis which we have seen is a valuable tool for text classification.

Place, publisher, year, edition, pages
Umeå: Umeå universitet, 2017. p. 176
Series
Report / UMINF, ISSN 0348-0542 ; 17.16
Keywords
Text classification, natural language processing, automata
National Category
Natural Language Processing Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-138917 (URN)978-91-7601-740-1 (ISBN)
Public defence
2017-09-29, N430, Naturvetarhuset, Umeå, 13:00 (English)
Opponent
Supervisors
Available from: 2017-09-04 Created: 2017-09-03 Last updated: 2025-02-01Bibliographically approved
Zechner, N. (2017). Derivatives of regular expressions with cuts.
Open this publication in new window or tab >>Derivatives of regular expressions with cuts
2017 (English)Report (Other academic)
Abstract [en]

Derivatives of regular expressions are an operation which for a given expression produces an expression for what remains after a specific symbol has been read. This can be used as a step in transforming an expression into a finite string automaton. Cuts are an extension of the ordinary regular expressions; the cut operator is essentially a concatenation without backtracking, formalising a behaviour found in many programming languages. Just as for concatenation, we can also define an iterated cut operator. We show and derive expressions for the derivatives of regular expressions with cuts and iterated cuts.

Publisher
p. 7
Series
Report / UMINF, ISSN 0348-0542 ; 17.03
Keywords
regular expression, derivative, cut expression
National Category
Natural Language Processing
Research subject
computational linguistics
Identifiers
urn:nbn:se:umu:diva-138916 (URN)
Available from: 2017-09-03 Created: 2017-09-03 Last updated: 2025-02-07Bibliographically approved
Björklund, J. & Zechner, N. (2017). Syntactic methods for topic-independent authorship attribution. Natural Language Engineering, 23(5), 789-806
Open this publication in new window or tab >>Syntactic methods for topic-independent authorship attribution
2017 (English)In: Natural Language Engineering, ISSN 1351-3249, E-ISSN 1469-8110, Vol. 23, no 5, p. 789-806Article in journal (Refereed) Published
Abstract [en]

The efficacy of syntactic features for topic-independent authorship attribution is evaluated, taking a feature set of frequencies of words and punctuation marks as baseline. The features are 'deep' in the sense that they are derived by parsing the subject texts, in contrast to 'shallow' syntactic features for which a part-of-speech analysis is enough. The experiments are made on two corpora of online texts and one corpus of novels written around the year 1900. The classification tasks include classical closed-world authorship attribution, identification of separate texts among the works of one author, and cross-topic authorship attribution. In the first tasks, the feature sets were fairly evenly matched, but for the last task, the syntax-based feature set outperformed the baseline feature set. These results suggest that, compared to lexical features, syntactic features are more robust to changes in topic.

Place, publisher, year, edition, pages
CAMBRIDGE UNIV PRESS, 2017
National Category
Natural Language Processing
Identifiers
urn:nbn:se:umu:diva-139621 (URN)10.1017/S1351324917000249 (DOI)000407573100006 ()2-s2.0-85028053319 (Scopus ID)
Available from: 2017-10-04 Created: 2017-10-04 Last updated: 2025-02-07Bibliographically approved
Bjorklund, J. & Zechner, N. (2016). My name is legion: estimating author counts based on stylistic diversity. In: Brynielsson J., Johansson F. (Ed.), 2016 European intelligence and security informatics conference (EISIC): . Paper presented at Conference on European Intelligence and Security Informatics Conference (EISIC), AUG 17-19, 2016, Uppsala, Sweden (pp. 108-111). IEEE
Open this publication in new window or tab >>My name is legion: estimating author counts based on stylistic diversity
2016 (English)In: 2016 European intelligence and security informatics conference (EISIC) / [ed] Brynielsson J., Johansson F., IEEE , 2016, p. 108-111Conference paper, Published paper (Refereed)
Abstract [en]

Online propaganda is a growing concern. Fraudulent users write under multiple signatures to give the impression that the opinions they promote are more widespread than they really are, or held by a different demography. The problem as such is not new, but it is becoming increasingly organised and therefore has effects on a larger scale. In this work, we develop methods for assessing the true number of authors of a body of work, to detect artificially inflated user sets. The assessments are based on stylistic richness, here measured as the number of unique features (e.g., words or syntactic fragments) divided by the sum of all features. Initial results suggest that the order of magnitude can be reliable estimated. It is for example possible to differentiate the works of hundreds and thousands of writers.

Place, publisher, year, edition, pages
IEEE, 2016
Series
European Intelligence and Security Informatics Conference, ISSN 2572-3723
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-140667 (URN)10.1109/EISIC.2016.028 (DOI)000411272300018 ()2-s2.0-85017207880 (Scopus ID)978-1-5090-2857-3 (ISBN)
Conference
Conference on European Intelligence and Security Informatics Conference (EISIC), AUG 17-19, 2016, Uppsala, Sweden
Available from: 2017-10-16 Created: 2017-10-16 Last updated: 2023-03-23Bibliographically approved
Björklund, J., Drewes, F. & Zechner, N. (2015). An efficient best-trees algorithm for weighted tree automata over the tropical semiring. In: Adrian-Horia Dediu, Enrico Formenti, Carlos Martín-Vide, and Bianca Truthe (Ed.), Proc. 9th International Conference on Language and Automata Theory and Applications: . Paper presented at 9th International Conference on Language and Automata Theory and Applications, LATA 2015, Nice, France, March 2-6, 2015 (pp. 97-108). Springer
Open this publication in new window or tab >>An efficient best-trees algorithm for weighted tree automata over the tropical semiring
2015 (English)In: Proc. 9th International Conference on Language and Automata Theory and Applications / [ed] Adrian-Horia Dediu, Enrico Formenti, Carlos Martín-Vide, and Bianca Truthe, Springer, 2015, p. 97-108Conference paper, Published paper (Refereed)
Abstract [en]

We generalise a search algorithm by Mohri and Riley from strings to trees. The original algorithm takes as input a weighted automaton M over the tropical semiring, together with an integer N, and outputs N strings of minimal weight with respect to M. In our setting, M defines a weighted tree language, again over the tropical semiring, and the output is a set of N trees with minimal weight. We prove that the algorithm is correct, and that its time complexity is a low polynomial in N and the relevant size parameters of M.

Place, publisher, year, edition, pages
Springer, 2015
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 8977
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-96680 (URN)10.1007/978-3-319-15579-1_7 (DOI)2-s2.0-84928788022 (Scopus ID)
Conference
9th International Conference on Language and Automata Theory and Applications, LATA 2015, Nice, France, March 2-6, 2015
Projects
MICO: Media in Context
Funder
EU, FP7, Seventh Framework Programme, 610480
Available from: 2014-11-25 Created: 2014-11-25 Last updated: 2023-03-23Bibliographically approved
Björklund, H., Björklund, J. & Zechner, N. (2014). Compression of finite-state automata through failure transitions. Theoretical Computer Science, 557, 87-100
Open this publication in new window or tab >>Compression of finite-state automata through failure transitions
2014 (English)In: Theoretical Computer Science, ISSN 0304-3975, E-ISSN 1879-2294, Vol. 557, p. 87-100Article in journal (Refereed) Published
Abstract [en]

Several linear-time algorithms for automata-based pattern matching rely on failure transitions for efficient back-tracking. Like epsilon transitions, failure transition do not consume input symbols, but unlike them, they may only be taken when no other transition is applicable. At a semantic level, this conveniently models catch-all clauses and allows for compact language representation.

This work investigates the transition-reduction problem for deterministic finite-state automata (DFA). The input is a DFA A and an integer k. The question is whether k or more transitions can be saved by replacing regular transitions with failure transitions. We show that while the problem is NP-complete, there are approximation techniques and heuristics that mitigate the computational complexity. We conclude by demonstrating the computational difficulty of two related minimisation problems, thereby cancelling the ongoing search for efficient algorithms.

Place, publisher, year, edition, pages
Elsevier, 2014
Keywords
failure automata, pattern matching, automata minimisation
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:umu:diva-93329 (URN)10.1016/j.tcs.2014.09.007 (DOI)000343784800008 ()2-s2.0-84926428941 (Scopus ID)
Funder
Swedish Research Council, 621-2011-6080
Available from: 2014-09-17 Created: 2014-09-17 Last updated: 2023-03-24Bibliographically approved
Zechner, N. & Lingas, A. (2014). Efficient algorithms for subgraph listing. Algorithms, 7(2), 243-252
Open this publication in new window or tab >>Efficient algorithms for subgraph listing
2014 (English)In: Algorithms, E-ISSN 1999-4893, Vol. 7, no 2, p. 243-252Article in journal (Refereed) Published
Abstract [en]

Subgraph isomorphism is a fundamental problem in graph theory. In this paper we focus on listing subgraphs isomorphic to a given pattern graph. First, we look at the algorithm due to Chiba and Nishizeki for listing complete subgraphs of fixed size, and show that it cannot be extended to general subgraphs of fixed size. Then, we consider the algorithm due to Gąsieniec et al. for finding multiple witnesses of a Boolean matrix product, and use it to design a new output-sensitive algorithm for listing all triangles in a graph. As a corollary, we obtain an output-sensitive algorithm for listing subgraphs and induced subgraphs isomorphic to an arbitrary fixed pattern graph.

Place, publisher, year, edition, pages
MDPI, 2014
Keywords
Clique, Output-sensitive algorithm, Subgraph, Subgraph isomorphism, Subgraph listing, Time complexity, Triangle
National Category
Computer Sciences
Identifiers
urn:nbn:se:umu:diva-211926 (URN)10.3390/a7020243 (DOI)000214154600006 ()2-s2.0-84907096435 (Scopus ID)
Available from: 2023-07-12 Created: 2023-07-12 Last updated: 2023-07-12Bibliographically approved
Organisations

Search in DiVA

Show all publications