Change search
ReferencesLink to record
Permanent link

Direct link
Syntaxbaserad författarigenkänning
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2010 (Swedish)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

The writing style of a particular author can be divided into many subfeatures, for example use of words, language and syntax. Focusing on the latter, this study aims to show how well syntactic information alone can attribute the correct author to a document. Syntactic information is defined as overlapping syntactic subtrees of height one (1) for all sentences of all included documents. The performance is compared to that of the previously very successful method of comparing stop word frequencies. These are words normally excluded from search engine queries, because they are present in all sorts of texts regardless of topic. However, this property is a positive feature when it comes to authorship attribution, as it allows for context-free comparisons of texts. Training and test data is obtained from the icwsm 2009 corpus, containing some 200 igabyte of blog posts and news articles. This data is automatically filtered to create a reasonably large collection (about 250000 documents) while remaining manageable by an automatic natural language parser (Stanford nlp) within the constraints of time. The filtering process guarantees that all texts used for comparison has texts of the same author within the training portion of the data. Indexing and searching is done using Latent Semantic Indexing (lsi). All documents are represented by a vector in multidimensional space, thus creating a matrix of document vectors. Search documents are then matched with those in the matrix by calculating the angles between document vectors, returning those with the smallest angular difference to the query document. The process of creating a document matrix and search documents is repeated multiple times, creating a new document matrix of randomly selected authors  every time. The performance of the different methods are measured by comparing average scores for each created document matrix. The results show that by average the syntactic information is more successful in correct authorship recognition compared to both chance and stop word frequency analysis. These results hold true for all tested numbers of authors present within the index matrix, ranging from ten to one hundred unique authors.

Place, publisher, year, edition, pages
, UMNAD, 853
National Category
Computer Science
URN: urn:nbn:se:umu:diva-38112OAI: diva2:372288
External cooperation
Educational program
Bachelor of Science Programme in Computing Science
Available from: 2010-11-25 Created: 2010-11-25 Last updated: 2012-06-19Bibliographically approved

Open Access in DiVA

fulltext(225 kB)264 downloads
File information
File name FULLTEXT01.pdfFile size 225 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
Department of Computing Science
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 264 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 182 hits
ReferencesLink to record
Permanent link

Direct link