umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Syntaxbaserad författarigenkänning
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2010 (Swedish)Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
Abstract [en]

The writing style of a particular author can be divided into many subfeatures, for example use of words, language and syntax. Focusing on the latter, this study aims to show how well syntactic information alone can attribute the correct author to a document. Syntactic information is defined as overlapping syntactic subtrees of height one (1) for all sentences of all included documents. The performance is compared to that of the previously very successful method of comparing stop word frequencies. These are words normally excluded from search engine queries, because they are present in all sorts of texts regardless of topic. However, this property is a positive feature when it comes to authorship attribution, as it allows for context-free comparisons of texts. Training and test data is obtained from the icwsm 2009 corpus, containing some 200 igabyte of blog posts and news articles. This data is automatically filtered to create a reasonably large collection (about 250000 documents) while remaining manageable by an automatic natural language parser (Stanford nlp) within the constraints of time. The filtering process guarantees that all texts used for comparison has texts of the same author within the training portion of the data. Indexing and searching is done using Latent Semantic Indexing (lsi). All documents are represented by a vector in multidimensional space, thus creating a matrix of document vectors. Search documents are then matched with those in the matrix by calculating the angles between document vectors, returning those with the smallest angular difference to the query document. The process of creating a document matrix and search documents is repeated multiple times, creating a new document matrix of randomly selected authors  every time. The performance of the different methods are measured by comparing average scores for each created document matrix. The results show that by average the syntactic information is more successful in correct authorship recognition compared to both chance and stop word frequency analysis. These results hold true for all tested numbers of authors present within the index matrix, ranging from ten to one hundred unique authors.

Place, publisher, year, edition, pages
2010.
Series
UMNAD, 853
National Category
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-38112OAI: oai:DiVA.org:umu-38112DiVA: diva2:372288
External cooperation
CodeMill
Educational program
Bachelor of Science Programme in Computing Science
Uppsok
Technology
Supervisors
Examiners
Available from: 2010-11-25 Created: 2010-11-25 Last updated: 2012-06-19Bibliographically approved

Open Access in DiVA

fulltext(225 kB)304 downloads
File information
File name FULLTEXT01.pdfFile size 225 kBChecksum SHA-512
7bf1e76d3abf4c1cd377c0cec79517a3c44a1d1c4b6bec84dc90199164a528c627aee62a16f09e3bdc8e401232e7c92ba679d7be6bb758dc7b5a92581a914144
Type fulltextMimetype application/pdf

By organisation
Department of Computing Science
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 304 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 211 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf