Independent thesis Basic level (degree of Bachelor), 10 credits / 15 HE creditsStudent thesis
The writing style of a particular author can be divided into many subfeatures, for example use of words, language and syntax. Focusing on the latter, this study aims to show how well syntactic information alone can attribute the correct author to a document. Syntactic information is defined as overlapping syntactic subtrees of height one (1) for all sentences of all included documents. The performance is compared to that of the previously very successful method of comparing stop word frequencies. These are words normally excluded from search engine queries, because they are present in all sorts of texts regardless of topic. However, this property is a positive feature when it comes to authorship attribution, as it allows for context-free comparisons of texts. Training and test data is obtained from the icwsm 2009 corpus, containing some 200 igabyte of blog posts and news articles. This data is automatically filtered to create a reasonably large collection (about 250000 documents) while remaining manageable by an automatic natural language parser (Stanford nlp) within the constraints of time. The filtering process guarantees that all texts used for comparison has texts of the same author within the training portion of the data. Indexing and searching is done using Latent Semantic Indexing (lsi). All documents are represented by a vector in multidimensional space, thus creating a matrix of document vectors. Search documents are then matched with those in the matrix by calculating the angles between document vectors, returning those with the smallest angular difference to the query document. The process of creating a document matrix and search documents is repeated multiple times, creating a new document matrix of randomly selected authors every time. The performance of the different methods are measured by comparing average scores for each created document matrix. The results show that by average the syntactic information is more successful in correct authorship recognition compared to both chance and stop word frequency analysis. These results hold true for all tested numbers of authors present within the index matrix, ranging from ten to one hundred unique authors.
Place, publisher, year, edition, pages
, UMNAD, 853
IdentifiersURN: urn:nbn:se:umu:diva-38112OAI: oai:DiVA.org:umu-38112DiVA: diva2:372288
Bachelor of Science Programme in Computing Science