Authorship attribution is an important problem, with many applications of practical use in the real-world. One principal constraint in dealing with this problem is related to the type of text being written, and for what purpose — its context. The context of a text has consequences on the stylistics of the resulting text.
This thesis presents an approach to the problem attempting to avoid the implications of context by analyzing grammatical structures, in practice dependency structures derived by computerized parsing software. For classification, latent semantic indexing is employed. Results are presented in terms of a comparison, in terms of performance, with a similar approach based on phrase structure trees.
The corpus used in these experiments is a subset of the ICWSM2009 corpus, provided by the International Conference on Weblogs and Social Media. The subset contains only blog posts, and shows a high degree of variance in a number of aspects, such as attributes in the authors and actual textual content.
In conclusion, the approach to the problem of attributing authorship appears to be significantly weaker than its phrase-structure counterpart. The outcome is further discussed, and possible approaches beyond the realm of authorship attribution is identified.