Extracting content from online news sites
Independent thesis Advanced level (degree of Master (One Year)), 20 credits / 30 HE creditsStudent thesis
Society is producing more and more data with every year. The number of unique URLs indexed by Google recently surpassed the one-trillion mark. To fully benefit from this surge in data, we need efficient algorithms for searching and extracting information. A popular approach is to use the so-called vector space model (VSM), that organises documents according to the terms that they contain. This thesis contributes to an investigation of how adding syntactical information to VSM affects search results. The thesis focuses on techniques for content extraction from online news sources, and describes the implementation and evaluation of a selection of these techniques. The extracted data is used to obtain test data for search evaluation. The implementation is generic and thus easily adopted to new data sources, and although the implementation lacks precision, its performance is sufficient for evaluating the syntax-based version of VSM.
Place, publisher, year, edition, pages
, UMNAD, 861
IdentifiersURN: urn:nbn:se:umu:diva-39558OAI: oai:DiVA.org:umu-39558DiVA: diva2:393829