Change search
ReferencesLink to record
Permanent link

Direct link
Extracting content from online news sites
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2011 (English)Independent thesis Advanced level (degree of Master (One Year)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Society is producing more and more data with every year. The number of unique URLs indexed by Google recently surpassed the one-trillion mark. To fully benefit from this surge in data, we need efficient algorithms for searching and extracting information. A popular approach is to use the so-called vector space model (VSM), that organises documents according to the terms that they contain. This thesis contributes to an investigation of how adding syntactical information to VSM affects search results. The thesis focuses on techniques for content extraction from online news sources, and describes the implementation and evaluation of a selection of these techniques. The extracted data is used to obtain test data for search evaluation. The implementation is generic and thus easily adopted to new data sources, and although the implementation lacks precision,  its performance is sufficient for evaluating the syntax-based version of VSM.

Place, publisher, year, edition, pages
, UMNAD, 861
National Category
Computer Science
URN: urn:nbn:se:umu:diva-39558OAI: diva2:393829
Available from: 2011-02-01 Created: 2011-02-01 Last updated: 2011-02-01Bibliographically approved

Open Access in DiVA

fulltext(736 kB)402 downloads
File information
File name FULLTEXT01.pdfFile size 736 kBChecksum SHA-512
Type fulltextMimetype application/pdf

By organisation
Department of Computing Science
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 402 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 223 hits
ReferencesLink to record
Permanent link

Direct link