umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Extracting content from online news sites
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2011 (English)Independent thesis Advanced level (degree of Master (One Year)), 20 credits / 30 HE creditsStudent thesis
Abstract [en]

Society is producing more and more data with every year. The number of unique URLs indexed by Google recently surpassed the one-trillion mark. To fully benefit from this surge in data, we need efficient algorithms for searching and extracting information. A popular approach is to use the so-called vector space model (VSM), that organises documents according to the terms that they contain. This thesis contributes to an investigation of how adding syntactical information to VSM affects search results. The thesis focuses on techniques for content extraction from online news sources, and describes the implementation and evaluation of a selection of these techniques. The extracted data is used to obtain test data for search evaluation. The implementation is generic and thus easily adopted to new data sources, and although the implementation lacks precision,  its performance is sufficient for evaluating the syntax-based version of VSM.

Place, publisher, year, edition, pages
2011.
Series
UMNAD, 861
National Category
Computer Science
Identifiers
URN: urn:nbn:se:umu:diva-39558OAI: oai:DiVA.org:umu-39558DiVA: diva2:393829
Uppsok
Technology
Supervisors
Examiners
Available from: 2011-02-01 Created: 2011-02-01 Last updated: 2011-02-01Bibliographically approved

Open Access in DiVA

fulltext(736 kB)417 downloads
File information
File name FULLTEXT01.pdfFile size 736 kBChecksum SHA-512
d17b433d563c676f719eccc883df922d385878680f7a3d80b2ae9a463eb7378841cc057421286a620138acc385262cc4dd6a59b815a9c13ca9b0bb624eebbc00
Type fulltextMimetype application/pdf

By organisation
Department of Computing Science
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
Total: 417 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

urn-nbn

Altmetric score

urn-nbn
Total: 240 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf