Change search
ReferencesLink to record
Permanent link

Direct link
S²ProT: Rank Allocation by Superpositioned Propagation of Topic-Relevance
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2008 (English)In: International Journal of Web Information Systems, ISSN 1744-0084, Vol. 4, no 4, 416-440 p.Article in journal (Refereed) Published
Abstract [en]

Purpose – The purpose of this paper is to assign topic-specific ratings to web pages.

Design/methodology/approach – The paper uses power iteration to assign topic-specific rating values (called relevance) to web pages, creating a ranking or partial order among these pages for each topic. This approach depends on a set of pages that are initially assumed to be relevant for a specific topic; the spatial link structure of the web pages; and a net-specific decay factor designated ξ.

Findings – The paper finds that this approach exhibits desirable properties such as fast convergence, stability and yields relevant answer sets. The first property will be shown using theoretical proofs, while the others are evaluated through stability experiments and assessments of real world data in comparison with already established algorithms.

Research limitations/implications – In the assessment, all pages that a web spider was able to find in the Nordic countries were used. It is also important to note that entities that use domains outside the Nordic countries ( are not present in the paper's datasets even though they reside logically within one or more of the Nordic countries. This is quite a large dataset, but still small in comparison with the entire worldwide web. Moreover, the execution speed of some of the algorithms unfortunately prohibited the use of a large test dataset in the stability tests.

Practical implications – It is not only possible, but also reasonable, to perform ranking of web pages without using Markov chain approaches. This means that the work of generating answer sets for complex questions could (at least in theory) be divided into smaller parts that are later summed up to give the final answer.

Originality/value – This paper contributes to the research on internet search engines.

Place, publisher, year, edition, pages
2008. Vol. 4, no 4, 416-440 p.
Keyword [en]
Information retrieval, Search engines, Spatial data structures, Worldwide web
National Category
Computer Science
URN: urn:nbn:se:umu:diva-11236DOI: 10.1108/17440080810919477OAI: diva2:150907
Available from: 2008-12-01 Created: 2008-12-01 Last updated: 2013-01-17Bibliographically approved
In thesis
1. Finding, extracting and exploiting structure in text and hypertext
Open this publication in new window or tab >>Finding, extracting and exploiting structure in text and hypertext
2009 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Att finna, extrahera och utnyttja strukturer i text och hypertext
Abstract [en]

Data mining is a fast-developing field of study, using computations to either predict or describe large amounts of data. The increase in data produced each year goes hand in hand with this, requiring algorithms that are more and more efficient in order to find interesting information within a given time.

In this thesis, we study methods for extracting information from semi-structured data, for finding structure within large sets of discrete data, and to efficiently rank web pages in a topic-sensitive way.

The information extraction research focuses on support for keeping both documentation and source code up to date at the same time. Our approach to this problem is to embed parts of the documentation within strategic comments of the source code and then extracting them by using a specific tool.

The structures that our structure mining algorithms are able to find among crisp data (such as keywords) is in the form of subsumptions, i.e. one keyword is a more general form of the other. We can use these subsumptions to build larger structures in the form of hierarchies or lattices, since subsumptions are transitive. Our tool has been used mainly as input to data mining systems and for visualisation of data-sets.

The main part of the research has been on ranking web pages in a such a way that both the link structure between pages and also the content of each page matters. We have created a number of algorithms and compared them to other algorithms in use today. Our focus in these comparisons have been on convergence rate, algorithm stability and how relevant the answer sets from the algorithms are according to real-world users.

The research has focused on the development of efficient algorithms for gathering and handling large data-sets of discrete and textual data. A proposed system of tools is described, all operating on a common database containing "fingerprints" and meta-data about items. This data could be searched by various algorithms to increase its usefulness or to find the real data more efficiently.

All of the methods described handle data in a crisp manner, i.e. a word or a hyper-link either is or is not a part of a record or web page. This means that we can model their existence in a very efficient way. The methods and algorithms that we describe all make use of this fact.

Abstract [sv]

Informationsutvinning (som ofta kallas data mining även på svenska) är ett forskningsområde som hela tiden utvecklas. Det handlar om att använda datorer för att hitta mönster i stora mängder data, alternativt förutsäga framtida data utifrån redan tillgänglig data. Eftersom det samtidigt produceras mer och mer data varje år ställer detta högre och högre krav på effektiviteten hos de algoritmer som används för att hitta eller använda informationen inom rimlig tid.

Denna avhandling handlar om att extrahera information från semi-strukturerad data, att hitta strukturer i stora diskreta datamängder och att på ett effektivt sätt rangordna webbsidor utifrån ett ämnesbaserat perspektiv.

Den informationsextraktion som beskrivs handlar om stöd för att hålla både dokumentationen och källkoden uppdaterad samtidigt. Vår lösning på detta problem är att låta delar av dokumentationen (främst algoritmbeskrivningen) ligga som blockkommentarer i källkoden och extrahera dessa automatiskt med ett verktyg.

De strukturer som hittas av våra algoritmer för strukturextraktion är i form av underordnanden, exempelvis att ett visst nyckelord är mer generellt än ett annat. Dessa samband kan utnyttjas för att skapa större strukturer i form av hierarkier eller riktade grafer, eftersom underordnandena är transitiva. Det verktyg som vi har tagit fram har främst använts för att skapa indata till ett informationsutvinningssystem samt för att kunna visualisera indatan.

Huvuddelen av den forskning som beskrivs i denna avhandling har dock handlat om att kunna rangordna webbsidor utifrån både deras innehåll och länkarna som finns mellan dem. Vi har skapat ett antal algoritmer och visat hur de beter sig i jämförelse med andra algoritmer som används idag. Dessa jämförelser har huvudsakligen handlat om konvergenshastighet, algoritmernas stabilitet givet osäker data och slutligen hur relevant algoritmernas svarsmängder har ansetts vara utifrån användarnas perspektiv.

Forskningen har varit inriktad på effektiva algoritmer för att hämta in och hantera stora datamängder med diskreta eller textbaserade data. I avhandlingen presenterar vi även ett förslag till ett system av verktyg som arbetar tillsammans på en databas bestående av “fingeravtryck” och annan meta-data om de saker som indexerats i databasen. Denna data kan sedan användas av diverse algoritmer för att utöka värdet hos det som finns i databasen eller för att effektivt kunna hitta rätt information.

Place, publisher, year, edition, pages
Umeå: Umeå Universitet, 2009. 217 p.
Report / UMINF, ISSN 0348-0542 ; 09.12
Automatic propagation, CHiC, Data mining, Discrete data, Extraction, Hierarchies, ProT, Rank distribution, S²ProT, Spatial linking, Web mining, Web searching
National Category
Computer Science
Research subject
Computing Science
urn:nbn:se:umu:diva-22352 (URN)978-91-7264-799-2 (ISBN)
Public defence
2009-06-05, MA 121, MIT, Umeå Universitet, Umeå, 13:15 (English)
AlgExt, CHiC, ProT
Available from: 2009-05-14 Created: 2009-05-06 Last updated: 2009-06-01Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full text

Search in DiVA

By author/editor
Ågren, Ola
By organisation
Department of Computing Science
Computer Science

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Altmetric score

Total: 180 hits
ReferencesLink to record
Permanent link

Direct link