umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Structural Information and Hidden Markov Models for Biological Sequence Analysis
Umeå University, Faculty of Science and Technology, Department of Computing Science.
2008 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Bioinformatics is a fast-developing field, which makes use of computational methods to analyse and structure biological data. An important branch of bioinformatics is structure and function prediction of proteins, which is often based on finding relationships to already characterized proteins. It is known that two proteins with very similar sequences also share the same 3D structure. However, there are many proteins with similar structures that have no clear sequence similarity, which make it difficult to find these relationships.

In this thesis, two methods for annotating protein domains are presented, one aiming at assigning the correct domain family or families to a protein sequence, and the other aiming at fold recognition. Both methods use hidden Markov models (HMMs) to find related proteins, and they both exploit the fact that structure is more conserved than sequence, but in two different ways.

Most of the research presented in the thesis focuses on the structure-anchored HMMs, saHMMs. For each domain family, an saHMM is constructed from a multiple structure alignment of carefully selected representative domains, the saHMM-members. These saHMM-members are collected in the so called "midnight ASTRAL set", and are chosen so that all saHMM-members within the same family have mutual sequence identities below a threshold of about 20%.

In order to construct the midnight ASTRAL set and the saHMMs, a pipe-line of software tools are developed. The saHMMs are shown to be able to detect the correct family relationships at very high accuracy,

and perform better than the standard tool Pfam in assigning the correct domain families to new domain sequences. We also introduce the FI-score, which is used to measure the performance of the saHMMs, in order to select the optimal model for each domain family.

The saHMMs are made available for searching through the FISH server, and can be used for assigning family relationships to protein sequences.

The other approach presented in the thesis is secondary structure HMMs (ssHMMs).

These HMMs are designed to use both the sequence and the predicted secondary structure of a query protein when scoring it against the model.

A rigorous benchmark is used, which shows that HMMs made from multiple sequences result in better fold recognition than those based on single sequences. Adding secondary structure information to the HMMs improves the ability of fold recognition further, both when using true and predicted secondary structures for the query sequence.

Abstract [sv]

Bioinformatik är ett område där datavetenskapliga och statistiska metoder används för att analysera och strukturera biologiska data. Ett viktigt område inom bioinformatiken försöker förutsäga vilken tredimensionell struktur och funktion ett protein har, utifrån dess aminosyrasekvens och/eller likheter med andra, redan karaktäriserade, proteiner. Det är känt att två proteiner med likande aminosyrasekvenser också har liknande tredimensionella strukturer. Att två proteiner har liknande strukturer behöver dock inte betyda att deras sekvenser är lika, vilket kan göra det svårt att hitta strukturella likheter utifrån ett proteins aminosyrasekvens.

Den här avhandlingen beskriver två metoder för att hitta likheter mellan proteiner, den ena med fokus på att bestämma vilken familj av proteindomäner, med känd 3D-struktur, en given sekvens tillhör, medan den andra försöker förutsäga ett proteins veckning, d.v.s. ge en grov bild av proteinets struktur.

Båda metoderna använder s.k. dolda Markov modeller (hidden Markov models, HMMer), en statistisk metod som bland annat kan användas för att beskriva proteinfamiljer. Med hjälp en HMM kan man förutsäga om en viss proteinsekvens tillhör den familj modellen representerar. Båda metoderna använder också strukturinformation för att öka modellernas förmåga att känna igen besläktade sekvenser, men på olika sätt.

Det mesta av arbetet i avhandlingen handlar om strukturellt förankrade HMMer (structure-anchored HMMs, saHMMer). För att bygga saHMMerna används strukturbaserade sekvensöverlagringar, vilka genereras utifrån hur proteindomänerna kan läggas på varandra i rymden, snarare än utifrån vilka aminosyror som ingår i deras sekvenser. I varje proteinfamilj används bara ett särskilt, representativt urval av domäner. Dessa är valda så att då sekvenserna jämförs parvis, finns det inget par inom familjen med högre sekvensidentitet än ca 20%. Detta urval görs för att få så stor spridning som möjligt på sekvenserna inom familjen. En programvaruserie har utvecklats för att välja ut representanter för varje familj och sedan bygga saHMMer baserade på dessa.

Det visar sig att saHMMerna kan hitta rätt familj till en hög andel av de testade sekvenserna, med nästan inga fel. De är också bättre än den ofta använda metoden Pfam på att hitta rätt familj till helt nya proteinsekvenser.

saHMMerna finns tillgängliga genom FISH-servern, vilken alla kan använda via Internet för att hitta vilken familj ett intressant protein kan tillhöra.

Den andra metoden som presenteras i avhandlingen är sekundärstruktur-HMMer, ssHMMer, vilka är byggda från vanliga multipla sekvensöverlagringar, men också från information om vilka sekundärstrukturer proteinsekvenserna i familjen har. När en proteinsekvens jämförs med ssHMMen används en förutsägelse om sekundärstrukturen, och den beräknade sannolikheten att sekvensen tillhör familjen kommer att baseras både på sekvensen av aminosyror och på sekundärstrukturen.

Vid en jämförelse visar det sig att HMMer baserade på flera sekvenser är bättre än sådana baserade på endast en sekvens, när det gäller att hitta rätt veckning för en proteinsekvens. HMMerna blir ännu bättre om man också tar hänsyn till sekundärstrukturen, både då den riktiga sekundärstrukturen används och då man använder en teoretiskt förutsagd.

Place, publisher, year, edition, pages
Umeå: Datavetenskap , 2008. , 92 p.
Series
Report / UMINF, ISSN 0348-0542 ; 08.05
Keyword [en]
HMM, structure alignment, protein structure, secondary structure, remote homologue, annotation, domain family, protein family, protein superfamily, protein fold recognition
National Category
Bioinformatics (Computational Biology)
Identifiers
URN: urn:nbn:se:umu:diva-1629ISBN: 978-91-7264-369-7 (print)OAI: oai:DiVA.org:umu-1629DiVA: diva2:141606
Public defence
2008-05-23, MA121, MIT-huset, Umeå universitet, 901 87 Umeå, 10:15 (English)
Opponent
Supervisors
Note
Jeanette Hargbo.Available from: 2008-04-29 Created: 2008-04-29 Last updated: 2009-08-13Bibliographically approved
List of papers
1. Accurate Domain Identification with Structure-Anchored Hidden Markov Models, saHMMs
Open this publication in new window or tab >>Accurate Domain Identification with Structure-Anchored Hidden Markov Models, saHMMs
2009 (English)In: Proteins: Structure, Function, and Genetics, ISSN 0887-3585, E-ISSN 1097-0134, Vol. 76, no 2, 343-352 p.Article in journal (Refereed) Published
Abstract [en]

The ever increasing speed of DNA sequencing widens the discrepancy between the number of known gene products, and the knowledge of their function and structure. Proper annotation of protein sequences is therefore crucial if the missing information is to be deduced from sequence-based similarity comparisons. These comparisons become exceedingly difficult as the pairwise identities drop to very low values. To improve the accuracy of domain identification, we exploit the fact that the three-dimensional structures of domains are much more conserved than their sequences. Based on structure-anchored multiple sequence alignments of low identity homologues we constructed 850 structure-anchored hidden Markov models (saHMMs), each representing one domain family. Since the saHMMs are highly family specific, they can be used to assign a domain to its correct family and clearly distinguish it from domains belonging to other families, even within the same superfamily. This task is not trivial and becomes particularly difficult if the unknown domain is distantly related to the rest of the domain sequences within the family. In a search with full length protein sequences, harbouring at least one domain as defined by the structural classification of proteins database (SCOP), version 1.71, versus the saHMM database based on SCOP version 1.69, we achieve an accuracy of 99.0%. All of the few hits outside the family fall within the correct superfamily. Compared to Pfam_ls HMMs, the saHMMs obtain about 11% higher coverage. A comparison with BLAST and PSI-BLAST demonstrates that the saHMMs have consistently fewer errors per query at a given coverage. Within our recommended E-value range, the same is true for a comparison with SUPERFAMILY. Furthermore, we are able to annotate 232 proteins with 530 nonoverlapping domains belonging to 102 different domain families among human proteins labelled unknown in the NCBI protein database. Our results demonstrate that the saHMM database represents a versatile and reliable tool for identification of domains in protein sequences. With the aid of saHMMs, homology on the family level can be assigned, even for distantly related sequences. Due to the construction of the saHMMs, the hits they provide are always associated with high quality crystal structures. The saHMM database can be accessed via the FISH server at http://babel.ucmp.umu.se/fish/.

Keyword
protein domain, structure alignment, remote homologue, sequence annotation, protein family, protein superfamily
National Category
Biochemistry and Molecular Biology Biophysics
Identifiers
urn:nbn:se:umu:diva-24700 (URN)10.1002/prot.22349 (DOI)
Note
Jeanette Hargbo.Available from: 2009-07-10 Created: 2009-07-10 Last updated: 2012-04-11Bibliographically approved
2. FISH-Family identification of sequence homologues using structure anchored hidden Markov models
Open this publication in new window or tab >>FISH-Family identification of sequence homologues using structure anchored hidden Markov models
2006 (English)In: Nucleic Acids Research, ISSN 0305-1048, E-ISSN 1362-4962, Vol. 34, no Web Server issue, W10-W14 p.Article in journal (Refereed) Published
Abstract [en]

The FISH server is highly accurate in identifying the family membership of domains in a query protein sequence, even in the case of very low sequence identities to known homologues. A performance test using SCOP sequences and an E-value cut-off of 0.1 showed that 99.3% of the top hits are to the correct family saHMM. Matches to a query sequence provide the user not only with an annotation of the identified domains and hence a hint to their function, but also with probable 2D and 3D structures, as well as with pairwise and multiple sequence alignments to homologues with low sequence identity. In addition, the FISH server allows users to upload and search their own protein sequence collection or to quarry public protein sequence data bases with individual saHMMs. The FISH server can be accessed at http://babel.ucmp.umu.se/fish/.

Keyword
Bioinformatics, databases, protein structure, internet, structure-anchored Hidden Markov Models, saHMM, protein structure alignment, protein structure superimposition, sequence annotation, sequence homology, amino acid, software, FISH-server
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:umu:diva-13966 (URN)10.1093/nar/gkl330 (DOI)16844969 (PubMedID)
Note
Jeanette Hargbo.Available from: 2007-11-23 Created: 2007-11-23 Last updated: 2010-05-10Bibliographically approved
3. Design, construction and use of the FISH server
Open this publication in new window or tab >>Design, construction and use of the FISH server
2007 (English)In: Applied parallel computing: state of the art in scientific computing, Springer Link , 2007, 647-657 p.Chapter in book (Other academic)
Abstract [en]

At the core of the FISH (Family Identification with Structure anchored Hidden Markov models, saHMMs) server lies the midnight ASTRAL set. It is a collection of protein domains with low mutual sequence identity within homologous families, according to the structural classification of proteins, SCOP. Here, we evaluate two algorithms for creating the midnight ASTRAL set. The algorithm that limits the number of structural comparisons is about an order of magnitude faster than the all-against-all algorithm. We therefore choose the faster algorithm, although it produces slightly fewer domains in the set. We use the midnight ASTRAL set to construct the structure-anchored Hidden Markov Model data base, saHMM-db, where each saHMM represents one family. Sequence searches using saHMMs provide information about protein function, domain organization, the probable 2D and 3D structure, and can lead to the discovery of homologous domains in remotely related sequences.

Place, publisher, year, edition, pages
Springer Link, 2007
Series
Lecture Notes in Computer Science, ISSN 0302-9743 (Print) 1611-3349 (Online) ; Volume 4699/2010
Keyword
Bioinformatik, structure-anchored Hidden Markov Model, saHMM, protein structure alignment, protein structure superimposition, remote homologue, FISH-server, protein annotation
National Category
Computer Science Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:umu:diva-3126 (URN)10.1007/978-3-540-75755-9_78 (DOI)978-3-540-75754-2 (ISBN)
Note
Jeanette Hargbo.Available from: 2008-04-29 Created: 2008-04-29 Last updated: 2010-05-10Bibliographically approved
4. Combinatorial Selection Improves Hidden Markov Model Performance
Open this publication in new window or tab >>Combinatorial Selection Improves Hidden Markov Model Performance
(English)Manuscript (preprint) (Other (popular science, discussion, etc.))
Identifiers
urn:nbn:se:umu:diva-3127 (URN)
Note
Jeanette Hargbo.Available from: 2008-04-29 Created: 2008-04-29 Last updated: 2010-01-14Bibliographically approved
5. Hidden Markov Models That Use Predicted Secondary Structures For Fold Recognition
Open this publication in new window or tab >>Hidden Markov Models That Use Predicted Secondary Structures For Fold Recognition
1999 (English)In: Proteins: Structure, Function, and Genetics, ISSN 0887-3585, Vol. 36, no 1, 68-76 p.Article in journal (Refereed) Published
Abstract [en]

There are many proteins that share the same fold but have no clear sequence similarity. To predict the structure of these proteins, so called protein fold recognition methods have been developed. During the last few years, improvements of protein fold recognition methods have been achieved through the use of predicted secondary structures (Rice and Eisenberg, J Mol Biol 1997;267:1026-1038), as well as by using multiple sequence alignments in the form of hidden Markov models (HMM) (Karplus et al., Proteins Suppl 1997;1:134-139). To test the performance of different fold recognition methods, we have developed a rigorous benchmark where representatives for all proteins of known structure are matched against each other. Using this benchmark, we have compared the performance of automatically-created hidden Markov models with standard-sequence-search methods. Further, we combine the use of predicted secondary structures and multiple sequence alignments into a combined method that performs better than methods that do not use this combination of information. Using only single sequences, the correct fold of a protein was detected for 10% of the test cases in our benchmark. Including multiple sequence information increased this number to 16%, and when predicted secondary structure information was included as well, the fold was correctly identified in 20% of the cases. Moreover, if the correct secondary structure was used, 27% of the proteins could be correctly matched to a fold. For comparison, blast2, fasta, and ssearch identifies the fold correctly in 13-17% of the cases. Thus, standard pairwise sequence search methods perform almost as well as hidden Markov models in our benchmark. This is probably because the automatically-created multiple sequence alignments used in this study do not contain enough diversity and because the current generation of hidden Markov models do not perform very well when built from a few sequences. Proteins 1999;36:68-76

Keyword
protein structure, HMM, Scop, HSSP, threading, blast, fasta, ssearch, protein fold recognition
Identifiers
urn:nbn:se:umu:diva-3128 (URN)10.1002/(SICI)1097-0134(19990701)36:1<68::AID-PROT6>3.0.CO;2-1 (DOI)
Note
Även med namnet Jeanette Tångrot.Available from: 2008-04-29 Created: 2008-04-29Bibliographically approved

Open Access in DiVA

fulltext(1959 kB)918 downloads
File information
File name FULLTEXT01.pdfFile size 1959 kBChecksum SHA-1
490318f58527cdd665aaed63cb2c4487b0b80acdda05125401df371229c572c7542b9131
Type fulltextMimetype application/pdf

By organisation
Department of Computing Science
Bioinformatics (Computational Biology)

Search outside of DiVA

GoogleGoogle Scholar
Total: 918 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 581 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf