Umeå University's logo

umu.sePublikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
From protein sequence to structural instability and disease
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Kemiska institutionen.
2010 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

A great challenge in bioinformatics is to accurately predict protein structure and function from its amino acid sequence, including annotation of protein domains, identification of protein disordered regions and detecting protein stability changes resulting from amino acid mutations. The combination of bioinformatics, genomics and proteomics becomes essential for the investigation of biological, cellular and molecular aspects of disease, and therefore can greatly contribute to the understanding of protein structures and facilitating drug discovery.

In this thesis, a PREDICTOR, which consists of three machine learning methods applied to three different but related structure bioinformatics tasks, is presented: using profile Hidden Markov Models (HMMs) to identify remote sequence homologues, on the basis of protein domains; predicting order and disorder in proteins using Conditional Random Fields (CRFs); applying Support Vector Machines (SVMs) to detect protein stability changes due to single mutation.

To facilitate structural instability and disease studies, these methods are implemented in three web servers: FISH, OnD-CRF and ProSMS, respectively.

For FISH, most of the work presented in the thesis focuses on the design and construction of the web-server. The server is based on a collection of structure-anchored hidden Markov models (saHMM), which are used to identify structural similarity on the protein domain level.

For the order and disorder prediction server, OnD-CRF, I implemented two schemes to alleviate the imbalance problem between ordered and disordered amino acids in the training dataset. One uses pruning of the protein sequence in order to obtain a balanced training dataset. The other tries to find the optimal p-value cut-off for discriminating between ordered and disordered amino acids.  Both these schemes enhance the sensitivity of detecting disordered amino acids in proteins. In addition, the output from the OnD-CRF web server can also be used to identify flexible regions, as well as predicting the effect of mutations on protein stability.

For ProSMS, we propose, after careful evaluation with different methods, a clustered by homology and a non-clustered model for a three-state classification of protein stability changes due to single amino acid mutations. Results for the non-clustered model reveal that the sequence-only based prediction accuracy is comparable to the accuracy based on protein 3D structure information. In the case of the clustered model, however, the prediction accuracy is significantly improved when protein tertiary structure information, in form of local environmental conditions, is included. Comparing the prediction accuracies for the two models indicates that the prediction of mutation stability of proteins that are not homologous is still a challenging task.

Benchmarking results show that, as stand-alone programs, these predictors can be comparable or superior to previously established predictors. Combined into a program package, these mutually complementary predictors will facilitate the understanding of structural instability and disease from protein sequence.

sted, utgiver, år, opplag, sider
Umeå: Kemiska institutionen , 2010. , s. 67
Emneord [en]
protein domain, remote homologue, intrinsically disorder/unstructured proteins, protein function, point mutation, protein family protein stability, HMMs, CRFs, SVMs
Identifikatorer
URN: urn:nbn:se:umu:diva-33845ISBN: 978-91-7459-016-6 (tryckt)OAI: oai:DiVA.org:umu-33845DiVA, id: diva2:318269
Disputas
2010-05-31, KB3B1, KBC-huset, Umeå Univerisity, 10:00 (engelsk)
Opponent
Veileder
Tilgjengelig fra: 2010-05-10 Laget: 2010-05-07 Sist oppdatert: 2018-06-08bibliografisk kontrollert
Delarbeid
1. FISH-Family identification of sequence homologues using structure anchored hidden Markov models
Åpne denne publikasjonen i ny fane eller vindu >>FISH-Family identification of sequence homologues using structure anchored hidden Markov models
2006 (engelsk)Inngår i: Nucleic Acids Research, ISSN 0305-1048, E-ISSN 1362-4962, Vol. 34, nr Web Server issue, s. W10-W14Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

The FISH server is highly accurate in identifying the family membership of domains in a query protein sequence, even in the case of very low sequence identities to known homologues. A performance test using SCOP sequences and an E-value cut-off of 0.1 showed that 99.3% of the top hits are to the correct family saHMM. Matches to a query sequence provide the user not only with an annotation of the identified domains and hence a hint to their function, but also with probable 2D and 3D structures, as well as with pairwise and multiple sequence alignments to homologues with low sequence identity. In addition, the FISH server allows users to upload and search their own protein sequence collection or to quarry public protein sequence data bases with individual saHMMs. The FISH server can be accessed at http://babel.ucmp.umu.se/fish/.

Emneord
Bioinformatics, databases, protein structure, internet, structure-anchored Hidden Markov Models, saHMM, protein structure alignment, protein structure superimposition, sequence annotation, sequence homology, amino acid, software, FISH-server
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-13966 (URN)10.1093/nar/gkl330 (DOI)16844969 (PubMedID)2-s2.0-33747830260 (Scopus ID)
Merknad
Jeanette Hargbo.Tilgjengelig fra: 2007-11-23 Laget: 2007-11-23 Sist oppdatert: 2025-02-07bibliografisk kontrollert
2. Design, construction and use of the FISH server
Åpne denne publikasjonen i ny fane eller vindu >>Design, construction and use of the FISH server
2007 (engelsk)Inngår i: Applied parallel computing: state of the art in scientific computing, Springer Link , 2007, s. 647-657Kapittel i bok, del av antologi (Annet vitenskapelig)
Abstract [en]

At the core of the FISH (Family Identification with Structure anchored Hidden Markov models, saHMMs) server lies the midnight ASTRAL set. It is a collection of protein domains with low mutual sequence identity within homologous families, according to the structural classification of proteins, SCOP. Here, we evaluate two algorithms for creating the midnight ASTRAL set. The algorithm that limits the number of structural comparisons is about an order of magnitude faster than the all-against-all algorithm. We therefore choose the faster algorithm, although it produces slightly fewer domains in the set. We use the midnight ASTRAL set to construct the structure-anchored Hidden Markov Model data base, saHMM-db, where each saHMM represents one family. Sequence searches using saHMMs provide information about protein function, domain organization, the probable 2D and 3D structure, and can lead to the discovery of homologous domains in remotely related sequences.

sted, utgiver, år, opplag, sider
Springer Link, 2007
Serie
Lecture Notes in Computer Science, ISSN 0302-9743 (Print) 1611-3349 (Online) ; Volume 4699/2010
Emneord
Bioinformatik, structure-anchored Hidden Markov Model, saHMM, protein structure alignment, protein structure superimposition, remote homologue, FISH-server, protein annotation
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-3126 (URN)10.1007/978-3-540-75755-9_78 (DOI)2-s2.0-38049019189 (Scopus ID)978-3-540-75754-2 (ISBN)
Merknad
Jeanette Hargbo.Tilgjengelig fra: 2008-04-29 Laget: 2008-04-29 Sist oppdatert: 2025-02-05bibliografisk kontrollert
3. OnD-CRF: predicting order and disorder in proteins using [corrected] conditional random fields
Åpne denne publikasjonen i ny fane eller vindu >>OnD-CRF: predicting order and disorder in proteins using [corrected] conditional random fields
2008 (engelsk)Inngår i: Bioinformatics, ISSN 1367-4803, E-ISSN 1367-4811, Vol. 24, nr 11, s. 1401-1402Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

MOTIVATION: Order and Disorder prediction using Conditional Random Fields (OnD-CRF) is a new method for accurately predicting the transition between structured and mobile or disordered regions in proteins. OnD-CRF applies CRFs relying on features which are generated from the amino acids sequence and from secondary structure prediction. Benchmarking results based on CASP7 targets, and evaluation with respect to several CASP criteria, rank the OnD-CRF model highest among the fully automatic server group. AVAILABILITY: http://babel.ucmp.umu.se/ond-crf/

sted, utgiver, år, opplag, sider
Oxford: Oxford university press, 2008
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-32263 (URN)10.1093/bioinformatics/btn132 (DOI)18430742 (PubMedID)2-s2.0-44349150513 (Scopus ID)
Tilgjengelig fra: 2010-03-05 Laget: 2010-03-05 Sist oppdatert: 2025-02-05bibliografisk kontrollert
4. Prediction of protein stability changes due to single amino acid mutations
Åpne denne publikasjonen i ny fane eller vindu >>Prediction of protein stability changes due to single amino acid mutations
(engelsk)Manuskript (preprint) (Annet vitenskapelig)
Abstract [en]

Accurate prediction of the change in protein stability due to single amino acid mutations is important for guiding site-directed mutagenesis and other protein-engineering techniques. Recently, different two state predictors became available aimed at predicting if point mutations stabilize or destabilize a protein. Considering the experimental errors and tolerances of protein with respect to mutations, we realized that the neutral mutations, which only slightly affect the protein’s stability, must be considered as well. Here, we present a new classification scheme for a three-state predictor (destabilizing, neutral and stabilizing mutations) based on multi-class support vector machines (SVM). We have created a refined training dataset of single amino acid mutations and evaluate the predictive ability of models trained on homology clustered and non-clustered training data using two different cross validation procedures. The experimental results reveal the significant difference of prediction accuracy according to different evaluation procedures. Furthermore we demonstrate that, for non-clustered model, the prediction accuracy based on the protein sequence information alone is comparable to the prediction accuracy based on protein structure information. On the other hand, for clustered model, the prediction ability is significantly improved when protein tertiary structure information is included. The comparison of prediction accuracy for the two models reveals that the prediction accuracy of mutation stability on clustered proteins is still a challenging task. Moreover, benchmarking by using previously published datasets, demonstrate that our method has an improved prediction performance over many established methods.

HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-33769 (URN)
Tilgjengelig fra: 2010-05-06 Laget: 2010-05-05 Sist oppdatert: 2025-02-07bibliografisk kontrollert

Open Access i DiVA

fulltekst(2272 kB)1199 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 2272 kBChecksum SHA-512
60ca4311f04c0ff6801f4663638e774da93f79ef454a5f65a98062973ef74f0ed16527a98e5b9f8f9cebd352de84b286c85949eb0c811a120652f25473eb4281
Type fulltextMimetype application/pdf

Person

Wang, Lixiao

Søk i DiVA

Av forfatter/redaktør
Wang, Lixiao
Av organisasjonen

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 1276 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

isbn
urn-nbn

Altmetric

isbn
urn-nbn
Totalt: 962 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf