umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Latent variable based computational methods for applications in life sciences: Analysis and integration of omics data sets
Umeå University, Faculty of Science and Technology, Chemistry.
2008 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

With the increasing availability of high-throughput systems for parallel monitoring of multiple variables, e.g. levels of large numbers of transcripts in functional genomics experiments, massive amounts of data are being collected even from single experiments. Extracting useful information from such systems is a non-trivial task that requires powerful computational methods to identify common trends and to help detect the underlying biological patterns. This thesis deals with the general computational problems of classifying and integrating high-dimensional empirical data using a latent variable based modeling approach. The underlying principle of this approach is that a complex system can be characterized by a few independent components that characterize the systematic properties of the system. Such a strategy is well suited for handling noisy, multivariate data sets with strong multicollinearity structures, such as those typically encountered in many biological and chemical applications.

The main foci of the studies this thesis is based upon are applications and extensions of the orthogonal projections to latent structures (OPLS) method in life science contexts. OPLS is a latent variable based regression method that separately describes systematic sources of variation that are related and unrelated to the modeling aim (for instance, classifying two different categories of samples). This separation of sources of variation can be used to pre-process data, but also has distinct advantages for model interpretation, as exemplified throughout the work. For classification cases, a probabilistic framework for OPLS has been developed that allows the incorporation of both variance and covariance into classification decisions. This can be seen as a unification of two historical classification paradigms based on either variance or covariance. In addition, a non-linear reformulation of the OPLS algorithm is outlined, which is useful for particularly complex regression or classification tasks.

The general trend in functional genomics studies in the post-genomics era is to perform increasingly comprehensive characterizations of organisms in order to study the associations between their molecular and cellular components in greater detail. Frequently, abundances of all transcripts, proteins and metabolites are measured simultaneously in an organism at a current state or over time. In this work, a generalization of OPLS is described for the analysis of multiple data sets. It is shown that this method can be used to integrate data in functional genomics experiments by separating the systematic variation that is common to all data sets considered from sources of variation that are specific to each data set.

Abstract [sv]

Funktionsgenomik är ett forskningsområde med det slutgiltiga målet att karakterisera alla gener i ett genom hos en organism. Detta inkluderar studier av hur DNA transkriberas till mRNA, hur det sedan translateras till proteiner och hur dessa proteiner interagerar och påverkar organismens biokemiska processer. Den traditionella ansatsen har varit att studera funktionen, regleringen och translateringen av en gen i taget. Ny teknik inom fältet har dock möjliggjort studier av hur tusentals transkript, proteiner och små molekyler uppträder gemensamt i en organism vid ett givet tillfälle eller över tid. Konkret innebär detta även att stora mängder data genereras även från små, isolerade experiment. Att hitta globala trender och att utvinna användbar information från liknande data-mängder är ett icke-trivialt beräkningsmässigt problem som kräver avancerade och tolkningsbara matematiska modeller.

Denna avhandling beskriver utvecklingen och tillämpningen av olika beräkningsmässiga metoder för att klassificera och integrera stora mängder empiriskt (uppmätt) data. Gemensamt för alla metoder är att de baseras på latenta variabler: variabler som inte uppmätts direkt utan som beräknats från andra, observerade variabler. Detta koncept är väl anpassat till studier av komplexa system som kan beskrivas av ett fåtal, oberoende faktorer som karakteriserar de huvudsakliga egenskaperna hos systemet, vilket är kännetecknande för många kemiska och biologiska system. Metoderna som beskrivs i avhandlingen är generella men i huvudsak utvecklade för och tillämpade på data från biologiska experiment.

I avhandlingen demonstreras hur dessa metoder kan användas för att hitta komplexa samband mellan uppmätt data och andra faktorer av intresse, utan att förlora de egenskaper hos metoden som är kritiska för att tolka resultaten. Metoderna tillämpas för att hitta gemensamma och unika egenskaper hos regleringen av transkript och hur dessa påverkas av och påverkar små molekyler i trädet poppel. Utöver detta beskrivs ett större experiment i poppel där relationen mellan nivåer av transkript, proteiner och små molekyler undersöks med de utvecklade metoderna.

Place, publisher, year, edition, pages
Umeå: Kemi , 2008. , 50 p.
Keyword [en]
Chemometrics, orthogonal projections to latent structures, OPLS, O2PLS, K-OPLS, kernel-based, non-linear, regression, classification
Keyword [la]
Populus
National Category
Chemical Sciences
Identifiers
URN: urn:nbn:se:umu:diva-1616ISBN: 978-91-7264-541-7 (print)OAI: oai:DiVA.org:umu-1616DiVA: diva2:141568
Public defence
2008-05-09, KB3A9, KBC-huset, Umeå Universitet, Umeå, 10:00 (English)
Opponent
Supervisors
Available from: 2008-04-17 Created: 2008-04-17 Last updated: 2009-06-18Bibliographically approved
List of papers
1. OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification
Open this publication in new window or tab >>OPLS discriminant analysis: combining the strengths of PLS-DA and SIMCA classification
Show others...
2006 (Swedish)In: Journal of Chemometrics, ISSN 0886-9383, E-ISSN 1099-128X, Vol. 20, no 8-10, 341-351 p.Article in journal (Refereed) Published
Abstract [sv]

The characteristics of the OPLS method have been investigated for the purpose of discriminant analysis (OPLS-DA). We demonstrate how class-orthogonal variation can be exploited to augment classification performance in cases where the individual classes exhibit divergence in within-class variation, in analogy with soft independent modelling of class analogy (SIMCA) classification. The prediction results will be largely equivalent to traditional supervised classification using PLS-DA if no such variation is present in the classes. A discriminatory strategy is thus outlined, combining the strengths of PLS-DA and SIMCA classification within the framework of the OPLS-DA method. Furthermore, resampling methods have been employed to generate distributions of predicted classification results and subsequently assess classification belief. This enables utilisation of the class-orthogonal variation in a proper statistical context. The proposed decision rule is compared to common decision rules and is shown to produce comparable or less class-biased classification results.

Keyword
OPLS-DA, orthogonal, multivariate, classification, PLS-DA, SIMCA
National Category
Biological Sciences
Identifiers
urn:nbn:se:umu:diva-13140 (URN)doi:10.1002/cem.1006 (DOI)
Available from: 2007-05-03 Created: 2007-05-03 Last updated: 2017-12-14
2. Kernel-based orthogonal projections to latent structures (K-OPLS)
Open this publication in new window or tab >>Kernel-based orthogonal projections to latent structures (K-OPLS)
Show others...
2007 (English)In: Journal of Chemometrics, ISSN 0886-9383, E-ISSN 1099-128X, Vol. 21, no 7-9, 379-385 p.Article in journal (Refereed) Published
Abstract [en]

The orthogonal projections to latent structures (OPLS) method has been successfully applied in various chemical and biological systems for modeling and interpretation of linear relationships between a descriptor matrix and response matrix. A kernel-based reformulation of the original OPLS algorithm is presented where the kernel Gram matrix is utilized as a replacement for the descriptor matrix. This enables usage of the kernel trick to efficiently transform the data into a higher-dimensional feature space where predictive and response-orthogonal components are calculated. This strategy has the capacity to improve predictive performance considerably in situations where strong non-linear relationships exist between descriptor and response variables while retaining the OPLS model framework. We put particular focus on describing properties of the rearranged algorithm in relation to the original OPLS algorithm. Four separate problems, two simulated and two real spectroscopic data sets, are employed to illustrate how the algorithm enables separate modeling of predictive and response-orthogonal variation in the feature space. This separation can be highly beneficial for model interpretation purposes while providing a flexible framework for supervised regression.

Keyword
K-OPLS, kernel methods, non-linear, OSC, OPLS, SVM, Kernel PLS
National Category
Biological Sciences
Identifiers
urn:nbn:se:umu:diva-16181 (URN)doi:10.1002/cem.1071 (DOI)
Available from: 2007-08-28 Created: 2007-08-28 Last updated: 2017-12-14
3. Orthogonal projections to latent structures as a strategy for microarray data normalization
Open this publication in new window or tab >>Orthogonal projections to latent structures as a strategy for microarray data normalization
Show others...
2007 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 8, no 207Article in journal (Refereed) Published
Abstract [en]

Background

During generation of microarray data, various forms of systematic biases are frequently introduced which limits accuracy and precision of the results. In order to properly estimate biological effects, these biases must be identified and discarded.

Results

We introduce a normalization strategy for multi-channel microarray data based on orthogonal projections to latent structures (OPLS); a multivariate regression method. The effect of applying the normalization methodology on single-channel Affymetrix data as well as dual-channel cDNA data is illustrated. We provide a parallel comparison to a wide range of commonly employed normalization methods with diverse properties and strengths based on sensitivity and specificity from external (spike-in) controls. On the illustrated data sets, the OPLS normalization strategy exhibits leading average true negative and true positive rates in comparison to other evaluated methods.

Conclusions

The OPLS methodology identifies joint variation within biological samples to enable the removal of sources of variation that are non-correlated (orthogonal) to the within-sample variation. This ensures that structured variation related to the underlying biological samples is separated from the remaining, bias-related sources of systematic variation. As a consequence, the methodology does not require any explicit knowledge regarding the presence or characteristics of certain biases. Furthermore, there is no underlying assumption that the majority of elements should be non-differentially expressed, making it applicable to specialized boutique arrays.

National Category
Biological Sciences
Identifiers
urn:nbn:se:umu:diva-14868 (URN)doi:10.1186/1471-2105-8-207 (DOI)
Available from: 2007-08-14 Created: 2007-08-14 Last updated: 2017-12-14
4. Data integration in plant biology the O2PLS method for combined modeling of transcript and metabolite data
Open this publication in new window or tab >>Data integration in plant biology the O2PLS method for combined modeling of transcript and metabolite data
Show others...
2007 (English)In: The Plant Journal, ISSN 0960-7412, E-ISSN 1365-313X, Vol. 52, no 6, 1181-1191 p.Article in journal (Refereed) Published
Abstract [en]

The technological advances in the instrumentation employed in life sciences have enabled the collection of a virtually unlimited quantity of data from multiple sources. By gathering data from several analytical platforms, with the aim of parallel monitoring of, e.g. transcriptomic, metabolomic or proteomic events, one hopes to answer and understand biological questions and observations. This 'systems biology' approach typically involves advanced statistics to facilitate the interpretation of the data. In the present study, we demonstrate that the O2PLS multivariate regression method can be used for combining 'omics' types of data. With this methodology, systematic variation that overlaps across analytical platforms can be separated from platform-specific systematic variation. A study of Populus tremula x Populus tremuloides, investigating short-day-induced effects at transcript and metabolite levels, is employed to demonstrate the benefits of the methodology. We show how the models can be validated and interpreted to identify biologically relevant events, and discuss the results in relation to a pairwise univariate correlation approach and principal component analysis.

Keyword
combined profiling, O2PLS, chemometrics, Populus, multivariate regression
National Category
Biological Sciences
Identifiers
urn:nbn:se:umu:diva-17156 (URN)doi:10.1111/j.1365-313X.2007.03293.x (DOI)17931352 (PubMedID)
Available from: 2007-12-13 Created: 2007-12-13 Last updated: 2017-12-14
5. Integrated analysis of transcript, protein and metabolite data to study lignin biosynthesis in hybrid aspen
Open this publication in new window or tab >>Integrated analysis of transcript, protein and metabolite data to study lignin biosynthesis in hybrid aspen
Show others...
2009 (English)In: Journal of Proteome Research, ISSN 1535-3893, E-ISSN 1535-3907, Vol. 8, no 1, 199-210 p.Article in journal (Refereed) Published
Abstract [en]

Tree biotechnology will soon reach a mature state where it will influence the overall supply of fiber, energy and wood products. We are now ready to make the transition from identifying candidate genes, controlling important biological processes, to discovering the detailed molecular function of these genes on a broader, more holistic, systems biology level. In this paper, a strategy is outlined for informative data generation and integrated modeling of systematic changes in transcript, protein and metabolite profiles measured from hybrid aspen samples. The aim is to study characteristics of common changes in relation to genotype-specific perturbations affecting the lignin biosynthesis and growth. We show that a considerable part of the systematic effects in the system can be tracked across all platforms and that the approach has a high potential value in functional characterization of candidate genes.

Keyword
Combined profiling, O2PLS, Chemometrics, Populus, Lignin biosynthesis
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:umu:diva-11303 (URN)10.1021/pr800298s (DOI)
Available from: 2009-01-13 Created: 2009-01-13 Last updated: 2017-12-14

Open Access in DiVA

fulltext(3262 kB)1052 downloads
File information
File name FULLTEXT01.pdfFile size 3262 kBChecksum SHA-1
dace868410db6ef1f49be978c99cd6c31529aaa4a886590f3bfaad39c3ad7a3248432779
Type fulltextMimetype application/pdf

By organisation
Chemistry
Chemical Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 1052 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1030 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf