umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Orthogonal projections to latent structures as a strategy for microarray data normalization
Umeå University, Faculty of Science and Technology, Department of Chemistry.
Umeå University, Faculty of Science and Technology, Umeå Plant Science Centre (UPSC).
Umeå University, Faculty of Science and Technology, Department of Plant Physiology. Umeå University, Faculty of Science and Technology, Umeå Plant Science Centre (UPSC).
Umeå University, Faculty of Science and Technology, Department of Plant Physiology. Umeå University, Faculty of Science and Technology, Umeå Plant Science Centre (UPSC).ORCID iD: 0000-0002-7906-6891
Show others and affiliations
2007 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 8, no 207Article in journal (Refereed) Published
Abstract [en]

Background

During generation of microarray data, various forms of systematic biases are frequently introduced which limits accuracy and precision of the results. In order to properly estimate biological effects, these biases must be identified and discarded.

Results

We introduce a normalization strategy for multi-channel microarray data based on orthogonal projections to latent structures (OPLS); a multivariate regression method. The effect of applying the normalization methodology on single-channel Affymetrix data as well as dual-channel cDNA data is illustrated. We provide a parallel comparison to a wide range of commonly employed normalization methods with diverse properties and strengths based on sensitivity and specificity from external (spike-in) controls. On the illustrated data sets, the OPLS normalization strategy exhibits leading average true negative and true positive rates in comparison to other evaluated methods.

Conclusions

The OPLS methodology identifies joint variation within biological samples to enable the removal of sources of variation that are non-correlated (orthogonal) to the within-sample variation. This ensures that structured variation related to the underlying biological samples is separated from the remaining, bias-related sources of systematic variation. As a consequence, the methodology does not require any explicit knowledge regarding the presence or characteristics of certain biases. Furthermore, there is no underlying assumption that the majority of elements should be non-differentially expressed, making it applicable to specialized boutique arrays.

Place, publisher, year, edition, pages
2007. Vol. 8, no 207
National Category
Biological Sciences
Identifiers
URN: urn:nbn:se:umu:diva-14868DOI: doi:10.1186/1471-2105-8-207OAI: oai:DiVA.org:umu-14868DiVA: diva2:154540
Available from: 2007-08-14 Created: 2007-08-14 Last updated: 2017-12-14
In thesis
1. Populus transcriptomics: from noise to biology
Open this publication in new window or tab >>Populus transcriptomics: from noise to biology
2007 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [sv]

Mikromatriser handlar numera inte bara om att alstra genuttrycksdata i snabb takt, utan det är minst lika viktigt att effektivt ta hand om informationen efteråt. I den här avhandlingen presenteras ett arbetsflöde för att mäta, lagra och analysera genuttrycksdata i asp och poppel (Populus spp.). En Populus} mikromatrisdatabas - UPSC--BASE - tillgänglig för alla intresserade, utvecklades i syfte att samla in och lagra genuttrycksdata. Flertalet analysverktyg gjordes samtidigt tillgängliga, för att möjliggöra ett smidigt arbetsflöde från rådata till biologiska slutsatser.

En av de stora utmaningarna i analys av mikromatriser är att kunna särskilja bruset från värdefull biologisk information. Att studera träd som växer utomhus är komplext eftersom de interagerar med omgivningen i mycket större utsträckning än vad som är fallet i växthusets kontrollerade miljö. Det här arbetet visar att det är möjligt, med hjälp av avancerad statistik och god försöksplanering, att följa och jämföra genuttrycket i blad från aspar utomhus under flera år för att dra värdefulla slutsatser om geners reglering.

Den lagrade biologiska informationen i UPSC-BASE är avsedd att vara en värdefull tillgång för växtfältet i stort. I databasen finns nästan hundra olika experiment som innefattar alltifrån unga blad till ved, insektsangrepp, kyla och torka samt studier av genmodifierade växter. Informationen kan användas både för jämförande studier inom olika aspförsök, men också för att jämföra med andra växter. För att illustrera möjligheterna, studerades och grupperades gener i blad baserat på hur de uppträder över alla dessa experiment. Dessa grupperingar användes sedan för att definiera gener som är viktiga i bladutvecklingen. Sammanfattningsvis ger arbetet som presenteras i denna avhandling tillgång till verktyg och kunskap för storskaliga studier av genuttryck och den lagrade informationen har bevisats vara en värdefull tillgång för mer ingående studier av geners reglering.

Abstract [en]

DNA microarray analysis today is not just generation of high-throughput data, much more attention is paid to the subsequent efficient handling of the generated information. In this thesis, a pipeline to generate, store and analyse Populus transcriptional data is presented A public Populus microarray database - UPSC--BASE - was developed to gather and store transcriptomic data. In addition, several tools were provided to facilitate microarray analysis without requirements for expert-level knowledge. The aim has been to streamline the workflow from raw data through to biological interpretation.

Differentiating noise from valuable biological information is one of the challenges in DNA microarray analysis. Studying gene regulation in free-growing aspen trees represents a complex analysis scenario as the trees are exposed to, and interacting with, the environment to a much higher extent than under highly controlled conditions in the greenhouse. This work shows that, by using multivariate statistics and experimental planning, it is possible to follow and compare gene expression in leaves from multiple growing seasons, and draw valuable conclusions about gene expression from field-grown samples.

The biological information in UPSC-BASE is intended to be a valuable transcriptomic resource also for the wider plant community. The database provides information from almost a hundred different experiments, spanning different developmental stages, tissue types, abiotic and biotic stresses and mutants. The information can potentially be used for both cross-experiment analysis and for comparisons against other plants, such as Arabidopsis or rice. As a demonstration of this, microarray experiments performed on Populus leaves were merged and genes preferentially expressed in leaves were organised in to regulons of co-regulated genes. Those regulons were used to define genes of importance in leaf development in Populus. Taken together, the work presented in this thesis provides tools and knowledge for large-scale transcriptional studies and the stored gene expression information has been proven to be a valuable information resource for in-depth studies about gene regulation.

Place, publisher, year, edition, pages
Umeå: Fysiologisk botanik, 2007. 50 p.
Keyword
Populus, microarray, transcriptomics, UPSC-BASE, leaf development
National Category
Bioinformatics (Computational Biology)
Identifiers
urn:nbn:se:umu:diva-1423 (URN)978-91-7264-442-7 (ISBN)
Public defence
2007-11-30, Hörsal, Arbetslivsinstitutet, Umeå universitet, 10:00 (English)
Opponent
Supervisors
Available from: 2007-11-07 Created: 2007-11-07 Last updated: 2009-12-03Bibliographically approved
2. Latent variable based computational methods for applications in life sciences: Analysis and integration of omics data sets
Open this publication in new window or tab >>Latent variable based computational methods for applications in life sciences: Analysis and integration of omics data sets
2008 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

With the increasing availability of high-throughput systems for parallel monitoring of multiple variables, e.g. levels of large numbers of transcripts in functional genomics experiments, massive amounts of data are being collected even from single experiments. Extracting useful information from such systems is a non-trivial task that requires powerful computational methods to identify common trends and to help detect the underlying biological patterns. This thesis deals with the general computational problems of classifying and integrating high-dimensional empirical data using a latent variable based modeling approach. The underlying principle of this approach is that a complex system can be characterized by a few independent components that characterize the systematic properties of the system. Such a strategy is well suited for handling noisy, multivariate data sets with strong multicollinearity structures, such as those typically encountered in many biological and chemical applications.

The main foci of the studies this thesis is based upon are applications and extensions of the orthogonal projections to latent structures (OPLS) method in life science contexts. OPLS is a latent variable based regression method that separately describes systematic sources of variation that are related and unrelated to the modeling aim (for instance, classifying two different categories of samples). This separation of sources of variation can be used to pre-process data, but also has distinct advantages for model interpretation, as exemplified throughout the work. For classification cases, a probabilistic framework for OPLS has been developed that allows the incorporation of both variance and covariance into classification decisions. This can be seen as a unification of two historical classification paradigms based on either variance or covariance. In addition, a non-linear reformulation of the OPLS algorithm is outlined, which is useful for particularly complex regression or classification tasks.

The general trend in functional genomics studies in the post-genomics era is to perform increasingly comprehensive characterizations of organisms in order to study the associations between their molecular and cellular components in greater detail. Frequently, abundances of all transcripts, proteins and metabolites are measured simultaneously in an organism at a current state or over time. In this work, a generalization of OPLS is described for the analysis of multiple data sets. It is shown that this method can be used to integrate data in functional genomics experiments by separating the systematic variation that is common to all data sets considered from sources of variation that are specific to each data set.

Abstract [sv]

Funktionsgenomik är ett forskningsområde med det slutgiltiga målet att karakterisera alla gener i ett genom hos en organism. Detta inkluderar studier av hur DNA transkriberas till mRNA, hur det sedan translateras till proteiner och hur dessa proteiner interagerar och påverkar organismens biokemiska processer. Den traditionella ansatsen har varit att studera funktionen, regleringen och translateringen av en gen i taget. Ny teknik inom fältet har dock möjliggjort studier av hur tusentals transkript, proteiner och små molekyler uppträder gemensamt i en organism vid ett givet tillfälle eller över tid. Konkret innebär detta även att stora mängder data genereras även från små, isolerade experiment. Att hitta globala trender och att utvinna användbar information från liknande data-mängder är ett icke-trivialt beräkningsmässigt problem som kräver avancerade och tolkningsbara matematiska modeller.

Denna avhandling beskriver utvecklingen och tillämpningen av olika beräkningsmässiga metoder för att klassificera och integrera stora mängder empiriskt (uppmätt) data. Gemensamt för alla metoder är att de baseras på latenta variabler: variabler som inte uppmätts direkt utan som beräknats från andra, observerade variabler. Detta koncept är väl anpassat till studier av komplexa system som kan beskrivas av ett fåtal, oberoende faktorer som karakteriserar de huvudsakliga egenskaperna hos systemet, vilket är kännetecknande för många kemiska och biologiska system. Metoderna som beskrivs i avhandlingen är generella men i huvudsak utvecklade för och tillämpade på data från biologiska experiment.

I avhandlingen demonstreras hur dessa metoder kan användas för att hitta komplexa samband mellan uppmätt data och andra faktorer av intresse, utan att förlora de egenskaper hos metoden som är kritiska för att tolka resultaten. Metoderna tillämpas för att hitta gemensamma och unika egenskaper hos regleringen av transkript och hur dessa påverkas av och påverkar små molekyler i trädet poppel. Utöver detta beskrivs ett större experiment i poppel där relationen mellan nivåer av transkript, proteiner och små molekyler undersöks med de utvecklade metoderna.

Place, publisher, year, edition, pages
Umeå: Kemi, 2008. 50 p.
Keyword
Chemometrics, orthogonal projections to latent structures, OPLS, O2PLS, K-OPLS, kernel-based, non-linear, regression, classification, Populus
National Category
Chemical Sciences
Identifiers
urn:nbn:se:umu:diva-1616 (URN)978-91-7264-541-7 (ISBN)
Public defence
2008-05-09, KB3A9, KBC-huset, Umeå Universitet, Umeå, 10:00 (English)
Opponent
Supervisors
Available from: 2008-04-17 Created: 2008-04-17 Last updated: 2009-06-18Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full text

Search in DiVA

By author/editor
Bylesjö, MaxSjödin, AndreasJansson, StefanTrygg, Johan
By organisation
Department of ChemistryUmeå Plant Science Centre (UPSC)Department of Plant Physiology
In the same journal
BMC Bioinformatics
Biological Sciences

Search outside of DiVA

GoogleGoogle Scholar

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 125 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf