umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Normalization and analysis of high-dimensional genomics data
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In the middle of the 1990’s the microarray technology was introduced. The technology allowed for genome wide analysis of gene expression in one experiment. Since its introduction similar high through-put methods have been developed in other fields of molecular biology. These high through-put methods provide measurements for hundred up to millions of variables in a single experiment and a rigorous data analysis is necessary in order to answer the underlying biological questions.

Further complications arise in data analysis as technological variation is introduced in the data, due to the complexity of the experimental procedures in these experiments. This technological variation needs to be removed in order to draw relevant biological conclusions from the data. The process of removing the technical variation is referred to as normalization or pre-processing. During the last decade a large number of normalization and data analysis methods have been proposed.

In this thesis, data from two types of high through-put methods are used to evaluate the effect pre-processing methods have on further analyzes. In areas where problems in current methods are identified, novel normalization methods are proposed. The evaluations of known and novel methods are performed on simulated data, real data and data from an in-house produced spike-in experiment.

Place, publisher, year, edition, pages
Umeå: Umeå Universitet , 2012. , 43 p.
Keyword [en]
normalization, pre-processing, microarray, downstream analysis, evaluation, sensitivity, bias, genomics data, gene expression, spike-in data, ChIP-chip
National Category
Bioinformatics and Systems Biology Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
URN: urn:nbn:se:umu:diva-53486ISBN: 978-91-7459-402-7 (print)OAI: oai:DiVA.org:umu-53486DiVA: diva2:512711
Public defence
2012-04-20, MA121, Mit-huset, Umeå Universitet, Umeå, 19:25 (English)
Opponent
Supervisors
Available from: 2012-03-30 Created: 2012-03-28 Last updated: 2012-04-02Bibliographically approved
List of papers
1. Evaluation of microarray data normalization procedures using spike-in experiments
Open this publication in new window or tab >>Evaluation of microarray data normalization procedures using spike-in experiments
Show others...
2006 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 7, no 300, 17- p.Article in journal (Refereed) Published
Abstract [en]

Background: Recently, a large number of methods for the analysis of microarray data have been proposed but there are few comparisons of their relative performances. By using so-called spike-in experiments, it is possible to characterize the analyzed data and thereby enable comparisons of different analysis methods.

Results: A spike-in experiment using eight in-house produced arrays was used to evaluate established and novel methods for filtration, background adjustment, scanning, channel adjustment, and censoring. The S-plus package EDMA, a stand-alone tool providing characterization of analyzed cDNA-microarray data obtained from spike-in experiments, was developed and used to evaluate 252 normalization methods. For all analyses, the sensitivities at low false positive rates were observed together with estimates of the overall bias and the standard deviation. In general, there was a trade-off between the ability of the analyses to identify differentially expressed genes (i.e. the analyses' sensitivities) and their ability to provide unbiased estimators of the desired ratios. Virtually all analysis underestimated the magnitude of the regulations; often less than 50% of the true regulations were observed. Moreover, the bias depended on the underlying mRNA-concentration; low concentration resulted in high bias. Many of the analyses had relatively low sensitivities, but analyses that used either the constrained model (i.e. a procedure that combines data from several scans) or partial filtration (a novel method for treating data from so-called not-found spots) had with few exceptions high sensitivities. These methods gave considerable higher sensitivities than some commonly used analysis methods.

Conclusion: The use of spike-in experiments is a powerful approach for evaluating microarray preprocessing procedures. Analyzed data are characterized by properties of the observed log-ratios and the analysis' ability to detect differentially expressed genes. If bias is not a major problem; we recommend the use of either the CM-procedure or partial filtration.

 

Place, publisher, year, edition, pages
London: BioMed Central Ltd, 2006
Keyword
microarray, data analysis, normalization, evaluation, spike-in experiments
National Category
Computational Mathematics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-30806 (URN)10.1186/1471-2105-7-300 (DOI)
Available from: 2010-01-19 Created: 2010-01-18 Last updated: 2017-12-12Bibliographically approved
2. MC-normalization: a novel method for dye-normalization of two-channel microarray data
Open this publication in new window or tab >>MC-normalization: a novel method for dye-normalization of two-channel microarray data
2009 (English)In: Statistical Applications in Genetics and Molecular Biology, ISSN 1544-6115, E-ISSN 1544-6115, Vol. 8, no 1, 42- p.Article in journal (Refereed) Published
Abstract [en]

Motivation: Pre-processing plays a vital role in two-color microarray data analysis. An analysis is characterized by its ability to identify differentially expressed genes (its sensitivity) and its ability to provide unbiased estimators of the true regulation (its bias). It has been shown that microarray experiments regularly underestimate the true regulation of differentially expressed genes. We introduce the MC-normalization, where C stands for channel-wise normalization, with considerably lower bias than the commonly used standard methods.

Methods: The idea behind the MC-normalization is that the channels’ individual intensities determine the correction, rather than the average intensity which is the case for the widely used MA-normalization. The two methods were evaluated using spike-in data from an in-house produced cDNA-experiment and a public available Agilent-experiment. The methods were applied on background corrected and non-background corrected data. For the cDNA-experiment the methods were either applied separately on data from each of the print-tips or applied on the complete array data. Altogether 24 analyses were evaluated. For each analysis the sensitivity, the bias and two variance measures were estimated.

Results: We prove that the MC-normalization has lower bias than the MA-normalization. The spike-in data confirmed the theoretical result and suggest that the difference is significant. Furthermore, the empirical data suggest that the MC-and MA-normalization have similar sensitivity. A striking result is that print-tip normalizations did have considerably higher sensitivity than analyses using the complete array data.

Place, publisher, year, edition, pages
Berkeley: The Berkeley Electronic Press (bepress), 2009
Keyword
microarray analysis, dye-normalization, background correction, gene expression, spike-in data, agilent
National Category
Bioinformatics (Computational Biology)
Research subject
Mathematical Statistics; Statistics
Identifiers
urn:nbn:se:umu:diva-26566 (URN)10.2202/1544-6115.1459 (DOI)
Available from: 2009-12-15 Created: 2009-10-15 Last updated: 2017-12-12Bibliographically approved
3. Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
Open this publication in new window or tab >>Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
Show others...
2010 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 11, 503Article in journal (Refereed) Published
Abstract [en]

Background: Cluster analysis, and in particular hierarchical clustering, is widely used to extract information from gene expression data. The aim is to discover new classes, or sub-classes, of either individuals or genes. Performing a cluster analysis commonly involve decisions on how to; handle missing values, standardize the data and select genes. In addition, pre processing, involving various types of filtration and normalization procedures, can have an effect on the ability to discover biologically relevant classes. Here we consider cluster analysis in a broad sense and perform a comprehensive evaluation that covers several aspects of cluster analyses, including normalization.

Result: We evaluated 2780 cluster analysis methods on seven publicly available 2-channel microarray data sets with common reference designs. Each cluster analysis method differed in data normalization (5 normalizations were considered), missing value imputation (2), standardization of data (2), gene selection (19) or clustering method (11). The cluster analyses are evaluated using known classes, such as cancer types, and the adjusted Rand index. The performances of the different analyses vary between the data sets and it is difficult to give general recommendations. However, normalization, gene selection and clustering method are all variables that have a significant impact on the performance. In particular, gene selection is important and it is generally necessary to include a relatively large number of genes in order to get good performance. Selecting genes with high standard deviation or using principal component analysis are shown to be the preferred gene selection methods. Hierarchical clustering using Ward's method, k-means clustering and Mclust are the clustering methods considered in this paper that achieves the highest adjusted Rand. Normalization can have a significant positive impact on the ability to cluster individuals, and there are indications that background correction is preferable, in particular if the gene selection is successful. However, this is an area that needs to be studied further in order to draw any general conclusions.

Conclusions: The choice of cluster analysis, and in particular gene selection, has a large impact on the ability to cluster individuals correctly based on expression profiles. Normalization has a positive effect, but the relative performance of different normalizations is an area that needs more research. In summary, although clustering, gene selection and normalization are considered standard methods in bioinformatics, our comprehensive analysis shows that selecting the right methods, and the right combinations of methods, is far from trivial and that much is still unexplored in what is considered to be the most basic analysis of genomic data.

Place, publisher, year, edition, pages
BioMed Central, 2010
National Category
Bioinformatics and Systems Biology
Identifiers
urn:nbn:se:umu:diva-37843 (URN)10.1186/1471-2105-11-503 (DOI)000283353200001 ()
Available from: 2010-11-16 Created: 2010-11-16 Last updated: 2017-12-12Bibliographically approved
4. Classification of microarrays: synergistic effects between normalization, gene selection and machine learning
Open this publication in new window or tab >>Classification of microarrays: synergistic effects between normalization, gene selection and machine learning
Show others...
2011 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 12, no 1, 390Article in journal (Refereed) Published
Abstract [en]

BACKGROUND: Machine learning is a powerful approach for describing and predicting classes in microarray data. Although several comparative studies have investigated the relative performance of various machine learning methods, these often do not account for the fact that performance (e.g. error rate) is a result of a series of analysis steps of which the most important are data normalization, gene selection and machine learning.

RESULTS: In this study, we used seven previously published cancer-related microarray data sets to compare the effects on classification performance of five normalization methods, three gene selection methods with 21 different numbers of selected genes and eight machine learning methods. Performance in term of error rate was rigorously estimated by repeatedly employing a double cross validation approach. Since performance varies greatly between data sets, we devised an analysis method that first compares methods within individual data sets and then visualizes the comparisons across data sets. We discovered both well performing individual methods and synergies between different methods.

CONCLUSION: Support Vector Machines with a radial basis kernel, linear kernel or polynomial kernel of degree 2 all performed consistently well across data sets. We show that there is a synergistic relationship between these methods and gene selection based on the T-test and the selection of a relatively high number of genes. Also, we find that these methods benefit significantly from using normalized data, although it is hard to draw general conclusions about the relative performance of different normalization procedures.

Place, publisher, year, edition, pages
BioMed Central, 2011
Keyword
statistical methods, expression, bioinformatics, features, tumors, cell
National Category
Biochemistry and Molecular Biology
Identifiers
urn:nbn:se:umu:diva-50285 (URN)10.1186/1471-2105-12-390 (DOI)000297641600001 ()21982277 (PubMedID)
Available from: 2011-12-06 Created: 2011-12-02 Last updated: 2017-12-08Bibliographically approved
5. Normalization of high dimensional genomics data where the distribution of the altered variables is skewed
Open this publication in new window or tab >>Normalization of high dimensional genomics data where the distribution of the altered variables is skewed
2011 (English)In: PLoS ONE, ISSN 1932-6203, E-ISSN 1932-6203, Vol. 6, no 11, e27942- p.Article in journal (Refereed) Published
Abstract [en]

Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is now routinely performed to compare different populations, such as treatment and reference groups. It is often necessary to normalize the data obtained to remove technical variation introduced in the course of conducting experimental work, but standard normalization techniques are not capable of eliminating technical bias in cases where the distribution of the truly altered variables is skewed, i.e. when a large fraction of the variables are either positively or negatively affected by the treatment. However, several experiments are likely to generate such skewed distributions, including ChIP-chip experiments for the study of chromatin, gene expression experiments for the study of apoptosis, and SNP-studies of copy number variation in normal and tumour tissues. A preliminary study using spike-in array data established that the capacity of an experiment to identify altered variables and generate unbiased estimates of the fold change decreases as the fraction of altered variables and the skewness increases. We propose the following work-flow for analyzing high-dimensional experiments with regions of altered variables: (1) Pre-process raw data using one of the standard normalization techniques. (2) Investigate if the distribution of the altered variables is skewed. (3) If the distribution is not believed to be skewed, no additional normalization is needed. Otherwise, re-normalize the data using a novel HMM-assisted normalization procedure. (4) Perform downstream analysis. Here, ChIP-chip data and simulated data were used to evaluate the performance of the work-flow. It was found that skewed distributions can be detected by using the novel DSE-test (Detection of Skewed Experiments). Furthermore, applying the HMM-assisted normalization to experiments where the distribution of the truly altered variables is skewed results in considerably higher sensitivity and lower bias than can be attained using standard and invariant normalization methods.

Keyword
skewness, normalization, genomics, array analysis, epigenetics
National Category
Bioinformatics and Systems Biology Bioinformatics (Computational Biology) Genetics
Research subject
biology
Identifiers
urn:nbn:se:umu:diva-50313 (URN)10.1371/journal.pone.0027942 (DOI)000297792400024 ()22132175 (PubMedID)
Available from: 2011-12-06 Created: 2011-12-06 Last updated: 2017-12-08Bibliographically approved

Open Access in DiVA

fulltext(1055 kB)839 downloads
File information
File name INSIDE01.pdfFile size 1055 kBChecksum SHA-512
681a37ff723879111773413156d6f57026468a5c0c98d6ae1b44b7caa18f92aa24596749c760756a03cc19e15043f1cdb3724e51f962ae501bd21ec04dee47bf
Type fulltextMimetype application/pdf

Authority records BETA

Landfors, Mattias

Search in DiVA

By author/editor
Landfors, Mattias
By organisation
Department of Mathematics and Mathematical Statistics
Bioinformatics and Systems BiologyProbability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 0 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 250 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf