umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Challenges in microarray class discovery: a comprehensive examination of normalization, gene selection and clustering
Umeå University, Faculty of Medicine, Department of Clinical Microbiology, Clinical Bacteriology.
Umeå University, Faculty of Medicine, Department of Clinical Microbiology, Clinical Bacteriology. Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
Umeå University, Faculty of Science and Technology, Department of Plant Physiology. Umeå University, Faculty of Science and Technology, Umeå Plant Science Centre (UPSC).
Umeå University, Faculty of Science and Technology, Department of Plant Physiology. Umeå University, Faculty of Science and Technology, Umeå Plant Science Centre (UPSC).ORCID iD: 0000-0001-6097-2539
Show others and affiliations
2010 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 11, 503Article in journal (Refereed) Published
Abstract [en]

Background: Cluster analysis, and in particular hierarchical clustering, is widely used to extract information from gene expression data. The aim is to discover new classes, or sub-classes, of either individuals or genes. Performing a cluster analysis commonly involve decisions on how to; handle missing values, standardize the data and select genes. In addition, pre processing, involving various types of filtration and normalization procedures, can have an effect on the ability to discover biologically relevant classes. Here we consider cluster analysis in a broad sense and perform a comprehensive evaluation that covers several aspects of cluster analyses, including normalization.

Result: We evaluated 2780 cluster analysis methods on seven publicly available 2-channel microarray data sets with common reference designs. Each cluster analysis method differed in data normalization (5 normalizations were considered), missing value imputation (2), standardization of data (2), gene selection (19) or clustering method (11). The cluster analyses are evaluated using known classes, such as cancer types, and the adjusted Rand index. The performances of the different analyses vary between the data sets and it is difficult to give general recommendations. However, normalization, gene selection and clustering method are all variables that have a significant impact on the performance. In particular, gene selection is important and it is generally necessary to include a relatively large number of genes in order to get good performance. Selecting genes with high standard deviation or using principal component analysis are shown to be the preferred gene selection methods. Hierarchical clustering using Ward's method, k-means clustering and Mclust are the clustering methods considered in this paper that achieves the highest adjusted Rand. Normalization can have a significant positive impact on the ability to cluster individuals, and there are indications that background correction is preferable, in particular if the gene selection is successful. However, this is an area that needs to be studied further in order to draw any general conclusions.

Conclusions: The choice of cluster analysis, and in particular gene selection, has a large impact on the ability to cluster individuals correctly based on expression profiles. Normalization has a positive effect, but the relative performance of different normalizations is an area that needs more research. In summary, although clustering, gene selection and normalization are considered standard methods in bioinformatics, our comprehensive analysis shows that selecting the right methods, and the right combinations of methods, is far from trivial and that much is still unexplored in what is considered to be the most basic analysis of genomic data.

Place, publisher, year, edition, pages
BioMed Central, 2010. Vol. 11, 503
National Category
Bioinformatics and Systems Biology
Identifiers
URN: urn:nbn:se:umu:diva-37843DOI: 10.1186/1471-2105-11-503ISI: 000283353200001OAI: oai:DiVA.org:umu-37843DiVA: diva2:370372
Available from: 2010-11-16 Created: 2010-11-16 Last updated: 2017-12-12Bibliographically approved
In thesis
1. Normalization and analysis of high-dimensional genomics data
Open this publication in new window or tab >>Normalization and analysis of high-dimensional genomics data
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In the middle of the 1990’s the microarray technology was introduced. The technology allowed for genome wide analysis of gene expression in one experiment. Since its introduction similar high through-put methods have been developed in other fields of molecular biology. These high through-put methods provide measurements for hundred up to millions of variables in a single experiment and a rigorous data analysis is necessary in order to answer the underlying biological questions.

Further complications arise in data analysis as technological variation is introduced in the data, due to the complexity of the experimental procedures in these experiments. This technological variation needs to be removed in order to draw relevant biological conclusions from the data. The process of removing the technical variation is referred to as normalization or pre-processing. During the last decade a large number of normalization and data analysis methods have been proposed.

In this thesis, data from two types of high through-put methods are used to evaluate the effect pre-processing methods have on further analyzes. In areas where problems in current methods are identified, novel normalization methods are proposed. The evaluations of known and novel methods are performed on simulated data, real data and data from an in-house produced spike-in experiment.

Place, publisher, year, edition, pages
Umeå: Umeå Universitet, 2012. 43 p.
Keyword
normalization, pre-processing, microarray, downstream analysis, evaluation, sensitivity, bias, genomics data, gene expression, spike-in data, ChIP-chip
National Category
Bioinformatics and Systems Biology Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-53486 (URN)978-91-7459-402-7 (ISBN)
Public defence
2012-04-20, MA121, Mit-huset, Umeå Universitet, Umeå, 19:25 (English)
Opponent
Supervisors
Available from: 2012-03-30 Created: 2012-03-28 Last updated: 2012-04-02Bibliographically approved

Open Access in DiVA

fulltext(912 kB)246 downloads
File information
File name FULLTEXT02.pdfFile size 912 kBChecksum SHA-512
377c6d12c9b46828435af7eb446337653e20de6470caef0fb40f05dc39c665e83b5531847ef62b9b27867fe3e83c45852154982933338182b024a790987974b6
Type fulltextMimetype application/pdf

Other links

Publisher's full text

Search in DiVA

By author/editor
Freyhult, EvaLandfors, MattiasÖnskog, JennyHvidsten, Torgeir R.Rydén, Patrik
By organisation
Clinical BacteriologyDepartment of Mathematics and Mathematical StatisticsDepartment of Plant PhysiologyUmeå Plant Science Centre (UPSC)Department of Statistics
In the same journal
BMC Bioinformatics
Bioinformatics and Systems Biology

Search outside of DiVA

GoogleGoogle Scholar
Total: 258 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 191 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf