umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Classification of microarrays: synergistic effects between normalization, gene selection and machine learning
Umeå University, Faculty of Science and Technology, Department of Plant Physiology. (Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden)
Umeå University, Faculty of Medicine, Department of Clinical Microbiology, Clinical Bacteriology. (Department of Medical Sciences, Uppsala University, Academic Hospital, Uppsala, Sweden; Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden)
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics. Umeå University, Faculty of Medicine, Department of Clinical Microbiology, Clinical Bacteriology. (Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden)
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics. (Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden)
Show others and affiliations
2011 (English)In: BMC Bioinformatics, ISSN 1471-2105, E-ISSN 1471-2105, Vol. 12, no 1, 390Article in journal (Refereed) Published
Abstract [en]

BACKGROUND: Machine learning is a powerful approach for describing and predicting classes in microarray data. Although several comparative studies have investigated the relative performance of various machine learning methods, these often do not account for the fact that performance (e.g. error rate) is a result of a series of analysis steps of which the most important are data normalization, gene selection and machine learning.

RESULTS: In this study, we used seven previously published cancer-related microarray data sets to compare the effects on classification performance of five normalization methods, three gene selection methods with 21 different numbers of selected genes and eight machine learning methods. Performance in term of error rate was rigorously estimated by repeatedly employing a double cross validation approach. Since performance varies greatly between data sets, we devised an analysis method that first compares methods within individual data sets and then visualizes the comparisons across data sets. We discovered both well performing individual methods and synergies between different methods.

CONCLUSION: Support Vector Machines with a radial basis kernel, linear kernel or polynomial kernel of degree 2 all performed consistently well across data sets. We show that there is a synergistic relationship between these methods and gene selection based on the T-test and the selection of a relatively high number of genes. Also, we find that these methods benefit significantly from using normalized data, although it is hard to draw general conclusions about the relative performance of different normalization procedures.

Place, publisher, year, edition, pages
BioMed Central, 2011. Vol. 12, no 1, 390
Keyword [en]
statistical methods, expression, bioinformatics, features, tumors, cell
National Category
Biochemistry and Molecular Biology
Identifiers
URN: urn:nbn:se:umu:diva-50285DOI: 10.1186/1471-2105-12-390ISI: 000297641600001PubMedID: 21982277OAI: oai:DiVA.org:umu-50285DiVA: diva2:462055
Available from: 2011-12-06 Created: 2011-12-02 Last updated: 2017-12-08Bibliographically approved
In thesis
1. Normalization and analysis of high-dimensional genomics data
Open this publication in new window or tab >>Normalization and analysis of high-dimensional genomics data
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In the middle of the 1990’s the microarray technology was introduced. The technology allowed for genome wide analysis of gene expression in one experiment. Since its introduction similar high through-put methods have been developed in other fields of molecular biology. These high through-put methods provide measurements for hundred up to millions of variables in a single experiment and a rigorous data analysis is necessary in order to answer the underlying biological questions.

Further complications arise in data analysis as technological variation is introduced in the data, due to the complexity of the experimental procedures in these experiments. This technological variation needs to be removed in order to draw relevant biological conclusions from the data. The process of removing the technical variation is referred to as normalization or pre-processing. During the last decade a large number of normalization and data analysis methods have been proposed.

In this thesis, data from two types of high through-put methods are used to evaluate the effect pre-processing methods have on further analyzes. In areas where problems in current methods are identified, novel normalization methods are proposed. The evaluations of known and novel methods are performed on simulated data, real data and data from an in-house produced spike-in experiment.

Place, publisher, year, edition, pages
Umeå: Umeå Universitet, 2012. 43 p.
Keyword
normalization, pre-processing, microarray, downstream analysis, evaluation, sensitivity, bias, genomics data, gene expression, spike-in data, ChIP-chip
National Category
Bioinformatics and Systems Biology Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-53486 (URN)978-91-7459-402-7 (ISBN)
Public defence
2012-04-20, MA121, Mit-huset, Umeå Universitet, Umeå, 19:25 (English)
Opponent
Supervisors
Available from: 2012-03-30 Created: 2012-03-28 Last updated: 2012-04-02Bibliographically approved

Open Access in DiVA

No full text

Other links

Publisher's full textPubMed

Search in DiVA

By author/editor
Önskog, JennyFreyhult, EvaLandfors, MattiasRydén, PatrikHvidsten, Torgeir R
By organisation
Department of Plant PhysiologyClinical BacteriologyDepartment of Mathematics and Mathematical Statistics
In the same journal
BMC Bioinformatics
Biochemistry and Molecular Biology

Search outside of DiVA

GoogleGoogle Scholar

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 167 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf