umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Normalization of high dimensional genomics data where the distribution of the altered variables is skewed
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics. Umeå University, Faculty of Medicine, Department of Clinical Microbiology, Clinical Bacteriology. (Patrik Rydén)
Umeå University, Faculty of Science and Technology, Department of Molecular Biology (Faculty of Science and Technology). (Per Stenberg)ORCID iD: 0000-0001-8752-0794
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics. (Patrik Rydén)
Umeå University, Faculty of Science and Technology, Department of Molecular Biology (Faculty of Science and Technology). (Computational Life Science Cluster (CLiC))
2011 (English)In: PLoS ONE, ISSN 1932-6203, E-ISSN 1932-6203, Vol. 6, no 11, e27942- p.Article in journal (Refereed) Published
Abstract [en]

Genome-wide analysis of gene expression or protein binding patterns using different array or sequencing based technologies is now routinely performed to compare different populations, such as treatment and reference groups. It is often necessary to normalize the data obtained to remove technical variation introduced in the course of conducting experimental work, but standard normalization techniques are not capable of eliminating technical bias in cases where the distribution of the truly altered variables is skewed, i.e. when a large fraction of the variables are either positively or negatively affected by the treatment. However, several experiments are likely to generate such skewed distributions, including ChIP-chip experiments for the study of chromatin, gene expression experiments for the study of apoptosis, and SNP-studies of copy number variation in normal and tumour tissues. A preliminary study using spike-in array data established that the capacity of an experiment to identify altered variables and generate unbiased estimates of the fold change decreases as the fraction of altered variables and the skewness increases. We propose the following work-flow for analyzing high-dimensional experiments with regions of altered variables: (1) Pre-process raw data using one of the standard normalization techniques. (2) Investigate if the distribution of the altered variables is skewed. (3) If the distribution is not believed to be skewed, no additional normalization is needed. Otherwise, re-normalize the data using a novel HMM-assisted normalization procedure. (4) Perform downstream analysis. Here, ChIP-chip data and simulated data were used to evaluate the performance of the work-flow. It was found that skewed distributions can be detected by using the novel DSE-test (Detection of Skewed Experiments). Furthermore, applying the HMM-assisted normalization to experiments where the distribution of the truly altered variables is skewed results in considerably higher sensitivity and lower bias than can be attained using standard and invariant normalization methods.

Place, publisher, year, edition, pages
2011. Vol. 6, no 11, e27942- p.
Keyword [en]
skewness, normalization, genomics, array analysis, epigenetics
National Category
Bioinformatics and Systems Biology Bioinformatics (Computational Biology) Genetics
Research subject
biology
Identifiers
URN: urn:nbn:se:umu:diva-50313DOI: 10.1371/journal.pone.0027942ISI: 000297792400024PubMedID: 22132175OAI: oai:DiVA.org:umu-50313DiVA: diva2:462049
Available from: 2011-12-06 Created: 2011-12-06 Last updated: 2017-12-08Bibliographically approved
In thesis
1. Normalization and analysis of high-dimensional genomics data
Open this publication in new window or tab >>Normalization and analysis of high-dimensional genomics data
2012 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

In the middle of the 1990’s the microarray technology was introduced. The technology allowed for genome wide analysis of gene expression in one experiment. Since its introduction similar high through-put methods have been developed in other fields of molecular biology. These high through-put methods provide measurements for hundred up to millions of variables in a single experiment and a rigorous data analysis is necessary in order to answer the underlying biological questions.

Further complications arise in data analysis as technological variation is introduced in the data, due to the complexity of the experimental procedures in these experiments. This technological variation needs to be removed in order to draw relevant biological conclusions from the data. The process of removing the technical variation is referred to as normalization or pre-processing. During the last decade a large number of normalization and data analysis methods have been proposed.

In this thesis, data from two types of high through-put methods are used to evaluate the effect pre-processing methods have on further analyzes. In areas where problems in current methods are identified, novel normalization methods are proposed. The evaluations of known and novel methods are performed on simulated data, real data and data from an in-house produced spike-in experiment.

Place, publisher, year, edition, pages
Umeå: Umeå Universitet, 2012. 43 p.
Keyword
normalization, pre-processing, microarray, downstream analysis, evaluation, sensitivity, bias, genomics data, gene expression, spike-in data, ChIP-chip
National Category
Bioinformatics and Systems Biology Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-53486 (URN)978-91-7459-402-7 (ISBN)
Public defence
2012-04-20, MA121, Mit-huset, Umeå Universitet, Umeå, 19:25 (English)
Opponent
Supervisors
Available from: 2012-03-30 Created: 2012-03-28 Last updated: 2012-04-02Bibliographically approved
2. Mining DNA elements involved in targeting of chromatin modifiers
Open this publication in new window or tab >>Mining DNA elements involved in targeting of chromatin modifiers
2014 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Background: In all higher organisms, the nuclear DNA is condensed into nucleosomes that consist of DNA wrapped around a core of highly conserved histone proteins. DNA bound to histones and other structural proteins form the chromatin. Generally, only few regions of DNA are accessible and most of the time RNA polymerase and other DNA binding proteins have to overcome this compaction to initiate transcription. Several proteins are involved in making the chromatin more compact or open. Such chromatin-modifying proteins make distinct post-translational modifications of histones – especially in the histone tails – to alter their affinity to DNA. Aim: The main aim of my thesis work is to study the targeting of chromatin modifiers important for correct gene expression in Drosophila melanogaster (fruit flies). Primary DNA sequences, chromatin associated proteins, transcription, and non-coding RNAs are all likely to be involved in targeting mechanisms. This thesis work involves the development of new computational methods for identification of DNA motifs and protein factors involved in the targeting of chromatin modifiers. Targeting and functional analysis of two chromatin modifiers, namely male-specific lethal (MSL) complex and CREB-binding protein (CBP) are specifically studied. The MSL complex is a protein complex that mediates dosage compensation in flies. CBP protein is known as a transcriptional co-regulator in metazoans and it has histone acetyl transferase activity and CBP has been used to predict novel enhancers. Results: My studies of the binding sites of MSL complex shows that promoters and coding sequences of MSL-bound genes on the X-chromosome of Drosophila melanogaster can influence the spreading of the complex along the X-chromosome. Analysis of MSL binding sites when two non-coding roX RNAs are mutated shows that MSL-complex recruitment to high-affinity sites on the Xchromosome is independent of roX, and the role of roX RNAs is to prevent binding to repeats in autosomal sites. Functional analysis of MSL-bound genes using their dosage compensation status shows that the function of the MSL complex is to enhance the expression of short housekeeping genes, but MSL-independent mechanisms exist to achieve complete dosage compensation. Studies of the binding sites of the CBP protein show that, in early embryos, Dorsal in cooperation with GAGA factor (GAF) and factors like Medea and Dichaete target CBP to its binding sites. In the S2 cell line, GAF is identified as the targeting factor of CBP at promoters and enhancers, and GAF and CBP together are found to induce high levels of polymerase II pausing at promoters. In another study using integrated data analysis, CBP binding sites could be classified into polycomb protein binding sites, repressed enhancers, insulator protein-bound regions, active promoters, and active enhancers, and this suggested different potential roles for CBP. A new approach was also developed to eliminate technical bias in skewed experiments. Our study shows that in the case of skewed datasets it is always better to identify non-altered variables and to normalize the data using only such variables.

Place, publisher, year, edition, pages
Umeå: Umeå University, 2014. 76 p.
Keyword
nucleosome, histone, chromatin, chromatin modifiers, targeting, DNA motifs, protein factors, MSL
National Category
Bioinformatics and Systems Biology Biochemistry and Molecular Biology Genetics
Research subject
Molecular Biology
Identifiers
urn:nbn:se:umu:diva-92979 (URN)978-91-7601-118-8 (ISBN)
Public defence
2014-10-03, E 04 Unod R1, Norrland University Hospital, Umeå, 09:00 (English)
Opponent
Supervisors
Available from: 2014-09-12 Created: 2014-09-09 Last updated: 2014-09-23Bibliographically approved

Open Access in DiVA

fulltext(569 kB)129 downloads
File information
File name FULLTEXT02.pdfFile size 569 kBChecksum SHA-512
b39c98ab44aa214ec0c8c691a1304cba989915c04cbdf55ddb47769a1791998eececf4053a88ec9bca5c7780ce00890eb32f9278e9f1b43420e862165a790b3f
Type fulltextMimetype application/pdf

Other links

Publisher's full textPubMedPLoS One - Full text

Search in DiVA

By author/editor
Landfors, MattiasPhilip, PhilgeRydén, PatrikStenberg, Per
By organisation
Department of Mathematics and Mathematical StatisticsClinical BacteriologyDepartment of Molecular Biology (Faculty of Science and Technology)
In the same journal
PLoS ONE
Bioinformatics and Systems BiologyBioinformatics (Computational Biology)Genetics

Search outside of DiVA

GoogleGoogle Scholar
Total: 157 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
pubmed
urn-nbn

Altmetric score

doi
pubmed
urn-nbn
Total: 185 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf