Umeå University's logo

umu.sePublikasjoner
Endre søk
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf
Cluster analysis on high dimensional RNA-seq data with applications to cancer research: An evaluation study
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för matematik och matematisk statistik.
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för matematik och matematisk statistik. Umeå universitet, Samhällsvetenskapliga fakulteten, Handelshögskolan vid Umeå universitet, Statistik.
Umeå universitet, Teknisk-naturvetenskapliga fakulteten, Institutionen för matematik och matematisk statistik.
2019 (engelsk)Inngår i: PLOS ONE, E-ISSN 1932-6203, Vol. 14, nr 12, artikkel-id e0219102Artikkel i tidsskrift (Fagfellevurdert) Published
Abstract [en]

Background: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.

Results: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.

Conclusions: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.

sted, utgiver, år, opplag, sider
San Francisco: Public Library of Science , 2019. Vol. 14, nr 12, artikkel-id e0219102
Emneord [en]
Cancer, cluster analysis
HSV kategori
Identifikatorer
URN: urn:nbn:se:umu:diva-167274DOI: 10.1371/journal.pone.0219102ISI: 000534009700002PubMedID: 31805048Scopus ID: 2-s2.0-85076157692OAI: oai:DiVA.org:umu-167274DiVA, id: diva2:1385558
Tilgjengelig fra: 2020-01-14 Laget: 2020-01-14 Sist oppdatert: 2025-02-05bibliografisk kontrollert
Inngår i avhandling
1. cancer subtype identification using cluster analysis on high-dimensional omics data
Åpne denne publikasjonen i ny fane eller vindu >>cancer subtype identification using cluster analysis on high-dimensional omics data
2020 (engelsk)Doktoravhandling, med artikler (Annet vitenskapelig)
Abstract [en]

Identification and prediction of cancer subtypes are important parts in the development towards personalized medicine. By tailoring treatments, it is possible to decrease unnecessary suffering and reduce costs. Since the introduction of next generation sequencing techniques, the amount of data available for medical research has increased rapidly. The high dimensional omics data produced by various techniques requires statistical methods to transform data into information and knowledge.

All papers in this thesis are related to distinguishing of disease subtypes in patients with cancer using omics data. The high dimension and the complexity of sequencing data from tumor samples makes it necessary to pre—process the data.  We carry out comparisons of feature selection methods and clustering methods used for identification of cancer subtypes. In addition, we evaluate the effect that certain characteristics of the data have on the ability to identify cancer subtypes. The results show that no method outperforms the others in all cases and the relative ranking of methods is very dependent on the data. We also show that the benefit of receiving a more homogeneous data by analyzing genders separately can outweigh the possible drawbacks caused by smaller sample sizes. One of the major challenges when dealing with omics data from tumor samples is that the patients are generally a very heterogeneous group. Factors that lead to heterogeneity include age, gender, ethnicity and stage of disease. How big the effect size is for each of these factors might affect the ability to identify the subgroups of interest.

In omics data, the feature space is often large and how many of the features that are informative for the factors of interest will also affect the complexity of the problem. We present a novel clustering approach that can identify different clusters in different subsets of the feature space, which is applied on methylation data to create new potential biomarkers. It is shown that by combining clinical data with methylation data for patients with clear cell renal carcinoma, it is possible to improve the currently used prediction model for disease progression.  

Using unsupervised clustering techniques, we identify three molecular subtypes of prostate cancer bone metastases based on gene expression profiles. The robustness of the identified subtypes is confirmed by applying several clustering algorithms with very similar results.

 

sted, utgiver, år, opplag, sider
Umeå: Umeå universitet, 2020. s. 22
Serie
Research report in mathematical statistics, ISSN 1653-0829 ; 70/20
Emneord
cluster analysis, cancer, classification
HSV kategori
Identifikatorer
urn:nbn:se:umu:diva-167275 (URN)978-91-7855-172-9 (ISBN)978-91-7855-173-6 (ISBN)
Disputas
2020-02-07, N460, Naturvetarhuset, Umeå, 09:15 (engelsk)
Opponent
Veileder
Tilgjengelig fra: 2020-01-17 Laget: 2020-01-14 Sist oppdatert: 2021-10-19bibliografisk kontrollert

Open Access i DiVA

fulltext(2717 kB)372 nedlastinger
Filinformasjon
Fil FULLTEXT01.pdfFilstørrelse 2717 kBChecksum SHA-512
0e50a57b87ff3e6c319d5e6560c3d2579e752c7e3d1244b9fe425e0522fbe9478d24241fcda5d487b0ec4b240c73f06e0c55304cf3518f5afd9f407bb44f0ce3
Type fulltextMimetype application/pdf

Andre lenker

Forlagets fulltekstPubMedScopus

Person

Vidman, LindaKällberg, DavidRydén, Patrik

Søk i DiVA

Av forfatter/redaktør
Vidman, LindaKällberg, DavidRydén, Patrik
Av organisasjonen
I samme tidsskrift
PLOS ONE

Søk utenfor DiVA

GoogleGoogle Scholar
Totalt: 372 nedlastinger
Antall nedlastinger er summen av alle nedlastinger av alle fulltekster. Det kan for eksempel være tidligere versjoner som er ikke lenger tilgjengelige

doi
pubmed
urn-nbn

Altmetric

doi
pubmed
urn-nbn
Totalt: 751 treff
RefereraExporteraLink to record
Permanent link

Direct link
Referera
Referensformat
  • apa
  • ieee
  • vancouver
  • Annet format
Fler format
Språk
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Annet språk
Fler språk
Utmatningsformat
  • html
  • text
  • asciidoc
  • rtf