Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Comparison of methods for variable selection in clustering of high-dimensional RNA-sequencing data to identify cancer subtypes
Umeå University, Faculty of Social Sciences, Umeå School of Business and Economics (USBE), Statistics.
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
(English)Manuscript (preprint) (Other academic)
Keywords [en]
feature selection, clustering, RNA-seq, cancer
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:umu:diva-167264OAI: oai:DiVA.org:umu-167264DiVA, id: diva2:1385477
Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2020-01-16Bibliographically approved
In thesis
1. cancer subtype identification using cluster analysis on high-dimensional omics data
Open this publication in new window or tab >>cancer subtype identification using cluster analysis on high-dimensional omics data
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Identification and prediction of cancer subtypes are important parts in the development towards personalized medicine. By tailoring treatments, it is possible to decrease unnecessary suffering and reduce costs. Since the introduction of next generation sequencing techniques, the amount of data available for medical research has increased rapidly. The high dimensional omics data produced by various techniques requires statistical methods to transform data into information and knowledge.

All papers in this thesis are related to distinguishing of disease subtypes in patients with cancer using omics data. The high dimension and the complexity of sequencing data from tumor samples makes it necessary to pre—process the data.  We carry out comparisons of feature selection methods and clustering methods used for identification of cancer subtypes. In addition, we evaluate the effect that certain characteristics of the data have on the ability to identify cancer subtypes. The results show that no method outperforms the others in all cases and the relative ranking of methods is very dependent on the data. We also show that the benefit of receiving a more homogeneous data by analyzing genders separately can outweigh the possible drawbacks caused by smaller sample sizes. One of the major challenges when dealing with omics data from tumor samples is that the patients are generally a very heterogeneous group. Factors that lead to heterogeneity include age, gender, ethnicity and stage of disease. How big the effect size is for each of these factors might affect the ability to identify the subgroups of interest.

In omics data, the feature space is often large and how many of the features that are informative for the factors of interest will also affect the complexity of the problem. We present a novel clustering approach that can identify different clusters in different subsets of the feature space, which is applied on methylation data to create new potential biomarkers. It is shown that by combining clinical data with methylation data for patients with clear cell renal carcinoma, it is possible to improve the currently used prediction model for disease progression.  

Using unsupervised clustering techniques, we identify three molecular subtypes of prostate cancer bone metastases based on gene expression profiles. The robustness of the identified subtypes is confirmed by applying several clustering algorithms with very similar results.

 

Place, publisher, year, edition, pages
Umeå: Umeå universitet, 2020. p. 22
Series
Research report in mathematical statistics, ISSN 1653-0829 ; 70/20
Keywords
cluster analysis, cancer, classification
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-167275 (URN)978-91-7855-172-9 (ISBN)978-91-7855-173-6 (ISBN)
Public defence
2020-02-07, N460, Naturvetarhuset, Umeå, 09:15 (English)
Opponent
Supervisors
Available from: 2020-01-17 Created: 2020-01-14 Last updated: 2021-10-19Bibliographically approved

Open Access in DiVA

No full text in DiVA

Authority records

Källberg, DavidVidman, LindaRydén, Patrik

Search in DiVA

By author/editor
Källberg, DavidVidman, LindaRydén, Patrik
By organisation
StatisticsDepartment of Mathematics and Mathematical Statistics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 360 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf