Umeå University's logo

umu.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
cancer subtype identification using cluster analysis on high-dimensional omics data
Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.
2020 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

Identification and prediction of cancer subtypes are important parts in the development towards personalized medicine. By tailoring treatments, it is possible to decrease unnecessary suffering and reduce costs. Since the introduction of next generation sequencing techniques, the amount of data available for medical research has increased rapidly. The high dimensional omics data produced by various techniques requires statistical methods to transform data into information and knowledge.

All papers in this thesis are related to distinguishing of disease subtypes in patients with cancer using omics data. The high dimension and the complexity of sequencing data from tumor samples makes it necessary to pre—process the data.  We carry out comparisons of feature selection methods and clustering methods used for identification of cancer subtypes. In addition, we evaluate the effect that certain characteristics of the data have on the ability to identify cancer subtypes. The results show that no method outperforms the others in all cases and the relative ranking of methods is very dependent on the data. We also show that the benefit of receiving a more homogeneous data by analyzing genders separately can outweigh the possible drawbacks caused by smaller sample sizes. One of the major challenges when dealing with omics data from tumor samples is that the patients are generally a very heterogeneous group. Factors that lead to heterogeneity include age, gender, ethnicity and stage of disease. How big the effect size is for each of these factors might affect the ability to identify the subgroups of interest.

In omics data, the feature space is often large and how many of the features that are informative for the factors of interest will also affect the complexity of the problem. We present a novel clustering approach that can identify different clusters in different subsets of the feature space, which is applied on methylation data to create new potential biomarkers. It is shown that by combining clinical data with methylation data for patients with clear cell renal carcinoma, it is possible to improve the currently used prediction model for disease progression.  

Using unsupervised clustering techniques, we identify three molecular subtypes of prostate cancer bone metastases based on gene expression profiles. The robustness of the identified subtypes is confirmed by applying several clustering algorithms with very similar results.

 

Place, publisher, year, edition, pages
Umeå: Umeå universitet , 2020. , p. 22
Series
Research report in mathematical statistics, ISSN 1653-0829 ; 70/20
Keywords [en]
cluster analysis, cancer, classification
National Category
Probability Theory and Statistics
Identifiers
URN: urn:nbn:se:umu:diva-167275ISBN: 978-91-7855-172-9 (print)ISBN: 978-91-7855-173-6 (electronic)OAI: oai:DiVA.org:umu-167275DiVA, id: diva2:1385586
Public defence
2020-02-07, N460, Naturvetarhuset, Umeå, 09:15 (English)
Opponent
Supervisors
Available from: 2020-01-17 Created: 2020-01-14 Last updated: 2021-10-19Bibliographically approved
List of papers
1. Cluster analysis on high dimensional RNA-seq data with applications to cancer research: An evaluation study
Open this publication in new window or tab >>Cluster analysis on high dimensional RNA-seq data with applications to cancer research: An evaluation study
2019 (English)In: PLOS ONE, E-ISSN 1932-6203, Vol. 14, no 12, article id e0219102Article in journal (Refereed) Published
Abstract [en]

Background: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.

Results: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.

Conclusions: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.

Place, publisher, year, edition, pages
San Francisco: Public Library of Science, 2019
Keywords
Cancer, cluster analysis
National Category
Probability Theory and Statistics Bioinformatics and Computational Biology
Identifiers
urn:nbn:se:umu:diva-167274 (URN)10.1371/journal.pone.0219102 (DOI)000534009700002 ()31805048 (PubMedID)2-s2.0-85076157692 (Scopus ID)
Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2025-02-05Bibliographically approved
2. Comparison of methods for variable selection in clustering of high-dimensional RNA-sequencing data to identify cancer subtypes
Open this publication in new window or tab >>Comparison of methods for variable selection in clustering of high-dimensional RNA-sequencing data to identify cancer subtypes
(English)Manuscript (preprint) (Other academic)
Keywords
feature selection, clustering, RNA-seq, cancer
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-167264 (URN)
Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2020-01-16Bibliographically approved
3. Gene expression profiles define molecular subtypes of prostate cancer bone metastases with different outcomes and morphology traceable back to the primary tumor
Open this publication in new window or tab >>Gene expression profiles define molecular subtypes of prostate cancer bone metastases with different outcomes and morphology traceable back to the primary tumor
Show others...
2019 (English)In: Molecular Oncology, ISSN 1574-7891, E-ISSN 1878-0261, Vol. 13, no 8, p. 1763-1777Article in journal (Refereed) Published
Abstract [en]

Bone metastasis is the lethal end-stage of prostate cancer (PC), but the biology of bone metastases is poorly understood. The overall aim of this study was therefore to explore molecular variability in PC bone metastases of potential importance for therapy. Specifically, genome-wide expression profiles of bone metastases from untreated patients (n = 12) and patients treated with androgen-deprivation therapy (ADT, n = 60) were analyzed in relation to patient outcome and to morphological characteristics in metastases and paired primary tumors. Principal component analysis and unsupervised classification were used to identify sample clusters based on mRNA profiles. Clusters were characterized by gene set enrichment analysis and related to histological and clinical parameters using univariate and multivariate statistics. Selected proteins were analyzed by immunohistochemistry in metastases and matched primary tumors (n = 52) and in transurethral resected prostate (TUR-P) tissue of a separate cohort (n = 59). Three molecular subtypes of bone metastases (MetA-C) characterized by differences in gene expression pattern, morphology, and clinical behavior were identified. MetA (71% of the cases) showed increased expression of androgen receptor-regulated genes, including prostate-specific antigen (PSA), and glandular structures indicating a luminal cell phenotype. MetB (17%) showed expression profiles related to cell cycle activity and DNA damage, and a pronounced cellular atypia. MetC (12%) exhibited enriched stroma-epithelial cell interactions. MetB patients had the lowest serum PSA levels and the poorest prognosis after ADT. Combined analysis of PSA and Ki67 immunoreactivity (proliferation) in bone metastases, paired primary tumors, and TUR-P samples was able to differentiate MetA-like (high PSA, low Ki67) from MetB-like (low PSA, high Ki67) tumors and demonstrate their different prognosis. In conclusion, bone metastases from PC patients are separated based on gene expression profiles into molecular subtypes with different morphology, biology, and clinical outcome. These findings deserve further exploration with the purpose of improving treatment of metastatic PC.

Place, publisher, year, edition, pages
John Wiley & Sons, 2019
Keywords
bone metastasis, gene expression, gene set enrichment analysis, morphology, survival, unsupervised cluster analysis
National Category
Cancer and Oncology
Identifiers
urn:nbn:se:umu:diva-162668 (URN)10.1002/1878-0261.12526 (DOI)000478600200009 ()31162796 (PubMedID)2-s2.0-85068158741 (Scopus ID)
Available from: 2019-09-05 Created: 2019-09-05 Last updated: 2023-03-24Bibliographically approved
4. Combining epigenetic and clinicopathological variables improves prognostic prediction in clear cell Renal Cell Carcinoma
Open this publication in new window or tab >>Combining epigenetic and clinicopathological variables improves prognostic prediction in clear cell Renal Cell Carcinoma
Show others...
(English)Manuscript (preprint) (Other academic)
Keywords
DNA methylation, cancer, cluster analysis, classification, clear cell renal cell carcinoma
National Category
Cancer and Oncology Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-167269 (URN)
Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2020-01-31Bibliographically approved

Open Access in DiVA

fulltext(724 kB)689 downloads
File information
File name FULLTEXT01.pdfFile size 724 kBChecksum SHA-512
4ea940d91cd259f4cb3b4bb06d1c71b4bc9a62f24ed2ad66d384efcb4d0a98fa1c88150f6d4d9b28a924a5818a6a38cdaf38a3bdd6c0fbc8a1749de53b5e90d2
Type fulltextMimetype application/pdf
spikblad(310 kB)70 downloads
File information
File name SPIKBLAD01.pdfFile size 310 kBChecksum SHA-512
c519416c9c00b81af9d66dd3e84ba4b932148bb3d467e43e30c6e912eb663aa9dca94b55d25e19269ae7d8b4341f0506d3a22dd8b852468e2d0e85878eb6119f
Type spikbladMimetype application/pdf

Authority records

Vidman, Linda

Search in DiVA

By author/editor
Vidman, Linda
By organisation
Department of Mathematics and Mathematical Statistics
Probability Theory and Statistics

Search outside of DiVA

GoogleGoogle Scholar
Total: 689 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 893 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf