Källberg, David

Cluster analysis on high dimensional RNA-seq data with applications to cancer research: An evaluation study
Vidman, Linda
Källberg, David
Rydén, Patrik

### Källberg, David

### Rydén, Patrik

2019 (English)In: PLoS ONE, E-ISSN 1932-6203, Vol. 14, no 12, article id e0219102Article in journal (Refereed) Published
##### Abstract [en]

##### Place, publisher, year, edition, pages

San Francisco: Public Library of Science, 2019
##### Keywords

Cancer, cluster analysis
##### National Category

Probability Theory and Statistics Bioinformatics and Systems Biology
##### Identifiers

urn:nbn:se:umu:diva-167274 (URN)10.1371/journal.pone.0219102 (DOI)31805048 (PubMedID)
Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2020-01-15Bibliographically approved

Background: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.

Results: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.

Conclusions: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.

A moment-distance hybrid method for estimating a mixture of two symmetric densities
Källberg, David
Belyaev, Yuri

### Belyaev, Yuri

Rydén, Patrik

2018 (English)In: Modern Stochastics: Theory and Applications, ISSN 2351-6054, Vol. 5, no 1, p. 1-36Article in journal (Refereed) Published
##### Abstract [en]

##### Keywords

inference for mixtures, method of moments, minimum distance, model-based clustering
##### National Category

Probability Theory and Statistics
##### Research subject

Mathematical Statistics
##### Identifiers

urn:nbn:se:umu:diva-144644 (URN)10.15559/17-VMSTA93 (DOI)000434875200001 ()
##### Funder

Swedish Research Council, 340-2013-5185
Available from: 2018-02-08 Created: 2018-02-08 Last updated: 2018-09-19Bibliographically approved

In clustering of high-dimensional data a variable selection is commonly applied to obtain an accurate grouping of the samples. For two-class problems this selection may be carried out by fitting a mixture distribution to each variable. We propose a hybrid method for estimating a parametric mixture of two symmetric densities. The estimator combines the method of moments with the minimum distance approach. An evaluation study including both extensive simulations and gene expression data from acute leukemia patients shows that the hybrid method outperforms a maximum-likelihood estimator in model-based clustering. The hybrid estimator is flexible and performs well also under imprecise model assumptions, suggesting that it is robust and suited for real problems.

The HRD-Algorithm: a general method for parametric estimation of two-component mixture models
patrik, Rydén
Källberg, David
Belyaev, Yu. K.

2017 (English)In: Lecture Notes in Computer Science, ISSN 978-3-319-71504-9, Vol. 10684, p. 497-508Article in journal (Refereed) Published
##### Abstract [en]

##### Place, publisher, year, edition, pages

Springer, 2017
##### Keywords

Mixture models, Parameter estimation, Method of moments Grid-approach, Resampling, Cluster analysis Variable selection, High-dimensional data
##### National Category

Natural Sciences Probability Theory and Statistics
##### Research subject

Mathematical Statistics
##### Identifiers

urn:nbn:se:umu:diva-144646 (URN)10.1007/978-3-319-71504-9_41 (DOI)
##### Conference

Analytical and Computational Methods in Probability Theory. ACMPT 2017.
##### Funder

Swedish Research Council, 340-2013-5185
Available from: 2018-02-08 Created: 2018-02-08 Last updated: 2018-10-15Bibliographically approved

We introduce a novel approach to estimate the parameters of a mixture of two distributions. The method combines a grid approach with the method of moments and can be applied to a wide range of two-component mixture models. The grid approach enables the use of parallel computing and the method can easily be combined with resampling techniques. We derive the method for the special cases when the data are described by the mixture of two Weibull distributions or the mixture of two normal distributions, and apply the method on gene expression data from 409 ER+" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">ER+ER+ breast cancer patients.

Estimation of entropy-type integral functionals
Källberg, David
Seleznjev, Oleg

2016 (English)In: Communications in Statistics - Theory and Methods, ISSN 0361-0926, E-ISSN 1532-415X, Vol. 45, no 4, p. 887-905Article in journal (Other academic) Published
##### Abstract [en]

##### Keywords

Divergence estimation, asymptotic normality, U-statistics, inter-point distances, quadratic functional, entropy estimation
##### National Category

Probability Theory and Statistics
##### Research subject

Mathematical Statistics
##### Identifiers

urn:nbn:se:umu:diva-60993 (URN)10.1080/03610926.2013.853789 (DOI)000370612900005 ()
Available from: 2012-11-06 Created: 2012-11-06 Last updated: 2018-06-08Bibliographically approved

Entropy-type integral functionals of densities are widely used in mathematical statistics, information theory, and computer science. Examples include measures of closeness between distributions (e.g., density power divergence) and uncertainty characteristics for a random variable (e.g., Renyi entropy). In this paper, we study *U*-statistic estimators for a class of such functionals. The estimators are based on ε-close vector observations in the corresponding independent and identically distributed samples. We prove asymptotic properties of the estimators (consistency and asymptotic normality) under mild integrability and smoothness conditions for the densities. The results can be applied in diverse problems in mathematical statistics and computer science (e.g., distribution identication problems, approximate matching for random databases, two-sample problems).

Statistical estimation of quadratic Rényi entropy for a stationary *m*-dependent sequence
Källberg, David
Leonenko, Nikolaj
Seleznjev, Oleg

2014 (English)In: Journal of nonparametric statistics (Print), ISSN 1048-5252, E-ISSN 1029-0311, Vol. 26, no 2, p. 385-411Article in journal (Refereed) Published
##### Abstract [en]

##### Place, publisher, year, edition, pages

Taylor & Francis, 2014
##### Keywords

entropy estimation, quadratic Rényi entropy, stationary m-dependent sequence, U-statistics, inter-point distances
##### National Category

Probability Theory and Statistics
##### Research subject

Mathematical Statistics
##### Identifiers

urn:nbn:se:umu:diva-79958 (URN)10.1080/10485252.2013.854438 (DOI)000334160600011 ()
##### Note

The Rényi entropy is a generalization of the Shannon entropy and is widely used in mathematical statistics and applied sciences for quantifying the uncertainty in a probability distribution. We consider estimation of the quadratic Rényi entropy and related functionals for the marginal distribution of a stationary *m*-dependent sequence. The *U*-statistic estimators under study are based on the number of ε-close vector observations in the corresponding sample. A variety of asymptotic properties for these estimators are obtained (e.g., consistency, asymptotic normality, Poisson convergence). The results can be used in diverse statistical and computer science problems whenever the conventional independence assumption is too strong (e.g., ε-keys in time series databases, distribution identication problems for dependent samples).

Included in thesis 2013 in submitted form.

Nonparametric Statistical Inference for Entropy-type Functionals
Källberg, David

2013 (English)Doctoral thesis, comprehensive summary (Other academic)
##### Alternative title[sv]

Icke-parametrisk statistisk inferens för entropirelaterade funktionaler
##### Abstract [en]

##### Place, publisher, year, edition, pages

Umeå: Umeå universitet, 2013. p. 21
##### Keywords

entropy estimation, Rényi entropy, divergence estimation, quadratic density functional, U-statistics, consistency, asymptotic normality, Poisson convergence, stationary m-dependent sequence, inter-point distances, entropy maximizing distribution, two-sample problem, approximate matching
##### National Category

Probability Theory and Statistics
##### Research subject

Mathematical Statistics
##### Identifiers

urn:nbn:se:umu:diva-79976 (URN)978-91-7459-701-1 (ISBN)
##### Public defence

2013-09-27, MIT-huset, MA121, Umeå universitet, Umeå, 10:00 (English)
##### Opponent

### Koski, Timo

##### Supervisors

Available from: 2013-09-06 Created: 2013-09-04 Last updated: 2018-06-08Bibliographically approved

In this thesis, we study statistical inference for entropy, divergence, and related functionals of one or two probability distributions. Asymptotic properties of particular nonparametric estimators of such functionals are investigated. We consider estimation from both independent and dependent observations. The thesis consists of an introductory survey of the subject and some related theory and four papers (A-D).

In Paper A, we consider a general class of entropy-type functionals which includes, for example, integer order Rényi entropy and certain Bregman divergences. We propose *U*-statistic estimators of these functionals based on the coincident or epsilon-close vector observations in the corresponding independent and identically distributed samples. We prove some asymptotic properties of the estimators such as consistency and asymptotic normality. Applications of the obtained results related to entropy maximizing distributions, stochastic databases, and image matching are discussed.

In Paper B, we provide some important generalizations of the results for continuous distributions in Paper A. The consistency of the estimators is obtained under weaker density assumptions. Moreover, we introduce a class of functionals of quadratic order, including both entropy and divergence, and prove normal limit results for the corresponding estimators which are valid even for densities of low smoothness. The asymptotic properties of a divergence-based two-sample test are also derived.

In Paper C, we consider estimation of the quadratic Rényi entropy and some related functionals for the marginal distribution of a stationary *m*-dependent sequence. We investigate asymptotic properties of the *U*-statistic estimators for these functionals introduced in Papers A and B when they are based on a sample from such a sequence. We prove consistency, asymptotic normality, and Poisson convergence under mild assumptions for the stationary *m*-dependent sequence. Applications of the results to time-series databases and entropy-based testing for dependent samples are discussed.

In Paper D, we further develop the approach for estimation of quadratic functionals with *m*-dependent observations introduced in Paper C. We consider quadratic functionals for one or two distributions. The consistency and rate of convergence of the corresponding *U*-statistic estimators are obtained under weak conditions on the stationary *m*-dependent sequences. Additionally, we propose estimators based on incomplete *U*-statistics and show their consistency properties under more general assumptions.

Statistical inference for Rényi entropy functionals
Källberg, David
Leonenko, Nikolaj
Oleg, Seleznjev

2012 (English)In: Conceptual modelling and its theoretical foundations / [ed] Antje Düsterhöft, Meike Klettke, Klaus-Dieter Schewe, Springer Berlin/Heidelberg, 2012, p. 36-51Chapter in book (Refereed)
##### Abstract [en]

##### Place, publisher, year, edition, pages

Springer Berlin/Heidelberg, 2012
##### Series

Lecture Notes in Computer Science, ISSN 0302-9743 ; 7260/2012
##### Keywords

entropy estimation, Rényi entropy, U-statistics, approximate matching, asymptotic normality
##### National Category

Probability Theory and Statistics
##### Identifiers

urn:nbn:se:umu:diva-60788 (URN)10.1007/978-3-642-28279-9_5 (DOI)978-3-642-28278-2 (ISBN)978-3-642-28279-9 (ISBN)
Available from: 2012-11-14 Created: 2012-10-29 Last updated: 2018-06-08Bibliographically approved

Numerous entropy-type characteristics (functionals) generalizing Rényi entropy are widely used in mathematical statistics, physics, information theory, and signal processing for characterizing uncertainty in probability distributions and distribution identification problems. We consider estimators of some entropy (integral) functionals for discrete and continuous distributions based on the number of epsilon-close vector records in the corresponding independent and identically distributed samples from two distributions. The proposed estimators are generalized *U*-statistics. We show the asymptotic properties of these estimators (e.g., consistency and asymptotic normality). The results can be applied in various problems in computer science and mathematical statistics (e.g., approximate matching for random databases, record linkage, image matching).

Combining epigenetic and clinicopathological variables improves prognostic prediction in clear cell Renal Cell Carcinoma
Andersson-Evelönn, Emma
Vidman, Linda
Källberg, David
Landfors, Mattias

### Vidman, Linda

Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.### Källberg, David

### Landfors, Mattias

Liu, Xijia
Ljungberg, Börje
Hultdin, Magnus
Degerman, Sofie
Rydén, Patrik

Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.### Ljungberg, Börje

### Hultdin, Magnus

### Degerman, Sofie

### Rydén, Patrik

(English)Manuscript (preprint) (Other academic)
##### Keywords

DNA methylation, cancer, cluster analysis, classification, clear cell renal cell carcinoma
##### National Category

Cancer and Oncology Probability Theory and Statistics
##### Identifiers

urn:nbn:se:umu:diva-167269 (URN)
Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2020-01-16Bibliographically approved

Comparison of methods for variable selection in clustering of high-dimensional RNA-sequencing data to identify cancer subtypes
Källberg, David
Vidman, Linda
Rydén, Patrik

### Vidman, Linda

Umeå University, Faculty of Science and Technology, Department of Mathematics and Mathematical Statistics.### Rydén, Patrik

(English)Manuscript (preprint) (Other academic)
##### Keywords

feature selection, clustering, RNA-seq, cancer
##### National Category

Probability Theory and Statistics
##### Identifiers

urn:nbn:se:umu:diva-167264 (URN)
Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2020-01-16Bibliographically approved

Estimation of quadratic density functionals under *m*-dependence
Källberg, David
Seleznjev, Oleg

(English)Manuscript (preprint) (Other academic)
##### Abstract [en]

##### Keywords

quadratic density functional, entropy estimation, divergence estimation, stationary m-dependent sequences, Renyi entropy, incomplete U-statistics
##### National Category

Probability Theory and Statistics
##### Research subject

Mathematical Statistics
##### Identifiers

urn:nbn:se:umu:diva-79956 (URN)
Available from: 2013-09-04 Created: 2013-09-04 Last updated: 2018-06-08Bibliographically approved

In this paper, we study estimation of certain integral functionals of one or two densities with samples from stationary *m*-dependent sequences. We consider two types of *U*-statistic estimators for these functionals that are functions of the number of epsilon-close vector observations in the samples. We show that the estimators are consistent and obtain their rates of convergence under weak distributional assumptions. In particular, we propose estimators based on incomplete *U*-statistics which have favorable consistency properties even when *m*-dependence is the only dependence condition that can be imposed on the stationary sequences. The results can be used for divergence and entropy estimation, and thus find many applications in statistics and applied sciences.