Umeå University's logo

umu.sePublications
Operational message
There are currently operational disruptions. Troubleshooting is in progress.
Change search
Link to record
Permanent link

Direct link
Publications (10 of 14) Show all publications
Källberg, D. & Waernbaum, I. (2023). Large sample properties of entropy balancing estimators of average causal effects. Econometrics and Statistics
Open this publication in new window or tab >>Large sample properties of entropy balancing estimators of average causal effects
2023 (English)In: Econometrics and Statistics, ISSN 2452-3062Article in journal (Refereed) Epub ahead of print
Abstract [en]

Weighting methods are used in observational studies to adjust for covariate imbalances between treatment and control groups. Entropy balancing (EB) is an alternative to inverse probability weighting with an estimated propensity score. The EB weights are constructed to satisfy balance constraints and optimized towards stability. Large sample properties of EB estimators of the average causal treatment effect, based on the Kullback-Leibler and quadratic Rényi relative entropies, are described. Additionally, estimators of their asymptotic variances are proposed. Even though the objective of EB is to reduce model dependence, the estimators are generally not consistent unless implicit parametric assumptions for the propensity score or conditional outcomes are met. The finite sample properties of the estimators are investigated through a simulation study. The average causal effect of smoking on blood lead levels is estimated using data from the National Health and Nutrition Examination Survey.

Place, publisher, year, edition, pages
Elsevier, 2023
Keywords
calibration weighting, entropy balancing, three-way balance, minimum relative entropy
National Category
Probability Theory and Statistics
Research subject
Statistics; Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-217989 (URN)10.1016/j.ecosta.2023.11.004 (DOI)2-s2.0-85183542892 (Scopus ID)
Funder
Swedish Research Council, 2016-00703Marianne and Marcus Wallenberg Foundation, 2015.0060
Available from: 2023-12-14 Created: 2023-12-14 Last updated: 2025-12-11
Källberg, D., Vidman, L. & Rydén, P. (2021). Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes. Frontiers in Genetics, 12, Article ID 632620.
Open this publication in new window or tab >>Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes
2021 (English)In: Frontiers in Genetics, E-ISSN 1664-8021, Vol. 12, article id 632620Article in journal (Refereed) Published
Abstract [en]

Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (−0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.

Place, publisher, year, edition, pages
Frontiers Media S.A., 2021
Keywords
cancer subtypes, feature selection, gene selection, high-dimensional, RNA-seq
National Category
Probability Theory and Statistics Bioinformatics and Computational Biology
Identifiers
urn:nbn:se:umu:diva-181727 (URN)10.3389/fgene.2021.632620 (DOI)000626903100001 ()2-s2.0-85102373666 (Scopus ID)
Available from: 2021-03-24 Created: 2021-03-24 Last updated: 2025-02-05Bibliographically approved
Andersson-Evelönn, E., Vidman, L., Källberg, D., Landfors, M., Liu, X., Ljungberg, B., . . . Degerman, S. (2020). Combining epigenetic and clinicopathological variables improves specificity in prognostic prediction in clear cell renal cell carcinoma. Journal of Translational Medicine, 18(1), Article ID 435.
Open this publication in new window or tab >>Combining epigenetic and clinicopathological variables improves specificity in prognostic prediction in clear cell renal cell carcinoma
Show others...
2020 (English)In: Journal of Translational Medicine, E-ISSN 1479-5876, Vol. 18, no 1, article id 435Article in journal (Refereed) Published
Abstract [en]

Background: Metastasized clear cell renal cell carcinoma (ccRCC) is associated with a poor prognosis. Almost one-third of patients with non-metastatic tumors at diagnosis will later progress with metastatic disease. These patients need to be identified already at diagnosis, to undertake closer follow up and/or adjuvant treatment. Today, clinicopathological variables are used to risk classify patients, but molecular biomarkers are needed to improve risk classification to identify the high-risk patients which will benefit most from modern adjuvant therapies. Interestingly, DNA methylation profiling has emerged as a promising prognostic biomarker in ccRCC. This study aimed to derive a model for prediction of tumor progression after nephrectomy in non-metastatic ccRCC by combining DNA methylation profiling with clinicopathological variables.

Methods: A novel cluster analysis approach (Directed Cluster Analysis) was used to identify molecular biomarkers from genome-wide methylation array data. These novel DNA methylation biomarkers, together with previously identified CpG-site biomarkers and clinicopathological variables, were used to derive predictive classifiers for tumor progression.

Results: The “triple classifier” which included both novel and previously identified DNA methylation biomarkers together with clinicopathological variables predicted tumor progression more accurately than the currently used Mayo scoring system, by increasing the specificity from 50% in Mayo to 64% in our triple classifier at 85% fixed sensitivity. The cumulative incidence of progress (pCIP5yr) was 7.5% in low-risk vs 44.7% in high-risk in M0 patients classified by the triple classifier at diagnosis.

Conclusions: The triple classifier panel that combines clinicopathological variables with genome-wide methylation data has the potential to improve specificity in prognosis prediction for patients with non-metastatic ccRCC.

Keywords
Clear cell renal cell carcinoma, Classification, DNA methylation, Prognosis, Directed cluster analysis
National Category
Clinical Medicine
Identifiers
urn:nbn:se:umu:diva-176921 (URN)10.1186/s12967-020-02608-1 (DOI)000594136300002 ()33187526 (PubMedID)2-s2.0-85095955809 (Scopus ID)
Funder
The Kempe FoundationsSwedish Research CouncilRegion Västerbotten
Available from: 2020-11-19 Created: 2020-11-19 Last updated: 2025-02-18Bibliographically approved
Vidman, L., Källberg, D. & Rydén, P. (2019). Cluster analysis on high dimensional RNA-seq data with applications to cancer research: An evaluation study. PLOS ONE, 14(12), Article ID e0219102.
Open this publication in new window or tab >>Cluster analysis on high dimensional RNA-seq data with applications to cancer research: An evaluation study
2019 (English)In: PLOS ONE, E-ISSN 1932-6203, Vol. 14, no 12, article id e0219102Article in journal (Refereed) Published
Abstract [en]

Background: Clustering of gene expression data is widely used to identify novel subtypes of cancer. Plenty of clustering approaches have been proposed, but there is a lack of knowledge regarding their relative merits and how data characteristics influence the performance. We evaluate how cluster analysis choices affect the performance by studying four publicly available human cancer data sets: breast, brain, kidney and stomach cancer. In particular, we focus on how the sample size, distribution of subtypes and sample heterogeneity affect the performance.

Results: In general, increasing the sample size had limited effect on the clustering performance, e.g. for the breast cancer data similar performance was obtained for n = 40 as for n = 330. The relative distribution of the subtypes had a noticeable effect on the ability to identify the disease subtypes and data with disproportionate cluster sizes turned out to be difficult to cluster. Both the choice of clustering method and selection method affected the ability to identify the subtypes, but the relative performance varied between data sets, making it difficult to rank the approaches. For some data sets, the performance was substantially higher when the clustering was based on data from only one sex compared to data from a mixed population. This suggests that homogeneous data are easier to cluster than heterogeneous data and that clustering males and females individually may be beneficial and increase the chance to detect novel subtypes. It was also observed that the performance often differed substantially between females and males.

Conclusions: The number of samples seems to have a limited effect on the performance while the heterogeneity, at least with respect to sex, is important for the performance. Hence, by analyzing the genders separately, the possible loss caused by having fewer samples could be outweighed by the benefit of a more homogeneous data.

Place, publisher, year, edition, pages
San Francisco: Public Library of Science, 2019
Keywords
Cancer, cluster analysis
National Category
Probability Theory and Statistics Bioinformatics and Computational Biology
Identifiers
urn:nbn:se:umu:diva-167274 (URN)10.1371/journal.pone.0219102 (DOI)000534009700002 ()31805048 (PubMedID)2-s2.0-85076157692 (Scopus ID)
Available from: 2020-01-14 Created: 2020-01-14 Last updated: 2025-02-05Bibliographically approved
Källberg, D., Belyaev, Y. & Rydén, P. (2018). A moment-distance hybrid method for estimating a mixture of two symmetric densities. Modern Stochastics: Theory and Applications, 5(1), 1-36
Open this publication in new window or tab >>A moment-distance hybrid method for estimating a mixture of two symmetric densities
2018 (English)In: Modern Stochastics: Theory and Applications, ISSN 2351-6054, Vol. 5, no 1, p. 1-36Article in journal (Refereed) Published
Abstract [en]

In clustering of high-dimensional data a variable selection is commonly applied to obtain an accurate grouping of the samples. For two-class problems this selection may be carried out by fitting a mixture distribution to each variable. We propose a hybrid method for estimating a parametric mixture of two symmetric densities. The estimator combines the method of moments with the minimum distance approach. An evaluation study including both extensive simulations and gene expression data from acute leukemia patients shows that the hybrid method outperforms a maximum-likelihood estimator in model-based clustering. The hybrid estimator is flexible and performs well also under imprecise model assumptions, suggesting that it is robust and suited for real problems.

Keywords
inference for mixtures, method of moments, minimum distance, model-based clustering
National Category
Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-144644 (URN)10.15559/17-VMSTA93 (DOI)000434875200001 ()2-s2.0-85061938713 (Scopus ID)
Funder
Swedish Research Council, 340-2013-5185
Available from: 2018-02-08 Created: 2018-02-08 Last updated: 2023-03-24Bibliographically approved
Rydén, P., Källberg, D. & Belyaev, Y. K. K. (2017). The HRD-Algorithm: a general method for parametric estimation of two-component mixture models. Paper presented at Analytical and Computational Methods in Probability Theory, ACMPT 2017.. Lecture Notes in Computer Science, 10684, 497-508
Open this publication in new window or tab >>The HRD-Algorithm: a general method for parametric estimation of two-component mixture models
2017 (English)In: Lecture Notes in Computer Science, ISSN 0302-9743, E-ISSN 1611-3349, Vol. 10684, p. 497-508Article in journal (Refereed) Published
Abstract [en]

We introduce a novel approach to estimate the parameters of a mixture of two distributions. The method combines a grid approach with the method of moments and can be applied to a wide range of two-component mixture models. The grid approach enables the use of parallel computing and the method can easily be combined with resampling techniques. We derive the method for the special cases when the data are described by the mixture of two Weibull distributions or the mixture of two normal distributions, and apply the method on gene expression data from 409 ER+" role="presentation" style="box-sizing: border-box; display: inline-table; line-height: normal; letter-spacing: normal; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; position: relative;">ER+ER+ breast cancer patients.

Place, publisher, year, edition, pages
Springer, 2017
Keywords
Mixture models, Parameter estimation, Method of moments Grid-approach, Resampling, Cluster analysis Variable selection, High-dimensional data
National Category
Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-144646 (URN)10.1007/978-3-319-71504-9_41 (DOI)2-s2.0-85039429074 (Scopus ID)
Conference
Analytical and Computational Methods in Probability Theory, ACMPT 2017.
Funder
Swedish Research Council, 340-2013-5185
Available from: 2018-02-08 Created: 2018-02-08 Last updated: 2023-03-24Bibliographically approved
Källberg, D. & Seleznjev, O. (2016). Estimation of entropy-type integral functionals. Communications in Statistics - Theory and Methods, 45(4), 887-905
Open this publication in new window or tab >>Estimation of entropy-type integral functionals
2016 (English)In: Communications in Statistics - Theory and Methods, ISSN 0361-0926, E-ISSN 1532-415X, Vol. 45, no 4, p. 887-905Article in journal (Other academic) Published
Abstract [en]

Entropy-type integral functionals of densities are widely used in mathematical statistics, information theory, and computer science. Examples include measures of closeness between distributions (e.g., density power divergence) and uncertainty characteristics for a random variable (e.g., Renyi entropy). In this paper, we study U-statistic estimators for a class of such functionals. The estimators are based on ε-close vector observations in the corresponding independent and identically distributed samples. We prove asymptotic properties of the estimators (consistency and asymptotic normality) under mild integrability and smoothness conditions for the densities. The results can be applied in diverse problems in mathematical statistics and computer science (e.g., distribution identication problems, approximate matching for random databases, two-sample problems).

Keywords
Divergence estimation, asymptotic normality, U-statistics, inter-point distances, quadratic functional, entropy estimation
National Category
Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-60993 (URN)10.1080/03610926.2013.853789 (DOI)000370612900005 ()2-s2.0-84958073239 (Scopus ID)
Available from: 2012-11-06 Created: 2012-11-06 Last updated: 2023-03-23Bibliographically approved
Källberg, D., Leonenko, N. & Seleznjev, O. (2014). Statistical estimation of quadratic Rényi entropy for a stationary m-dependent sequence. Journal of nonparametric statistics (Print), 26(2), 385-411
Open this publication in new window or tab >>Statistical estimation of quadratic Rényi entropy for a stationary m-dependent sequence
2014 (English)In: Journal of nonparametric statistics (Print), ISSN 1048-5252, E-ISSN 1029-0311, Vol. 26, no 2, p. 385-411Article in journal (Refereed) Published
Abstract [en]

The Rényi entropy is a generalization of the Shannon entropy and is widely used in mathematical statistics and applied sciences for quantifying the uncertainty in a probability distribution. We consider estimation of the quadratic Rényi entropy and related functionals for the marginal distribution of a stationary m-dependent sequence. The U-statistic estimators under study are based on the number of ε-close vector observations in the corresponding sample. A variety of asymptotic properties for these estimators are obtained (e.g., consistency, asymptotic normality, Poisson convergence). The results can be used in diverse statistical and computer science problems whenever the conventional independence assumption is too strong (e.g., ε-keys in time series databases, distribution identication problems for dependent samples).

Place, publisher, year, edition, pages
Taylor & Francis, 2014
Keywords
entropy estimation, quadratic Rényi entropy, stationary m-dependent sequence, U-statistics, inter-point distances
National Category
Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-79958 (URN)10.1080/10485252.2013.854438 (DOI)000334160600011 ()2-s2.0-84893178465 (Scopus ID)
Note

Included in thesis 2013 in submitted form.

Available from: 2013-09-04 Created: 2013-09-04 Last updated: 2023-03-24Bibliographically approved
Källberg, D. (2013). Nonparametric Statistical Inference for Entropy-type Functionals. (Doctoral dissertation). Umeå: Umeå universitet
Open this publication in new window or tab >>Nonparametric Statistical Inference for Entropy-type Functionals
2013 (English)Doctoral thesis, comprehensive summary (Other academic)
Alternative title[sv]
Icke-parametrisk statistisk inferens för entropirelaterade funktionaler
Abstract [en]

In this thesis, we study statistical inference for entropy, divergence, and related functionals of one or two probability distributions. Asymptotic properties of particular nonparametric estimators of such functionals are investigated. We consider estimation from both independent and dependent observations. The thesis consists of an introductory survey of the subject and some related theory and four papers (A-D).

In Paper A, we consider a general class of entropy-type functionals which includes, for example, integer order Rényi entropy and certain Bregman divergences. We propose U-statistic estimators of these functionals based on the coincident or epsilon-close vector observations in the corresponding independent and identically distributed samples. We prove some asymptotic properties of the estimators such as consistency and asymptotic normality. Applications of the obtained results related to entropy maximizing distributions, stochastic databases, and image matching are discussed.

In Paper B, we provide some important generalizations of the results for continuous distributions in Paper A. The consistency of the estimators is obtained under weaker density assumptions. Moreover, we introduce a class of functionals of quadratic order, including both entropy and divergence, and prove normal limit results for the corresponding estimators which are valid even for densities of low smoothness. The asymptotic properties of a divergence-based two-sample test are also derived.

In Paper C, we consider estimation of the quadratic Rényi entropy and some related functionals for the marginal distribution of a stationary m-dependent sequence. We investigate asymptotic properties of the U-statistic estimators for these functionals introduced in Papers A and B when they are based on a sample from such a sequence. We prove consistency, asymptotic normality, and Poisson convergence under mild assumptions for the stationary m-dependent sequence. Applications of the results to time-series databases and entropy-based testing for dependent samples are discussed.

In Paper D, we further develop the approach for estimation of quadratic functionals with m-dependent observations introduced in Paper C. We consider quadratic functionals for one or two distributions. The consistency and rate of convergence of the corresponding U-statistic estimators are obtained under weak conditions on the stationary m-dependent sequences. Additionally, we propose estimators based on incomplete U-statistics and show their consistency properties under more general assumptions.

Place, publisher, year, edition, pages
Umeå: Umeå universitet, 2013. p. 21
Keywords
entropy estimation, Rényi entropy, divergence estimation, quadratic density functional, U-statistics, consistency, asymptotic normality, Poisson convergence, stationary m-dependent sequence, inter-point distances, entropy maximizing distribution, two-sample problem, approximate matching
National Category
Probability Theory and Statistics
Research subject
Mathematical Statistics
Identifiers
urn:nbn:se:umu:diva-79976 (URN)978-91-7459-701-1 (ISBN)
Public defence
2013-09-27, MIT-huset, MA121, Umeå universitet, Umeå, 10:00 (English)
Opponent
Supervisors
Available from: 2013-09-06 Created: 2013-09-04 Last updated: 2018-06-08Bibliographically approved
Källberg, D., Leonenko, N. & Oleg, S. (2012). Statistical inference for Rényi entropy functionals. In: Antje Düsterhöft, Meike Klettke, Klaus-Dieter Schewe (Ed.), Conceptual modelling and its theoretical foundations: (pp. 36-51). Springer Berlin/Heidelberg
Open this publication in new window or tab >>Statistical inference for Rényi entropy functionals
2012 (English)In: Conceptual modelling and its theoretical foundations / [ed] Antje Düsterhöft, Meike Klettke, Klaus-Dieter Schewe, Springer Berlin/Heidelberg, 2012, p. 36-51Chapter in book (Refereed)
Abstract [en]

Numerous entropy-type characteristics (functionals) generalizing Rényi entropy are widely used in mathematical statistics, physics, information theory, and signal processing for characterizing uncertainty in probability distributions and distribution identification problems. We consider estimators of some entropy (integral) functionals for discrete and continuous distributions based on the number of epsilon-close vector records in the corresponding independent and identically distributed samples from two distributions. The proposed estimators are generalized U-statistics. We show the asymptotic properties of these estimators (e.g., consistency and asymptotic normality). The results can be applied in various problems in computer science and mathematical statistics (e.g., approximate matching for random databases, record linkage, image matching).

Place, publisher, year, edition, pages
Springer Berlin/Heidelberg, 2012
Series
Lecture Notes in Computer Science, ISSN 0302-9743 ; 7260/2012
Keywords
entropy estimation, Rényi entropy, U-statistics, approximate matching, asymptotic normality
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-60788 (URN)10.1007/978-3-642-28279-9_5 (DOI)2-s2.0-84857553447 (Scopus ID)978-3-642-28278-2 (ISBN)978-3-642-28279-9 (ISBN)
Available from: 2012-11-14 Created: 2012-10-29 Last updated: 2023-03-24Bibliographically approved
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0003-2386-930x

Search in DiVA

Show all publications