Umeå University's logo

umu.sePublications
Change search
Link to record
Permanent link

Direct link
Alternative names
Publications (10 of 156) Show all publications
Wallmark, J., Wiberg, M. & Eriksson, M. (2026). Developing and validating a frailty score based on patient-reported outcome 3 months after stroke: A Riksstroke-based study.. PLOS ONE, 21(2), e0343249, Article ID e0343249.
Open this publication in new window or tab >>Developing and validating a frailty score based on patient-reported outcome 3 months after stroke: A Riksstroke-based study.
2026 (English)In: PLOS ONE, E-ISSN 1932-6203, Vol. 21, no 2, p. e0343249-, article id e0343249Article in journal (Refereed) Published
Abstract [en]

BACKGROUND: Frailty is common after stroke and linked to poor outcomes, but many measures are clinician-rated, time-consuming, and not suited to patient-reported data. To address these issues, we developed and validated a frailty score from the Swedish Stroke Register (Riksstroke) three-month follow-up questionnaire.

METHODS: We analyzed responses from 19,470 stroke survivors to nine patient-reported items covering function, mood, fatigue, pain and general health, in the 2021-2022 Riksstroke questionnaire. Dimensionality was assessed with Mokken Scale Analysis and exploratory factor analysis. Item response theory (IRT) was used for score computation. Competing graded response IRT models (unidimensional, correlated-factor, bifactor) were compared, and measurement fairness was examined using differential item functioning (DIF) across age, sex, and education. Prognostic validity was tested with Kaplan-Meier curves and Cox regression for all-cause mortality.

RESULTS: From the Mokken Scale Analysis, all items met scalability criteria. Factor analysis suggested two correlated interpretable facets (Physical Functioning; Well-being/Mental Health). A bifactor IRT model provided the best fit to the data, comprising a general frailty dimension while addressing the strong correlation between the facets. DIF was minimal for sex and education, with modest age-related effects. Higher frailty scores were associated with increased mortality in adjusted Cox models and Kaplan-Meier curves. Tools for computing frailty scores are available at https://github.com/joakimwallmark/frailty-irt-scores.

CONCLUSIONS: A robust, fair, and prognostically meaningful frailty score can be derived from patient-reported items in Riksstroke. More broadly, the study demonstrates how routinely collected patient-reported outcome measures can be leveraged to build scalable frailty scores, offering efficient cost-effective tools for monitoring outcome and guiding quality improvement in stroke care.

National Category
Medical Biostatistics
Research subject
Statistics; Epidemiology
Identifiers
urn:nbn:se:umu:diva-250646 (URN)10.1371/journal.pone.0343249 (DOI)41719319 (PubMedID)
Funder
Swedish Research Council, 2022-02046Swedish Research Council, 2024-02846
Available from: 2026-03-04 Created: 2026-03-04 Last updated: 2026-03-04
Wallin, G. & Wiberg, M. (2026). Propensity score methods for local test score equating: stratification and inverse probability weighting. Journal of educational and behavioral statistics
Open this publication in new window or tab >>Propensity score methods for local test score equating: stratification and inverse probability weighting
2026 (English)In: Journal of educational and behavioral statistics, ISSN 1076-9986, E-ISSN 1935-1054Article in journal (Refereed) Epub ahead of print
Abstract [en]

In test equating, ensuring score comparability across different test forms is crucial but particularly challenging when test groups are non-equivalent and no anchor test is available. Local test equating aims to satisfy Lord’s equity requirement by conditioning equating transformations on individual-level information, typically using anchor test scores as proxies for latent ability. However, anchor tests are not always available in practice. This paper introduces two novel propensity score-based methods for local equating: stratification and inverse probability weighting (IPW). These methods use covariates to account for group differences, with propensity scores serving as proxies for latent ability differences between test groups. The stratification method partitions examinees into comparable groups based on similar propensity scores, while IPW assigns weights inversely proportional to the probability of group membership. We evaluate these methods through empirical analysis and simulation studies. Results indicate both methods can effectively adjust for group differences, with their relative performance depending on the strength of covariate-ability correlations. The study extends local equating methodology to cases where only covariate information is available, providing testing programs with new tools for ensuring fair score comparability.

Place, publisher, year, edition, pages
Sage Publications, 2026
Keywords
inverse probability weighting, local equating, propensity scores, stratification, test equating
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-248590 (URN)10.3102/10769986251397558 (DOI)001654667800001 ()2-s2.0-105026732163 (Scopus ID)
Funder
Marianne and Marcus Wallenberg Foundation, MMW 2019.0129
Available from: 2026-01-21 Created: 2026-01-21 Last updated: 2026-01-21
Wallin, L., Svedin, C.-G., Wiberg, M. & Dennhag, I. (2026). The compassionate engagement and action scale for youths: psychometric properties in a clinical psychiatric Swedish sample. Frontiers in Psychology, 16, Article ID 1653979.
Open this publication in new window or tab >>The compassionate engagement and action scale for youths: psychometric properties in a clinical psychiatric Swedish sample
2026 (English)In: Frontiers in Psychology, E-ISSN 1664-1078, Vol. 16, article id 1653979Article in journal (Refereed) Published
Abstract [en]

Background: Compassion contributes to wellbeing and serves as a protective factor against mental health problems in young people. The Compassionate Engagement and Action Scale for Youth—Swedish version (CEASY-SE) is a self-report questionnaire that assesses compassion across six competencies, organized into three scales: Self-Compassion, Compassion for Others, and Compassion from Others. While previously validated in a Swedish school sample, its applicability in clinical psychiatric populations remains unexplored. This study aimed to evaluate the psychometric properties of CEASY-SE in a clinical psychiatric sample of Swedish youth aged 16–22.

Methods: A cross-sectional study was conducted, collecting self-reported data from young people (N = 355) receiving care in child and adolescent psychiatry and primary care. We assessed the CEASY-SE's factor structure, reliability, and validity (internal consistency, convergent, divergent, construct, and criterion-related validity). Sex and age differences were also analyzed, along with comparisons of total scores across diagnostic groups.

Results: Confirmatory factor analyses supported a two-factor model within each of the three scales. Internal consistency was good to excellent across all scales (α ranging from 0.75 to 0.92), except for the Self-Compassion Engagement subscale among males (α = 0.62; ω = 0.60). Convergent and divergent validity were satisfactory. Among the three CEASY-SE scales, Self-Compassion showed the strongest correlations with mental health outcomes. Sex differences primarily affected Compassion for Others, and a sex-by-age interaction was found for both Compassion for Others and Self-Compassion. A total score of 40 was associated with an 8.41-fold increase in the predicted probability of depression symptoms compared to a score of 100. The lowest scores were found among patients with eating disorders and depression.

Conclusions: The CEASY-SE demonstrates acceptable to excellent psychometric properties in a clinical psychiatric sample of youth. It is a promising tool for clinicians and researchers to assess and promote compassion in young people, with potential relevance for interventions aimed at improving mental health outcomes.

Place, publisher, year, edition, pages
Frontiers Media S.A., 2026
Keywords
clinical youth sample, compassion, confirmatory factor analysis, psycometrics, reliability, validity
National Category
Psychiatry Social Work
Identifiers
urn:nbn:se:umu:diva-249158 (URN)10.3389/fpsyg.2025.1653979 (DOI)001663706500001 ()41561593 (PubMedID)2-s2.0-105027862287 (Scopus ID)
Funder
Region VästerbottenNorrbotten County Council
Available from: 2026-01-30 Created: 2026-01-30 Last updated: 2026-01-30Bibliographically approved
Ramsay, J. O., Li, J., Wallmark, J. & Wiberg, M. (2025). An information manifold perspective for analyzing test data. Applied psychological measurement, 49(3), 90-108
Open this publication in new window or tab >>An information manifold perspective for analyzing test data
2025 (English)In: Applied psychological measurement, ISSN 0146-6216, E-ISSN 1552-3497, Vol. 49, no 3, p. 90-108Article in journal (Refereed) Published
Abstract [en]

Modifications of current psychometric models for analyzing test data are proposed that produce an additive scale measure of information. This information measure is a one-dimensional space curve or curved surface manifold that is invariant across varying manifold indexing systems. The arc length along a curve manifold is used as it is an additive metric having a defined zero and a version of the bit as a unit. This property, referred to here as the scope of the test or an item, facilitates the evaluation of graphs and numerical summaries. The measurement power of the test is defined by the length of the manifold, and the performance or experiential level of a person by a position along the curve. In this study, we also use all information from the items including the information from the distractors. Test data from a large-scale college admissions test are used to illustrate the test information manifold perspective and to compare it with the well-known item response theory nominal model. It is illustrated that the use of information theory opens a vista of new ways of assessing item performance and inter-item dependency, as well as test takers' knowledge.

Place, publisher, year, edition, pages
Sage Publications, 2025
Keywords
entropy, expected sum score, nominal model, scope, score index, spline functions, surprisal, test information, TestGardener
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-233717 (URN)10.1177/01466216241310600 (DOI)001380542200001 ()39713764 (PubMedID)2-s2.0-105001648191 (Scopus ID)
Funder
Wallenberg Foundations, MMW 2019.012
Available from: 2025-01-09 Created: 2025-01-09 Last updated: 2025-04-29Bibliographically approved
Wiberg, M. & Laukaityte, I. (2025). Calculating bias in test score equating in a NEAT design. Applied psychological measurement, 49(7), 350-366
Open this publication in new window or tab >>Calculating bias in test score equating in a NEAT design
2025 (English)In: Applied psychological measurement, ISSN 0146-6216, E-ISSN 1552-3497, Vol. 49, no 7, p. 350-366Article in journal (Refereed) Published
Abstract [en]

Test score equating is used to make scores from different test forms comparable, even when groups differ in ability. In practice, the non-equivalent group with anchor test (NEAT) design is commonly used. The overall aim was to compare the amount of bias under different conditions when using either chained equating or frequency estimation with five different criterion functions: the identity function, linear equating, equipercentile, chained equating and frequency estimation. We used real test data from a multiple-choice binary scored college admissions test to illustrate that the choice of criterion function matter. Further, we simulated data in line with the empirical data to examine difference in ability between groups, difference in item difficulty, difference in anchor test form and regular test form length, difference in correlations between anchor test form and regular test forms, and different sample size. The results indicate that how bias is defined heavily affects the conclusions we draw about which equating method is to be preferred in different scenarios. Practical implications of this in standardized tests are given together with recommendations on how to calculate bias when evaluating equating transformations.

Place, publisher, year, edition, pages
Sage Publications, 2025
Keywords
criterion function, frequency estimation, chained equating
National Category
Probability Theory and Statistics
Research subject
Statistics
Identifiers
urn:nbn:se:umu:diva-236964 (URN)10.1177/01466216251330305 (DOI)001450757600001 ()40162326 (PubMedID)2-s2.0-105001869754 (Scopus ID)
Funder
Marianne and Marcus Wallenberg Foundation, 2019.0129
Available from: 2025-03-26 Created: 2025-03-26 Last updated: 2025-09-22Bibliographically approved
Laukaityte, I., Wallin, G. & Wiberg, M. (2025). Combining propensity scores and common items for test score equating. Applied psychological measurement, Article ID 01466216251363240.
Open this publication in new window or tab >>Combining propensity scores and common items for test score equating
2025 (English)In: Applied psychological measurement, ISSN 0146-6216, E-ISSN 1552-3497, article id 01466216251363240Article in journal (Refereed) Epub ahead of print
Abstract [en]

Ensuring that test scores are fair and comparable across different test forms and different test groups is a significant statistical challenge in educational testing. Methods to achieve score comparability, a process known as test score equating, often rely on including common test items or assuming that test taker groups are similar in key characteristics. This study explores a novel approach that combines propensity scores, based on test takers' background covariates, with information from common items using kernel smoothing techniques for binary-scored test items. An empirical analysis using data from a high-stakes college admissions test evaluates the standard errors and differences in adjusted test scores. A simulation study examines the impact of factors such as the number of test takers, the number of common items, and the correlation between covariates and test scores on the method’s performance. The findings demonstrate that integrating propensity scores with common item information reduces standard errors and bias more effectively than using either source alone. This suggests that balancing the groups on the test-takers' covariates enhance the fairness and accuracy of test score comparisons across different groups. The proposed method highlights the benefits of considering all the collected data to improve score comparability.

Place, publisher, year, edition, pages
Sage Publications, 2025
Keywords
academic admission, educational testing, equating, fairness, nonequivalent groups with anchor test design
National Category
Probability Theory and Statistics Psychology (Excluding Applied Psychology)
Identifiers
urn:nbn:se:umu:diva-243387 (URN)10.1177/01466216251363240 (DOI)001539586300001 ()40757034 (PubMedID)2-s2.0-105012775902 (Scopus ID)
Funder
Marianne and Marcus Wallenberg Foundation, 2019.0129
Available from: 2025-08-20 Created: 2025-08-20 Last updated: 2025-08-20
Wiberg, M., González, J. & von Davier, A. A. (2025). Generalized kernel equating with applications in R (1ed.). Boca Raton: CRC Press
Open this publication in new window or tab >>Generalized kernel equating with applications in R
2025 (English)Book (Refereed)
Abstract [en]

Generalized Kernel Equating is a comprehensive guide for statisticians, psychometricians, and educational researchers aiming to master test score equating. This book introduces the Generalized Kernel Equating (GKE) framework, providing the necessary tools and methodologies for accurate and fair score comparisons.

The book presents test score equating as a statistical problem and covers all commonly used data collection designs. It details the five steps of the GKE framework: presmoothing, estimating score probabilities, continuization, equating transformation, and evaluating the equating transformation. Various presmoothing strategies are explored, including log-linear models, item response theory models, beta4 models, and discrete kernel estimators. The estimation of score probabilities when using IRT models is described and Gaussian kernel continuization is extended to other kernels such as uniform, logistic, epanechnikov and adaptive kernels. Several bandwidth selection methods are described. The kernel equating transformation and variants of it are defined, and both equating-specific and statistical measures for evaluating equating transformations are included. Real data examples, guiding readers through the GKE steps with detailed R code and explanations are provided. Readers are equipped with an advanced knowledge and practical skills for implementing test score equating methods.

Place, publisher, year, edition, pages
Boca Raton: CRC Press, 2025. p. 235 Edition: 1
Series
Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences
National Category
Probability Theory and Statistics
Research subject
Statistics
Identifiers
urn:nbn:se:umu:diva-231970 (URN)10.1201/9781315283777 (DOI)2-s2.0-85208298899 (Scopus ID)9781138196988 (ISBN)9781032904955 (ISBN)9781315283777 (ISBN)
Available from: 2024-11-19 Created: 2024-11-19 Last updated: 2024-11-22Bibliographically approved
Franco, V. R. & Wiberg, M. (2025). How to score respondents? A Monte Carlo study comparing three different procedures: [Como pontuar respondentes? Um estudo de Monte Carlo comparando três procedimentos diferentes] [¿Cómo calificar a los respondientes? Un estudio de Monte Carlo que compara tres procedimientos diferentes]. Avaliacao Psicologica, 24, Article ID e22224.
Open this publication in new window or tab >>How to score respondents? A Monte Carlo study comparing three different procedures: [Como pontuar respondentes? Um estudo de Monte Carlo comparando três procedimentos diferentes] [¿Cómo calificar a los respondientes? Un estudio de Monte Carlo que compara tres procedimientos diferentes]
2025 (English)In: Avaliacao Psicologica, ISSN 1677-0471, Vol. 24, article id e22224Article in journal (Refereed) Published
Abstract [en]

This study aimed to empirically compare the effectiveness of Likert, Thurstonian, and Expected a Posteriori (EAP) scoring methods. A computational simulation of the two-parameter logistic model was employed under various conditions, including different sample sizes, number of items, scale levels, item extremeness, and varying discriminations. Effectiveness was assessed through correlation with the true score, root mean squared errors, bias, and the accurate recovery of effect sizes. The results indicated that Likert scores exhibit greater bias than EAP and Thurstonian scores for extreme scores, however, they strongly correlate with both the true score and EAP scores. Likert scores were slightly more effective in recovering mean differences between two groups, correlation estimates, and regression parameters. Overall, Likert scores should be avoided when ordering or thresholding individuals at the extremes of scales is the primary objective. However, they are preferable in situations where Thurstonian and EAP scores may fail to converge. The study also recommends that future research explore conditions involving more complex data-generating processes.

Abstract [pt]

O objetivo deste artigo foi comparar empiricamente a eficácia dos métodos de escores de Likert, Thurstoniano e Esperado a Posteriori (EAP). Foi realizada uma simulação computacional de um modelo logístico de dois parâmetros em diversas condições (diferentes tamanhos amostrais, número de itens, níveis de escala, itens extremos e discriminações variadas). A eficácia foi medida pela correlação com o escore verdadeiro, pelo erro quadrático médio, viés e pela recuperação correta dos tamanhos de efeito. Os resultados mostraram que os escores de Likert são mais enviesados do que os escores EAP e Thurstoniano para escores extremos, embora apresentem correlação forte tanto com o escore verdadeiro quanto com os escores EAP. Os escores de Likert também se mostraram ligeiramente mais eficientes na recuperação de diferenças médias entre dois grupos, estimativas de correlação e parâmetros de regressão. De modo geral, os escores de Likert devem ser evitados quando a ordenação ou a definição de limites dos indivíduos nas extremidades das escalas for o principal interesse. No entanto, eles devem ser preferidos sempre que os escores Thurstoniano e EAP possam não convergir. Recomendamos também que estudos futuros testem condições com processos geradores de dados mais complexos.

Abstract [es]

El objetivo de este artículo fue comparar empíricamente la efectividad de los métodos de medición Likert, Thurstoniano y Esperado a Posteriori (EAP). Se realizó una simulación computacional de un modelo logístico de dos parámetros bajo diversas condiciones (diferentes tamaños de muestra, número de ítems, niveles de escala, ítems extremos y discriminaciones variadas). La efectividad se midió mediante la correlación con la puntuación verdadera, el error cuadrático medio, el sesgo y la correcta recuperación de los tamaños de efecto. Los resultados mostraron que las puntuaciones de Likert son más sesgadas que las puntuaciones EAP y Thurstonianas para puntuaciones extremas, aunque presentan una fuerte correlación tanto con la puntuación verdadera como con las puntuaciones EAP. Las puntuaciones de Likert también se mostraron ligeramente más eficientes en la recuperación de diferencias medias entre dos grupos, estimaciones de correlación y parámetros de regresión. En general, las puntuaciones de Likert deben evitarse cuando la ordenación o el establecimiento de límites de los individuos en los extremos de las escalas sea el principal interés. Sin embargo, deben preferirse siempre que las puntuaciones Thurstonianas y EAP puedan no converger. También recomendamos que estudios futuros prueben condiciones con procesos generadores de datos más complejos.

Keywords
Factor scores, Monte Carlo simulation, Psychometric theory, Escores fatoriais, Teoria Psicométrica, simulação de Monte Carlo, Puntuaciones factoriales, Teoría psicométrica, Simulación de Monte Carlo
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-242190 (URN)10.15689/ap.2025.24.e22224 (DOI)2-s2.0-105009830365 (Scopus ID)
Available from: 2025-07-14 Created: 2025-07-14 Last updated: 2025-07-14Bibliographically approved
Wallin, G. & Wiberg, M. (2025). Smoothing of bivariate test score distributions: model selection targeting test score equating. Journal of educational and behavioral statistics
Open this publication in new window or tab >>Smoothing of bivariate test score distributions: model selection targeting test score equating
2025 (English)In: Journal of educational and behavioral statistics, ISSN 1076-9986, E-ISSN 1935-1054Article in journal (Refereed) Epub ahead of print
Abstract [en]

Observed-score test equating is a vital part of every testing program, aiming to make test scores across test administrations comparable. Central to this process is the equating function, typically estimated by composing distribution functions of the scores to be equated. An integral part of this estimation is presmoothing, where statistical models are fit to observed score frequencies to mitigate sampling variability. This study evaluates the impact of commonly used model fit indices on bivariate presmoothing model-selection accuracy in both item response theory (IRT) and non-IRT settings. It also introduces a new model-selection criterion that directly targets the equating function in contrast to existing methods. The study focuses on the framework of non-equivalent groups with anchor test design, estimating bivariate score distributions based on real and simulated data. Results show that the choice of presmoothing model and model fit criterion influences the equated scores. In non-IRT contexts, a combination of the proposed model-selection criterion and the Bayesian information criterion exhibited superior performance, balancing bias, and variance of the equated scores. For IRT models, high selection accuracy and minimal equating error were achieved across all scenarios.

Place, publisher, year, edition, pages
Sage Publications, 2025
Keywords
item response theory, log-linear models, model-selection, smoothing, test score equating
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-240948 (URN)10.3102/10769986241300523 (DOI)001388830400001 ()2-s2.0-105007825362 (Scopus ID)
Funder
Swedish Research Council, 2020-06484Marianne and Marcus Wallenberg Foundation, MMW
Available from: 2025-07-01 Created: 2025-07-01 Last updated: 2025-07-01
Wallmark, J. & Wiberg, M. (2025). The bit scale: a metric score scale for unidimensional item response theory models. Psychometrika
Open this publication in new window or tab >>The bit scale: a metric score scale for unidimensional item response theory models
2025 (English)In: Psychometrika, ISSN 0033-3123, E-ISSN 1860-0980Article in journal (Refereed) Epub ahead of print
Abstract [en]

In Item Response Theory (IRT), the conventional latent trait scale (θ) is inherently arbitrary, lacking a fixed unit or origin and often tied to specific population distributional assumptions (e.g., standard normal). This limits the direct comparability and interpretability of scores across different tests, populations, or model estimation methods. This paper introduces the “bit scale,” a novel metric transformation for unidimensional IRT scores derived from fundamental principles of information theory, specifically surprisal and entropy. Bit scores are anchored to the properties of the test items rather than the test-taker population. This item-based anchoring ensures the scale’s invariance to population assumptions and provides a consistent metric for comparing latent trait levels. We illustrate the utility of the bit scale through empirical examples: demonstrating consistent scoring when fitting models with different θ scale assumptions, and using anchor items to directly link scores from different test administrations. A simulation study confirms the desirable statistical properties (low bias, accurate standard errors) of Maximum Likelihood estimated bit scores and their robustness to extreme scores. The bit scale offers a theoretically grounded, interpretable, and comparable metric for reporting and analyzing IRT-based assessment results. Software implementations in R (bitscale) and Python (IRTorch) are available and practical implications are discussed.

Place, publisher, year, edition, pages
Cambridge University Press, 2025
National Category
Probability Theory and Statistics
Identifiers
urn:nbn:se:umu:diva-248215 (URN)10.1017/psy.2025.10071 (DOI)41367367 (PubMedID)2-s2.0-105025666005 (Scopus ID)
Funder
Marianne and Marcus Wallenberg Foundation, 2019-0129Swedish Research Council, 2022-02046
Available from: 2026-01-09 Created: 2026-01-09 Last updated: 2026-01-09
Projects
Analysis and modelling of Swedish students´ performance in TIMSS and PISA in an international perspective [2008-04027_VR]; Umeå UniversityNew statistical methods to ascertain high quality over time in standardized achievement tests [2014-00578_VR]; Umeå UniversityInnovative methods to compare standardized achievement tests over time [2019-03493_VR]; Umeå University
Organisations
Identifiers
ORCID iD: ORCID iD iconorcid.org/0000-0001-5549-8262

Search in DiVA

Show all publications