Item response theory (IRT) models the relationship between the possible scores on a test item against a test taker’s attainment of the latent trait that the item is intended to measure. In this study, we compare two models for tests with polytomously scored items: the optimal scoring (OS) model, a nonparametric IRT model based on the principles of information theory, and the generalized partial credit (GPC) model, a widely used parametric alternative. We evaluate these models using both simulated and real test data. In the real data examples, the OS model demonstrates superior model fit compared to the GPC model across all analyzed datasets. In our simulation study, the OS model outperforms the GPC model in terms of bias, but at the cost of larger standard errors for the probabilities along the estimated item response functions. Furthermore, we illustrate how surprisal arc length, an IRT scale invariant measure of ability with metric properties, can be used to put scores from vastly different types of IRT models on a common scale. We also demonstrate how arc length can be a viable alternative to sum scores for scoring test takers.