Skip to main content
Top

Open Access 09-12-2024 | Original Research

Old but Gold or New and Shiny? Comparing Tree Ensembles for Ordinal Prediction with a Classic Parametric Approach

Authors: Philip Buczak, Daniel Horn, Markus Pauly

Published in: Journal of Classification

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Ordinal data are frequently encountered, e.g., in the life and social sciences. Predicting ordinal outcomes can inform important decisions, e.g., in medicine or education. Two methodological streams tackle prediction of ordinal outcomes: Traditional parametric models, e.g., the proportional odds model (POM), and machine learning-based tree ensemble (TE) methods. A promising TE approach involves selecting the best performing from sets of randomly generated numeric scores assigned to ordinal response categories (ordinal forest; Hornung, 2019). We propose a new method, the ordinal score optimization algorithm, that takes a similar approach but selects scores through non-linear optimization. We compare these and other TE methods with the computationally much less expensive POM. Despite selective efforts, the literature lacks an encompassing simulation-based comparison. Aiming to fill this gap, we find that while TE approaches outperform the POM for strong non-linear effects, the latter is competitive for small sample sizes even under medium non-linear effects.
Notes

Supplementary Information

The online version contains supplementary material available at https://​doi.​org/​10.​1007/​s00357-024-09497-9.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Ordinal response data are often encountered in biomedical and psychological applications, e.g., when assessing persons’ health status, rating of a set of choices or agreement towards given statements. Despite the common assignment of numeric values, ordinal responses are categorical variables whose categories are not necessarily equidistant, but (in contrast to nominal responses) carry a natural order. Typically, ordinal responses have been modeled through parametric models such as the cumulative model (particularly through its special case, the proportional odds model). For a general overview of the cumulative and further parametric models, we refer to Tutz (2022).
Apart from parametric models, there has also been an increasing use of non-parametric machine learning (ML) methods based on recursive partitioning, e.g., classification and regression trees (CART; Breiman et al., 1984) as well as ensemble methods such as random forest (RF; Breiman, 2001). While not accounting for the nature of ordinal responses inherently, several variations of trees and RF tailored towards ordinal prediction have been proposed. Piccarreta (2007) extended the Gini-Simpson split criterion to the ordinal case, while Archer (2010) and Galimberti et al. (2012) used the generalized Gini impurity with misclassification costs to incentivize respecting the ordered structure of the response. Buri and Hothorn (2020) proposed model-based RFs for detecting changes in proportional odds. Several further approaches are based on assigning numeric scores to the ordinal response categories, e.g., Kramer et al. (2000) used regression trees based on the numeric scores, while Janitza et al. (2016) used conditional inference forest (CF; Hothorn et al., 2006). Despite also using numeric scores, Hornung (2019) avoids pre-specification of scores in their ordinal forest (OF) by using a two-stage approach based on regression RF where in the first step the numeric scores to be assigned are optimized w.r.t. the predictive performance achieved when using them.
Another possibility of avoiding the predicament of score assignment is re-formulating the ordinal prediction task as a series of binary prediction tasks and aggregating the predictions of the individual binary models into a combined prediction for the ordinal prediction task (Frank & Hall, 2001). Similarly, Tutz (2021) proposed split-based ordinal RF (RFSp), a score-free framework for using classifiers such as RF in a binarized fashion aimed at resembling parametric models.
As such, one can identify two streams of methodology for ordinal classification: traditional parametric models, such as the cumulative model, and the adaptation of modern tree ensemble (TE) methods that have displayed high predictive power in other application contexts (see, e.g., Grinsztajn et al., 2022), but are also characterized by an increased computational complexity. While the recent literature on ordinal classification focused increasingly on the latter stream, the attention attributed to parametric models has slightly fallen off as they are often overlooked when evaluating ML-based prediction approaches (Tutz, 2021). This shortcoming in the literature impedes the development of recommendations for researchers and practitioners as to which method may be preferable for their given prediction application. While the works mentioned above benchmarked their methods to some capacity, there is a lack of an encompassing simulation-based comparison of the ordinal TE methods with parametric methods in different scenarios to assess under which circumstances using TE methods over a parametric model leads to improved predictive performance and, as such, is worth the increased computational complexity. Furthermore, there is no extensive comparison of the individual TE methods available in the current literature that would help obtaining more general guidelines among the set of TE methods. In particular, it would be of interest to study whether the approach of employing a computationally demanding score optimization procedure as in OF justifies its computation costs when compared to other TE methods such as regression RFs with non-optimized scores, classification RF, CF, and RFSp.
The aspects above have only been partially covered in the existing literature on ordinal classification. Janitza et al. (2016) only compared score-based CF with a regular classification RF. They did not find notable performance differences in their study with simulated and real datasets. Furthermore, they also did not identify a difference regarding the choice of numeric scores. However, they have only compared the default scores \(1, 2, \dots , k\) for k categories with the squared scores \(1^2, 2^2, \dots , k^2\). Hornung (2019) compared their OF with a naive OF variant using only default scores, classical RF, and a cumulative model (with probit link) on five real datasets. The parametric model performed best for one dataset, while it lagged severely behind for two datasets and was competitive for the remaining two datasets. Regarding the TE methods, for two out of the five datasets, OF could outperform naive OF and classical RF, while for the other three datasets, the three TE methods performed similarly. In an additional simulation study, Hornung (2019) further compared OF, naive OF, and RF, however, without including a parametric model. Regarding the benefit of optimizing scores, they found that OF could improve the most over naive OF in scenarios where the middle categories were distinctly more populated than the margin categories. The most complete comparison, to our knowledge, was performed by Tutz (2021) who compared parametric models, RFSp, OF, CF as well as ensembles comprised of combinations of these methods. In their study, the author found that performance gaps usually only occurred between the group of parametric models and the group of TE methods, while the different TE approaches performed similarly when compared only among themselves. However, these comparisons were only conducted on real datasets where one cannot directly control and manipulate the data generating processes. This makes deducting more general recommendations difficult because the actual effect structure present in the data is unknown. As such, it is generally recommended to use both simulation and real data for evaluating and comparing different methods (see, e.g., Friedrich & Friede, 2024).
We contribute to the literature above in two ways. First, we perform an extensive simulation study including differing data generating processes with increasingly non-linear effects and varying class distribution patterns to obtain recommendations for researchers and practitioners as to when a parametric model or TE methods are preferable. Further, we study whether optimizing numeric scores within a RF framework is worth the associated computational burden. To our knowledge, OF is currently the only method that optimizes its numeric scores. However, its optimization procedure separates the generation of score sets from their evaluation, i.e., all candidate score sets are first generated and evaluated afterwards. This means that the optimization procedure cannot react to the performance of any given score set, whereas iterative optimization approaches could take the performance of previously evaluated candidate score sets into account and specifically focus on exploring promising regions. To this end, we additionally add to the literature on ordinal classification with score-based RFs by proposing the ordinal score optimization algorithm (OSOA). Similar to OF, OSOA optimizes the numeric scores of the ordinal categories. However, it aims to enhance the optimization procedure of OF by employing a non-linear optimization algorithm based on the popular Nelder-Mead method (Nelder & Mead, 1965).
After introducing the investigated methods in Sect. 2, we describe the setup of our simulation study as well as present our results in Sect. 3. Further, we analyze the performance of all methods for real data examples in Sect. 4. We close with a discussion of our findings, potential avenues for future research, and a set of practical recommendations regarding the different methods in Sect. 5.

2 Methods

In the following, we consider an ordinal classification problem where the aim is to predict a response Y with ordinal categories \(1, 2, \dots , k\). We assume that a dataset with n observations and p covariates is available that further contains the ground truth. We will now present the methods considered for our comparison as well as introduce our newly proposed OSOA. All methods are implemented in R (R Core Team, 2023). We will refer to their individual implementations in the respective method description.

2.1 Cumulative Model

The cumulative model (McCullagh, 1980) is a parametric model which assumes that the observed ordinal response manifests an underlying latent variable that is continuous and can only be observed via thresholds that define the response categories. It models the probability \(P(Y \le r | \varvec{x})\) that the ordinal outcome variable Y for an observation with covariate vector \(\varvec{x}\) takes at most category \(r = 1, \dots , k\) as
$$ P(Y \le r | \varvec{x}) = F\left( \gamma _r + \varvec{x}^{\top } \varvec{\beta }\right) , $$
where \(-\infty< \gamma _1< \dots < \gamma _k = \infty \) denote the thresholds and F is a strictly increasing distribution function (Tutz, 2022). A common choice for F is the logistic function resulting in the proportional odds model (Tutz, 2022), i.e.,
$$\begin{aligned} P(Y \le r | \varvec{x}) = \frac{\exp (\gamma _r + \varvec{x}^{\top } \varvec{\beta })}{1 + \exp (\gamma _r + \varvec{x}^{\top } \varvec{\beta })} \end{aligned}$$
(1)
which will also be used in this work. For our simulation, we fitted the proportional odds model using the clm function from the ordinal package (Christensen, 2022). As such, we will be referring to proportional odds models in the context of ordinal prediction as CLM (cumulative link model) in the remainder of this work.

2.1.1 Random Forest

RF (Breiman, 2001) is a ML ensemble method combining a multitude of classification or regression trees into an aggregated model. In this work, RF was used as a standalone method for multi-label classification as well as a building block of further methods introduced below such as OF, OSOA, and RFSp. Popular implementations of RF in R include the ranger package (Wright & Ziegler, 2017) and the randomForest package (Liaw & Wiener, 2002).

2.1.2 Conditional Inference Forest

In contrast to RF, CF relies on conditional inference trees (CTs; Hothorn et al., 2006) as its base component. CTs determine splits by performing permutation tests to test the association between the outcome variable and a given covariate. The covariate with the strongest association is selected as the split variable, and the concrete split value is computed in a second step. Through their conditional inference framework based on permutation tests, CTs support nominal, ordinal, and metric response types. For ordinal outcomes, numeric scores are mapped to the ordinal categories. Per default, the scores \(1, 2, \dots , k\) are used. CFs have been investigated for use in ordinal classification in detail in Janitza et al. (2016). They are implemented in the partykit package (Hothorn & Zeileis, 2015).

2.1.3 Split-Based Ordinal Random Forest

RFSp (Tutz, 2021) is based on reformulating the ordinal response problem as considered in the cumulative model into a series of binary response models that hold simultaneously. To this end, \(k-1\) binary classification RFs are trained where each aims to classify whether observations belong to categories \(1, \dots , r-1\) or to categories \(r, \dots , k\), respectively, with \(r = 2,...,k\). The cumulative probabilities are then computed by aggregating the probabilities obtained from the individual RF models. This allows for using RF while following the logic of cumulative models and without having to rely on numeric scores. For our simulations, we used the implementation from https://​github.​com/​GerhardTutz/​ScoreFreeTrees.​ There, the individual RF models are trained using the RF implementation from randomForest (Liaw & Wiener, 2002).

2.1.4 Ordinal Forest

OF (Hornung, 2019) is a score-based RF method for ordinal prediction. Though relying on numeric scores for the regression forest internally, scores for the ordinal response categories do not have to be specified beforehand as the method tries to find optimal scores in an advance optimization step. To this end, OF generates partitions of the [0, 1] interval into k sub-intervals corresponding to each ordinal category. The associated numeric scores are in turn determined as the midpoint of the respective sub-intervals and are used to fit a regression RF (using the implementation from ranger; Wright & Ziegler, 2017). Prediction of unseen data is achieved by obtaining the numeric predictions from the RF fit and checking in which class they fall based on the respective borders of the class intervals. The different partitions are evaluated w.r.t. the out-of-bag (OOB) performance achieved when using them. While the OF implementation in the ordinalForest package (Hornung, 2022) offers a choice of performance measures, a balanced version of Youden’s Index J (Youden, 1950) is used by default where for a binary classification task
$$ J = \frac{\text {TP}}{\text {TP} + \text {FN}} + \frac{\text {TN}}{\text {TN} + \text {FP}} - 1 = \text {sensitivity} + \text {specificity} - 1. $$
Here, TP denotes the number of true positives, TN the number of true negatives, FP the number of false positives, and FN the number of false negatives. In the balanced case, Youden’s J is computed for each class and aggregated as a simple average. Thus, all classes have the same weight irrespective of their individual sizes. The best performing scores are combined into a single, final score set by averaging which is then used to fit the final RF model. For studying the benefit of score optimization, we have also included a naive OF variant in our simulations that fits a regression forest to the default scores \(1, 2, \dots , k\) with class borders \(0.5, 1.5, \dots , \frac{2k+1}{2}\).

2.1.5 Ordinal Score Optimization Algorithm

Similar to OF, OSOA also assumes that the observed k response categories are a coarsened version of an underlying numeric variable. To approximate this latent variable, OSOA follows OF in partitioning the [0, 1] interval into class-specific sub-intervals \([b_r, b_{r+1}), r = 1, \dots , k,\) where each sub-interval is defined by its class borders \(b_r\) and \(b_{r+1}\) with \(0 = b_1< b_2< \dots < b_{k+1} = 1\). Each ordinal response category is represented by the midpoint of its respective sub-interval, i.e., by the numeric score \(s_r = \frac{b_r + b_{r+1}}{2}, r = 1, \dots , k\). As in OF, these numeric scores are used to train a regression RF. To determine optimal choices for the class borders and thus for the class scores, OSOA follows OF in performing an optimization procedure, but employs a different optimization approach. Because OF relies on pre-generating partitions of the [0, 1] interval and then assessing their optimality, the procedure cannot react to the performance of specific partitions during the optimization process. As such, it cannot iteratively explore the space of possible partitions and focus on promising regions. To address this shortcoming, we propose OSOA which uses a non-linear optimization algorithm for finding its class borders and scores. The method is described in pseudocode in Algorithm 1. The general idea is to optimize the helper function \(\texttt {evaluateBorders}\) (Algorithm 2) that takes a set of inner class borders \(b_2, \dots , b_k\) (the outer borders \(b_1\) and \(b_{k+1}\) are fixed as 0 and 1, respectively) as input and returns the OOB performance achieved with them. Internally, evaluateBorders derives the numeric scores for the response categories from the provided class borders and fits a regression RF using the numeric scores as the target variable and the corresponding covariates as predictors. As in OF, the numeric scores are first transformed with the quantile function of the standard normal distribution \(\Phi ^{-1}\). For fitting RFs, we are using the implementation from the ranger package (Wright & Ziegler, 2017). From the RF fit, OOB predictions are obtained and in turn converted into class labels by using the transformed class borders. Finally, the predicted class labels derived from the OOB predictions can be compared with the true class labels to compute the balanced version of Youden’s J used in OF. The evaluateBorders function can be optimized using any derivative-free non-linear optimization algorithm. In this work, we have used the Sbplx algorithm from the NLopt library (Johnson, 2007). The Sbplx algorithm is based on the Subplex algorithm by Rowan (1990) which is a variant of the Nelder-Mead algorithm (Nelder & Mead, 1965). Since our algorithm follows Hornung (2019) in determining an optimal partition of the [0, 1] interval, we also restrict candidate class borders during the optimization through a lower bound of 0 and an upper bound of 1. As the class borders relate to the ordinal categories, they need to be sorted such that they match the order of the original categories. Hence, only sorted borders should be considered. This can either be enforced through inequality constraints if they are supported by the given optimizer or by disincentivizing unsorted solutions via penalization in the evaluation step. As starting values for the optimization, we are using \(\frac{1}{k}, \dots , \frac{k-1}{k}\), i.e., a partition with equally wide class intervals.
The optimizer will run until a pre-specified termination condition is fulfilled such as reaching a maximum number of function evaluations max.eval or failing to exceed a minimum performance improvement \(\varepsilon \). For our simulations, we set max.eval = 300 and \(\varepsilon = 1 \times 10^{-4}\). Smaller values for \(\varepsilon \) would allow for finding finer differences, but in turn, negatively impact the runtime, while larger values for \(\varepsilon \) would speed up the optimization process, but lead to a potentially more imprecise result. Optimal settings for \(\varepsilon \) and max.eval depend on the given application context and should be chosen w.r.t. the desired preciseness and the computational power available. The values selected here were chosen for showcasing the method and were not optimized further. Once the optimization algorithm determines a solution, the respective scores can be used to fit the final RF model and the final borders which are both returned. For unseen data, predicted (numeric) values can be obtained from the model’s individual trees and converted into class labels using the (transformed) borders. The overall class prediction for a given observation can be determined by majority voting. This prediction procedure is identical to the procedure employed in OF (cf. Hornung, 2022).

3 Simulation Study

3.1 Simulation Setup

To compare the predictive performance of the different prediction methods, we have performed a simulation study in which we generated datasets of varying characteristics. All data were generated from proportional odds models as is a common choice for simulating ordinal data (e.g., Janitza et al., 2016; Hornung, 2019) and aligns with the ordinal prediction methods compared in this work, most of which implicitly or explicitly assume proportional odds settings. We varied the number of observations, n, between 250, 750, and 2500. As many ordinal data applications originate from (bio-)medicine, psychology or the social sciences, we aimed to keep a mix of medium sample sizes to create realistic scenarios (see, e.g., Shen et al., 2011, for a review of sample sizes in psychology). The number of covariates, p, was either 10 or 35. While the former setting reflects common application scenarios (cf. the real data examples studied further below), the latter setting represents a compromise between including a higher number of covariates and not putting the CLM which is known to suffer from high dimensionality (Zahid & Tutz, 2013) too much at a disadvantage. We would expect values such as 35 to be more commonly encountered when analyzing, e.g., large-scale assessment studies (see, e.g., Immekus et al., 2022). As increasing the number of covariates to 35 only introduced further noise variables, the two settings also distinguished between a setting with a high signal-to-noise-ratio (SNR) and a setting with a low SNR. The number of categories k was varied between 3, 5, and 7, as these commonly occur, e.g., in questionnaires using Likert scale items. Out of the p covariates generated, seven had an effect on the outcome. The influential covariates consisted of five normally distributed variables \(X_1, \dots , X_5 \sim \mathcal {N}(0,1)\) and two binary variables \(X_6, X_7 \sim \text {Bin}(1, 0.5)\). The remaining \(p-7\) covariates were normally distributed noise, i.e., \(X_8, \dots , X_p \sim \mathcal {N}(0,1)\). All covariates were simulated as uncorrelated. This design choice was made to limit the scope of the simulation study in which we placed more focus on different effect structures and category response distributions as explained below. For our simulation design, we were loosely inspired by Janitza et al. (2016). However, apart from the normally distributed covariates, we additionally included binary covariates and further included non-linear effects, whereas Janitza et al. (2016) only studied linear effects. We simulated the outcome using the following three different data generating processes (DGPs):
$$\begin{aligned} \text {DGP 1:} \ &\varvec{x}^{\top } \varvec{\beta } = 3x_1 + x_2 + 2x_3 + x_4 + x_5 + x_6 + x_7,\\ \text {DGP 2:} \ &\varvec{x}^{\top } \varvec{\beta } = 3x_1 + x_2 + 2x_3 + {\left\{ \begin{array}{ll} 3, & x_4 \in (-1, 1]\\ -1, & x_4 \notin (-1, 1] \end{array}\right. } + 2 \times \mathbbm {1}_{x_3 \le 0.5 \wedge x_5> 0.5} + x_6 + x_7,\\ \text {DGP 3:} \ &\varvec{x}^{\top } \varvec{\beta } = 3x_1 + x_2 + 2x_3 + x_3^2 + {\left\{ \begin{array}{ll} 3, & x_4 \in (-1, 1]\\ -1, & x_4 \notin (-1, 1] \end{array}\right. } + 2 \times \mathbbm {1}_{x_3 \le 0.5 \wedge x_5 > 0.5} + x_6 + x_7.\\ \end{aligned}$$
The three DGPs were characterized by an increasing amount of non-linear effects. While for DGP 1 all effects were linear, DGP 2 replaced two linear effects by non-linear effects. DGP 3 added an additional quadratic effect. Introducing an increasing amount of non-linear effects allowed for investigating to which point the parametric model still sufficed compared to the TE methods and at which point the TE methods started to outperform the parametric model and should, hence, be preferred. Apart from varying the DGPs, we followed Hornung (2019) in simulating the data according to different class distribution patterns as the author found this to have impacted the methods they have studied. To extend the approach of using different class distributions to more ordinal classification methods as well as to create more diverse scenarios (Hornung, 2019, originally used equally distributed and randomly distributed classes), we included three different class distribution patterns in our simulation. First, a pattern where the classes were distributed approximately equally. Second, a pattern where the middle categories were more populated than the margin categories. Third, a pattern where the margin categories were more populated than the middle categories. Table 1 contains an overview of the relative frequencies per class that we targeted for the different class distribution patterns. They were derived by attaching linear weights to the categories depending on the distribution pattern, e.g., for the pattern “wide middle” with five categories, the weights for categories were 1, 2, 3, 2, and 1. After dividing by the sum of all weights, one arrives at 0.11, 0.22, 0.33, 0.22, and 0.11. The class distributions were obtained by selecting specific values for the thresholds \(\gamma _1, \dots , \gamma _k\) that approximately resulted in the intended relative frequencies for each combination of DGP, number of categories, and class distribution pattern in a dataset of size 100 000. Since this was a heuristic approach, the true class probabilities did not necessarily match the values in Table 1 exactly, hence the term “targeted relative frequencies.” The threshold values chosen for each scenario are listed in Table 3 in Appendix A. Using the simulated linear predictor values and respective thresholds in Eq. 1, the respective cumulative probabilities were computed, transformed into class probabilities, and used for generating class labels from a multinomial distribution.
Table 1
Targeted relative frequencies \(\pi _r, r = 1, \dots , k\) w.r.t. distribution pattern and number of categories k
Pattern
k
\(\pi _1\)
\(\pi _2\)
\(\pi _3\)
\(\pi _4\)
\(\pi _5\)
\(\pi _6\)
\(\pi _7\)
Equal
3
0.33
0.33
0.33
 
5
0.20
0.20
0.20
0.20
0.20
 
7
0.14
0.14
0.14
0.14
0.14
0.14
0.14
Wide middle
3
0.25
0.50
0.25
 
5
0.11
0.22
0.33
0.22
0.11
 
7
0.06
0.13
0.19
0.25
0.19
0.13
0.06
Wide margins
3
0.40
0.20
0.40
 
5
0.27
0.18
0.09
0.18
0.27
 
7
0.21
0.16
0.11
0.05
0.11
0.16
0.21
Fully crossing the settings of DGP, class distribution pattern, number of observations, covariates, and categories resulted in 162 conditions for evaluating the seven methods considered here. We performed 1000 replications, respectively. To evaluate the classification performance, we split the generated dataset into a training set that contained \(\frac{2}{3}\) of the observations for fitting the model and a test set with \(\frac{1}{3}\) of the observations for validating the model. The data partitions were determined by class-stratified sampling. As performance measures, we used the weighted Kappa coefficient \(\kappa _w\) (Cohen, 1968) with linear and quadratic weights as well as Kendall’s \(\tau _B\) (Kendall, 1945) and Spearman’s tie-corrected \(\rho \) (Kendall, 1948). All of these measures are specifically suited for assessing ordinal predictions. Cohen’s weighted Kappa with linear and quadratic weights has frequently been used for evaluating ordinal classification performance (see, e.g., Hornung, 2019; Ben-David, 2008). It is given by
$$\begin{aligned} \kappa _w = \frac{\sum \limits _{r=1}^k\sum \limits _{s=1}^k w_{rs}p^o_{rs} - \sum \limits _{r=1}^k\sum \limits _{s=1}^k w_{rs}p^c_{rs}}{1 - \sum \limits _{r=1}^k\sum \limits _{s=1}^k w_{rs}p^c_{rs}}, \end{aligned}$$
where \(p^o_{rs}\) is the observed proportion of instances for which r is the true category and s the predicted category, while \(p^c_{rs}\) is the analogous proportion that is expected by chance (Cohen, 1968). The respective weights are denoted by \(w_{rs}\), where for linear weights, \(w^{\text {lin}}_{rs} = 1 - \frac{|r - s|}{k-1}\), and quadratic weights, \(w^{\text {quad}}_{rs} = 1 - \frac{|r - s|^2}{(k-1)^2}\) is chosen. The different weights represent different strategies for penalizing the class distances between predicted and true categories. For linear weights, instances where the predicted categories are equal or close to the true categories are associated with higher weights. For quadratic weights, relatively more weight is attributed to predictions further away from the true category and less weight to predictions close to the true category as compared to linear weights (Hornung, 2019).
To limit the computational burden of the simulation study, we did not perform a hyperparameter tuning for the RF-based methods and used default values instead as RF has been shown to be relatively robust regarding parameter choices (Probst et al., 2019). This decision is in line with previous works from the field (Janitza et al., 2016; Hornung, 2019; Tutz, 2021). For consistency, we have set the number of trees to 500 for all RF methods as this is a common default, e.g., in the ranger and randomForest packages, hence, overriding the default number of trees (for the final forest) in the ordinalForest package that is 5000. In all other cases, the default values remained unchanged.

3.2 Simulation Results

In the following, the results from the simulation study will be presented. From the 1 134 000 simulation runs (across all 162 conditions, 7 prediction methods, and 1000 replications), we had to exclude 138 runs that led to non-computable correlation values. All of these runs occurred for \(n=250\) when the wide middle class pattern was used. In 135 of these runs, CF was used as the learner, while naive OF (referred to as nOF in the following result figures) was used in the remaining three. The correlation values were not computable due to the model predicting the same class for all observations from the test set resulting in non-existent variability. As such, these runs were excluded from the analysis. Since for all DGPs and performance measures, the findings were consistent across all numbers of categories considered, we are only showing the results for \(k = 5\) categories. The results for \(k = 3\) and \(k = 7\) categories are included in the Online Supplement. For the same reason, we are only showing results for the weighted Kappa with linear weights and Kendall rank correlation scores because the weighted Kappa with quadratic weights and Spearman rank correlation scores led to similar findings, respectively. We also included results for the latter two performance measures in the Online Supplement.

3.2.1 Results for DGP 1

Figure 1 shows the linearly weighted Kappa values the seven methods achieved for DGP 1 which only included linear effects. It can be seen that the CLM outperformed the TE methods in all scenarios. With an increasing number of observations, the RF-based learners and CF were able to catch up slightly, but still lagged behind the CLM notably. Overall, increasing the number of observations reduced the variability of the Kappa values. An increase in the number of variables led to a performance decrease that seems to have affected the TE methods more than the CLM. When only comparing the TE learners, their performance was mostly similar. For class distributions of equal size and wide margins, only minor differences between the RF-based learners and CF emerged. Only when the class distribution was characterized by a wide middle, the performances became more discernible. For this pattern, RFSp mostly performed best among the TE methods. Its performance advantage diminished, however, with an increasing number of observations. For \(n = 2500\) and \(p=10\), the remaining TE methods achieved similar Kappa values again. On the other hand, CF and classical RF lagged behind for most of the wide middle scenarios. Between OF, naive OF, and OSOA, OF and OSOA performed similarly, while naive OF generally achieved slightly lower Kappa values. When studying the other two class distributions patterns, however, naive OF was on par and even slightly ahead of OF in some scenarios.
When looking at the Kendall rank correlation scores for DGP 1 (Fig. 6 in Appendix B), similar findings resulted. The CLM achieved the highest correlation scores in all scenarios. The TE methods performed mostly similar when classes were distributed equally or with wide margins. When the middle categories were more populated than the marginal categories, more differences between the TE methods became visible especially when the number of observations was 250 or 750 and the number of variables was high. In these cases, RFSp performed best among the TE learners and CF. Compared to using weighted Kappa with linear weights, CF did not lag behind as much in the wide middle scenarios and was mostly on par with the other TE methods. Further, OF could not achieve noticeable gains over the naive OF. The two OF methods performed mostly similar in all scenarios along with OSOA.

3.2.2 Results for DGP 2

While DGP 1 only included linear effects, DGP 2 replaced some of these linear effects and introduced non-linear effects instead. Figure 2 shows the linearly weighted Kappa values achieved by the seven learners for DGP 2. Compared to DGP 1, the performance of the learners was more similar and not as dominated by the CLM anymore. While for low observation counts, the CLM was still ahead of the other learners, the differences were not as strong. For an increasing number of observations, the CLM fell slightly behind most TE methods when the number of variables was low. When the number of variables was high, the CLM benefitted from the performance loss suffered by the TE learners. While the results for the equal and wide margins class distribution pattern display similar trends, the wide middle pattern discerns more differences between the learners, especially for a high number of variables. RFSp performed consistently well in all wide middle scenarios (even under a high number of variables). The CLM was competitive in cases where the number of observations was either low or the number of variables was high. OF performed well for higher number of observations, slightly but consistently outperforming naive OF in the wide middle scenarios. On the other hand, naive OF performed slightly better than OF for the other two class distribution patterns. For equally distributed classes, naive OF was among the best performing methods for observation counts greater than 250. OSOA was close in performance to OF in all scenarios. While the differences between the RF-based methods were overall rather subtle, the performance of CF fell off in a number of scenarios, especially for class distributions with a wide middle.
Figure 7 (Appendix B) showing the Kendall rank correlation scores achieved by the seven learners echoes the findings for the weighted Kappa with linear weights. For the most part, the CLM and all RF-based learners performed similarly. Using CF usually led to the lowest correlation scores. In spite of the existence of non-linear effects, the CLM was competitive in cases where observation counts were 250 or 750. For \(n = 2500\), the CLM fell behind for all class distribution patterns in the case of \(p=10\) covariates. For \(p = 35\), the CLM was competitive again since it did not suffer as much from the increase in dimensionality as the TE methods. Overall, the differences between the different learners are even less pronounced than when using the weighted Kappa with linear weights as the performance measure. Regarding OF and naive OF, OF could only outperform naive OF in the wide middle scenario with 250 observations and 35 covariates. In the remaining scenarios, naive OF was either on par with OF or very slightly ahead as in the case of classes that had wide margins or were distributed equally. OSOA matched the performance of the former two methods in all scenarios without deviating notably in any direction.

3.2.3 Results for DGP 3

Figure 3 shows the performance of the seven learners as measured by the weighted Kappa with linear weights for DGP 3 where an additional quadratic effect was introduced. It can be seen that while the CLM was still competitive for small sample sizes (\(n = 250\)), it fell slightly behind all RF-based learners for 750 observations and even more for 2500 observations, especially when the number of covariates was low. However, it was still ahead of CF in most scenarios. As before, the results for equally distributed classes and classes with wide margins were similar, while the scenarios with a wide middle revealed more differences between the learners. For the latter class pattern, RFSp, OF, and OSOA were frequently among the best performing methods. As for the two other data generating processes, OF and OSOA could improve upon naive OF for class distributions with a wide middle, while naive OF performed similarly or (slightly) better for classes that were populated equally or more strongly in the margin categories. For equally distributed classes, naive OF was one of the best performing learners. However, the differences between the RF-based learners (apart from CF) were again mostly subtle.
Regarding the Kendall rank correlation scores, Fig. 8 (Appendix B) generally mirrors the results obtained for the weighted Kappa. The CLM was competitive when the number of observations was low, but fell with increasing sample sizes. Similarly, CF was commonly outperformed by the RF-based methods and was on par with CLM, or in some scenarios even behind. For the RF-based methods, the differences were once more subtle. As for the previous data generating processes, OF and OSOA could not notably improve upon naive OF.

3.3 Robustness of Data Generation

To study whether the results were sensitive to the generative model, we additionally reran parts of the simulation for a generative model that created data according to a linear regression model and binned the outcome into ordinal response categories. As such, this generative model mimicked the approach of transforming numeric outcomes to ordinal outcomes commonly used in practice. To this end, we generated the linear predictor term according to DGP 2 and added standard normal noise. The resulting numeric outcomes were binned such that the targeted relative class frequencies in Table 1 were obtained. This was achieved by approximating the empirical distribution function of the numeric outcome using 100 000 simulated observations. As binning values, the respective quantiles leading to the targeted class distribution pattern were selected. For instance, for the wide middle example mentioned above with \(k = 5\) categories and targeted relative frequencies of 0.11, 0.22, 0.33, 0.22, and 0.11, respectively, we have selected the 11.11%, 33.33%, 66.66%, and 88.88% quantiles as binning values. This approach was inspired by the simulation study in Hornung (2019). Using the alternative generative model for DGP 2, we have obtained results (see Online Supplement) that were consistent with the results described above which indicates that our findings are robust w.r.t. the generative model used for simulation.

3.4 Runtime Comparison

Apart from the predictive performance, the required computation time should also be taken into account when comparing the different methods. Figure 4 shows the time needed for training and predicting using a given method relative to the time needed for the CLM. This implies that if absolute runtimes increase overall between conditions, but by the same factor for any given method and the CLM, relative runtimes will be the same in both conditions. Relative runtimes have the advantage of being less machine-specific than the absolute runtimes. The CLM was chosen as the reference method as it represents the classical approach for ordinal prediction as well as the computationally least expensive method. Note that the values have been logarithmized using base 10. As such, a value of 0 means that a given method needed the same time as the CLM, while a value of 2 means that the computation time of a given method was larger by a factor of 100. Since the choices of DGP, number of categories (except for RFSp in which the number of RF fits within RFSp scales linearly with the number of categories), and class pattern had little impact on the relative runtime results, we are only showing the runtime comparison for data generated according to DGP 2 with five categories that were distributed equally. Regarding RFSp, we have decided for five categories as it posed the middle ground of the different numbers of categories considered in our simulation. Since all computations were performed on a compute cluster, it cannot be guaranteed that the individual computations were performed on identical CPUs. Furthermore, we have restricted all methods to only use a single core for computation which may have negatively impacted the runtime of some methods due to not being able to employ parallelization for faster computations. Due to these limitations, our results should be interpreted as broad estimates rather than exact benchmarks. When comparing the relative runtimes, one can see that for all methods, the required runtime was higher by at least a factor of 10 compared to the CLM. From the TE methods, the lowest runtimes were achieved by RF and naive OF which both fit a single RF. As RFSp needed to fit four RF models, it required more runtime than RF and naive OF as expected. In comparison, to the RF-based methods, CF needed relatively higher runtimes for fitting a single model. As expected, the runtimes of OF and OSOA were quite high due to the optimization step. Their computation time was between a 1000 and 10,000 times higher than for the CLM. However, one should keep in mind that the runtime of OF and OSOA is directly linked to the number of optimization iterations, i.e., the number of different score sets evaluated in OF and the maximum number of evaluations in OSOA.
Regarding the impact of n and p, one can see that the relative computation times of the TE methods increase for larger sample sizes. This means that the CLM’s runtimes were less impacted by increased sample sizes than the runtimes of the TE methods. Increasing the number of covariates did not result in a similar effect, indicating that the CLM scaled similarly to the TE methods regarding the computation time.
For practical implications, however, one should also consider the actual runtimes instead of solely relying on the relative runtimes. For example, in the computationally most demanding case of \(n = 2500\) observations and \(p = 35\) covariates, the CLM needed a median time of 0.08 s, RF 3.40 s, naive OF 4.66 s, RFSp 10.26 s, CF 114.93 s, OSOA 394.76 s, and OF 479.79 s. Depending on the dataset and computational power at hand, the discrepancy between, say, a CLM and naive OF or RFSp may be negligible for practical purposes as long as one does not employ a hyperparameter tuning for the RF-based methods.

4 Real Data Examples

In addition to the simulation study, the seven methods were also evaluated on eight real datasets. For our selection of datasets, we have strived for incorporating a variety of different application domains and dataset characteristics. Therefore, we have included datasets from psychology, (bio-)medicine, and the social sciences, as these are common application fields for ordinal prediction. The datasets also vary regarding their size and target variable properties (i.e., number of categories and their distribution). Table 2 provides an overview of the datasets. Out of these eight datasets, five (Birthweight, Boston, Hearth, Medical Care, and Wine Quality) were already analyzed in Tutz (2021), while the Mammography data were also used in Janitza et al. (2016) and Hornung (2019). The Birthweight dataset is concerned with predicting the birthweight of newborns. It was obtained from the MASS package. The original numeric target variable was categorized according to Tutz (2021). For the Boston dataset, it is of interest to predict the median value of owner-occupied homes in Boston. It was obtained from the mlbench package (Leisch & Dimitriadou, 2021). The numeric target variable was binned according to Tutz (2021). For the Hearth dataset, the goal is to predict the severity of coronary artery disease. It was taken from the ordinalForest package (Hornung, 2022). The Mammography dataset contains information about mammography experiences and was taken from the TH.data package (Hothorn, 2023).
Table 2
Description of real datasets used for evaluation
Name
Obs
Cov
Description and categories
Birthweight
189
8
Birth weight in grams
   
1: \(< 2500\) \((n = 59)\), 2: 2500-3000 \((n = 38)\)
   
3: 3000–3500 \((n = 45)\), 4: \(>3500\) \((n = 47)\)
Boston
506
13
Median value of owner-occupied homes in $1000
   
1: \(<15\) \((n=97)\), 2: 15-19 \((n=78)\), 3: 19-22 \((n=109)\)
   
4: 22–25 \((n=98)\), 5: 25-32 \((n=57)\), 6: \(>32\) \((n=67)\)
Hearth
294
10
Severity of coronary artery disease
   
1: no disease \((n=188)\), 2: degree 1 \((n = 37)\)
   
3: degree 2 \((n=26)\), 4: degree 3 \((n=28)\), 5: degree 4 \((n = 15)\)
Mammography
412
5
Last mammography visit
   
1: Never \((n=234\)), 2: Within a year \((n = 104)\)
   
3: Over a year \((n=74\))
Medical Care
1778
10
Number of physician office visits
   
1: 0 \((n=329)\), 2: 1 \((n = 183)\), 3: 2-3 \((n=362)\), 4: 4-6 \((n=398)\)
   
5: 7–8 \((n=149)\), 6: 9-11 \((n = 149)\), 7: \(>11\) \((n=208)\)
Student
649
12
Final grade in Portuguese language course
   
1: 0–10 \((n=100)\), 2: 10-11 \((n=201)\), 3: 12-13 \((n=154)\)
   
4: 14–15 \((n=112)\), 5: 15-20 \((n=82)\)
Wage
3000
8
Wage of workers in Mid-Atlantic region in $100k
   
1: \(<75\) \((n = 430)\), 2: 75-100 \((n=913)\), 3: 100-125 \((n=789)\)
   
4: 125–150 \((n = 525)\), 5: \(>150\) \((n=343)\)
Wine Quality
4898
6
Wine quality rating
   
1: \(<5\) \((n = 183)\), 2: 5 \((n=1457)\), 3: 6 \((n=2198)\)
   
4: 7 \((n = 880)\), 5: \(>7\) \((n=180)\)
The Medical Care dataset originates from the US National Medical Expenditure Survey from 1987. It was obtained from the AER package (Kleiber & Zeileis, 2008). We have chosen the same subset of observations and covariates to predict the number of physician office visits as well as the same target variable binning as Tutz (2021). The Student dataset contains information about the final grade of students from a Portuguese language course. The data was taken from the UCI Machine Learning Repository (Cortez, 2014). We have binned the target variable that was originally on a 20-point scale into five categories (see Table 2). As covariates, we have selected gender, age, region (rural vs. urban), parents’ cohabitation status, mother’s education, father’s education, weekly study time, presence of educational support from the school, presence of educational support from the family, partaking in paid extra classes, interest in taking higher education as well as access to the internet at home. The Wage dataset was obtained from the ISLR package (James et al., 2021). The goal is to predict the wage of workers in the Mid-Atlantic region. The target variable was binned into five categories for our analysis (see Table 2). Lastly, the task for the Wine Quality dataset is predicting the quality score of wine. It was taken from Cortez et al. (2009). The original categories were coarsened according to Tutz (2021). None of the obtained datasets contained any missing values. To evaluate the seven learners, we performed a five-fold cross-validation with 50 replications. For the learners, we used the same settings as in the simulation study before.
Figure 5 shows the values for the weighted Kappa with linear weights achieved by the learners on the eight datasets. It can be seen that the CLM was notably outperformed on the Boston, Mammography, and Wine Quality datasets. For the Medical Care data, the CLM, however, achieved the best performance of all learners. For the remaining datasets, it was competitive with the TE learners. When comparing the RF-based learners and CF, CF and the classification RF could never outperform the other learners and were either on par or lagged (slightly) behind. OF could improve upon naive OF for the Birthweight, Mammography, Medical Care, Student, Wage, and Wine Quality datasets. The predictive performance of OSOA was mostly aligned with the performance achieved by OF except for the Medical Care data. RFSp was among the most consistently performing methods. It outperformed all other learners on the Wage dataset and was competitive for the remaining datasets. Overall, however, the performance differences between the RF-based learners were often on a small scale apart from situational advantages or disadvantages for some methods, respectively. The fluctuating hierarchies regarding the performance could not reveal a general advantage for any method. When using Cohen’s Kappa with quadratic weights, a similar picture emerged (see Online Supplement).
Figure 9 (Appendix B) shows the Kendall rank correlation scores achieved by the seven learners. Overall, the result patterns resembled the findings for the weighted Kappa. The CLM was outperformed on the Wine Quality and Mammography datasets as well as slightly on the Boston dataset. For the Medical Care dataset, the CLM performed best and was competitive for the remaining datasets. In contrast to the weighted Kappa with linear weights, CF was lagging behind less compared to the RF-based learners for Kendall’s rank correlation scores. OF could not notably outperform its naive counterpart. As before, OSOA’s performance mostly fell between OF and naive OF. Generally, the differences between the RF-based learners were mostly subtle. When taking all learners into account, the largest differences were mostly caused by the CLM when it either performed particularly well as for the Medical Care data or when it was not suitable as for the Wine Quality and Mammography datasets. These findings were echoed when looking at the Spearman rank correlation scores. For the latter results, we refer to the Online Supplement. While the predictive performance of the individual methods was based on relative comparisons so far, for practical purposes, it is also relevant to consider the absolute predictive performance of the methods. Figures 5 and 9 show that overall, the highest predictive performance was achieved on the Boston dataset with Kendall rank correlations between 0.78 and 0.84, followed by the Wine Quality (only for TE learners), Hearth, and Wage datasets. For the Birthweight, Student, Mammography, and Medical Care data, the predictive performance values achieved were generally lower (e.g., for the latter two datasets, rank correlations lower than 0.34 were achieved for all methods), indicating that these prediction tasks were more difficult.

5 Discussion

In this work, we provided an extensive comparison of the CLM and TE methods for ordinal classification such as RF (Breiman, 2001), CF (Hothorn et al., 2006), (naive) OF (Hornung, 2019), and RFSp (Tutz, 2021). We further contributed a new method through our proposed OSOA. OSOA employs a non-linear optimization algorithm for determining optimal numeric scores to be assigned to the ordinal response categories within a regression RF framework.
We studied all methods in a wide range of varying scenarios including three different DGPs that were characterized by an increasing amount of non-linear effects. Inspired by Hornung (2019), we further varied the class distributions using three different distribution patterns. Creating such diverse data settings helped us investigating under which circumstances traditional parametric models such as the CLM are competitive with modern, computationally more demanding TE methods, and at which point the latter offer a noticeable improvement in predictive performance. Furthermore, by including different TE approaches such as classification RFs, binarized RFs, regression RFs with unoptimized scores as well as regression RFs with optimized scores, we could study differences among the TE methods, particularly regarding the question whether the computational cost of score optimization yields relevant benefits. Our extensive comparison has revealed several important insights which we discuss in the following.

5.1 Finding 1: CLM Remains Competitive for Small Sample Sizes and Limited Non-linear Effects

Similar to the findings in Tutz (2021), our results were also mainly characterized by cases in which the CLM either outperformed the other methods, was on par or notably lagged behind. For the first DGP that included linear effects only, the CLM notably outperformed all ML approaches. This was to be expected as the data generated from a proportional odds model with the all-linear effect structure represented the optimal use case for the CLM. With an increasing amount of non-linear effects, the CLM’s performance suffered in comparison to the TE methods. However, it required quite strong non-linear effects for the CLM to be outperformed. Especially for small sample sizes, the CLM was regularly competitive even in the presence of non-linear effects. For larger sample sizes, the TE learners usually closed the performance gap (when they lagged behind for smaller samples) or widened it (when they were slightly ahead or on par for smaller samples), respectively. Our analysis of real datasets revealed a similar pattern where the most discrepancies between the different methods were caused by the performance of the CLM in relation to the TE methods.

5.2 Finding 2: TE Methods Reveal Only Small Differences Among Themselves

When comparing the TE methods, CF fell behind the RF-based learners once the DGPs increasingly included non-linear effects. This was more prevalent when using the weighted Kappa to assess the predictive performance and less when using rank correlation scores. The differences between the RF-based methods themselves were mostly subtle. The latter finding is in line with Tutz (2021). The performance of our newly proposed OSOA mostly matched the performance of OF. When evaluating all methods on real datasets, we similarly found small differences between the TE methods. The largest performance differences were mostly caused by the CLM that either performed particularly well on a dataset or was outperformed notably by the set of RF-based methods.
Regarding the number of covariates, we observed the RF-based learners to incur a more notable performance loss than the CLM. However, this effect could be attributed to a lacking hyperparameter tuning. Even though RFs are relatively robust regarding their parameter choices, the condition with a higher number of covariates added only noise to the model. Consequently, only 7 of the 35 covariates affected the outcome. By using the default setting (i.e., the square root of the number of covariates) for the hyperparameter mtry that regulates how many covariates are randomly sampled for consideration in a given split, RFs potentially had to rely often on noise variables only for splitting which ultimately harmed the predictive performance.

5.3 Finding 3: Limited Benefit of Score Optimization in OF and OSOA

Regarding the question of whether optimizing scores in score-based methods such as OF and OSOA nets a benefit over naive OF, our results were mixed. In our simulation, we found that OF improved upon naive OF in cases where the distribution of classes was characterized by dominant middle categories. As such, this supports the findings in Hornung (2019). For the other distribution patterns where classes were distributed equally or a dominant margin categories were present, however, OF could not outperform naive OF. As OSOA’s performance mostly aligned with OF, the findings above also hold when comparing OSOA to naive OF. For the real datasets, OF and OSOA could improve upon naive OF for six out of eight datasets, but the differences were often on a small scale. As such, the benefits of score optimization were rather situational. While it can improve the predictive performance, it is not guaranteed for any given dataset and the high runtimes demonstrated in the runtime comparison must be kept in mind.

5.4 Limitations of Simulation Study

While we aimed to make our DGPs as diverse as possible by varying the amount of non-linear effects and the distribution of the response categories, all datasets were generated from models with identical effects across the categories. Future work could study DGPs that include category-specific effects. To this end, it would also be sensible to analyze more parametric models such as partial proportional odds models (see e.g., Brant, 1990; Peterson & Harrell, 1990) or the more recently proposed location-shift model by Tutz and Berger (2022). Additionally, all covariates were simulated as uncorrelated. While it is likely that in individual simulation runs some correlations between covariates randomly occurred, systematically varying the simulated correlation between covariates in future work may help illuminate further differences between the ordinal prediction methods studied here. Furthermore, our work only focused on datasets with relatively few covariates. For high-dimensional data, classical parametric models may run into problems (see, e.g., Zahid & Tutz, 2013). As such, it would be of interest to study how the findings from our comparison translate to high-dimensional settings. Another limitation of our work is posed by the lack of hyperparameter tuning for TE methods. Although this was in line with other works from the field (Janitza et al., 2016; Hornung, 2019; Tutz, 2021) and despite RF’s relative robustness regarding the parameter choice (Probst et al., 2019), a hyperparameter tuning could have amended the performance loss suffered by the TE methods in the simulation scenarios with many noise variables. Further, we did not optimize the parameters of OSOA as its inclusion in the comparison study served more as a showcase. More fine tuning of OSOA’s parameters may yield higher predictive performance or limit the computational burden of the optimization procedure while potentially sacrificing only little predictive performance.

5.5 Combining Multiple Prediction Methods as Ensembles

While this work focused on comparing individual prediction methods to assess their respective viability in different data scenarios, Tutz (2021) proposed combining multiple prediction models (e.g., CLM, OF) into a joint ensemble prediction model. To this end, the individual models are trained separately on the training data. For predicting new observations, the predicted response category probabilities of the individual models are aggregated via a weighted mean. The weights are determined in an advance step where the prediction methods are evaluated separately on a subset of the training data and higher predictive performance (relative to the other methods) results in a higher weight. For more details on the joint ensemble approach, we refer to Tutz (2021). When running an ensemble consisting of a CLM, an OF and a RFSp model on the real data examples, Fig. 10 (Appendix C) shows that the joint ensemble achieved the highest linearly weighted Kappa values for two datasets, indicating that this approach can yield benefits. For some datasets, the joint ensemble was partly held back when one of its prediction models performed poorly on the given dataset (e.g., CLM for the Wine Quality data). As such, the joint ensemble approach can be worthwhile to consider, but requires a few design choices whose optimality is likely to vary between datasets, e.g., how many and which prediction methods should be included, what the optimal weighting strategy is, etc. As the joint ensemble fits each prediction method multiple times (due to the weight computation), this approach also leads to extended computational runtimes. Furthermore, if interpretability is of interest, then the ensemble approach loses the interpretability its individual models potentially provide. However, if only the predictive performance is relevant for the application at hand, then joint ensembles can be a sensible approach to employ.

5.6 Avenues for Further Methodological Research

A possible explanation for the lack of consistent improvement upon naive OF could be that OF and OSOA are internally both optimizing a class balanced version of Youden’s J. As Youden’s J is a measure reflecting sensitivity and specificity, it is not a performance measure specifically tailored towards ordinal classification. As such, optimizing Youden’s J may not necessarily result in a more optimal classification result from an ordinal perspective. One could investigate whether optimizing other performance measures such as the weighted Kappa would lead to more consistent improvement upon naive OF.
Specifically for OSOA, the proposed optimization procedure could further be enhanced by including restarts which could help the optimizer in escaping local minima to possibly find better solutions. Additionally, one could try to use multiple sets of starting score values for initializing the optimization instead of only using scores derived from a partition of the [0, 1] interval where all sub-intervals are of equal width. Furthermore, one could explore alternative optimization approaches, e.g., evolutionary algorithms (EAs) such as the covariance matrix adaptation evolution strategy (Hansen & Ostermeier, 1996). EAs keep a population of solution candidates which is continually optimized through recombining and mutating current population members, or sampling new candidates from distributions influenced by current candidates (for an introduction to EAs see e.g., Pétrowski & Ben-Hamida, 2017).
Another possible explanation for the lack of improvement achieved through score optimization may be found in the prediction procedure of OF and OSOA. OF and naive OF determine predictions for new observations by first obtaining the predicted numeric scores from all trees. For each of the tree models, the predicted score is first transformed into a class label by using the associated class borders. The final class prediction is then computed by aggregating over all class labels via majority voting. As OSOA was modeled after OF in this regard, it follows the same prediction procedure. However, since the individual trees in a forest are usually grown without restricting their complexity too much, the resulting terminal nodes may often be pure, i.e., contain (almost) only observations from a single category. Thus, the predicted value for observations falling into the terminal node would simply be the numeric score assigned to the respective category, or a value very close to it in case the node is not entirely pure. Since the predicted numeric score would be immediately transformed into a class label and not processed for further aggregation in its numeric state, the actual numeric score assigned to the respective category may have little impact on the overall prediction behavior. In future work, one could investigate whether score optimization yields greater benefits when first aggregating the numeric scores from the individual trees by averaging and assigning a class to the aggregated prediction score.

5.7 Practical Recommendations

Practical machine learning applications often follow a workflow in which the data problem at hand is not approached by only fitting a single prediction method, but instead comparing several prediction methods on a subset of the data to determine which of the many prediction methods available achieves the best performance for the given data. While it would be ideal to compare as many methods as possible, this is often not feasible in practice due to limited time and computational resources. Therefore, one typically resorts to benchmarking only a selected set of prediction methods. Acknowledging this common compromise, we therefore offer the following recommendations that may help arriving at a feasible set of methods to compare for a given data situation. Our work demonstrated that while the RF-based methods can outperform the CLM for non-linear effects and larger sample sizes, it required relatively strong non-linear effects for causing a performance gap. Even in the presence of non-linear effects, the CLM was competitive for small sample sizes. In line with Tutz (2021), we therefore recommend always considering a parametric model as a benchmark model against which other methods should be gauged. For small sample sizes and weak non-linear effects, the parametric model may already suffice and even lead to better performance as the TE methods often require a certain baseline of observations before they can achieve satisfying performance. Having small sample sizes is not uncommon in fields like psychology and social sciences, where ordinal responses are often encountered. For example, in their review of 1568 samples used in psychological publications between 1995 and 2008, Shen et al. (2011) found a median sample size of 172 and a mean sample size of 690 observations. As the TE methods often performed similarly and most differences in our comparison arose from performance gaps between the CLM and the relatively homogeneous set of TE methods, we recommend comparing the parametric benchmark model with a naive OF or RFSp. Classical RF was often lagging slightly behind the other TE methods. The score optimization procedures of OF and OSOA have been shown to be quite computationally demanding and their benefit was mostly situational. Therefore, comparing a parametric model with a computationally less demanding naive OF or RFSp model can give a first indication whether the application at hand is more suited towards parametric models or TE methods. Should naive OF or RFSp outperform the parametric model and should the distribution of the response categories be characterized by dominant middle categories, employing a score optimizing TE method may be worthwhile (if the computational resources permit). Regarding the TE methods, however, one should further keep in mind that despite the RF’s robustness regarding its hyperparameter settings (Probst et al., 2019), scenarios in which one expects a high rate of noisy covariates and a low rate of influential covariates may warrant a hyperparameter tuning. This, in turn, would increase the runtime even further and in combination with a score optimization approach as in OF and OSOA may be less feasible for practical purposes.
Instead of arriving at a prediction model based on the recommendations above, one can also follow the joint ensemble approach from Tutz (2021) with the implications discussed earlier. Since the joint ensemble performs a model selection process internally, it offers a viable alternative to the benchmarking approach. In this case, our recommendations could be used to guide which prediction models to include in the joint ensemble.

Acknowledgements

The authors would like to thank Dr. Marie Beisemann for providing helpful discussion and valuable feedback. This work has been partly supported by the Research Center Trustworthy Data Science and Security (https://​rc-trust.​ai), one of the Research Alliance centers within the UA Ruhr (https://​uaruhr.​de). Additionally, the authors gratefully acknowledge the computing time provided on the Linux HPC cluster at TU Dortmund University Dortmund (LiDO3), partially funded in the course of the LargeScale Equipment Initiative by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) as project 271512359.

Declarations

Conflict of Interest

The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix

Appendix A: Threshold Values for Simulation Study

Table 3
Threshold values for combinations of DGP, class distribution pattern, and k
DGP
Class pattern
k
\(\gamma _1\)
\(\gamma _2\)
\(\gamma _3\)
\(\gamma _4\)
\(\gamma _5\)
\(\gamma _6\)
\(\gamma _7\)
DGP 1
Equal
3
\(-3\)
0.75
\(\infty \)
  
5
\(-\)4.75
\(-2\)
0.25
3
\(\infty \)
  
7
\(-\)5.75
\(-\)3.5
\(-\)1.75
\(-\)0.25
1.5
3.75
\(\infty \)
 
Wide margins
3
\(-2\)
0.25
\(\infty \)
  
5
\(-\)3.75
\(-\)1.75
\(-\)0.5
1.75
\(\infty \)
  
7
\(-\)4.75
\(-\)2.75
\(-\)1.5
\(-\)0.75
0.5
2.5
\(\infty \)
 
Wide middle
3
\(-4\)
2
\(\infty \)
  
5
\(-\)6.25
\(-\)2.75
1
4.5
\(\infty \)
  
7
\(-8\)
\(-5\)
\(-\)2.5
0.25
2.75
5.5
\(\infty \)
DGP 2
Equal
3
\(-\)5.25
\(-\)1.25
\(\infty \)
  
5
\(-\)7
\(-\)4.25
\(-2\)
0.75
\(\infty \)
  
7
\(-8\)
\(-\)5.75
\(-4\)
\(-\)2.25
\(-\)0.5
1.75
\(\infty \)
 
Wide margins
3
\(-\)4.25
\(-2\)
\(\infty \)
  
5
\(-6\)
\(-\)3.75
\(-\)2.5
\(-\)0.25
\(\infty \)
  
7
\(-7\)
\(-5\)
\(-\)3.75
\(-3\)
\(-\)1.75
0.5
\(\infty \)
 
Wide middle
3
\(-\)6.25
0
\(\infty \)
  
5
\(-\)8.75
\(-\)5.25
\(-\)1.25
2.5
\(\infty \)
  
7
\(-10\)
\(-\)7.25
\(-\)4.75
\(-\)1.75
1
4
\(\infty \)
DGP 3
Equal
3
\(-6\)
\(-2\)
\(\infty \)
  
5
\(-8\)
\(-\)5.25
\(-3\)
\(-\)0.25
\(\infty \)
  
7
\(-9\)
\(-\)6.75
\(-5\)
\(-\)3.25
\(-\)1.5
0.75
\(\infty \)
 
Wide margins
3
\(-\)5.25
\(-3\)
\(\infty \)
  
5
\(-7\)
\(-\)4.75
\(-\)3.5
\(-\)1.25
\(\infty \)
  
7
\(-8\)
\(-\)5.75
\(-\)4.5
\(-\)3.75
\(-\)2.5
\(-\)0.25
\(\infty \)
 
Wide middle
3
\(-\)7.25
\(-1\)
\(\infty \)
  
5
\(-10\)
\(-6\)
\(-2\)
1.5
\(\infty \)
  
7
\(-\)11.25
\(-\)8.25
\(-\)5.5
\(-\)2.5
0
3
\(\infty \)

Appendix B: Kendall Rank Correlation for Simulation and Real Data Comparisons

Appendix C: Real Data Example Results for Joint Ensemble Learner

Supplementary Information

Below is the link to the electronic supplementary material.
Literature
go back to reference Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35 (NeurIPS 2022). Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022). Why do tree-based models still outperform deep learning on typical tabular data? In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, & A. Oh (Eds.), Advances in Neural Information Processing Systems 35 (NeurIPS 2022).
go back to reference Immekus, J. C., Jeong, T.-s., & Yoo, J. E. (2022). Machine learning procedures for predictor variable selection for schoolwork-related anxiety: Evidence from PISA 2015 mathematics, reading, and science assessments. Large-scale Assessments in Education, 10(1). https://doi.org/10.1186/s40536-022-00150-8 Immekus, J. C., Jeong, T.-s., & Yoo, J. E. (2022). Machine learning procedures for predictor variable selection for schoolwork-related anxiety: Evidence from PISA 2015 mathematics, reading, and science assessments. Large-scale Assessments in Education, 10(1). https://​doi.​org/​10.​1186/​s40536-022-00150-8
go back to reference Kendall, M. G. (1948). Rank correlation methods. London, UK: Griffin. Kendall, M. G. (1948). Rank correlation methods. London, UK: Griffin.
go back to reference Kramer, S., Widmer, G., Pfahringer, B., & de Groeve, M. (2000). Prediction of Ordinal Classes Using Regression Trees. In Z. W. Raś, & S. Ohsuga (Eds.), Lecture Notes in Computer Science: Vol. 1932. Foundations of Intelligent Systems. ISMIS 2000 (pp. 426–434). https://doi.org/10.1007/3-540-39963-1_45 Kramer, S., Widmer, G., Pfahringer, B., & de Groeve, M. (2000). Prediction of Ordinal Classes Using Regression Trees. In Z. W. Raś, & S. Ohsuga (Eds.), Lecture Notes in Computer Science: Vol. 1932. Foundations of Intelligent Systems. ISMIS 2000 (pp. 426–434). https://​doi.​org/​10.​1007/​3-540-39963-1_​45
go back to reference Leisch, F., & Dimitriadou, E. (2021). mlbench: Machine learning benchmark problems [Computer software manual]. (R package version 2.1-3.1) Leisch, F., & Dimitriadou, E. (2021). mlbench: Machine learning benchmark problems [Computer software manual]. (R package version 2.1-3.1)
go back to reference Pétrowski, A., & Ben-Hamida, S. (2017). Evolutionary algorithms. Hoboken, NJ: Wiley. Pétrowski, A., & Ben-Hamida, S. (2017). Evolutionary algorithms. Hoboken, NJ: Wiley.
go back to reference Rowan, T. H. (1990). Functional stability analysis of numerical algorithms (Doctoral dissertation). Retrieved from ProQuest Dissertations and Theses database. (UMI No. 9031702) Rowan, T. H. (1990). Functional stability analysis of numerical algorithms (Doctoral dissertation). Retrieved from ProQuest Dissertations and Theses database. (UMI No. 9031702)
Metadata
Title
Old but Gold or New and Shiny? Comparing Tree Ensembles for Ordinal Prediction with a Classic Parametric Approach
Authors
Philip Buczak
Daniel Horn
Markus Pauly
Publication date
09-12-2024
Publisher
Springer US
Published in
Journal of Classification
Print ISSN: 0176-4268
Electronic ISSN: 1432-1343
DOI
https://doi.org/10.1007/s00357-024-09497-9

Premium Partner