The article introduces a novel estimator, the Ordered Forest, for ordered choice models using Random Forest algorithms. Traditional parametric models like Ordered Logit and Ordered Probit impose strong assumptions and may not capture complex relationships in the data. The Ordered Forest, however, offers flexibility in handling nonlinearities and high-dimensional data, making it a powerful tool for econometric analysis. The estimator not only provides conditional choice probabilities but also estimates marginal effects, a feature typically lacking in machine learning methods. The article includes a comprehensive simulation study and an empirical application, showcasing the superior performance of the Ordered Forest in various settings. Additionally, the authors provide a free software implementation of the estimator in R and Python, facilitating its use by researchers and practitioners.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
In this paper we develop a new machine learning estimator for ordered choice models based on the Random Forest. The proposed Ordered Forest flexibly estimates the conditional choice probabilities while taking the ordering information explicitly into account. In addition to common machine learning estimators, it enables the estimation of marginal effects as well as conducting inference and thus provides the same output as classical econometric estimators. An extensive simulation study reveals a good predictive performance, particularly in settings with nonlinearities and high correlation among covariates. An empirical application contrasts the estimation of marginal effects and their standard errors with an Ordered Logit model. A software implementation of the Ordered Forest is provided both in R and Python in the package orf available on CRAN and PyPI, respectively.
Notes
A previous version of the paper was presented at research seminars of the University of St.Gallen, at the German Statistical Week in Trier and the Statistics of Machine Learning Conference in Prague. We thank participants, in particular Francesco Audrino, Martin Biewen, Daniel Goller, Michael Knaus and David Preinerstorfer, as well as two anonymous reviewers for helpful comments and suggestions. The usual disclaimer applies. Gabriel Okasa conducted this work, while affiliated with the Swiss Institute for Empirical Economic Research at the University of St. Gallen (SEW-HSG) .
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Many empirical models deal with categorical dependent variables which have an inherent ordering. In such cases the outcome variable is measured on an ordered scale such as level of education defined by primary, secondary and tertiary education or income coded into low, middle and high income level. Further examples include survey outcomes on self-assessed health status (bad, good, very good, see, for example, Case et al. 2002; or Murasko 2008), level of life satisfaction and happiness (Boes et al. 2010; and Boes and Winkelmann 2010) or political opinions (do not agree, agree, strongly agree, see, for example, Jackson and Darrow 2005; or Jackman 2009) as well as grades, scores and various ratings and valuations (see Butler et al. 1998; Hamermesh and Parker 2005; Afonso et al. 2009; or Gogas et al. 2014, for some further examples). Moreover, even sports outcomes resulting in loss, draw and win are part of such modelling framework (e.g. Goller et al. 2021). So far, the Ordered Probit or Ordered Logit model represent workhorse models in such cases. The main advantage of these models is the ease of estimation, usually done by maximum likelihood. However, the major disadvantage are the strong parametric assumptions which are imposed for convenience rather than derived from any substantive knowledge about the application. Unfortunately, the desired marginal effects are sensitive to these assumptions. Although there is a large literature on how to generalize these assumptions in case of binary choice models (Matzkin 1992; Ichimura 1993; Klein and Spady 1993), or multinomial (unordered) choice models (Lee 1995; Fox 2007), limited work has been done for ordered choice models (Lewbel 2000; Klein and Sherman 2002; also see Stewart 2005, for an overview).
In this paper, we exploit recent advances in the machine learning literature to develop an estimator for conditional choice probabilities as well as marginal effects together with inference procedures when the outcome variable has an ordered categorical nature. The proposed Ordered Forest estimator is based on the Regression Random Forest algorithm as introduced by Breiman (2001) and makes use of cumulative probability predictions based on binary indicators of respective ordered categories to flexibly estimate the single choice probabilities of the particular ordered category, conditional on covariates. Furthermore, to analyse the relationship of the ordered choice probabilities with the covariates, the Ordered Forest exploits numerical derivative approximations for estimation of the mean marginal effects and marginal effects at mean as the typical quantities of interest in the field of discrete choice models (see, for example, Greene and Hensher 2010). Finally, in order to quantify the estimation uncertainty of the above parameters, the Ordered Forest estimated with honesty, i.e. with sample splitting, adapts the weight-based inference proposed by Lechner (2018) using the asymptotic results of Wager and Athey (2018) for the consistency and normality of Random Forest predictions for the case of ordered categorical outcomes. Thus, Ordered Forest estimator provides not only the point estimate for the conditional choice probabilities and the corresponding marginal effects, but in its honest version also an estimate for the respective standard errors. We investigate the predictive performance of the estimator by comparing it to classical and other competing methods via a large-scale Monte Carlo simulation study as well as using real datasets. The results from the synthetic simulation reveal good performance of the Ordered Forest in finite samples throughout all simulation designs, including high-dimensional settings. In particular, the superior performance of the estimator over the parametric Ordered Logit becomes apparent when dealing with nonlinear functional forms and high correlation among covariates. Furthermore, the Ordered Forest in its non-honest version, i.e. without sample splitting, outperforms the competing forest-based estimators in the most complex simulation designs. Additionally, the results from the empirical evaluation further confirm the good predictive performance of the estimator in real datasets. Lastly, an empirical application demonstrates the estimation of the marginal effects and the associated inference procedure based on the honest version of the Ordered Forest. The empirical results highlight the value of the additional flexibility in the effect estimation of relevant economic parameters. Moreover, to enable the usage of the method by applied researchers a free software implementation of the Ordered Forest estimator has been developed in R (R Core Team 2021) as well as in Python (Van Rossum and Drake 2009) and is provided in the package orf available on the official CRAN (Lechner and Okasa 2019) and PyPI (Lechner et al. 2022) repositories.1
Advertisement
This paper contributes to the econometric as well as machine learning literature in several ways. In terms of econometrics, this paper develops a new estimator of the ordered choice models based on a machine learning algorithm. The proposed Ordered Forest estimator improves on the classical parametric models such as Ordered Logit and Ordered Probit models by allowing ex-ante flexible functional forms as well as allowing for a larger covariate space. The latter is a feature of many machine learning methods, but is typically absent from standard econometrics. In terms of machine learning, this paper develops a new type of Random Forest estimator adapted to ordered categorical outcomes. As such, the proposed Ordered Forest extends the classical regression forests as developed by Breiman (2001) and Wager and Athey (2018) specifically for estimation of ordered choice models and thus expands the forest-based estimators for particular econometric models such as, for example, the survival forest (Hothorn et al. 2004) designed for estimation of survival models or the quantile regression forest (Meinshausen 2006) for estimation of conditional quantiles. Additionally to the above forest-based estimators, the Ordered Forest further advances machine learning methods with the estimation of marginal effects and the inference thereof, a feature of many parametric models, but generally missing in the machine learning literature. Hence, our contribution is twofold. First, with respect to the literature on parametric estimation of the ordered choice models, the Ordered Forest represents a flexible estimator without any parametric assumptions, while providing essentially the same information as an ordered parametric model. Second, with respect to the machine learning literature, the Ordered Forest achieves precise estimation of ordered choice probabilities, while adding estimation of marginal effects as well as statistical inference thereof.
This paper is organized as follows. Section 2 discusses the related literature concerning parametric and machine learning methods for the estimation of ordered choice models. Section 3 reviews the Random Forest algorithm and its theoretical properties. In Sect. 4 the Ordered Forest estimator is introduced including the estimation of the conditional choice probabilities, marginal effects and the inference procedure. The Monte Carlo simulation is presented in Sect. 5. Section 6 shows an empirical application. Section 7 concludes. Further details regarding estimation methods, the simulation study and the empirical application are provided in Appendices A, B and C, respectively.
2 Literature
In econometrics, the Ordered Probit and Ordered Logit models are widely used when there are ordered response variables (McCullagh 1980). These models build on the latent regression model assuming an underlying continuous outcome \(Y_i^*\) as a linear function of regressors \(X_i\) with unknown coefficients \(\beta \), while assuming that the latent error term \(u_i\) follows a particular distribution, i.e. the standard normal or the logistic distribution in the case of Ordered Probit and Ordered Logit, respectively. Furthermore, the ordered discrete outcome \(Y_i\) represents categories that cover a certain range of the latent continuous \(Y_i^*\) and is determined by unknown threshold parameters \(\alpha _m\). Formally, in the case of the Ordered Logit the latent model is defined as:
with unknown threshold parameters \(\alpha _0<\alpha _1<...<\alpha _M\) such that:
$$\begin{aligned} Y_i&= m \quad \text { if } \quad \alpha _{m-1} < Y_i^* \le \alpha _m \quad \text { for } \quad m=1,...,M , \end{aligned}$$
(2.2)
where the coefficients and the thresholds are commonly estimated via maximum likelihood with the delta method or bootstrapping used for inference. The above latent model is also often motivated by the quantity of interest, i.e. the conditional ordered choice probabilities \(P[Y_i=m \mid X_i=x]\).
Although such models are relatively easy to estimate, they impose strong parametric assumptions which hinder the flexibility of these models. Apart from the assumptions about the distribution of the error term, further functional form assumptions are being imposed. As is clear from (2.1), the coefficients \(\beta \) are constant across the outcome classes which is often labelled as the parallel regression assumption (Williams 2016). This inflexibility affects both the estimation of the choice probabilities as well as the estimation of marginal effects. For these reasons, generalizations of these models have been proposed in the literature in order to relax some of the assumptions. An example of such models is the Generalized Ordered Logit model (McCullagh and Nelder 1989), where the parallel regression assumption is abandoned. Boes and Winkelmann (2006) provide an excellent overview of several other generalized parametric models. However, all of these models retain some of the distributional assumptions which limit their modelling flexibility.
Advertisement
Besides the standard econometric literature on parametric specifications of ordered choice models (for an overview see Agresti 2002; or Boes and Winkelmann 2006), a new strand of literature devoted to relaxing the parametric assumptions by using novel machine learning methods is emerging. Particularly, the tree-based methods have gained considerable attention. Trees (Breiman et al. 1984) and Random Forests (Breiman 2001) are highly flexible, local nonparametric prediction methods, which can effectively deal with large-dimensional settings (Biau and Scornet 2016). In particular, trees recursively split the sample into smaller, non-overlapping strata, the so-called leaves of the tree, with the goal of grouping homogeneous observations within the leaves, but heterogeneous ones across the leaves. The splits are placed by choosing a specific value of a covariate that leads to the largest reduction of the pre-specified loss, e.g. mean-squared error. The prediction rule of the tree is then based on averaging the outcomes in the leaves of the tree. While single trees have a high degree of interpretability, they suffer from unstable splits and a lack of smoothness, due to the recursive path-dependent structure (Hastie et al. 2009). An improvement in this respect is achieved by the so-called bagging of trees, i.e. bootstrap aggregation (Efron and Tibshirani 1994; Bühlmann and Yu 2002). In this case, for each bootstrap sample a single tree is estimated, and the predictions of each tree are averaged over all trees, thus, stabilizing the predictions and reducing the variance (Hastie et al. 2009). In addition to randomly choosing observations via bootstrapping, randomly choosing a subset of covariates used for the splitting has led to the development of Random Forests (Breiman 2001), which have demonstrated even better prediction performance. Furthermore, the theoretical properties of Random Forests have been extensively studied which make them amenable to econometric application, where statistical inference is of importance (see Meinshausen 2006; Biau 2012; Wager et al. 2014; Wager 2014; Scornet et al. 2015; Mentch and Hooker 2016; Tibshirani et al. 2018, for a discussion of statistical properties of different types of Random Forests). As a result, variations of Random Forests adapted towards treatment effect estimation, the so-called Causal Forests, have been developed (Wager and Athey 2018; Lechner 2018; Athey et al. 2019) and successfully applied in several empirical studies (Athey and Wager 2019; Cockx et al. 2023; Hodler et al. 2023).
In a similar vein, we leverage the benefits of Random Forests for a flexible estimation of ordered choice models. Although the classical Random Forest algorithms introduced by Breiman (2001) are very powerful in both regression as well as in classification (see Loh 2011, for a review), there is a need for adjustment when predicting ordered response. In the case of regression, the discrete nature of the outcome is not being taken into account and in the case of classification, the ordered nature of the outcome is not being taken into account. As such, appropriate modifications of the standard Random Forest algorithm are desired in order to predict conditional probabilities of discrete ordered outcomes.2 Based on the Random Forests algorithm, Hothorn et al. (2006) propose a method building on their conditional inference framework for recursive partitioning which can also deal with ordered outcomes. Similarly, Hornung (2019a) proposes an Ordinal Forest method for prediction of ordinal response variables. While both of the these approaches take the ordering information of the outcomes into account, they focus mainly on prediction and variable importance without considering estimation of the marginal effects or the associated inference for the effects which are a fundamental part of the classical econometric ordered choice models. We propose a new estimator—Ordered Forest—that adapts the Random Forests algorithm, while providing not only the predictions of conditional probabilities, but also enabling the estimation of marginal effects and the inference thereof and thus offering a flexible alternative to parametric ordered choice models without imposing strict functional form assumptions. In what follows, we formally define the underlying Random Forests and derive the Ordered Forest estimator.
3 Random Forests
Random Forests as introduced by Breiman (2001) became quickly a very popular prediction method thanks to its good prediction accuracy, while being relatively simple to tune. Further advantages of Random Forests as a nonparametric technique are the high degree of flexibility and ability to deal with large number of predictors, while coping better with the curse of dimensionality problem in comparison to classical nonparametric methods such as kernel or local linear regression (see, for example, Racine 2008). In what follow we focus on the definition of the Regression Random Forests as the building block of the Ordered Forest estimator.
Random Forests are based on bootstrap aggregation, i.e. the so-called bagging of single regression trees where the covariates considered for each next split within a tree are selected at random. More precisely, the Random Forest algorithm draws a bootstrap sample \(Z_i^*(X_i, Y_i)\) of size N from the available training data for \(b=1,...,B\) bootstrap replications. For each bootstrapped sample, a Random Forest tree is grown by recursive partitioning until the minimum leaf size is reached. The recursive partitioning is based on finding an optimal split given by a splitting covariate and its splitting point, such that the mean-squared error is minimized. This is achieved by a greedy search over all covariates and all possible splitting points, where the predictions are based on averaging outcomes in the resulting subsets defined by the split (Hastie et al. 2009). At each of the splits, m out of p covariates chosen at random are considered. The minimum leaf size then determines how many recursive splits are conducted, i.e. how deep the trees are grown. After all B trees are grown in this fashion, the Regression Random Forest prediction \(\hat{\mu }(x)\) of the conditional mean \(E[Y_i\mid X_i=x]\) is the ensemble of the tree predictions \(\hat{\mu }_b(x)\):
where \(L_b(x)\) denotes a leaf containing x. Single trees, if grown sufficiently deep, have a low bias, but fairly high variance. By averaging over many single trees with randomly choosing the set of observations and splitting covariates, the variance of the estimator is being reduced substantially. First, the variance reduction is achieved through bagging. The higher the number of bootstrap replications, the lower the variance. Second, the variance is further reduced through the random selection of covariates. The lower is the number of considered covariates for a split, the more is the correlation between the trees reduced, and consequently, the bigger is the variance reduction of the average (Hastie et al. 2009).
Another attractive feature of Random Forests is the weighted average representation of the final estimate of the conditional mean \(E[Y_i\mid X_i=x]\). As such we can rewrite the Random Forest prediction as follows:
As such the forest weights \(\hat{w}_i(x)\) are again an average over all single tree weights. These tree weights capture if the training example \(X_i\) falls into the leaf \(L_b(x)\) scaled by the size of that leaf. Notice, that the weights are locally adaptive. Intuitively, Random Forests resemble the classical nonparametric kernel regression with an adaptive, data-driven bandwidth and with limited curse of dimensionality. One can show that in the regression case, the Random Forest estimate as defined in (3.1) is equivalent to the weighting estimate defined in (3.2). This weighting perspective of Random Forests has been firstly suggested by Hothorn et al. (2004) and Meinshausen (2006) in the scope of survival and quantile regression, respectively. Recently, Athey et al. (2019) point out the usefulness of the Random Forest weights in various estimation tasks. In this spirit, we will later on in Sect. 4.3 use the forest induced weights explicitly for inference as has been recently suggested by Lechner (2018).
Besides the huge popularity of Random Forests for prediction, the statistical literature focused on establishing asymptotic properties of Random Forests as well (Meinshausen 2006; Biau 2012; Scornet et al. 2015; Mentch and Hooker 2016). A major step towards formally valid inference has been done in a recent work by Wager (2014) and Wager and Athey (2018) who prove consistency and asymptotic normality of Random Forest predictions, under some modifications of the standard Random Forest algorithm. These modifications concern both the tree-building procedure as well as the tree-aggregation scheme. First, the tree aggregation is now done using subsampling without replacement instead of bootstrapping. Second, the tree-building procedure introduces the major and crucial condition of so-called honesty as first suggested by Athey and Imbens (2016). A tree is honest, if it does not use the same responses for both, placing splits and estimating the within-leaf predictions. This can be achieved by the so-called double-sample trees, which split the random subsample of training data \(Z_i^*(X_i, Y_i)\) into two disjoint sets of the same size, while the one is used for placing splits and the other one for estimating the predictions. Furthermore, for the consistency it is essential that the size of the leaves L of the trees becomes small relative to the sample size as N gets large.3 This is achieved by introducing some randomness in choosing the splitting variables. Particularly, each covariate receives a minimum amount of positive chance of a split. Such constructed tree is then said to be a random-split tree. Additionally, the trees are required to be \(\alpha \)-regular, meaning that after each split, both of the child nodes contain at least a fraction \(\alpha > 0\) of the training data. Also, trees have to be symmetric in a sense that the order of the training data is independent of the predictor output. Lastly, some additional regularity conditions such as i.i.d. sampling need to be satisfied for the asymptotic arguments to hold.4 Overall, apart from subsampling and honesty the above conditions are not particularly binding and do not fundamentally deviate from the standard regression Random Forest. Then, under the above assumptions, the Random Forest predictions can be shown to be (pointwise) asymptotically Gaussian and unbiased. We use this result to provide an inference procedure for the marginal effects of the Ordered Forest discussed in Sect. 4.3.
4 Ordered Forest Estimator
The general idea of the Ordered Forest estimator is to provide a flexible alternative for estimation of ordered choice models that can deal with a large-dimensional covariate space. As such, the main goal is the estimation of conditional ordered choice probabilities, i.e. \(P[Y_i=m \mid X_i=x]\), as well as marginal effects, i.e. the changes in the estimated probabilities in association with changes in covariates. Correspondingly, the variability of the estimated effects is of interest and therefore a method for conducting statistical inference is provided as well. The latter two features go beyond the traditional machine learning estimators which focus solely on the prediction exercise, and complement the prediction with the same econometric output as the traditional parametric estimators.
4.1 Conditional choice probabilities
The main idea of the estimation of the ordered choice probabilities by a Random Forest algorithm lies in the estimation of cumulative, i.e. nested probabilities based on binary indicators. Such transformations of an ordered model into multiple binary models have been previously proposed in the context of generalized linear models (e.g. Fahrmeir et al. 1994) and ordinal classification (e.g. Kwon et al. 1997; Frank and Hall 2001). This approach is universal as in principle any estimator of a conditional mean could be used for the prediction of the choice probabilities. However, we adapt this approach by using a specific version of the Random Forest algorithm that further enables not only the estimation of the conditional choice probabilities, but also the estimation of marginal effects and the inference thereof.
As such, for an i.i.d. random sample of size \(N(i=1,...,N)\), consider an ordered outcome variable \(Y_i \in \{1,...,M \}\) with ordered classes m. Then the binary indicators are given as \(Y_{m,i}=\textbf{1}(Y_i \le m)\) for outcome classes \(m=1,...,M-1\). First, the ordered model is transformed into multiple overlapping binary models which are estimated by Random Forests yielding the predictions for the cumulative probabilities, i.e. \(\hat{Y}_{m,i}=\hat{P}[Y_{m,i}=1 \mid X_i=x]\). Second, the estimated cumulative probabilities are differenced to isolate the respective class probabilities \(P_{m,i}=P[Y_i=m \mid X_i=x]\). Hence, the estimate for the conditional probability of the m-th ordered class is given by subtracting two adjacent cumulative probabilities as \(\hat{P}_{m,i}=\hat{Y}_{m,i}-\hat{Y}_{m-1,i}\).
Given that the building block of the above procedure is the estimation of the conditional probabilities \(P[Y_{m,i}=1 \mid X_i=x]\), we apply the Regression Random Forest to directly estimate these probabilities as opposed to a Classification Random Forest. In the Regression Random Forest, this is achieved by averaging the binary outcomes in the leaves of the trees, which subsequently get averaged across the trees, resulting in valid probabilities bounded between 0 and 1 for each binary outcome. In an alternative case of a Classification Random Forest, the predictions are not directly the probabilities, but rather the predicted classes by using a majority voting in the leaves of the trees and subsequently a majority voting across the trees (Hastie et al. 2009). A valid probability prediction can be obtained only as a by-product by averaging the class predictions across the trees instead of majority voting. As such, we are interested to minimize the squared error between the observed outcome and the estimated probability as is the case for the Regression Random Forest, as opposed to minimizing the misclassification error in the case of the Classification Random Forest. Furthermore, the theoretical guarantees of asymptotic unbiasedness of predictions, asymptotic normality and consistency, which are crucial for the inference on marginal effects are applicable exclusively to the Regression Random Forest as pointed out by Wager and Athey (2018).
Formally, the proposed estimation procedure can be described as follows:
1.
Create \(M-1\) binary indicator variables such as
$$\begin{aligned} Y_{m,i}=\textbf{1}(Y_i \le m) \quad \text { for } \quad m=1,...,M-1, \end{aligned}$$
(4.1)
where m is known and given by the definition of the dependent variable.
2.
Estimate regression Random Forest for each of the \(M-1\) indicators as
where the forest weights are defined as \(w_{m,i}(x)=\frac{1}{B}\sum ^B_{b=1}w_{m,b,i}(x)\) with trees weights given by \(w_{m,b,i}(x)=\frac{\textbf{1}(\{X_i \in L_{b,m}(x) \})}{\mid \{ i:X_i \in L_{b,m}(x) \} \mid }\) with leaves \(L_{b,m}(x)\) for a total of B trees.
3.
Obtain forest predictions for each of the \(M-1\) indicators as
where Eq. (4.4) makes use of the cumulative (nested) probability feature. As such, the predicted values of two subsequent binary indicator variables \(Y_{m,i}\) are subtracted from each other to isolate the probability of the higher order class. In Eq. (4.5) the first part is given by construction as follows from the indicator function (4.1) that all values of \(Y_i\) fulfil the condition for \(m=M\) and from the fact that cumulative probabilities must add up to 1. The second part defines the probability of the lowest value of the ordered outcome variable. This follows directly from the Random Forest estimation as the created indicator variable \(Y_{1,i}\) describes the very lowest value of the ordered outcome classes and as such, no modification of its predicted value is necessary to obtain a valid probability prediction. Line (4.6) ensures that the computed probabilities from (4.4) do not become negative. This might occasionally happen especially if the respective outcome classes comprise of very few observations.5 This issue is well-known also from the Generalized Ordered Logit model where the parallel regression assumption is relaxed (see McCullagh and Nelder 1989, p. 155). However, even though it is possible in theory, growing honest trees and increasing the sample size seems to largely prevent this from happening in practice. Lastly, in case negative predictions should occur and thus being set to zero, (4.7) defines a normalization step to ensure that all class probabilities sum up to 1. Notice, that such an approach requires estimation of \(M-1\) forests in the training data, which might appear to be computationally expensive. However, given that most empirical problems involve a rather limited number of outcome classes (usually not exceeding 10 distinct classes) and the relatively fast estimation of standard regression forest,6 the here proposed procedure shall be computationally tractable (see Tables 29 and 30 in Appendix B.4 for a comparison with competing methods).
4.2 Marginal effects
After estimating the conditional ordered choice probabilities, it is of interest to investigate how the estimated probabilities are associated with covariates, i.e. how the changes in the covariates translate into changes in the probabilities. Typical measures for such relationships in standard nonlinear econometrics are the marginal, or, partial effects. Thus, for nonlinear models, including ordered choice models, two fundamental measures are of common interest, mean marginal effects and marginal effects at the mean of the covariates.7 These quantities are feasible also in the case of the Ordered Forest estimator. Due to the character of the ordered choice model, the marginal effects on all probabilities of different values of the ordered outcome classes are estimated, i.e. \(P[Y_i=m \mid X_i=x]\). In the following, let us define the marginal effect for an element \(x^k\) of \(X_i\) as follows:
with \(X_i^k\) and \(X_i^{-k}\) denoting the elements of \(X_i\) with and without the k-th element, respectively.8 Next, let us define the marginal effect for categorical variables as a discrete change in the following way:
where and \(\left\lfloor {\cdot }\right\rfloor \) denote upper and lower integer values, respectively, such that a difference of one unit is respected. Notice, that in the case of a binary variable this leads to the respective probabilities being evaluated at \(\Big \lceil {x^k}\Big \rceil = 1\) and \(\left\lfloor {x^k}\right\rfloor =0\) as is usual for desired quantity of interest, i.e. the marginal effect at mean by evaluating \(ME_i^{k,m}(x)\) at the population mean of \(X_i\), for which the sample mean is a natural estimator. The mean marginal effect is obtained by taking sample averages of \(ME_i^{k,m}(x)\), i.e. \(\frac{1}{N}\sum ^{N}_{i=1}ME_i^{k,m}(x)\). Additionally, it is possible to evaluate the marginal effect for all values in the support of \(X_i\) to visualize its estimated functional form.
Having formally defined the desired marginal effects, the next issue is the estimation of these effects. For the case of binary and categorical covariates \(X^k\), this appears straightforward as the estimated Ordered Forest model provides predicted values for all probabilities at all values \(x^k\). As such, the estimate \(\hat{ME}_i^{k,m}(x)\) of marginal effects defined in Eq. (4.9) remains as a difference of the two conditional probabilities estimated by the Ordered Forest. However, it is less obvious for continuous variables, where derivatives are needed. As the estimates of the choice probabilities are averaged leaf means, the marginal effect is not explicit and not differentiable. In the nonparametric literature Stoker (1996) and Powell and Stoker (1996), among others, are directly concerned with estimating average derivatives. However, these methods lack convenience of estimation and have thus not been widely adopted by empirical researchers.9 Therefore, we approximate the derivative by a discrete analogue based on the definition of a derivative as follows:
with \(x^{kU},x^{kL}\) defined as \(x^{kU}=x^{k}+h \cdot \sigma (x^{k})\) and \(x^{kL}=x^{k}-h \cdot \sigma (x^{k})\), while ensuring that the support of \(x^k\) is respected, and where \(\sigma (\cdot )\) denotes standard deviation and h controls the window size for evaluating the marginal effect. We recommend to set \(h=0.1\) to achieve accurate evaluation at the margin.10 Hence, the approximation targets the marginal change in the value of the covariate \(X_i^k\). Notice, that such an estimation of marginal effects is much more demanding exercise than solely predicting the choice probabilities. Therefore, it is expected that considerably more subsampling iterations are needed for a good performance. Note that this approach to estimating marginal effects is applicable to any estimator that estimates conditional ordered choice probabilities and is not restricted only to the Ordered Forest.
4.3 Inference
The building block of the Ordered Forest are the estimates of conditional probabilities such as \(P[Y_{m,i}=1\mid X_i=x]\). Particularly, the Ordered Forest makes use of linear combinations of such probability estimates made by the Random Forest for both the conditional ordered choice probabilities as well as for the corresponding marginal effects. Therefore, for conducting inference on these quantities, it is sufficient to ensure that the underlying estimates of conditional probabilities are asymptotically normally distributed. Here, we combine the results of Wager and Athey (2018) and Lechner (2018). First, we use the asymptotic results of Wager and Athey (2018) who show that the consistency and normality of Random Forest predictions hold also when dealing with binary outcomes and thus also hold for probability predictions of type \(P[Y_{m,i}=1\mid X_i=x]\).11 Hence, the final Ordered Forest estimates for the conditional ordered choice probabilities and the marginal effects, based on a forest algorithm respecting the conditions discussed in Sect. 3, inherit the asymptotic properties of consistency and normality. Second, we adapt the inference procedure for Random Forests as developed by Lechner (2018) to estimate the variance of the conditional ordered choice probabilities and the corresponding marginal effects.
The here proposed method for conducting approximate inference of the estimated marginal effects utilizes the weight-based representation of Random Forest predictions and adapts the weight-based inference proposed by Lechner (2018) for the case of the Ordered Forest estimator.12 The main condition for conducting weight-based inference is to ensure that the weights and the outcomes are independent. In general, the weights are functions of the covariates for the observation i and the training data. In order to estimate the variance of the marginal effects successfully, the conditioning set of the weights must be reduced. Therefore, if the observation i is not part of the training data and there is i.i.d. sampling, then the weights depend only on the observation i and are furthermore independent of the outcomes (for a formal analysis, see Lechner 2018). This is achieved through sample splitting where one half of the sample is used to build the forest, and thus to determine the weights, and the other half to estimate the effects using the respective outcomes. Notice that this condition goes beyond honesty as defined in Wager and Athey (2018) as this requires not only estimating honest trees but estimating honest forest as a whole. The reason for this is the fact that the weights are not based on the estimated trees, but on the estimated forest. Therefore, to ensure independence between the weights and outcomes, the honesty condition must be w.r.t. to the forest and it is not sufficient to build honest trees only. This comes, however, at the expense of the efficiency of the estimator as less data are effectively used. Nevertheless, the simulation evidence in Lechner (2018) suggests that this efficiency loss is small, if present at all.13
Since the Ordered Forest estimator is based on differences of Random Forest predictions for adjacent outcome categories, also the covariance term enters the variance formula of the final estimator14 as opposed to the Modified Causal Forests developed in Lechner (2018). Further, the estimation of marginal effects is based on differences of single Ordered Forest predictions which also needs to be taken into account.15 Let us first rewrite the marginal effects in terms of weighted means of the outcomes as follows:
where \(\tilde{w}_{i,m}(x^{kU}x^{kL})=\hat{w}_{i,m}(x^{kU})-\hat{w}_{i,m}(x^{kL})\), and \(\tilde{w}_{i,m-1}(x^{kU}x^{kL})=\hat{w}_{i,m-1}(x^{kU})-\hat{w}_{i,m-1}(x^{kL})\) are the new weights defining the marginal effect. As such the quantity of interest for inference becomes the variance of the above expression given as:
where for the marginal effects at the mean of the covariates the weights \(\tilde{w}_{i,m}(x^{kU}x^{kL})\) and the scaling factor \(1/(x^{kU}-x^{kL})^2\) are evaluated at the respective sample means, whereas for the mean marginal effects the average of the weights \(\frac{1}{N} \sum ^{N}_{i=1} \tilde{w}_{i,m}(x^{kU}x^{kL})\) and of the scaling factor \(1/\big (\frac{1}{N} \sum ^{N}_{i=1}(x^{kU}-x^{kL})\big )^2\) is used. Notice also the fact that the scaling factor drops out in the case of categorical covariates. According to the simulation study in Lechner (2018), the weight-based inference in case of the Modified Causal Forests tends to be rather conservative for the individual effects and rather accurate for aggregate effects. The results from the here conducted empirical application resemble this pattern where inference for the marginal effects at the mean of the covariates is more conservative in comparison to inference for the mean marginal effects (see also Appendix C.2 for a comparison). Note that unlike the general approach to estimating marginal effects, the weight-based inference for these effects is uniquely tied to a class of weight-based, asymptotically normally distributed estimators centred at the true value. For the forest-based estimators, this implies the necessary condition of honesty such as in the here proposed Ordered Forest.
5 Monte Carlo simulation
In order to investigate the finite sample performance of the proposed Ordered Forest estimator, we perform a Monte Carlo simulation study comparing competing estimators for ordered choice models based on the Random Forest algorithm. As a parametric benchmark, we take the ordered logistic regression. The considered models are specifically the following: (i) Ordered Logit (McCullagh 1980), (ii) naive Ordinal Forest (Hothorn et al. 2006) and (v) Ordered Forest as developed in Sect. 4. Within the simulation study the Ordered Forest estimator is analysed more closely to study the finite sample performance of the estimator depending on the particular forest building schemes and the way the ordering information is being taken into account. Regarding the former we study the Ordered Forest based on the standard Random Forest as in Breiman (2001), i.e. with boostrapping and without honesty as well as based on the adjusted Random Forest as in Wager and Athey (2018), i.e. with subsampling and with honesty. Regarding the latter, we study an alternative approach for estimating the conditional choice probabilities which could be labelled as a ’Multinomial’ Forest. In that case, the ordering information is not being taken into account and the probabilities of each category are estimated directly. The details of this approach are provided in Appendix A.1. Given this, the Ordered Forest estimator should perform better than the Multinomial Forest in terms of the prediction accuracy thanks to the incorporation of additional information from the ordering of the outcome classes. Within the simulation we investigate the accuracy of predictions for the conditional ordered choice probabilities. Given the definition of marginal effects as (scaled) differences in predictions for different values of the covariates, the simulation results in turn provide supportive evidence on the estimation of marginal effects as well.
Table 1
General settings of the simulation
Monte Carlo
observations in training set
200 (800)
observations in testing set
10,000
replications
100
covariates with effect
15
trees in a forest
1,000
randomly chosen covariates
\(\sqrt{p}\)
minimum leaf size\(^{\text {a}}\)
5
\(^{\text {a}}\) Due to the conceptual differences of the Conditional Forests, an alternative stopping rule ensuring growing deep trees is chosen. See details in Appendix B.4
General settings regarding the sample size, the number of replications, as well as forest-specific tuning parameters for the Monte Carlo simulation are depicted in Table 1. Furthermore, a detailed description of the software implementation of the respective estimators as well as the software specific tuning parameters are discussed in Appendix B.4.
5.1 Data generating process
In terms of the data generating process, we built upon an Ordered Logit model as defined in (2.1) and (2.2). As such we simulate the underlying continuous latent variable \(Y_i^*\) as a linear function of regressors \(X_i\), while drawing the error term \(u_i\) from the logistic distribution. Then, the continuous outcome \(Y_i^*\) is discretized into an ordered categorical outcome \(Y_i\) based on the threshold parameters \(\alpha _m\).17 Furthermore, the intercept term is fixed to zero, i.e. \(\beta _0=0\), and thus, the thresholds are relative to this value of the intercept. As a result, such DGP captures the probability of the latent variable \(Y_i^*\) falling into a particular class given the location defined by the deterministic component of the model together with its stochastic component (Carsey and Harden 2013).
In simulations of the data generating process, different numbers of possible discrete ordered classes are considered, particularly \(M = \{3,6,9\}\) which corresponds to the simulation set-up used in Janitza et al. (2016) and Hornung (2019a). Further, both equal class widths, i.e. equally spaced threshold parameters \(\alpha _m\), as well as randomly spaced thresholds, while still preserving the monotonicity of the discrete outcome \(Y_i\), are considered. For the latter, the threshold quantiles are drawn from the uniform distribution, i.e. \(\alpha _m^q \sim U(0,1)\) and ordered afterwards. For the former, the threshold quantiles are equally spaced between 0 and 1 depending on the number of classes. The \(\beta \) coefficients are specified as having fixed coefficient size, namely \(\beta _1,...,\beta _5 = 1\), \(\beta _6,...,\beta _{10} = 0.75\) and \(\beta _{11},...,\beta _{15} = 0.5\) as is also the case in Janitza et al. (2016) and Hornung (2019a). Moreover, an option for nonlinear effects is introduced, too. As such, the covariates do not enter the functional form linearly, but are given by a sine function \(sin(2X_i)\) as, for example, in Lin et al. (2014), which is hard to model as opposed to other nonlinearities such as polynomials or interactions. The set of covariates \(X_i\) is drawn from the multivariate normal distribution with zero mean vector and a pre-specified variance-covariance matrix \(\Sigma \), i.e. \(X_i \sim {\mathcal {N}}(0,\Sigma )\), where \(\Sigma \) is specified either as an identity matrix and as such implying zero correlation between regressors, or it is specified to have a specific correlation structure between regressors18 as follows:
which is inspired by the correlation structure from the simulations in Janitza et al. (2016) and Hornung (2019a). Further, an option to include additional variables with zero effect is implemented as well. As such, another 15 covariates are added to the covariate space with \(\beta _{16}=...=\beta _{30}=0\) which are again drawn from the normal distribution with zero mean and a pre-specified variance-covariance matrix \(\Sigma ^0\), i.e. \(X_{i}^0 \sim {\mathcal {N}}(0,\Sigma ^{0})\), where \(\Sigma ^{0}\) defines a declining correlation among noise covariates as
As the performance of the Ordered Forest estimator in high-dimensional settings is of particular interest, due to the lack of theoretical results in such settings, we include an option for additionally enlarging the covariate space with 1000 zero effect covariates according to the same DGP as above, effectively creating a setting with \(p>>N\). In the high-dimensional case the Ordered Logit is excluded from the simulations for obvious reasons. Overall, considering all the possible combinations for specifying the DGP, we end up with 72 different DGPs.19 For all of them we simulate a training dataset of size \(N=200\) and a testing dataset of size \(N=10'000\) for evaluating the prediction performance of the considered methods. We simulate the large testing set for three main reasons. First, the large testing set enables us to reduce the prediction noise and thus provides a more reliable measure for average out-of-sample performance of the estimators. Second, the large testing set also helps to reduce the simulation noise and thus to obtain more precise estimates for the performance measures. Third, we choose the large testing set to ensure further comparability with the simulation studies performed by Janitza et al. (2016) and Hornung (2019a). Note that such a large testing set is also common choice in many other simulation studies (see, for example, Jacob 2020; or Knaus et al. 2021). Further, we focus more closely on the simulation designs corresponding to the least and the most complex DGPs for which we simulate also a training set of size \(N=800\). The former DGP (labelled as simple DGP henceforth) corresponds exactly to an Ordered Logit model as in (2.1) with equal class widths, uncorrelated covariates with linear effects and without any additional zero effect variables. The latter DGP (labelled as complex DGP henceforth) features random class widths, correlated covariates with nonlinear effects and additional zero effect variables. For each replication, we estimate the model on the training set and evaluate the predictions on the testing set, for all tested methods.
5.2 Competing methods
We consider the ordered logistic regression as a parametric benchmark method in our simulations as the most widely used method by practitioners when dealing with ordered categorical outcome variables.20 In addition, we compare the Ordered Forest to other flexible Random Forests methods adapted towards ordinal regression—the Conditional Forest (Hothorn et al. 2006) and the Ordinal Forest (Hornung 2019a). In case of the Conditional Forest, the difference to standard regression forests lies in a different splitting criterion using a test statistic where the conditional distribution at each split is based on permutation tests (for details see Strasser and Weber 1999; and Hothorn et al. 2006). Their proposed Ordinal Forest regression assumes an underlying latent continuous response \(Y_i^*\) as is the case in standard ordered choice models. Hothorn et al. (2006) define a score vector \(s(m) \in {\mathbb {R}}^M\), with \(m=1,...,M\) observed ordered classes. This scores reflect the distances between the classes. The authors suggest to set the scores as midpoints of the intervals of \(Y_i^*\) which define the classes. As the underlying \(Y_i^*\) is unobserved, such a suggestion results in \(s(m)=m\) and Ordinal Forest regression collapses to a standard forest regression as pointed out by Janitza et al. (2016).21 However, although the tree-building step coincides, the prediction step differs as the estimates are the choice probabilities calculated as the proportions of the respective outcome classes falling into the same leaf instead of averages of the outcomes. As such, for each leaf within a tree, the prediction is computed for each value of the ordered categorical outcome as its share within the leaf, resulting in a probability predictions between 0 and 1. This is in contrast to standard prediction procedures, which would compute an average of all values of the ordered categorical outcome. Nevertheless, after computing the single-tree predictions as the relative frequencies of the ordered outcomes, the forest estimates of the conditional choice probabilities \(\hat{P}[Y_i=m \mid X_i=x]\) are computed by taking the averages of the choice probabilities produced by each tree, i.e. the same aggregation scheme as in a regression forest. Hornung (2019a) points out that setting \(s(m)=m\) implies inherently assuming that the class widths, i.e. the adjacent intervals of the continuous outcome variable \(Y_i^*\) determining the discrete outcome \(Y_i\) are of the same length. This, however, does not have to hold in general and these intervals might not follow any particular pattern.22 In order to address this issue, Hornung (2019a) proposes an Ordinal Forest method, which optimizes these interval widths by maximizing the out-of-bag (OOB) prediction performance of the forests.23 However, in contrast to the approach of Hothorn et al. (2006), the forest algorithm used is based on the forest as developed by Breiman (2001), while the primary target is to predict the ordinal class and the choice probabilities are obtained as relative frequencies of trees predicting the particular class. As such, each tree predicts the most probable value of the ordered categorical outcome. Thereupon, the forest prediction for the conditional choice probability is computed as the share of trees predicting the particular categorical value of the ordered outcome. This is in contrast to the estimation scheme by Hothorn et al. (2006), where the probability prediction step occurs at the level of trees, instead of at the level of forest as is the case here. Hornung (2019a) shows better prediction performance of such Ordinal Forests which optimize the class widths of \(Y_i^*\) in comparison to the Conditional Forests. Without the optimization step, the author denotes such forest as the naive Ordinal Forest.24
Although both of these methods demonstrate good predictive performance, none of them provides theoretical guarantees with regards to the bias and distribution of the predictions. This is due to the fact that none of these methods grow the trees in the forest respecting the conditions laid out in Wager and Athey (2018), most notably the honesty condition, which has been shown to be crucial to ensure the asymptotic unbiasedness and asymptotic normality of the forest estimator (Wager and Athey 2018). In other words, these methods do not use separate sets of observations to grow the tree, i.e. to place the splits, and to make the predictions in the leaves of the tree. Further, it is worth to mention that in practice both methods suffer from considerable computational costs. For a comparison of the computation time with the Conditional Forest as well as the Ordinal Forest, see Tables 29 and 30 in Appendix B.4.
5.3 Evaluation measures
In order to properly evaluate the prediction performance we use two measures of accuracy, namely the mean-squared error (MSE) and the ranked probability score (RPS). The former evaluates the error of the estimated conditional choice probabilities as a squared difference from the true values of the conditional choice probabilities. Given our simulation design, we know these true values and hence, we can define the Monte Carlo average MSE as:
where j refers to the j-th simulation replication, while R being the total number of replications. The second measure, the RPS as developed by Epstein (1969) is arguably the preferred measure for the evaluation of probability forecasts for ordered outcomes as it takes the ordering information into account (see Gneiting and Raftery 2007; and Constantinou and Fenton 2012). The Monte Carlo average RPS can be defined as follows:
$$\begin{aligned} \text {ARPS} =\frac{1}{R}\sum ^{R}_{j=1}\frac{1}{N}\sum ^N_{i=1}\frac{1}{M-1}\sum ^M_{m=1}\bigg (P[Y_{i,j}\le m \mid X_{i,j}=x]-\hat{P}[Y_{i,j}\le m \mid X_{i,j}=x]\bigg )^2 , \end{aligned}$$
where on the contrary to the MSE, the difference between the cumulative choice probabilities is measured. The RPS can be seen as a generalization of the Brier Score (Brier 1950) for multiple, ordered outcomes. As such, it measures the discrepancy between the predicted cumulative distribution function and the true one. Nevertheless, although the ordering information is taken into account, the relative distance between the classes is not reflected as pointed out by Janitza et al. (2016).
5.4 Simulation results
For the sake of brevity, here we focus mainly on the simulation results obtained for the simple and for the complex DGP, while the results for all 72 DGPs are provided in Appendix B.2. Figures 1 and 2 summarize the results for the low-dimensional setting for the simple and the complex DGP, respectively. Similarly, Figs. 3 and 4 present the results for the simple and the complex DGP for the high-dimensional setting. The upper panels of the figures show the ARPS, the preferred accuracy measure, whereas the lower panels show the AMSE as a complementary measure. Within the figures the transparent boxplots in the background show the results for the smaller sample size along with the bold boxplots in the foreground showing the results for the bigger sample size. From left to right the figures present the results for 3, 6 and 9 outcome classes, respectively. The figures compare the prediction accuracy of the Ordered Logit, naive Ordinal Forest, Ordinal Forest, Conditional Forest, Ordered Forest and the Multinomial Forest, where the asterisk \((^*)\) denotes the honest version of the last two forests considered. Further tables with more detailed results and statistical tests for mean differences in the prediction errors are listed in Appendix B.1.
Fig. 1
Simulation Results: Simple DGP & Low Dimension. Note: Figure summarizes the prediction accuracy results based on 100 simulation replications. The upper panel contains the ARPS and the lower panel contains the AMSE. The boxplots show the median and the interquartile range of the respective measure. The transparent boxplots denote the results for the small sample size, while the bold boxplots denote the results for the big sample size. From left to right the results for 3, 6 and 9 outcome classes are displayed
In the low-dimensional setting with the simple DGP it is expected that the ordered logistic regression should perform best in terms of both the AMSE as well as the ARPS. Indeed, we do observe this results in Fig. 1 as the Ordered Logit model performs unanimously best out of the considered models, reaching almost zero prediction error. Among the flexible forest-based estimators, the proposed Ordered Forest belongs to those better performing methods in terms of both accuracy measures. The honest versions of the forests lag behind what points at the efficiency loss due to the additional sample splitting. Overall, the ranking of the estimators stays stable with regard to the number of outcome categories. Additional pattern common to all estimators is the lower prediction error and increased precision with growing sample size.
Fig. 2
Simulation Results: Complex DGP & Low Dimension. Note: Figure summarizes the prediction accuracy results based on 100 simulation replications. The upper panel contains the ARPS and the lower panel contains the AMSE. The boxplots show the median and the interquartile range of the respective measure. The transparent boxplots denote the results for the small sample size, while the bold boxplots denote the results for the big sample size. From left to right the results for 3, 6 and 9 outcome classes are displayed
In the case of the complex DGP, the performance of the flexible forest-based estimators is expected to be better in comparison to the parametric Ordered Logit. This can be seen in Fig. 2 as the Ordered Logit lags behind the majority of the flexible methods in both accuracy measures. The somewhat higher prediction errors of the naive and the Ordinal Forest compared to the other forest-based methods might be due to their different primary target which are the ordered classes instead of the ordered probabilities as is the case for the other methods. In this respect the Conditional Forest exhibits considerably good prediction performance. The Ordered Forest outperforms the competing forest-based estimators in terms of the ARPS throughout all outcome class scenarios and also in terms of the AMSE in two scenarios, being outperformed only by the Conditional Forest in case of 9 outcome classes. Interestingly, the multinomial forest performs very well across all scenarios. However, it is consistently worse than the Ordered Forest with bigger discrepancy between the two the more outcome classes are considered. This points to the value of the ordering information and the ability of the Ordered Forest to utilize it in the estimation. With regard to the sample size, we observe the same pattern as in Fig. 1.
Fig. 3
Simulation Results: Simple DGP & High Dimension. Note: Figure summarizes the prediction accuracy results based on 100 simulation replications. The upper panel contains the ARPS and the lower panel contains the AMSE. The boxplots show the median and the interquartile range of the respective measure. The transparent boxplots denote the results for the small sample size, while the bold boxplots denote the results for the big sample size. From left to right the results for 3, 6 and 9 outcome classes are displayed
Considering the high-dimensional setting for the case of the simple DGP, we see in Fig. 3 that the Ordered Forest slightly lags behind the other methods, except the scenarios with 3 outcome classes. In comparison, the Conditional Forest performs best in terms of the ARPS as well as in terms of the AMSE. This is possibly due to the feature of the Conditional Forest to provide unbiased variable selection for covariates with no effects, as is the case in this DGP. Also the naive and the Ordinal Forest exhibit better performance compared to the previous simulation designs. However, it should be noted that the overall differences in the magnitude of the prediction errors are much lower within this simulation design as compared to the previous designs. Further, taking a closer look at the ARPS results of the Multinomial Forest we clearly see that in the simple ordered design the ignorance of the ordering information really harms the predictive performance of the estimator the more outcome classes are considered. Additionally, it is interesting to see that the performance gain due to a bigger sample size seems to be much less for the honest version of the forests in the high-dimensional setting as opposed to the low-dimensional setting.
Fig. 4
Simulation Results: Complex DGP & High Dimension. Note: Figure summarizes the prediction accuracy results based on 100 simulation replications. The upper panel contains the ARPS and the lower panel contains the AMSE. The boxplots show the median and the interquartile range of the respective measure. The transparent boxplots denote the results for the small sample size, while the bold boxplots denote the results for the big sample size. From left to right the results for 3, 6 and 9 outcome classes are displayed
Lastly, the case of the complex DGP in the high-dimensional setting as in Fig. 4 shows some interesting patterns. In general, all of the methods exhibit good predictive performance as the loss in the prediction accuracy due to the high-dimensional covariate space is small. Additionally, although dealing with the most complex design, no substantial loss in the prediction accuracy can be observed in comparison to the less complex designs. This fact demonstrates the ability of the Random Forest algorithm as such to effectively cope with highly nonlinear functional forms even in high dimensions. Further, it seems that the role of the sample size is of particular importance in this complex design. On the contrary to the previous designs, where the prediction accuracy increases almost by a constant amount for all estimators and thus does not change their relative ranking, here it does not hold anymore. First, some estimators seem to learn faster than others, i.e. to have a faster rate of convergence. As such in the small sample size the Ordered Forest has in some settings higher values of the ARPS as well as the AMSE than the Conditional Forest, however manages to outperform the Conditional Forest in the bigger training sample. This becomes most apparent in the case of 9 outcome classes. Here, the median of the ARPS is almost the same for the two methods based on the small training sample, but significantly lower for the Ordered Forest based on the larger training sample.25 Second, for the Ordinal Forest the prediction accuracy even worsens with the bigger training sample, which might hint on possible convergence issues. This might possibly come from the fact that the estimator comprises multiple distinct optimization and partly nonlinear transformation steps that are tied together, but lack formal asymptotic arguments to analyse the impacts and propagation of the estimation errors into the final point estimator. Overall, the Ordered Forest achieves the lowest ARPS as well as AMSE within this design, closely followed by the conditional and the multinomial forest. However, the generally good performance of the Conditional Forest might be due to a different type of the stopping criterion as well as due to the unbiased variable selection.
In addition to the four main simulation designs discussed above, we also inspect all 72 different DGPs to analyse the performance and the sensitivity of the Ordered Forest to the particular features of the simulated DGPs (for details see Appendix B.2). In case of both the low-dimensional setting, as well as the high-dimensional setting, the Ordered Forest performs particularly well if there are nonlinear effects accompanied by high correlation of regressors as such as well as together with additional noise variables or randomly spaced thresholds. Furthermore, the honest version of the Ordered Forest achieves consistently lower prediction accuracy in both settings. It seems that in small samples the increase in variance due to honesty dominates the reduction in the bias of the estimator. In order to further investigate the impact of the honesty feature in bigger samples as well as the convergence of the Ordered Forest, we quadruple the size of the training set once again and repeat the main simulation for the Ordered Forest and its honest version with \(N=3'200\) observations (see Appendix B.1 for the full results). Firstly, for both versions we observe that with growing sample size the prediction errors get lower and the precision increases. However, the rate of convergence seems to be slower than the parametric rate of \(\sqrt{N}\). Secondly, we observe the same pattern as in the smaller sample sizes, namely slightly lower prediction accuracy for the honest version of the Ordered Forest which stays roughly constant across all simulation designs. Hence, even in the biggest sample the additional variance dominates the bias reduction. However, it should be noted that for a prediction exercise honesty is an optional choice, while if inference is of interest, honesty becomes binding.
5.5 Empirical results
Additionally to the above synthetic simulations, we explore the performance of the Ordered Forest estimator based on real datasets26 previously used in Janitza et al. (2016) and Hornung (2019a). Table 2 summarizes the features of the datasets and the descriptive statistics are provided in Appendix B.3.1. We compare our estimator in terms of the prediction accuracy to all the estimators used in the above Monte Carlo simulation.
Table 2
Description of the datasets
Datasets summary
Dataset
Sample size
Outcome
Class range
Covariates
Wine quality
4,893
Quality score
1 (moderate)–6 (high)
11
Mammography
412
Visits history
1 (never)–3 (over year)
5
Nhanes
1,914
Health status
1 (excellent)–5 (poor)
26
Vlbw
218
Physical condition
1 (threatening)–9 (optimal)
10
Support Study
798
Disability degree
1 (none)–5 (fatal)
15
Similarly to Hornung (2019a) we evaluate the prediction accuracy based on a repeated cross-validation in order to reduce the dependency of the results on the particular training and test sample splits. As such we perform a 10-fold cross-validation on each dataset, i.e. we randomly split the dataset in 10 equally sized folds and use 9 folds for training the model and 1 fold for validation. This process is repeated such that each fold serves as a validation set exactly once. Lastly, we repeat this whole procedure 10 times and report average accuracy measures. The results of the cross-validation exercise for the ARPS as well as the AMSE are summarized in Figs. 5 and 6, respectively. Similarly as for the simulation results, Appendix B.3 contains more detailed statistics.
Fig. 5
Cross-Validation: ARPS. Note: Figure summarizes the prediction accuracy results in terms of the ARPS based on 10 repetitions of 10-fold cross-validation for respective datasets. The boxplots show the median and the interquartile range of the respective measure
The main difference in evaluating the prediction accuracy in comparison to the simulation study is the fact that we do not observe the underlying ordered class probabilities, but only the realized ordered classes. This affects the computation of the accuracy measures and it can be expected that the prediction errors are somewhat higher in comparison to the simulation data, which is also the case here. Overall, the results imply a substantial heterogeneity in the prediction accuracy across the considered datasets. On the one hand, the parametric Ordered Logit does well in small samples (vlbw), whereas the forest-based methods are somewhat lagging behind. This is not surprising as a lower precision in small samples is the price to pay for the additional flexibility. On the other hand, in the largest sample (winequality) the Ordered Logit is clearly the worst performing method and all forest-based methods perform substantially better. With respect to the Ordered Forest estimator we observe relatively high prediction accuracy for three datasets (mammography, supportstudy, winequality) and relatively low prediction accuracy for two datasets (nhanes, vlbw) in comparison to the competing methods. The good performance in the winequality and the supportstudy dataset is expected due to the large samples available. In case of the mammography dataset, even when smaller in sample size, the Ordered Forest maintains the good prediction performance, with its honest version doing even better. The worse performance for the vlbw dataset might be due to the small sample size. However, the honest version of the Ordered Forest performs rather well. The relatively poor performance in the case of the nhanes dataset comes rather at surprise as the sample size is rather large. Nevertheless, here the differences among all estimators are very small in magnitude, in fact the smallest among the considered datasets. Overall, the empirical results provide evidence for a good predictive performance of the new Ordered Forest estimator, especially its non-honest version, based on various real datasets.
Fig. 6
Cross-Validation: AMSE. Note: Figure summarizes the prediction accuracy results in terms of the AMSE based on 10 repetitions of 10-fold cross-validation for respective datasets. The boxplots show the median and the interquartile range of the respective measure
In order to showcase the Ordered Forest estimation of marginal effects, we revisit the question of self-assessed health status and its relationship with socio-economic characteristics as, for example, analysed previously by Case et al. (2002) and Murasko (2008). In our empirical application we analyse the dataset from the 2009 National Health Interview Survey (NHIS) used in Angrist and Pischke (2014) which includes an ordered categorical outcome indicating a self-assessed health status. The specific survey question of interest reads as: ’Would you say your health in general is excellent, very good, good, fair, or poor?’ and is coded on an ordered scale ranging from 1 (poor) to 5 (excellent). We examine how the ordered choice probabilities of the self-assessed health status differ for individuals with and without a coverage by private health insurance (see Levy and Meltzer 2008, for a review of insurance effects on health) as well as how these probabilities vary with further socio-demographic characteristics, namely age, race and family size as well as economic characteristics, namely education, employment status and family income. The considered dataset is well-suited for demonstrating the evaluation of marginal effects for several reasons. First, the dataset features an ordered categorical outcome with five distinct ordered categories, which are unevenly distributed and thus challenging for estimating the associated marginal effects. Second, the dataset includes both continuous as well as categorical covariates which enables an exhaustive demonstration of the evaluation of marginal effects for various variable types. Third, the dataset contains more than \(18'000\) observations which allows for a precise estimation of the marginal effects. The descriptive statistics for the considered dataset are presented in Appendix C.1.27 We follow the data preparation of Angrist and Pischke (2014) and discard all observations with missing values and retain only individuals from single family households and those of age between 26 and 59 years as those do not yet qualify for the public health insurance programme Medicare.
We estimate the ordered choice probabilities for the self-reported health status conditional on having a private health insurance contract and further socio-economic characteristics using the Ordered Forest in its honest version as defined in Sect. 4.3 and in Lechner (2018) as well as the Ordered Logit and evaluate the corresponding marginal effects. Table 3 contains the estimated mean marginal effects for each outcome class for all covariates together with the associated standard errors, t-values, p-values as well as conventional significance levels for both the Ordered Forest as well as the Ordered Logit.28
In general, we see similar patterns in terms of the effect sizes and effect direction for both the Ordered Forest and the Ordered Logit. However, we do observe more variability in terms of the effect direction in case of the Ordered Forest. This is due to the main difference to the Ordered Logit as the Ordered Forest does not use any parametric link function in the estimation of the marginal effects and as such does not impose any functional form on these estimates. As a result, the Ordered Forest does neither fix the sign of the marginal effects estimates nor revert it exactly once within the class range as is the case for the Ordered Logit (the so-called single crossing feature, see, for example, Boes and Winkelmann 2006; or Greene and Hensher 2010) but rather estimates these in a data-driven manner. Nevertheless, the Ordered Forest, same as the Ordered Logit, still ensures that the marginal effects across the class range sum up to zero. In terms of uncertainty of the effects the level of precision estimated via the weight-based inference seems slightly lower as compared to the delta method used in the Ordered Logit.
Table 3
Mean marginal effects: NHIS dataset
Dataset
Ordered Forest
Ordered Logit
Variable
Class
Effect
Std.Error
t-Value
p-Value
Effect
Std.Error
t-Value
p-Value
Health Insurance
1
0.23
0.08
2.89
0.00***
\(-\)0.11
0.05
\(-\)2.19
0.03**
2
\(-\)0.95
0.49
\(-\)1.93
0.05*
\(-\)0.49
0.22
\(-\)2.22
0.03**
3
\(-\)4.51
1.99
\(-\)2.27
0.02**
\(-\)1.32
0.59
\(-\)2.25
0.02**
4
4.44
1.80
2.47
0.01**
0.02
0.03
0.65
0.52
5
0.78
2.47
0.32
0.75
1.90
0.83
2.29
0.02**
Female
1
\(-\)0.19
0.12
\(-\)1.59
0.11
0.02
0.03
0.68
0.50
2
0.05
0.31
0.17
0.87
0.10
0.14
0.68
0.50
3
0.52
0.70
0.74
0.46
0.26
0.39
0.68
0.50
4
0.44
0.86
0.52
0.61
0.00
0.01
0.59
0.55
5
\(-\)0.82
1.16
\(-\)0.70
0.48
\(-\)0.39
0.57
\(-\)0.68
0.50
Non White
1
0.38
0.15
2.57
0.01**
0.36
0.05
7.02
0.00***
2
0.57
0.42
1.36
0.18
1.60
0.20
7.89
0.00***
3
5.97
1.12
5.32
0.00***
4.10
0.48
8.57
0.00***
4
\(-\)4.23
1.09
\(-\)3.87
0.00***
\(-\)0.26
0.08
\(-\)3.12
0.00***
5
\(-\)2.69
1.57
\(-\)1.72
0.09*
\(-\)5.81
0.65
\(-\)8.87
0.00***
Age
1
0.04
0.01
4.22
0.00***
0.04
0.00
12.60
0.00***
2
0.15
0.03
5.09
0.00***
0.20
0.01
19.77
0.00***
3
0.45
0.07
6.07
0.00***
0.54
0.02
23.87
0.00***
4
\(-\)0.01
0.09
\(-\)0.13
0.89
0.01
0.01
1.24
0.22
5
\(-\)0.62
0.12
\(-\)5.10
0.00***
\(-\)0.78
0.03
\(-\)24.15
0.00***
Education
1
0.00
0.00
0.41
0.68
\(-\)0.11
0.01
\(-\)11.61
0.00***
2
\(-\)0.01
0.00
\(-\)1.73
0.08*
\(-\)0.51
0.03
\(-\)16.80
0.00***
3
\(-\)0.02
0.01
\(-\)2.80
0.01***
\(-\)1.39
0.07
\(-\)18.94
0.00***
4
0.00
0.01
0.71
0.48
\(-\)0.02
0.02
\(-\)1.23
0.22
5
0.02
0.01
2.57
0.01**
2.04
0.11
18.85
0.00***
Family Size
1
0.00
0.00
0.32
0.75
\(-\)0.01
0.01
\(-\)0.81
0.42
2
\(-\)0.00
0.01
\(-\)0.21
0.83
\(-\)0.04
0.05
\(-\)0.81
0.42
3
\(-\)0.06
0.02
\(-\)3.49
0.00***
\(-\)0.12
0.14
\(-\)0.81
0.42
4
\(-\)0.03
0.02
\(-\)1.78
0.08*
\(-\)0.00
0.00
\(-\)0.67
0.50
5
0.10
0.02
4.97
0.00***
0.17
0.21
0.81
0.42
Employed
1
\(-\)3.99
0.50
\(-\)7.94
0.00***
\(-\)0.42
0.06
\(-\)7.30
0.00***
2
\(-\)3.81
0.73
\(-\)5.19
0.00***
\(-\)1.86
0.23
\(-\)8.21
0.00***
3
2.58
1.15
2.25
0.02**
\(-\)4.77
0.53
\(-\)8.98
0.00***
4
4.37
1.24
3.51
0.00***
0.39
0.11
3.55
0.00***
5
0.84
1.82
0.46
0.64
6.66
0.71
9.42
0.00***
Income
1
\(-\)0.11
0.04
\(-\)3.00
0.00***
\(-\)0.00
0.00
\(-\)12.07
0.00***
2
\(-\)0.46
0.14
\(-\)3.27
0.00***
\(-\)0.00
0.00
\(-\)17.73
0.00***
3
\(-\)0.06
0.51
\(-\)0.12
0.91
\(-\)0.00
0.00
\(-\)20.61
0.00***
4
0.36
0.37
0.97
0.33
\(-\)0.00
0.00
\(-\)1.24
0.21
5
0.27
0.45
0.61
0.54
0.00
0.00
20.96
0.00***
Significance levels correspond to: \( ^{***}.<0.01\), \( ^{**}.<0.05\), \( ^{*}.<0.1\)
Notes: Table shows the comparison of the mean marginal effects in percentage points between the Ordered Forest and the Ordered Logit. The effects are estimated for all classes, together with the corresponding standard errors, t-values and p-values. The standard errors for the Ordered Forest are estimated using the weight-based inference and for the Ordered Logit are obtained via the delta method
In particular, inspecting the variable of interest we immediately see the additional flexibility of the Ordered Forest. While both methods estimate positive marginal effects of having a private health insurance on the probability of being in very good or excellent health condition and negative marginal effects for being in good or fair health condition, the Ordered Forest estimates also a positive effect for being in poor health condition, whereas the Ordered Logit is forced to estimate a negative effect due to its above-mentioned single-crossing property. As such, the Ordered Forest estimates a non-monotonic effect of having a private health insurance across the class probabilities. The results suggest that on one hand individuals with health insurance are by 4.51 percentage points less likely to be in good health condition and by 0.95 percentage points less likely to be in fair health condition, respectively. On the other hand, individuals with health insurance are 4.44 percentage points more likely to be in very good health condition as well as 0.78 percentage points more likely to be in excellent health condition, respectively, but they are also 0.23 percentage points more likely to be in poor health condition. As the decision to sign up for a private health insurance is not random, i.e. the data come from a non-experimental setting, it is not possible to uncover the causal effect without strong assumptions. One might, however, argue that based on the partial correlation evidence, due to the regular medical care and prevention the health insurance increases the likelihood of being in rather good health condition, but also that individuals with rather poor health condition are more likely to sign up for a private health insurance to cover up for the expected medical care costs. As can be seen, the Ordered Forest enables for such a non-monotonic effects analysis, while the classical Ordered Logit (without any additional augmentation such as splines or similar) does not permit such mechanism to take place at all. Overall, in terms of effect sizes as well as statistical uncertainty, we observe similar results for both estimators.
Inspecting the effects of the additional conditioning variables, we see similar results for the binary covariates. As such, neither the Ordered Forest nor the Ordered Logit find evidence for gender influencing the health class probabilities, while both methods estimate a lower probability of being in very good or excellent health condition for people of colour and the unemployed, results that are comparable in both effect sizes and the statistical precision. For the categorical income level variable, both methods estimate a positive relationship, i.e. individuals with higher income are less likely to be in rather bad health condition. However, in case of the Ordered Forest, the effects are sizeable, whereas in the case of the Ordered Logit the effect sizes lack substantive relevance. Lastly, for continuous covariates, both methods estimate a higher likelihood of being in rather bad health condition for increasing age with similar effect sizes as well as with similar statistical precision. In terms of education and family size, the Ordered Forest suggests non-monotonic effects, which is not the case for the Ordered Logit.
Overall, the main advantage of the estimation of the marginal effects by the Ordered Forest stems from a more flexible, data-driven approximation of possible nonlinearities in the functional form.
7 Conclusion
In this paper, we develop and apply a new machine learning estimator of the econometric ordered choice models based on the Random Forest algorithm. The Ordered Forest estimator is a flexible alternative to parametric ordered choice models such as the Ordered Logit or Ordered Probit which does not rely on any distributional assumptions and provides essentially the same output as the parametric models, including the estimation of the marginal effects as well as the associated inference. The proposed estimator utilizes the flexibility of Random Forests and can thus naturally deal with nonlinearities in the data and with a large-dimensional covariate space, while taking the ordering information of the categorical outcome variable into account. Hence, the estimator flexibly estimates the conditional ordered choice probabilities without restrictive assumptions about the distribution of the error term, or other assumptions such as the single index and constant threshold assumptions as is the case for the parametric ordered choice models (see Boes and Winkelmann 2006, for a discussion of these assumptions). Further, the estimator allows also the estimation of the marginal effects, i.e. how the estimated conditional ordered choice probabilities vary with changes in covariates. The weighted representation of these effects together with the honesty of the forest enables the weight-based inference as suggested by Lechner (2018). The fact that the estimator comprises of linear combinations of Random Forest predictions ensures that the theoretical guarantees of Wager and Athey (2018) are satisfied. Additionally, a free software implementation of the Ordered Forest estimator in both R (R Core Team 2021) and Python (Van Rossum and Drake 2009) is available in the package orf available on the official CRAN (Lechner and Okasa 2019) and PyPI (Lechner et al. 2022) repositories to enable the usage of the method by applied researchers.
The performance of the Ordered Forest estimator is studied and compared to other competing estimators in an extensive Monte Carlo simulation as well as using real datasets. The simulation results suggest good performance of the estimator in finite samples, including also high-dimensional settings. The advantages of the machine learning estimation compared to a parametric method become apparent when dealing with high correlation among covariates and highly nonlinear functional forms. In such cases all of the considered forest-based estimators perform better than the Ordered Logit in terms of the prediction accuracy. Among the forest-based estimators, the Ordered Forest in its non-honest version, i.e. without sample splitting, proposed in this paper performs well throughout all simulated DGPs and outperforms the competing methods in the most complex simulation designs. In contrast, the honest version of the Ordered Forest lags behind as the increase in variance dominates the bias reduction. These results document the trade-off between prediction performance, for which honesty is optional, and statistical inference, for which the honesty is required. The empirical evidence using real datasets supports the findings from the Monte Carlo simulation. Additionally, the estimation of the marginal effects as well as the inference procedure seems to work well in the presented empirical example.
Despite the attractive properties of the Ordered Forest estimator, many interesting questions are left open. Particularly, a further extension of the Monte Carlo simulation to study the sensitivity of the Ordered Forest in respect to tuning parameters of the underlying Random Forest as well as in respect to different simulation designs would be of interest. Similarly, the performance of the estimator with and without honesty for larger sample sizes should be further investigated. Also, the optimal choice of the size of the window for evaluating the marginal effects would be worth to explore. Additionally, besides the theoretical guarantees for the point estimator, a rigorous asymptotic analysis of the weight-based inference procedure for the estimation of standard errors would be beneficial to describe the exact theoretical properties. Lastly, it would be of great interest to see more real data applications of the Ordered Forest estimator such as, for example, in Kim et al. (2021), especially for large samples.
Declarations
Conflict of interest
The authors did not receive support from any organization for the submitted work. The authors have no relevant financial or non-financial interests to disclose. A previous version of the manuscript has been published as a part of a doctoral thesis at the University of St.Gallen and is available online: https://www.alexandria.unisg.ch/265914/
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Considering the Ordered Forest estimator, a possible modification for models with categorical outcome variable without an inherent ordering appears to be straightforward. Instead of estimating cumulative probabilities and afterwards isolating the respective class probabilities, we can estimate the class probabilities \(P_{m,i}=P[Y_{i}=m \mid X_i=x]\) directly. As such the binary outcomes are now constructed to indicate the particular outcome classes separately. Then the Random Forest predictions for each class yield the conditional choice probabilities which need to be afterwards normalized to sum up to 1. Formally, consider (un)ordered categorical outcome variable \(Y_i \in \{1,...,M \}\) with classes m and sample size \(N(i=1,...,N)\). Then, the estimation procedure can be described as follows:
1.
Create M binary indicator variables such as
$$\begin{aligned} Y_{m,i}=\textbf{1}(Y_i = m) \quad \text { for } \quad m=1,...,M . \end{aligned}$$
(A.1)
where m is known and given by the definition of the dependent variable.
2.
Estimate regression Random Forest for each of the M indicators as
where the forest weights are defined as \(w_{m,i}(x)=\frac{1}{B}\sum ^B_{b=1}w_{m,b,i}(x)\) with trees weights given by \(w_{m,b,i}(x)=\frac{\textbf{1}(\{X_i \in L_{b,m}(x) \})}{\mid \{ i:X_i \in L_{b,m}(x) \} \mid }\) with leaves \(L_{b,m}(x)\) for a total of B trees.
3.
Obtain forest predictions for each of the M indicators as
where the Eq. (A.4) defines the probabilities of all M classes and subsequent Eq. (A.5) ensures that the probabilities sum up to 1 as this might not be the case otherwise. Similarly to the Ordered Forest estimator, also the Multinomial Forest is a linear combination of the respective forest predictions and as such also inherits the theoretical properties stemming from Random Forest estimation as described in Sect. 3 of the main text.
Conditional Forest
The Conditional Forest as discussed in Sect. 2 of the main text is grown with the so-called conditional inference trees. The main idea is to provide an unbiased way of recursive splitting of the trees using a test statistic based on permutation tests (Strasser and Weber 1999). To describe the estimation procedure, consider an ordered categorical outcome \(Y_i \in (1,...,M)\) with ordered classes m and sample size \(N(i=1,...,N)\). Further, define binary case weights \(w_i \in \{0,1\}\) which determine if the observation is part of the current leaf. Then, the algorithm developed by Hothorn et al. (2006) can be described as follows:
1.
Test the global null hypothesis of independence between any of the P covariates and the outcome, for the particular case weights, given a bootstrap sample \(Z_b\). Afterwards, select the p-th covariate \(X_{i,p}\) with the strongest association with the outcome \(Y_i\), or stop if the null hypothesis cannot be rejected. The association is measured by a linear statistic T given as:
where \(g_p(\cdot )\) and \(h(\cdot )\) are specific transformation functions.
2.
Split the covariate sample space \({\mathcal {X}}_p\) into two disjoint sets \({\mathcal {I}}\) and \({\mathcal {J}}\) with adapted case weights \(w_i\textbf{1}(X_{i,p} \in {\mathcal {I}})\) and \(w_i\textbf{1}(X_{i,p} \in {\mathcal {J}})\) determining the observations falling into the subset \({\mathcal {I}}\) and \({\mathcal {J}}\), respectively. Then, the split is chosen by evaluating a two-sample statistic as a special case of A.6:
for all possible subsets \({\mathcal {I}}\) of the covariate sample space \({\mathcal {X}}_p\).
3.
Repeat steps 1 and 2 recursively with modified case weights.
Hence, the above algorithm distinguishes between variable selection (step 1) and splitting rule (step 2), while both relying on the variations of the test statistic \(T_p(Z_b,w)\). In practice, however, the distribution of this statistic under the null hypothesis is unknown and depends on the joint distribution of \(Y_i\) and \(X_{i,p}\). For this reason, the permutation tests are applied to abstract from the dependency by fixing the covariates and conditioning on all possible permutations of the outcomes. Then, the conditional mean and covariance of the test statistic can be derived and the asymptotic distribution can be approximated by Monte Carlo procedures, while Strasser and Weber (1999) proved its normality. Finally, variables and splits are chosen according to the lowest p-value of the test statistic \(T_p(Z_b,w)\) and \(T^{{\mathcal {I}}}_p(Z_b,w)\), respectively.
Besides the permutation tests, the choice of the transformation functions \(g_p(\cdot )\) and \(h(\cdot )\) is important and depends on the type of the variables. For continuous outcome and covariates, identity transformation is suggested. For the case of an ordinal regression which is of interest here, the transformation function is given through the score function s(m). If the underlying latent \(Y_i^*\) is unobserved, it is suggested that \(s(m)=m\) and thus \(h(Y_i)=Y_i\). Hence, in the tree building the ordered outcome is treated as a continuous one (Janitza et al. 2016). Then, however, the leaf predictions are the choice probabilities computed as proportions of the outcome classes falling within the leaf, instead of fitting a within-leaf constant. The final Conditional Forest predictions for the choice probabilities are the averaged conditional tree probability predictions. Such obtained choice probabilities are analysed in the Monte Carlo study in Sect. 5 of the main text.
Ordinal Forest
In the following, the algorithm for the Ordinal Forest as developed by Hornung (2019a) is described. To begin with, consider an ordered categorical outcome \(Y_i \in (1,...,M)\) with ordered classes m and sample size \(N(i=1,...,N)\). Then, for a set of optimization forests \(b=1,...,B_{sets}\):
1.
Draw \(M-1\) uniformly distributed variables \(D_{b,m} \sim U(0,1)\) and sort them according to their values. Further, set \(D_{b,1}=0\) and \(D_{b,M+1}=1\).
2.
Define a score set \(S_{b,m}=\{S_{b,1},...,S_{b,M}\}\) with scores constructed as \(S_{b,m}=\Phi ^{-1}\big (\frac{D_{b,m}+D_{b,m+1}}{2}\big )\) for \(m=1,...,M\), where \(\Phi (\cdot )\) is the cumulative distribution function of the standard normal.
3.
Create a new continuous outcome \(Z_{b,i}=(Z_{b,1},...,Z_{b,N})\) by replacing each class value m of the original ordered categorical \(Y_i\) by the m-th value of the score set \(S_{b,m}\) for all \(m=1,...,M\).
4.
Use \(Z_{b,i}\) as dependent variable and estimate a regression forest \(RF_{S_{b,m}}\) with \(B_{prior}\) trees.
5.
Obtain the out-of-bag (OOB) predictions for the continuous \(Z_{b,i}\) and transform them into predictions for \(Y_i\) as follows: \(\hat{Y}_{b,i}=m\) if \(\hat{Z}_{b,i} \in \big ]\Phi ^{-1}(D_{b,m}), \Phi ^{-1}(D_{b,m+1})\big ]\).
6.
Compute a performance measure for the given forest \(\hat{RF}_{S_{b,m}}\) based on some performance function of type \(f(Y_i,\hat{Y}_{b,i})\).
After estimating \(B_{sets}\) of optimization forests, take \(S_{best}\) of these which achieved the best performance according to the performance function. Then, construct the final set of uniformly distributed variables \(D_1,...,D_{M+1}\) as an average of those from \(S_{best}\) for \(m=1,...,M+1\). Finally, form the optimized score set \(S_m=\{S_{1},...,S_{M}\}\) with scores constructed as \(S_{m}=\Phi ^{-1}\big (\frac{D_{m}+D_{m+1}}{2}\big )\) for \(m=1,...,M\). The continuous outcome \(Z_i=(Z_{1},...,Z_{N})\) is then similarly as in the optimization procedure constructed by replacing each m value of the original outcome \(Y_i\) by the m-th value of the optimized score set \(S_m\) for all \(m=1,...,M\). Finally, estimate the regression forest \(RF_{final}\) using \(Z_i\) as the dependent variable. On the one hand, the class prediction of such an Ordinal Forest is one of the M ordered classes which has been predicted the most by the respective trees of the forest. On the other hand, the probability prediction is obtained as a relative frequency of trees predicting the particular class. Such predicted choice probabilities are analysed in the conducted Monte Carlo study in Sect. 5 of the main text. Further, the so-called naive forest corresponds to the Ordinal Forest with omitting the above described optimization procedure.
Simulation study
Main simulation results
In Tables 4, 5, 6, 7 and 8 are summarized the simulation results presented in Sect. 5.4 of the main text. Each table specifies the particular simulation design as follows: the column Class indicates the number of outcome classes, Dim. specifies the dimension, DGP characterizes the data generating process as defined in the main text and Statistic contains summary statistics of the simulation results. In particular, the mean of the respective accuracy measure and its standard deviation. Furthermore, rows t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods. The alternative hypothesis is that the mean of the Ordered Forest is less than the mean of the other method to test if the Ordered Forest achieves significantly lower prediction error than the other considered methods. Furthermore, Figs. 7, 8, 9 and 10 complement the results presented in Sect. 5.4 of the main text for the simulations with the increased sample size.
Notes: Table reports the average measures of the RPS based on 100 simulation replications for the sample size of 200 observations. The first column denotes the number of outcome classes. Columns 2 and 3 specify the dimension and the DGP, respectively. The fourth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Notes: Table reports the average measures of the MSE based on 100 simulation replications for the sample size of 200 observations. The first column denotes the number of outcome classes. Columns 2 and 3 specify the dimension and the DGP, respectively. The fourth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Notes: Table reports the average measures of the RPS based on 100 simulation replications for the sample size of 800 observations. The first column denotes the number of outcome classes. Columns 2 and 3 specify the dimension and the DGP, respectively. The fourth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Notes: Table reports the average measures of the MSE based on 100 simulation replications for the sample size of 800 observations. The first column denotes the number of outcome classes. Columns 2 and 3 specify the dimension and the DGP, respectively. The fourth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Notes: Table reports the average measures of the RPS and MSE based on 100 simulation replications for the sample size of 3200 observations. The first column denotes the number of outcome classes. Columns 2 and 3 specify the dimension and the DGP, respectively. The fourth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and the honest version of the Ordered Forest
Fig. 7
Ordered Forest Simulation Results: Simple DGP & Low Dimension. Note: Figure summarizes the prediction accuracy results based on 100 simulation replications. The upper panel contains the ARPS and the lower panel contains the AMSE. The boxplots show the median and the interquartile range of the respective measure. The transparent boxplots denote the results for the small sample size, the semi-transparent ones denote the medium sample size, while the bold boxplots denote the results for the big sample size. From left to right the results for 3, 6 and 9 outcome classes are displayed
Ordered Forest Simulation Results: Complex DGP & Low Dimension. Note: Figure summarizes the prediction accuracy results based on 100 simulation replications. The upper panel contains the ARPS and the lower panel contains the AMSE. The boxplots show the median and the interquartile range of the respective measure. The transparent boxplots denote the results for the small sample size, the semi-transparent ones denote the medium sample size, while the bold boxplots denote the results for the big sample size. From left to right the results for 3, 6 and 9 outcome classes are displayed
Ordered Forest Simulation Results: Simple DGP & High Dimension. Note: Figure summarizes the prediction accuracy results based on 100 simulation replications. The upper panel contains the ARPS and the lower panel contains the AMSE. The boxplots show the median and the interquartile range of the respective measure. The transparent boxplots denote the results for the small sample size, the semi-transparent ones denote the medium sample size, while the bold boxplots denote the results for the big sample size. From left to right the results for 3, 6 and 9 outcome classes are displayed
Ordered Forest Simulation Results: Complex DGP & High Dimension. Note Figure summarizes the prediction accuracy results based on 100 simulation replications. The upper panel contains the ARPS and the lower panel contains the AMSE. The boxplots show the median and the interquartile range of the respective measure. The transparent boxplots denote the results for the small sample size, the semi-transparent ones denote the medium sample size, while the bold boxplots denote the results for the big sample size. From left to right the results for 3, 6 and 9 outcome classes are displayed
Tables 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 and 20 summarize the simulation results for all 72 different DGPs, complementing the main results presented in Sect. 5.4 of the main text. Each table specifies the particular simulation design as follows: the first column DGP provides the identifier for the data generating process. Columns 2 to 5 specify the particular characteristics of the respective DGP, namely if the DGP features additional noise variables (noise), 15 in the low-dimensional case and 1000 in the high-dimensional case, nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic contains summary statistics of the simulation results. In particular, the mean of the respective accuracy measure (mean) and its standard deviation (st.dev.). Furthermore, rows t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods. The alternative hypothesis is that the mean of the Ordered Forest is less than the mean of the other method to test if the Ordered Forest achieves significantly lower prediction error than the other considered methods.
Notes: Table reports the average measures of the RPS based on 100 simulation replications for the sample size of 200 observations with 3 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 15 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Notes: Table reports the average measures of the RPS based on 100 simulation replications for the sample size of 200 observations with 6 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 15 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Notes: Table reports the average measures of the RPS based on 100 simulation replications for the sample size of 200 observations with 9 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 15 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Simulation results: Accuracy Measure = ARPS & high dimension with 3 classes
Simulation design
Comparison of methods
DGP
Noise
Nonlin
Multi
Random
Statistic
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
49
✓
✗
✗
✗
mean
0.1135
0.1139
0.1112
0.1140
0.1180
0.1139
0.1179
st.dev.
0.0009
0.0010
0.0009
0.0008
0.0006
0.0008
0.0006
t-test
1.0000
0.7676
1.0000
0.0000
0.7268
0.0000
wilcox-test
0.9999
0.8438
1.0000
0.0000
0.7191
0.0000
50
✓
✓
✗
✗
mean
0.0896
0.0899
0.0901
0.0903
0.0907
0.0901
0.0907
st.dev.
0.0008
0.0010
0.0008
0.0007
0.0007
0.0007
0.0006
t-test
1.0000
0.9997
0.9840
0.0002
0.9973
0.0004
wilcox-test
1.0000
1.0000
0.9929
0.0000
0.9989
0.0000
51
✓
✗
✓
✗
mean
0.1534
0.1529
0.0827
0.0766
0.1082
0.0867
0.1134
st.dev.
0.0011
0.0012
0.0024
0.0025
0.0029
0.0024
0.0026
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
52
✓
✗
✗
✓
mean
0.1253
0.1252
0.1224
0.1248
0.1296
0.1250
0.1296
st.dev.
0.0013
0.0013
0.0010
0.0009
0.0007
0.0009
0.0007
t-test
0.0011
0.0115
1.0000
0.0000
0.1664
0.0000
wilcox-test
0.0013
0.0140
1.0000
0.0000
0.1515
0.0000
53
✓
✓
✓
✗
mean
0.1299
0.1300
0.1048
0.1016
0.1200
0.1021
0.1202
st.dev.
0.0011
0.0012
0.0034
0.0027
0.0026
0.0027
0.0025
t-test
0.0000
0.0000
0.0000
0.0000
0.0674
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0494
0.0000
54
✓
✓
✗
✓
mean
0.0997
0.0996
0.0999
0.0998
0.1004
0.0997
0.1004
st.dev.
0.0012
0.0013
0.0012
0.0012
0.0011
0.0011
0.0012
t-test
0.5772
0.8438
0.3065
0.0000
0.6432
0.0000
wilcox-test
0.6792
0.9705
0.2427
0.0000
0.7183
0.0000
55
✓
✗
✓
✓
mean
0.1678
0.1667
0.0862
0.0836
0.1167
0.0906
0.1195
st.dev.
0.0015
0.0013
0.0026
0.0030
0.0029
0.0029
0.0029
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
56
✓
✓
✓
✓
mean
0.1476
0.1474
0.1156
0.1102
0.1316
0.1110
0.1317
st.dev.
0.0013
0.0010
0.0041
0.0029
0.0031
0.0029
0.0031
t-test
0.0000
0.0000
0.0000
0.0000
0.0287
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0272
0.0000
Notes: Table reports the average measures of the RPS based on 100 simulation replications for the sample size of 200 observations with 3 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 1000 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Simulation results: Accuracy Measure = ARPS & high dimension with 6 classes
Simulation design
Comparison of methods
DGP
Noise
Nonlin
Multi
Random
Statistic
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
57
✓
✗
✗
✗
mean
0.0974
0.0972
0.0951
0.0983
0.1012
0.0998
0.1016
st.dev.
0.0006
0.0006
0.0006
0.0005
0.0004
0.0005
0.0004
t-test
1.0000
1.0000
1.0000
0.0000
0.0000
0.0000
wilcox-test
1.0000
1.0000
1.0000
0.0000
0.0000
0.0000
58
✓
✓
✗
✗
mean
0.0762
0.0762
0.0765
0.0773
0.0772
0.0776
0.0773
st.dev.
0.0006
0.0006
0.0006
0.0005
0.0005
0.0004
0.0004
t-test
1.0000
1.0000
1.0000
0.9803
0.0000
0.7833
wilcox-test
1.0000
1.0000
1.0000
0.9838
0.0000
0.7449
59
✓
✗
✓
✗
mean
0.1336
0.1327
0.0747
0.0675
0.0968
0.0912
0.1152
st.dev.
0.0008
0.0010
0.0013
0.0016
0.0015
0.0016
0.0017
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
60
✓
✗
✗
✓
mean
0.0845
0.0845
0.0826
0.0857
0.0880
0.0872
0.0883
st.dev.
0.0005
0.0005
0.0006
0.0004
0.0003
0.0004
0.0003
t-test
1.0000
1.0000
1.0000
0.0000
0.0000
0.0000
wilcox-test
1.0000
1.0000
1.0000
0.0000
0.0000
0.0000
61
✓
✓
✓
✗
mean
0.1091
0.1088
0.0891
0.0885
0.1026
0.1010
0.1105
st.dev.
0.0009
0.0008
0.0025
0.0021
0.0018
0.0023
0.0010
t-test
0.0000
0.0000
0.0547
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0626
0.0000
0.0000
0.0000
62
✓
✓
✗
✓
mean
0.0658
0.0659
0.0660
0.0669
0.0665
0.0672
0.0666
st.dev.
0.0006
0.0006
0.0006
0.0006
0.0006
0.0005
0.0005
t-test
1.0000
1.0000
1.0000
1.0000
0.0006
0.9998
wilcox-test
1.0000
1.0000
1.0000
1.0000
0.0000
1.0000
63
✓
✗
✓
✓
mean
0.1167
0.1163
0.0682
0.0606
0.0872
0.0820
0.1052
st.dev.
0.0007
0.0008
0.0014
0.0016
0.0015
0.0018
0.0015
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
64
✓
✓
✓
✓
mean
0.0927
0.0927
0.0766
0.0772
0.0882
0.0898
0.0952
st.dev.
0.0006
0.0005
0.0020
0.0016
0.0014
0.0018
0.0006
t-test
0.0000
0.0000
0.9878
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.9887
0.0000
0.0000
0.0000
Notes: Table reports the average measures of the RPS based on 100 simulation replications for the sample size of 200 observations with 6 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 1000 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Simulation results: Accuracy Measure = ARPS & high dimension with 9 classes
Simulation design
Comparison of methods
DGP
Noise
Nonlin
Multi
Random
Statistic
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
65
✓
✗
✗
✗
mean
0.0921
0.0918
0.0900
0.0931
0.0959
0.0955
0.0964
st.dev.
0.0006
0.0006
0.0006
0.0005
0.0003
0.0004
0.0003
t-test
1.0000
1.0000
1.0000
0.0000
0.0000
0.0000
wilcox-test
1.0000
1.0000
1.0000
0.0000
0.0000
0.0000
66
✓
✓
✗
✗
mean
0.0721
0.0720
0.0724
0.0732
0.0730
0.0739
0.0731
st.dev.
0.0006
0.0005
0.0006
0.0005
0.0004
0.0004
0.0004
t-test
1.0000
1.0000
1.0000
0.9959
0.0000
0.8717
wilcox-test
1.0000
1.0000
1.0000
0.9991
0.0000
0.9308
67
✓
✗
✓
✗
mean
0.1268
0.1260
0.0713
0.0648
0.0926
0.0979
0.1175
st.dev.
0.0008
0.0009
0.0013
0.0013
0.0014
0.0017
0.0015
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
68
✓
✗
✗
✓
mean
0.0904
0.0902
0.0884
0.0915
0.0941
0.0937
0.0946
st.dev.
0.0006
0.0006
0.0005
0.0005
0.0003
0.0004
0.0003
t-test
1.0000
1.0000
1.0000
0.0000
0.0000
0.0000
wilcox-test
1.0000
1.0000
1.0000
0.0000
0.0000
0.0000
69
✓
✓
✓
✗
mean
0.1031
0.1028
0.0838
0.0838
0.0967
0.1024
0.1061
st.dev.
0.0007
0.0007
0.0021
0.0017
0.0016
0.0016
0.0005
t-test
0.0000
0.0000
0.4695
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.5044
0.0000
0.0000
0.0000
70
✓
✓
✗
✓
mean
0.0706
0.0707
0.0710
0.0718
0.0716
0.0724
0.0717
st.dev.
0.0007
0.0007
0.0006
0.0006
0.0005
0.0005
0.0006
t-test
1.0000
1.0000
1.0000
0.9903
0.0000
0.8186
wilcox-test
1.0000
1.0000
1.0000
0.9983
0.0000
0.8723
71
✓
✗
✓
✓
mean
0.1246
0.1238
0.0704
0.0636
0.0911
0.0966
0.1153
st.dev.
0.0007
0.0008
0.0014
0.0013
0.0014
0.0016
0.0018
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
72
✓
✓
✓
✓
mean
0.1007
0.1004
0.0817
0.0819
0.0945
0.0997
0.1036
st.dev.
0.0007
0.0007
0.0020
0.0017
0.0015
0.0019
0.0006
t-test
0.0000
0.0000
0.7875
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.8473
0.0000
0.0000
0.0000
Notes: Table reports the average measures of the RPS based on 100 simulation replications for the sample size of 200 observations with 9 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 1000 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Notes: Table reports the average measures of the MSE based on 100 simulation replications for the sample size of 200 observations with 3 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 15 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Notes: Table reports the average measures of the MSE based on 100 simulation replications for the sample size of 200 observations with 6 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 15 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Notes: Table reports the average measures of the MSE based on 100 simulation replications for the sample size of 200 observations with 9 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 15 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Simulation results: Accuracy Measure = AMSE & high dimension with 3 classes
Simulation design
Comparison of methods
DGP
Noise
Nonlin
Multi
Random
Statistic
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
49
✓
✗
✗
✗
mean
0.0923
0.0931
0.0908
0.0930
0.0952
0.0926
0.0952
st.dev.
0.0008
0.0013
0.0009
0.0009
0.0007
0.0007
0.0007
t-test
1.0000
0.2408
1.0000
0.0000
0.9980
0.0000
wilcox-test
1.0000
0.5433
1.0000
0.0000
0.9977
0.0000
50
✓
✓
✗
✗
mean
0.0692
0.0698
0.0696
0.0702
0.0699
0.0696
0.0699
st.dev.
0.0009
0.0013
0.0010
0.0009
0.0009
0.0008
0.0008
t-test
1.0000
0.9907
0.9999
0.9649
1.0000
0.9852
wilcox-test
1.0000
1.0000
1.0000
0.9887
1.0000
0.9944
51
✓
✗
✓
✗
mean
0.1385
0.1379
0.0864
0.0752
0.1008
0.0881
0.1087
st.dev.
0.0009
0.0010
0.0019
0.0021
0.0023
0.0019
0.0018
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
52
✓
✗
✗
✓
mean
0.0906
0.0904
0.0884
0.0902
0.0931
0.0902
0.0931
st.dev.
0.0011
0.0013
0.0008
0.0008
0.0007
0.0008
0.0007
t-test
0.0006
0.0794
1.0000
0.0000
0.3296
0.0000
wilcox-test
0.0010
0.1853
1.0000
0.0000
0.2606
0.0000
53
✓
✓
✓
✗
mean
0.1079
0.1083
0.0910
0.0888
0.1010
0.0892
0.1013
st.dev.
0.0009
0.0011
0.0025
0.0020
0.0019
0.0019
0.0019
t-test
0.0000
0.0000
0.0000
0.0000
0.0936
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0745
0.0000
54
✓
✓
✗
✓
mean
0.0706
0.0703
0.0703
0.0705
0.0706
0.0704
0.0706
st.dev.
0.0010
0.0011
0.0010
0.0009
0.0009
0.0008
0.0009
t-test
0.1479
0.9409
0.8941
0.1495
0.7655
0.1796
wilcox-test
0.1712
0.9972
0.9496
0.0718
0.8048
0.1178
55
✓
✗
✓
✓
mean
0.1291
0.1276
0.0725
0.0678
0.0914
0.0758
0.0954
st.dev.
0.0016
0.0010
0.0020
0.0021
0.0021
0.0020
0.0020
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
56
✓
✓
✓
✓
mean
0.1081
0.1079
0.0863
0.0828
0.0970
0.0834
0.0971
st.dev.
0.0012
0.0009
0.0028
0.0019
0.0021
0.0020
0.0021
t-test
0.0000
0.0000
0.0000
0.0000
0.0264
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0364
0.0000
Notes: Table reports the average measures of the MSE based on 100 simulation replications for the sample size of 200 observations with 3 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 1000 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Simulation results: Accuracy Measure = AMSE & high dimension with 6 classes
Simulation design
Comparison of methods
DGP
Noise
Nonlin
Multi
Random
Statistic
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
57
✓
✗
✗
✗
mean
0.0352
0.0352
0.0347
0.0361
0.0361
0.0360
0.0361
st.dev.
0.0003
0.0004
0.0004
0.0004
0.0004
0.0003
0.0004
t-test
1.0000
1.0000
1.0000
0.8112
0.9994
0.6394
wilcox-test
1.0000
1.0000
1.0000
0.8788
0.9989
0.6579
58
✓
✓
✗
✗
mean
0.0246
0.0246
0.0246
0.0257
0.0248
0.0252
0.0248
st.dev.
0.0003
0.0003
0.0003
0.0004
0.0003
0.0002
0.0003
t-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
wilcox-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
59
✓
✗
✓
✗
mean
0.0622
0.0617
0.0459
0.0383
0.0494
0.0479
0.0553
st.dev.
0.0003
0.0003
0.0005
0.0007
0.0007
0.0006
0.0006
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
60
✓
✗
✗
✓
mean
0.0339
0.0341
0.0335
0.0350
0.0347
0.0348
0.0348
st.dev.
0.0003
0.0004
0.0004
0.0004
0.0004
0.0003
0.0004
t-test
1.0000
1.0000
1.0000
1.0000
0.9993
0.9999
wilcox-test
1.0000
1.0000
1.0000
1.0000
0.9995
1.0000
61
✓
✓
✓
✗
mean
0.0397
0.0397
0.0351
0.0358
0.0383
0.0380
0.0399
st.dev.
0.0004
0.0004
0.0007
0.0006
0.0005
0.0006
0.0004
t-test
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
62
✓
✓
✗
✓
mean
0.0229
0.0231
0.0229
0.0241
0.0231
0.0235
0.0231
st.dev.
0.0004
0.0005
0.0005
0.0005
0.0005
0.0004
0.0005
t-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
wilcox-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
63
✓
✗
✓
✓
mean
0.0628
0.0629
0.0481
0.0405
0.0512
0.0506
0.0583
st.dev.
0.0003
0.0004
0.0005
0.0008
0.0007
0.0006
0.0005
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
64
✓
✓
✓
✓
mean
0.0383
0.0386
0.0343
0.0350
0.0367
0.0378
0.0387
st.dev.
0.0003
0.0004
0.0006
0.0005
0.0005
0.0005
0.0004
t-test
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
Notes: Table reports the average measures of the MSE based on 100 simulation replications for the sample size of 200 observations with 6 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 1000 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Simulation results: Accuracy Measure = AMSE & high dimension with 9 classes
Simulation design
Comparison of methods
DGP
Noise
Nonlin
Multi
Random
Statistic
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
65
✓
✗
✗
✗
mean
0.0180
0.0181
0.0178
0.0189
0.0185
0.0188
0.0185
st.dev.
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
t-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
wilcox-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
66
✓
✓
✗
✗
mean
0.0123
0.0123
0.0123
0.0133
0.0124
0.0129
0.0124
st.dev.
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
0.0002
t-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
wilcox-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
67
✓
✗
✓
✗
mean
0.0339
0.0337
0.0263
0.0224
0.0281
0.0284
0.0316
st.dev.
0.0002
0.0002
0.0003
0.0005
0.0004
0.0003
0.0003
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
68
✓
✗
✗
✓
mean
0.0181
0.0181
0.0179
0.0190
0.0186
0.0188
0.0186
st.dev.
0.0002
0.0002
0.0003
0.0003
0.0003
0.0002
0.0003
t-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
wilcox-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
69
✓
✓
✓
✗
mean
0.0198
0.0199
0.0178
0.0187
0.0193
0.0201
0.0201
st.dev.
0.0002
0.0002
0.0003
0.0003
0.0003
0.0002
0.0002
t-test
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
70
✓
✓
✗
✓
mean
0.0124
0.0124
0.0124
0.0133
0.0125
0.0130
0.0125
st.dev.
0.0002
0.0002
0.0002
0.0003
0.0002
0.0002
0.0002
t-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
wilcox-test
1.0000
1.0000
1.0000
1.0000
1.0000
1.0000
71
✓
✗
✓
✓
mean
0.0338
0.0337
0.0262
0.0225
0.0281
0.0285
0.0315
st.dev.
0.0002
0.0002
0.0004
0.0005
0.0005
0.0003
0.0004
t-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
72
✓
✓
✓
✓
mean
0.0200
0.0200
0.0178
0.0187
0.0193
0.0201
0.0202
st.dev.
0.0002
0.0002
0.0003
0.0003
0.0003
0.0003
0.0002
t-test
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
1.0000
0.0000
0.0000
0.0000
Table reports the average measures of the MSE based on 100 simulation replications for the sample size of 200 observations with 9 outcome classes. Columns 1 to 5 specify the DGP identifier and its features, namely 1000 additional noise variables (noise), nonlinear effects (nonlin), high correlation among covariates (multi) and randomly spaced thresholds (random). The sixth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Empirical results
In this section we present more detailed and supplementary results regarding the empirical results (Sect. 5.5) discussed in the main text. In the following the descriptive statistics for the considered datasets and the results for the prediction accuracy are summarized (Tables 21, 22, 23, 24, 25).
Descriptive statistics
Table 21
Descriptive statistics: mammography dataset
Mammography dataset
Variable
Type
Mean
SD
Median
Min
Max
SYMPT
Categorical
2.97
0.95
3.00
1.00
4.00
PB
Numeric
7.56
2.10
7.00
5.00
17.00
HIST
Categorical
1.11
0.31
1.00
1.00
2.00
BSE
Categorical
1.87
0.34
2.00
1.00
2.00
DECT
Categorical
2.66
0.56
3.00
1.00
3.00
y
Categorical
1.61
0.77
1.00
1.00
3.00
Table 22
Descriptive statistics: nhanes dataset
Nhanes dataset
Variable
Type
Mean
SD
Median
Min
Max
sex
Categorical
1.51
0.50
2.00
1.00
2.00
race
Categorical
2.87
1.00
3.00
1.00
5.00
country_of_birth
Categorical
1.34
0.79
1.00
1.00
4.00
education
Categorical
3.37
1.24
3.00
1.00
5.00
marital_status
Categorical
2.31
1.74
1.00
1.00
6.00
waistcircum
Numeric
100.37
16.37
99.40
61.60
176.70
Cholesterol
Numeric
196.89
41.59
193.00
97.00
432.00
WBCcount
Numeric
7.30
2.88
6.90
1.60
83.20
AcuteIllness
Categorical
1.25
0.43
1.00
1.00
2.00
depression
Categorical
1.39
0.76
1.00
1.00
4.00
ToothCond
Categorical
3.05
1.24
3.00
1.00
5.00
sleepTrouble
Categorical
2.28
1.28
2.00
1.00
5.00
wakeUp
Categorical
2.41
1.30
2.00
1.00
5.00
cig
Categorical
1.51
0.50
2.00
1.00
2.00
diabetes
Categorical
1.14
0.34
1.00
1.00
2.00
asthma
Categorical
1.15
0.36
1.00
1.00
2.00
heartFailure
Categorical
1.03
0.16
1.00
1.00
2.00
stroke
Categorical
1.03
0.18
1.00
1.00
2.00
chronicBronchitis
Categorical
1.07
0.26
1.00
1.00
2.00
alcohol
Numeric
3.93
20.18
2.00
0.00
365.00
heavyDrinker
Categorical
1.17
0.37
1.00
1.00
2.00
medicalPlaceToGo
Categorical
1.92
0.67
2.00
1.00
5.00
BPsys
Numeric
124.44
18.62
122.00
78.00
230.00
BPdias
Numeric
71.18
11.84
72.00
10.00
114.00
age
Numeric
49.96
16.68
50.00
20.00
80.00
BMI
Numeric
29.33
6.66
28.32
14.20
73.43
y
Categorical
2.77
1.00
3.00
1.00
5.00
Table 23
Descriptive statistics: supportstudy dataset
Supportstudy dataset
Variable
Type
Mean
SD
Median
Min
Max
age
Numeric
62.80
16.27
65.29
20.30
100.13
sex
Categorical
1.54
0.50
2.00
1.00
2.00
dzgroup
Categorical
3.23
2.48
2.00
1.00
8.00
num.co
Numeric
1.90
1.34
2.00
0.00
7.00
scoma
Numeric
12.45
25.29
0.00
0.00
100.00
charges
Numeric
59307.91
86620.70
28416.50
1635.75
740010.00
avtisst
Numeric
23.53
13.60
20.00
1.67
64.00
race
Categorical
1.36
0.88
1.00
1.00
5.00
meanbp
Numeric
84.52
27.64
77.00
0.00
180.00
wblc
Numeric
12.62
9.31
10.50
0.05
100.00
hrt
Numeric
98.59
32.93
102.50
0.00
300.00
resp
Numeric
23.60
9.54
24.00
0.00
64.00
temp
Numeric
37.08
1.25
36.70
32.50
41.20
crea
Numeric
1.80
1.74
1.20
0.30
11.80
sod
Numeric
137.64
6.34
137.00
118.00
175.00
y
Categorical
2.90
1.81
2.00
1.00
5.00
Table 24
Descriptive statistics: vlbw dataset
Vlbw dataset
Variable
Type
Mean
SD
Median
Min
Max
race
Categorical
1.57
0.50
2.00
1.00
2.00
bwt
Numeric
1094.89
260.44
1140.00
430.00
1500.00
inout
Categorical
1.03
0.16
1.00
1.00
2.00
twn
Categorical
1.24
0.43
1.00
1.00
2.00
lol
Numeric
7.73
19.47
3.00
0.00
192.00
magsulf
Categorical
1.18
0.39
1.00
1.00
2.00
meth
Categorical
1.44
0.50
1.00
1.00
2.00
toc
Categorical
1.24
0.43
1.00
1.00
2.00
delivery
Categorical
1.41
0.49
1.00
1.00
2.00
sex
Categorical
1.50
0.50
1.00
1.00
2.00
y
Categorical
5.09
2.58
6.00
1.00
9.00
Table 25
Descriptive statistics: winequality dataset
Winequality dataset
Variable
Type
Mean
SD
Median
Min
Max
fixed.acidity
Numeric
6.85
0.84
6.80
3.80
14.20
volatile.acidity
Numeric
0.28
0.10
0.26
0.08
1.10
citric.acid
Numeric
0.33
0.12
0.32
0.00
1.66
residual.sugar
Numeric
6.39
5.07
5.20
0.60
65.80
chlorides
Numeric
0.05
0.02
0.04
0.01
0.35
free.sulfur.dioxide
Numeric
35.31
17.01
34.00
2.00
289.00
total.sulfur.dioxide
Numeric
138.38
42.51
134.00
9.00
440.00
density
Numeric
0.99
0.00
0.99
0.99
1.04
pH
Numeric
3.19
0.15
3.18
2.72
3.82
sulphates
Numeric
0.49
0.11
0.47
0.22
1.08
alcohol
Numeric
10.51
1.23
10.40
8.00
14.20
y
Categorical
3.87
0.88
4.00
1.00
6.00
Prediction accuracy
Tables 26 and 27 summarize in detail the results of the prediction accuracy exercise using real datasets for the ARPS and the AMSE, respectively. The first column Data specifies the dataset, the second column Class defines the number of outcome classes of the dependent variable and the third column Size indicates the number of observations. Similarly to the simulation results, the column Statistic contains summary statistics and statistical tests results for the equality of means between the results of the Ordered Forest and all the other methods.
Table 26
Empirical results: Accuracy Measure = ARPS
Dataset summary
Comparison of methods
Data
Class
Size
Statistic
Ologit
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
mammography
3
412
mean
0.1776
0.2251
0.2089
0.1767
0.1823
0.1766
0.1826
0.1767
st.dev.
0.0010
0.0027
0.0021
0.0013
0.0018
0.0008
0.0019
0.0007
t-test
1.0000
0.0000
0.0000
1.0000
1.0000
0.3999
1.0000
wilcox-test
1.0000
0.0000
0.0000
1.0000
1.0000
0.3153
1.0000
nhanes
5
1914
mean
0.1088
0.1089
0.1100
0.1085
0.1103
0.1137
0.1104
0.1159
st.dev.
0.0004
0.0003
0.0004
0.0001
0.0002
0.0001
0.0002
0.0001
t-test
1.0000
1.0000
0.9839
1.0000
0.0000
0.2106
0.0000
wilcox-test
1.0000
1.0000
0.9738
1.0000
0.0000
0.2179
0.0000
supportstudy
5
798
mean
0.1872
0.1849
0.1834
0.1800
0.1823
0.1931
0.1857
0.1944
st.dev.
0.0011
0.0010
0.0009
0.0008
0.0008
0.0003
0.0007
0.0004
t-test
0.0000
0.0000
0.0052
1.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0073
1.0000
0.0000
0.0000
0.0000
vlbw
9
218
mean
0.1595
0.1713
0.1724
0.1603
0.1686
0.1623
0.1685
0.1642
st.dev.
0.0011
0.0026
0.0030
0.0014
0.0021
0.0005
0.0020
0.0003
t-test
1.0000
0.0100
0.0023
1.0000
1.0000
0.5143
1.0000
wilcox-test
1.0000
0.0116
0.0010
1.0000
1.0000
0.5733
1.0000
winequality
6
4893
mean
0.0756
0.0501
0.0503
0.0596
0.0507
0.0673
0.0504
0.0683
st.dev.
0.0000
0.0003
0.0002
0.0001
0.0002
0.0001
0.0002
0.0000
t-test
0.0000
1.0000
0.9992
0.0000
0.0000
0.9971
0.0000
wilcox-test
0.0000
0.9999
0.9986
0.0000
0.0000
0.9966
0.0000
Table reports the average measures of the RPS based on 10 repetitions of 10-fold cross-validation. The fourth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Table 27
Empirical results: Accuracy Measure = AMSE
Dataset summary
Comparison of methods
Data
Class
Size
Statistic
Ologit
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
mammography
3
412
mean
0.1754
0.2593
0.2222
0.1720
0.1766
0.1726
0.1770
0.1726
st.dev.
0.0007
0.0025
0.0031
0.0008
0.0012
0.0004
0.0013
0.0004
t-test
0.9923
0.0000
0.0000
1.0000
1.0000
0.2467
1.0000
wilcox-test
0.9943
0.0000
0.0000
1.0000
1.0000
0.2179
1.0000
nhanes
5
1914
mean
0.1310
0.1309
0.1332
0.1304
0.1332
0.1329
0.1319
0.1343
st.dev.
0.0003
0.0003
0.0003
0.0002
0.0003
0.0001
0.0003
0.0001
t-test
1.0000
1.0000
0.7067
1.0000
0.9936
1.0000
0.0000
wilcox-test
1.0000
1.0000
0.6579
1.0000
0.9955
1.0000
0.0000
supportstudy
5
798
mean
0.1124
0.1110
0.1094
0.1078
0.1088
0.1129
0.1101
0.1135
st.dev.
0.0005
0.0004
0.0004
0.0004
0.0004
0.0002
0.0003
0.0002
t-test
0.0000
0.0000
0.0020
1.0000
0.0000
0.0000
0.0000
wilcox-test
0.0000
0.0000
0.0008
0.9999
0.0000
0.0000
0.0000
vlbw
9
218
mean
0.0944
0.0986
0.0990
0.0956
0.1008
0.0958
0.1006
0.0956
st.dev.
0.0002
0.0008
0.0009
0.0004
0.0008
0.0003
0.0009
0.0002
t-test
1.0000
1.0000
0.9999
1.0000
1.0000
0.7224
1.0000
wilcox-test
1.0000
1.0000
0.9999
1.0000
1.0000
0.7821
1.0000
winequality
6
4893
mean
0.1001
0.0692
0.0698
0.0831
0.0702
0.0906
0.0693
0.0913
st.dev.
0.0000
0.0003
0.0003
0.0001
0.0003
0.0001
0.0003
0.0001
t-test
0.0000
1.0000
0.9960
0.0000
0.0000
1.0000
0.0000
wilcox-test
0.0000
1.0000
0.9974
0.0000
0.0000
1.0000
0.0000
Notes: Table reports the average measures of the MSE based on 10 repetitions of 10-fold cross-validation. The fourth column Statistic shows the mean and the standard deviation of the accuracy measure for all methods. Additionally, t-test and wilcox-test contain the p-values of the parametric t-test as well as the nonparametric Wilcoxon test for the equality of means between the results of the Ordered Forest and all the other methods
Software implementation
The Monte Carlo study has been conducted using the R statistical software (R Core Team 2021) in version 3.5.2 (Eggshell Igloo) and the respective packages implementing the estimators used. With regards to the forest-based estimators the main tuning parameters, namely the number of trees, the number of randomly chosen covariates and the minimum leaf size have been specified according to the values in Table 1 in the main text.
Table 28
Overview of software packages and tuning parameters
Software implementation and tuning parameters
Method
Ologit
Naive
Ordinal
Conditional
Ordered
Ordered*
Multi
Multi*
Package
rms
ordinalForest
ordinalForest
party
ranger
grf
ranger
grf
Function
lrm
ordfor
ordfor
cforest
ranger
regression_forest
ranger
regression_forest
Max. iterations
25
–
–
–
–
–
–
–
Trees
–
1000
1000
1000
1000
1000
1000
1000
Random subset
–
\(\sqrt{p}\)
\(\sqrt{p}\)
\(\sqrt{p}\)
\(\sqrt{p}\)
\(\sqrt{p}\)
\(\sqrt{p}\)
\(\sqrt{p}\)
Leaf size
–
5
5
0
5
5
5
5
\(B_{sets}\)
–
0
1000
–
–
–
–
–
\(B_{prior}\)
–
0
100
–
–
–
–
–
Performance
–
equal
equal
–
–
–
–
–
\(S_{best}\)
–
0
10
–
–
–
–
–
In terms of the particular R packages used the ordered logistic regression has been implemented using the rms package (version 5.1-3) written by Harrell (2019). The respective lrm function for fitting the Ordered Logit has been used with the default parameters, except setting the maximum number of iterations, maxit=25 as for some of the DGPs the Ordered Logit has experienced convergence issues. Next, the naive and the Ordinal Forest have been applied based on the ordinalForest package in version 2.3 (Hornung 2019b) with the ordfor function. As described in Appendix A.3 the Ordinal Forest introduces additional tuning parameters for which we use the default parameters as suggested in the package manual. Further, the Conditional Forest has been estimated with the package party in version 1.3-1 (Hothorn et al. 2006; Strobl et al. 2007, 2008). Regarding the choice of the tuning parameters, we rely on the default parameters of the cforest function. A particularity of the Conditional Forest is, due to the conceptual differences to standard regression forest in terms of the splitting criterion, the choice of the stopping rule. This is controlled by the significance level \(\alpha \) (see Appendix A.2 for details). However, in order to grow deep trees we follow the suggestion in the package manual to set mincriterion\(=0\), which has been also used in the simulation study conducted in Janitza et al. (2016). Lastly, the Ordered Forest as well as the Multinomial Forest algorithms are implemented using the package ranger in version 0.11.1 (Wright and Ziegler 2017) with the default hyperparameters. The honest versions of the above two estimators rely on the grf package in version 0.10.2 (Tibshirani et al. 2018) with the default hyperparameters as well. A detailed overview of packages with the corresponding tuning parameters is provided in Table 28.
Furthermore, Tables 29 and 30 compare the absolute and relative computation time of the respective methods. For comparison purposes, we measure the computation time for the four main DGPs presented in Sect. 5.4 of the main text, namely the simple DGP in the low- and high-dimensional case as well as the complex DGP in the low- and high-dimensional case, for both the small sample size (\(N=200\)) and the big sample size (\(N=800\)) for all considered number of outcome classes. We estimate the model based on the training set and predict the class probabilities for a test set of size \(N=10'000\) as in the main simulation. We repeat this procedure 10 times and report the average computation time. The tuning parameters and the software implementations are chosen as defined in Table 1 in the main text and Table 28 herein, respectively. All simulations are computed on a 64-Bit Windows machine with 4 cores (1.80GHz) and 16 GB RAM storage.
Table 29
Absolute computation time in seconds
Simulation design
Comparison of methods
Class
Dim.
DGP
Size
Ologit
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
3
Low
Simple
200
0.01
1.22
10.33
46.61
0.62
1.24
0.91
1.86
3
Low
Simple
800
0.02
1.58
40.83
150.84
1.03
1.96
1.61
2.98
3
Low
Complex
200
0.02
1.19
11.93
47.43
0.63
1.26
0.98
1.92
3
Low
Complex
800
0.03
1.71
52.45
150.59
1.08
1.94
1.73
3.06
3
High
Simple
200
3.50
61.89
64.28
4.05
5.08
6.06
7.27
3
High
Simple
800
13.91
332.60
175.76
7.19
7.10
12.19
11.02
3
High
Complex
200
3.46
60.25
59.98
4.02
4.96
6.02
7.10
3
High
Complex
800
13.83
325.65
173.63
6.83
6.61
11.50
10.66
6
Low
Simple
200
0.02
1.88
12.79
46.80
1.47
3.00
1.74
3.52
6
Low
Simple
800
0.03
2.28
48.98
151.58
2.45
4.75
3.10
5.82
6
Low
Complex
200
0.03
1.85
14.75
46.97
1.56
3.12
1.85
3.66
6
Low
Complex
800
0.04
2.54
64.44
151.84
2.68
4.82
3.30
6.02
6
High
Simple
200
4.21
69.80
64.14
10.24
11.74
12.01
13.63
6
High
Simple
800
15.86
386.02
176.27
19.34
17.43
26.24
19.97
6
High
Complex
200
4.11
70.51
60.85
9.98
11.52
11.95
13.61
6
High
Complex
800
15.85
371.69
174.17
18.11
17.18
24.43
19.52
9
Low
Simple
200
0.03
2.32
20.53
46.70
2.27
4.71
2.44
5.03
9
Low
Simple
800
0.04
2.69
57.22
145.21
3.82
7.29
4.61
7.99
9
Low
Complex
200
0.03
2.29
22.86
47.36
2.40
4.83
2.65
5.28
9
Low
Complex
800
0.05
3.07
79.15
151.36
4.27
7.75
5.81
8.68
9
High
Simple
200
4.85
80.76
63.25
16.05
17.84
17.69
19.56
9
High
Simple
800
16.91
413.74
169.91
31.34
26.91
38.95
27.38
9
High
Complex
200
4.62
78.86
57.68
15.79
17.78
17.57
19.59
9
High
Complex
800
18.10
437.04
175.07
31.12
27.33
37.59
28.16
Table reports the average absolute computation time in seconds based on 10 simulation replications of training and prediction. The first column denotes the number of outcome classes. Columns 2 and 3 specify the dimension and the DGP, respectively. The fourth column contains the number of observations in the training set. The prediction set consists of 10 000 observations
The results reveal the expected pattern for the Ordered Forest. The more outcome classes the longer the computation time as by definition of the algorithm more forests have to be estimated. Furthermore, we also observe a longer computation time if the number of observation and/or the number of considered splitting covariates increases which is also an expected behaviour. However, the computation time is not sensitive to the particular DGP which it should not be either. The latter two patterns are true for all considered methods. In comparison to the other forest-based methods, the computational advantage of the Ordered Forest becomes apparent. The Ordered Forest outperforms the Ordinal and the Conditional Forest in all cases. In some cases the Ordered Forest is even more than 100 times faster and even in the closest cases it is more than 3 times faster than the two. In absolute terms this translates to computation time of around 1 s for the Ordered Forest and around 50 s for the Ordinal and around 150 s for the Conditional Forest in the most extreme case. Contrarily, in the closest case, the computation time for the Ordered Forest is around 15 s, while for the Ordinal Forest this is around 80 s and around 60 s for the Conditional Forest. This points to the additional computation burden of the Ordinal and the Conditional Forest. The only exception is the naive forest which does not include any optimization step. Furthermore, we observe a slightly longer computation time for the Multinomial Forest in comparison to the Ordered Forest, which is due to one extra forest being estimated. The honest versions of the two forests take a bit longer in general, but this seems to reverse once bigger samples are considered (in terms of both number of observations as well as number of considered covariates).
Table 30
Relative computation time
Simulation design
Comparison of methods
Class
Dim.
DGP
Size
Ologit
Naive
Ordinal
Cond.
Ordered
Ordered*
Multi
Multi*
3
Low
Simple
200
0.02
1.98
16.76
75.66
1
2.02
1.48
3.02
3
Low
Simple
800
0.02
1.53
39.68
146.59
1
1.91
1.56
2.90
3
Low
Complex
200
0.03
1.87
18.79
74.70
1
1.99
1.55
3.03
3
Low
Complex
800
0.03
1.59
48.79
140.09
1
1.81
1.61
2.84
3
High
Simple
200
0.86
15.27
15.86
1
1.25
1.50
1.79
3
High
Simple
800
1.94
46.28
24.46
1
0.99
1.70
1.53
3
High
Complex
200
0.86
14.99
14.92
1
1.23
1.50
1.77
3
High
Complex
800
2.02
47.68
25.42
1
0.97
1.68
1.56
6
Low
Simple
200
0.02
1.28
8.73
31.95
1
2.05
1.19
2.40
6
Low
Simple
800
0.01
0.93
19.95
61.74
1
1.94
1.26
2.37
6
Low
Complex
200
0.02
1.18
9.45
30.09
1
2.00
1.19
2.34
6
Low
Complex
800
0.02
0.94
24.02
56.59
1
1.80
1.23
2.24
6
High
Simple
200
0.41
6.81
6.26
1
1.15
1.17
1.33
6
High
Simple
800
0.82
19.96
9.11
1
0.90
1.36
1.03
6
High
Complex
200
0.41
7.07
6.10
1
1.16
1.20
1.36
6
High
Complex
800
0.88
20.52
9.62
1
0.95
1.35
1.08
9
Low
Simple
200
0.01
1.02
9.03
20.54
1
2.07
1.07
2.21
9
Low
Simple
800
0.01
0.70
14.98
38.01
1
1.91
1.21
2.09
9
Low
Complex
200
0.01
0.95
9.51
19.69
1
2.01
1.10
2.19
9
Low
Complex
800
0.01
0.72
18.55
35.48
1
1.82
1.36
2.03
9
High
Simple
200
0.30
5.03
3.94
1
1.11
1.10
1.22
9
High
Simple
800
0.54
13.20
5.42
1
0.86
1.24
0.87
9
High
Complex
200
0.29
5.00
3.65
1
1.13
1.11
1.24
9
High
Complex
800
0.58
14.04
5.63
1
0.88
1.21
0.90
Table reports the average relative computation time with regards to the Ordered Forest estimator based on 10 simulation replications of training and prediction. The first column denotes the number of outcome classes. Columns 2 and 3 specify the dimension and the DGP, respectively. The fourth column contains the number of observations in the training set. The prediction set consists of 10 000 observations
Generally, the sensitivity with regards to the computation time appears to be very different for the considered methods. For the Ordered Forest as well as the Multinomial Forest, including their honest versions, the most important aspect is clearly the number of outcome classes. For the naive and the Ordinal Forest the number of observations seems to be most decisive and for the Conditional Forest paradoxically the size of the prediction set is most relevant. Overall, the above result support the theoretical argument of the Ordered Forest being computationally advantageous in comparison to the Ordinal and the Conditional Forest.
Empirical application
In this appendix we provide the descriptive statistics for the dataset used in the empirical application of the main text as well as supplementary results containing the estimation of marginal effects.
Descriptive statistics
Table 31
Descriptive statistics: NHIS dataset
NHIS dataset
Variable
Type
Mean
SD
Median
Min
Max
Health status
Categorical
3.93
0.95
4.00
1.00
5.00
Health insurance
Categorical
0.84
0.37
1.00
0.00
1.00
Female
Categorical
0.50
0.50
0.50
0.00
1.00
Non-white
Categorical
0.20
0.40
0.00
0.00
1.00
Age
Numeric
42.72
8.70
43.00
26.00
59.00
Education
Numeric
13.74
2.99
14.00
0.00
18.00
Family size
Numeric
3.63
1.37
4.00
2.00
18.00
Employed
Categorical
0.82
0.39
1.00
0.00
1.00
Income
Categorical
94178.04
56738.46
85985.78
19282.93
167844.53
Table 32
Descriptive statistics by class: NHIS dataset
NHIS dataset
Health status
Variable
Poor
Fair
Good
Very Good
Excellent
Health status
1.14
5.66
25.14
34.92
33.13
Health insurance
79.07
71.50
77.88
87.52
87.76
Female
49.77
51.08
49.28
50.43
49.92
Non-white
31.63
23.89
22.84
18.18
18.21
Age
47.65
45.37
43.75
42.73
41.30
Education
12.11
12.20
12.89
13.97
14.46
Family size
3.33
3.68
3.68
3.59
3.64
Employed
28.84
65.57
80.99
84.35
84.21
Income
53409.03
62473.99
78957.11
99685.45
106743.21
N
215
1063
4724
6562
6226
share in %
1.14
5.66
25.14
34.92
33.13
Means of variables for respective outcome class displayed. Shares for dummy variables are indicated in %
Table 33
Differences in health status based on health insurance: NHIS dataset
Notes: Table shows the comparison of the marginal effects at mean in percentage points between the Ordered Forest and the Ordered Logit. The effects are estimated for all classes, together with the corresponding standard errors, t-values and p-values. The standard errors for the Ordered Forest are estimated using the weight-based inference and for the Ordered Logit are obtained via the delta method
Further, to describe the differences in the health status based on the health insurance we inspect the ordered class probabilities for the self-reported health status for individuals with and without a private healths insurance contract. The descriptive results are reported in Table 33, including statistical evidence for the differences between the two groups. The descriptive evidence suggests that individuals with health insurance have a higher probability to be in excellent or very good health condition and at the same time have a lower probability to be in good or fair health condition. This evidence is both statistically precise and economically relevant. Furthermore, individuals with health insurance seem to have also a lower probability to be in poor health condition. However the evidence for that is less pronounced, both in statistical as well as in economic terms.
Marginal effects
In what follows, the results for the marginal effects at mean are presented for the considered NHIS dataset. Similarly as in the main text, the effects are computed for each outcome class of the dependent variable both for the Ordered Forest as well as for the Ordered Logit. The estimations are done in R version 3.6.1 using the orf package (Lechner and Okasa 2019) in version 0.1.3 for the Ordered Forest and the oglmx package (Carroll 2018) in version 3.0.0.0 for the Ordered Logit (Table 34).
Source codes of both R and Python versions of the estimator are available on GitHub. Additionally, an implementation of the estimator in GAUSS is available online and on ResearchGate.
A different strand of the literature focused particularly on adjustments towards ordered classification rather than regression which excludes the estimation of the conditional probabilities as is the case in the parametric ordered choice models. See, for example, (Kramer et al. 2001), who propose a simple procedure for constructing a distance-sensitive classification learner, or Piccarreta (2008) who suggest usage of alternative objective functions. Both of these measures put higher penalty on misclassification the more distant the predicted category is from the true one.
Wager and Athey (2018) point out that the leaves need to be relatively small in all dimensions of the covariate space. This implies that the high-dimensional settings are not considered and hence the theoretical asymptotic results might not hold in such settings.
For example, covariates need to be independently distributed with a density that is bounded away from 0 and infinity. Notice, that this condition prevents categorical covariates if a certain category has \(p(x)=0\). For a detailed description of the conditions as well as of the proof, see Theorem 1 in Wager and Athey (2018).
The computational speed of the regression forests depends on many tuning parameters, of which the number of bootstrap replications, i.e. the number of grown trees, is the most decisive one.
We have additionally experimented with \(h=0.5\) and \(h=1\) which resulted in incrementally larger effect sizes. Generally, the lower the window size h, the more local the effect and the higher the window size h, the more global the effect becomes. As Burden and Faires (2011) point out, the window size h should not be chosen too small due to the instability of the numerical derivative approximations. In the software implementation in the R package orf, users can control this parameter by changing the argument window. See Lechner and Okasa (2019) for more details.
The asymptotic normality holds as long as the predictions are constructed by averaging the binary outcomes and thus resulting in a probability estimate ensuring that the predictions remain real valued as noted by Wager and Athey (2018) as well as Mentch and Hooker (2016). This excludes the classification forest, where the predictions are constructed via majority voting.
The so-called cross-fitting to avoid the efficiency loss as suggested by Chernozhukov et al. (2018) does not appear to be applicable here as the independence of the weights and the outcomes would not be ensured.
Here, we estimate the variance with sample counterparts. An alternative approach, as in Lechner (2018), would be to first apply the law of total variance and, second, estimate the conditional moments by nonparametric methods. However, due to the presence of the covariance term the conditioning set contains two variables which causes the convergence rate to decrease and hence such variance estimation might even result in less precise estimates, depending on the sample size.
The thresholds are determined beforehand according to fixed threshold quantiles \(\alpha _m^q\) of a large sample of \(N=1'000'000\) observations of the latent \(Y_i^*\) from the very same DGP to reflect the realized outcome distribution and then used afterwards in the simulations as a part of the deterministic component.
Note that with a too high multicollinearity, the Ordered Logit model breaks down. With restricting the level of correlation among covariates, the logit model can be still reasonably compared to the other competing methods.
For the low-dimensional setting we have \(n=4\) options for the DGP settings, out of which we can choose from none to all of them, whereby the ordering does not matter, we end up with 16 possible combinations as given by the formula \(\sum _{r=0}^{n} \left( {\begin{array}{c}n\\ r\end{array}}\right) \), each for 3 possible numbers of outcome classes resulting in 48 different DGPs. For the high-dimensional setting we have \(n=3\) options as the additional noise variables are always considered. This for all 3 distinct numbers of outcome classes yields 24 different DGPs.
We refrain from further comparisons with alternatives such as the ordinal generalized additive models (see, for example, Hastie 2017) to highlight the differences between the workhorse parametric model and the flexible forest-based models.
Janitza et al. (2016) perform also a simulation study to test the robustness of the suggested score values by setting \(s(m)=m^2\), but do not find any significant differences to simple \(s(m)=m\).
Recently, Buri and Hothorn (2020) and Tutz (2022) proposed score-free methods based on Random Forests that do not rely on the underlying continuous intervals of the observed ordered classes.
This approach could be regarded as semiparametric as it uses the nonparametric structure of the trees and assumes a particular parametric distribution (standard normal) within its optimization procedure.
The here proposed algorithm has been already applied and is in use for predicting match outcomes in football, see Goller et al. (2021) and SEW Soccer Analytics for details.