1 Introduction
2 Methods
2.1 Cumulative Model
clm
function from the ordinal
package (Christensen, 2022). As such, we will be referring to proportional odds models in the context of ordinal prediction as CLM (cumulative link model) in the remainder of this work.2.1.1 Random Forest
R
include the ranger
package (Wright & Ziegler, 2017) and the randomForest
package (Liaw & Wiener, 2002).2.1.2 Conditional Inference Forest
partykit
package (Hothorn & Zeileis, 2015).2.1.3 Split-Based Ordinal Random Forest
randomForest
(Liaw & Wiener, 2002).2.1.4 Ordinal Forest
ranger
; Wright & Ziegler, 2017). Prediction of unseen data is achieved by obtaining the numeric predictions from the RF fit and checking in which class they fall based on the respective borders of the class intervals. The different partitions are evaluated w.r.t. the out-of-bag (OOB) performance achieved when using them. While the OF implementation in the ordinalForest
package (Hornung, 2022) offers a choice of performance measures, a balanced version of Youden’s Index J (Youden, 1950) is used by default where for a binary classification task2.1.5 Ordinal Score Optimization Algorithm
evaluateBorders
derives the numeric scores for the response categories from the provided class borders and fits a regression RF using the numeric scores as the target variable and the corresponding covariates as predictors. As in OF, the numeric scores are first transformed with the quantile function of the standard normal distribution \(\Phi ^{-1}\). For fitting RFs, we are using the implementation from the ranger
package (Wright & Ziegler, 2017). From the RF fit, OOB predictions are obtained and in turn converted into class labels by using the transformed class borders. Finally, the predicted class labels derived from the OOB predictions can be compared with the true class labels to compute the balanced version of Youden’s J used in OF. The evaluateBorders
function can be optimized using any derivative-free non-linear optimization algorithm. In this work, we have used the Sbplx algorithm from the NLopt
library (Johnson, 2007). The Sbplx algorithm is based on the Subplex algorithm by Rowan (1990) which is a variant of the Nelder-Mead algorithm (Nelder & Mead, 1965). Since our algorithm follows Hornung (2019) in determining an optimal partition of the [0, 1] interval, we also restrict candidate class borders during the optimization through a lower bound of 0 and an upper bound of 1. As the class borders relate to the ordinal categories, they need to be sorted such that they match the order of the original categories. Hence, only sorted borders should be considered. This can either be enforced through inequality constraints if they are supported by the given optimizer or by disincentivizing unsorted solutions via penalization in the evaluation step. As starting values for the optimization, we are using \(\frac{1}{k}, \dots , \frac{k-1}{k}\), i.e., a partition with equally wide class intervals.max.eval
or failing to exceed a minimum performance improvement \(\varepsilon \). For our simulations, we set max.eval
= 300 and \(\varepsilon = 1 \times 10^{-4}\). Smaller values for \(\varepsilon \) would allow for finding finer differences, but in turn, negatively impact the runtime, while larger values for \(\varepsilon \) would speed up the optimization process, but lead to a potentially more imprecise result. Optimal settings for \(\varepsilon \) and max.eval
depend on the given application context and should be chosen w.r.t. the desired preciseness and the computational power available. The values selected here were chosen for showcasing the method and were not optimized further. Once the optimization algorithm determines a solution, the respective scores can be used to fit the final RF model and the final borders which are both returned. For unseen data, predicted (numeric) values can be obtained from the model’s individual trees and converted into class labels using the (transformed) borders. The overall class prediction for a given observation can be determined by majority voting. This prediction procedure is identical to the procedure employed in OF (cf. Hornung, 2022).3 Simulation Study
3.1 Simulation Setup
Pattern | k | \(\pi _1\) | \(\pi _2\) | \(\pi _3\) | \(\pi _4\) | \(\pi _5\) | \(\pi _6\) | \(\pi _7\) |
---|---|---|---|---|---|---|---|---|
Equal | 3 | 0.33 | 0.33 | 0.33 | – | – | – | – |
5 | 0.20 | 0.20 | 0.20 | 0.20 | 0.20 | – | – | |
7 | 0.14 | 0.14 | 0.14 | 0.14 | 0.14 | 0.14 | 0.14 | |
Wide middle | 3 | 0.25 | 0.50 | 0.25 | – | – | – | – |
5 | 0.11 | 0.22 | 0.33 | 0.22 | 0.11 | – | – | |
7 | 0.06 | 0.13 | 0.19 | 0.25 | 0.19 | 0.13 | 0.06 | |
Wide margins | 3 | 0.40 | 0.20 | 0.40 | – | – | – | – |
5 | 0.27 | 0.18 | 0.09 | 0.18 | 0.27 | – | – | |
7 | 0.21 | 0.16 | 0.11 | 0.05 | 0.11 | 0.16 | 0.21 |
ranger
and randomForest
packages, hence, overriding the default number of trees (for the final forest) in the ordinalForest
package that is 5000. In all other cases, the default values remained unchanged.3.2 Simulation Results
3.2.1 Results for DGP 1
3.2.2 Results for DGP 2
3.2.3 Results for DGP 3
3.3 Robustness of Data Generation
3.4 Runtime Comparison
4 Real Data Examples
MASS
package. The original numeric target variable was categorized according to Tutz (2021). For the Boston dataset, it is of interest to predict the median value of owner-occupied homes in Boston. It was obtained from the mlbench
package (Leisch & Dimitriadou, 2021). The numeric target variable was binned according to Tutz (2021). For the Hearth dataset, the goal is to predict the severity of coronary artery disease. It was taken from the ordinalForest
package (Hornung, 2022). The Mammography dataset contains information about mammography experiences and was taken from the TH.data
package (Hothorn, 2023).Name | Obs | Cov | Description and categories |
---|---|---|---|
Birthweight | 189 | 8 | Birth weight in grams |
1: \(< 2500\) \((n = 59)\), 2: 2500-3000 \((n = 38)\) | |||
3: 3000–3500 \((n = 45)\), 4: \(>3500\) \((n = 47)\) | |||
Boston | 506 | 13 | Median value of owner-occupied homes in $1000 |
1: \(<15\) \((n=97)\), 2: 15-19 \((n=78)\), 3: 19-22 \((n=109)\) | |||
4: 22–25 \((n=98)\), 5: 25-32 \((n=57)\), 6: \(>32\) \((n=67)\) | |||
Hearth | 294 | 10 | Severity of coronary artery disease |
1: no disease \((n=188)\), 2: degree 1 \((n = 37)\) | |||
3: degree 2 \((n=26)\), 4: degree 3 \((n=28)\), 5: degree 4 \((n = 15)\) | |||
Mammography | 412 | 5 | Last mammography visit |
1: Never \((n=234\)), 2: Within a year \((n = 104)\) | |||
3: Over a year \((n=74\)) | |||
Medical Care | 1778 | 10 | Number of physician office visits |
1: 0 \((n=329)\), 2: 1 \((n = 183)\), 3: 2-3 \((n=362)\), 4: 4-6 \((n=398)\) | |||
5: 7–8 \((n=149)\), 6: 9-11 \((n = 149)\), 7: \(>11\) \((n=208)\) | |||
Student | 649 | 12 | Final grade in Portuguese language course |
1: 0–10 \((n=100)\), 2: 10-11 \((n=201)\), 3: 12-13 \((n=154)\) | |||
4: 14–15 \((n=112)\), 5: 15-20 \((n=82)\) | |||
Wage | 3000 | 8 | Wage of workers in Mid-Atlantic region in $100k |
1: \(<75\) \((n = 430)\), 2: 75-100 \((n=913)\), 3: 100-125 \((n=789)\) | |||
4: 125–150 \((n = 525)\), 5: \(>150\) \((n=343)\) | |||
Wine Quality | 4898 | 6 | Wine quality rating |
1: \(<5\) \((n = 183)\), 2: 5 \((n=1457)\), 3: 6 \((n=2198)\) | |||
4: 7 \((n = 880)\), 5: \(>7\) \((n=180)\) |
AER
package (Kleiber & Zeileis, 2008). We have chosen the same subset of observations and covariates to predict the number of physician office visits as well as the same target variable binning as Tutz (2021). The Student dataset contains information about the final grade of students from a Portuguese language course. The data was taken from the UCI Machine Learning Repository (Cortez, 2014). We have binned the target variable that was originally on a 20-point scale into five categories (see Table 2). As covariates, we have selected gender, age, region (rural vs. urban), parents’ cohabitation status, mother’s education, father’s education, weekly study time, presence of educational support from the school, presence of educational support from the family, partaking in paid extra classes, interest in taking higher education as well as access to the internet at home. The Wage dataset was obtained from the ISLR
package (James et al., 2021). The goal is to predict the wage of workers in the Mid-Atlantic region. The target variable was binned into five categories for our analysis (see Table 2). Lastly, the task for the Wine Quality dataset is predicting the quality score of wine. It was taken from Cortez et al. (2009). The original categories were coarsened according to Tutz (2021). None of the obtained datasets contained any missing values. To evaluate the seven learners, we performed a five-fold cross-validation with 50 replications. For the learners, we used the same settings as in the simulation study before.5 Discussion
5.1 Finding 1: CLM Remains Competitive for Small Sample Sizes and Limited Non-linear Effects
5.2 Finding 2: TE Methods Reveal Only Small Differences Among Themselves
mtry
that regulates how many covariates are randomly sampled for consideration in a given split, RFs potentially had to rely often on noise variables only for splitting which ultimately harmed the predictive performance.