Finds documents with both search terms in any word order, permitting "n" words as a maximum distance between them. Best choose between 15 and 30 (e.g. NEAR(recruit, professionals, 20)).
Finds documents with the search term in word versions or composites. The asterisk * marks whether you wish them BEFORE, BEHIND, or BEFORE and BEHIND the search term (e.g. lightweight*, *lightweight, *lightweight*).
Activate our intelligent search to find suitable subject content or patents.
Select sections of text to find matching patents with Artificial Intelligence.
powered by
Select sections of text to find additional relevant content using AI-assisted search.
powered by
(Link opens in a new window)
Abstract
The article delves into the transformative role of automated machine learning (AutoML) in democratizing machine learning and the challenges posed by model opacity. It introduces the concept of human-centered explainable AI (XAI) and the Rashomon effect, which highlights the multiplicity of near-optimal models and the resulting explanation uncertainty. The authors propose Rashomon Partial Dependence Profiles (Rashomon PDP), a model-agnostic explanation technique that aggregates feature effect estimates across a set of near-optimal models generated by AutoML. This approach aims to enhance interpretive robustness and provide more accurate and trustworthy estimates of partial dependencies. The article presents empirical results from synthetic and real-world datasets, demonstrating the efficacy of Rashomon PDP in improving explanation fidelity and uncertainty quantification. It also discusses the broader implications of this work for human-centered XAI and future research directions.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
Trustworthiness of AI systems is a core objective of Human-Centered Explainable AI, and relies, among other things, on explainability and understandability of the outcome. While automated machine learning tools automate model training, they often generate not only a single “best” model but also a set of near-equivalent alternatives, known as the Rashomon set. This set provides a unique opportunity for human-centered explainability: by exposing variability among similarly performing models, we can offer users richer and more informative explanations. In this paper, we introduce Rashomon partial dependence profiles, a model-agnostic technique that aggregates feature effect estimates across the Rashomon set. Unlike traditional explanations derived from a single model, Rashomon partial dependence profiles explicitly quantify uncertainty and visualize variability, further enabling user trust and understanding model behavior to make informed decisions. Additionally, under high-noise conditions, the Rashomon partial dependence profiles more accurately recover ground-truth feature relationships than a single-model partial dependence profile. Experiments on synthetic and real-world datasets demonstrate that Rashomon partial dependence profiles reduce average deviation from the ground truth by up to 38%, and their confidence intervals reliably capture true feature effects. These results highlight how leveraging the Rashomon set can enhance technical rigor while centering explanations on user trust and understanding aligned with Human-centered explainable AI principles.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Automated machine learning (AutoML) has played a transformative role in democratizing machine learning by enabling non-expert users to generate high-performing predictive models with minimal manual intervention. However, as these systems are increasingly deployed in high-stakes domains, concerns arise over the opacity of the resulting models. To address this, explainable artificial intelligence (XAI) methods are commonly used to clarify the behavior of these models (Li & Yan, 2025). When aligned with human needs, trust, and contexts, these explanations contribute to a more human-centered AI development process—referred to as Human-centered XAI (Ehsan & Riedl, 2020; Maity & Deroy, 2024).
Fig. 1
Core components of human-centered XAI and the relation with XAI (Ehsan & Riedl, 2020; Maity & Deroy, 2024)
Unlike technically focused XAI approaches prioritizing algorithmic transparency, human-centered XAI emphasizes making AI systems comprehensible and socially meaningful to users. This context integration, as illustrated in Fig. 1, is formalized around four core components that form the basis of explanation generation: user trust, user needs, social context, and usability. This framework demonstrates that the human-centered XAI methodology diverges from the technically algorithmic-focused XAI by systematically formalizing and integrating the context of use, rather than relying solely on the underlying ML algorithms. This includes accounting for the socio-technical environments in which systems operate and minimizing the risk of misinterpretation, particularly in sensitive domains such as healthcare or criminal justice (Ferrario et al., 2024). For instance, in a clinical decision support scenario, two models with comparable accuracy might offer conflicting explanations for a patient’s risk profile—one highlighting age while another emphasizing comorbidities, potentially leading to misinformed treatment choices and eroding clinician trust. human-centered XAI also emphasizes the importance of evaluating explanations not only by their fidelity or informativeness, but also by their impact on user trust, understanding, and confidence in model decisions (Suffian et al., 2023). Nonetheless, human-centered XAI faces a structural tension: while its goals are human-centric, it is often implemented through technically driven machine learning models that may not fully capture the ambiguity, context-dependence, or interpretability needs of end users (Nirenburg et al., 2024; Nguyen & Zhu, 2022).
Advertisement
One prominent challenge that illustrates this tension is explanation uncertainty, which refers to the phenomenon where different model settings yield conflicting justifications for the same model prediction (Roy et al., 2022; Barr et al., 2023). As model complexity increases, these discrepancies tend to grow more pronounced (Krishna et al., 2024), introducing epistemic uncertainty (Löfström et al., 2024) and undermining the reliability of explanations. Conflicting rationales can lead users to confusion, mistrust, or even disregard for explainability tools altogether (Goethals et al., 2023; Mitruț et al., 2024). Yet, uncertainty in explanations need not always be viewed as a flaw. For instance, dissenting explanations have been shown to reduce overreliance on model outputs and foster more cautious, reflective decision-making (Reingold et al., 2024; Vascotto et al., 2025). Some researchers thus advocate framing explanations as arguments rather than truths, seeing disagreement as a tool to encourage critical interpretation and convey model uncertainty (Schwarzschild et al., 2023). Various strategies have been proposed to manage this disagreement, including identifying high-consensus regions in the input space (Mitruț et al., 2024), aggregating outputs of multiple explanation methods (Roy et al., 2022), and designing models that promote internal consistency among explanation outputs (Schwarzschild et al., 2023).
While XAI helps uncover how models behave, AutoML systems typically return a single best model, based on predefined performance metrics. Although this practice simplifies deployment and evaluation, it neglects the reality that multiple models may achieve similar performance while exhibiting meaningful differences in interpretability, robustness, or ethical alignment. Even for notable AutoML systems that ensemble various well-performing models (such as Auto-sklearn Feurer et al., 2015 or AutoGluon Erickson et al., 2020), it remains non-trivial to utilise these additional models in robust explanations. Especially in high-dimensional or noisy settings, the best model may capture dataset-specific artifacts rather than reflect generalizable patterns (Semenova et al., 2023). As a result, relying exclusively on a single model can oversimplify the explanation space and overlook viable alternative narratives. In practice, AutoML systems are frequently executed using default search parameters, especially by non-expert users. While this simplifies usability, it can restrict the diversity of candidate models and limit interpretive robustness.
These limitations are deeply connected to the broader Rashomon effect in machine learning, where a multitude of near-optimal models offer diverse, and occasionally conflicting, explanations for the same data (Breiman, 2001). As noted in recent work (Rudin et al., 2022, 2024), such multiplicity is not an anomaly but a natural consequence of the underdetermined nature of many predictive tasks. The practical implications of selecting one model over another can be especially serious in sensitive application areas such as healthcare, finance, or criminal justice (Watson-Daniels et al., 2024). Within the scope of human-centered XAI, this highlights the importance of embracing model diversity not as a source of confusion, but as a resource for richer, more inclusive decision support (Watson-Daniels et al., 2023), and a new window for machine learning research (Cavus & Biecek, 2025; Yardimci & Cavus, 2025) referred to as the Rashomon perspective (Biecek & Samek, 2024). A compelling illustration of this approach can be found in Kobylińska et al. (2024), where Partial Dependence Profiles are aggregated across discrepant models which are selected from the Rashomon set to support more diverse explanations in a medical context.
To address these challenges, we propose the Rashomon partial dependence profiles (Rashomon PDP), an extension of the classic model-agnostic explanation technique known as the partial dependence profile (PDP) (Friedman, 2001), that integrates the Rashomon perspective. While traditional PDP is computed using only the single model, Rashomon PDP aggregates the marginal feature effects across a set of near-optimal models that lie within a small performance threshold of the top performer. Importantly, these models are already generated by AutoML but are typically discarded once the best model is selected. Related work focuses on detecting and selecting discrepant models within the Rashomon set (Kobylińska et al., 2024), our approach introduces a new explanation technique—Rashomon Partial Dependence Profiles—that aggregates feature effects across the entire set and quantifies their uncertainty. Rather than identifying diverse models, we provide a unified explanation profile and explicitly connect Rashomon-set construction to AutoML outputs, giving the human expert insights about various reasoning the good-performing models can make. Despite this low overhead, Rashomon PDP offers significantly more reliable estimates of feature effects by explicitly accounting for model-based uncertainty and explanation variability. By quantifying the spread of feature effect estimates across near-optimal models, it not only reveals unstable regions in the explanatory space but also empowers users to gauge their confidence in specific interpretive insights, thereby directly addressing concerns about trust and reliability in human-centered XAI systems. In practice, not all near-optimal models contribute equally to explanation diversity. A useful way to summarize this diversity is through the Rashomon ratio–the proportion of near-optimal models relative to the total models. Intuitively, a higher ratio means that many models perform similarly well, which implies greater potential uncertainty in their explanations. We return to this idea in Section 3, where we provide a formal definition and demonstrate its role in quantifying explanation variability.
Advertisement
This paper builds upon our earlier work (Cavus et al., 2025), which first introduced the Rashomon PDP as a means to quantify explanation uncertainty in AutoML. While that work focused on regression tasks using real-world datasets, the current paper significantly extends the scope by (i) generalizing the method to both regression and binary classification, (ii) incorporating synthetic benchmarks with known ground-truth relationships, and (iii) offering a more comprehensive evaluation of explanation fidelity and uncertainty across varying AutoML configurations. The key contributions of this paper are as follows:
1.
We introduce the Rashomon PDP, a novel model-agnostic explanation technique that aggregates feature effect estimates across the Rashomon set of near-optimal models generated during AutoML processes. By explicitly accounting for explanation uncertainty and model-based variability—unlike standard single-model PDP methods—Rashomon PDP enhances interpretive robustness at minimal computational cost and yields more accurate and trustworthy estimates of partial dependencies, particularly under high-noise or complex interaction settings.
2.
We propose the Rashomon ratio as a quantitative indicator of explanation uncertainty, showing that it is strongly associated with the reliability of feature effect estimates across diverse modeling scenarios.
3.
Our approach aligns with the goals of human-centered XAI by providing explanations that reflect underlying model uncertainty, support user trust calibration, and promote more reflective decision-making.
Empirical results in this paper support the efficacy of this approach. Using synthetic datasets with known ground-truth feature relationships, we show that Rashomon PDP approximates the true partial dependence profiles more accurately than PDP derived from the single best model alone. In scenarios involving high data noise or complex feature interactions, the improvement is particularly pronounced. We find that this benefit is closely linked to the Rashomon ratio—the proportion of near-optimal models to all models, which serves as a useful signal of explanation uncertainty, especially under high-noise conditions. In datasets where the Rashomon ratio is high, Rashomon PDP yields notably accurate estimates of feature effects. These findings suggest that Rashomon PDPs do not merely add interpretability—they improve it in a measurable and principled way, aligning closely with the goals of human-centered XAI by offering users calibrated and faithful explanations without the need for retraining or restructuring the model pipeline.
The remainder of this paper is structured as follows. Section 2 reviews relevant literature on AutoML, human-centered XAI, and explanation uncertainty. Section 3 introduces the conceptual and computational foundations of the Rashomon PDP framework. Section 4 details the experimental setup, including dataset design and evaluation metrics. Section 5 presents the empirical findings and provides a comparative analysis of explanation quality across different approaches. Finally, Section 6 concludes with a discussion of the broader implications of this work and outlines promising directions for future research.
2 Related Work
The following subsections situate our work within three complementary areas: (i) human-centered explainability, which motivates the need for trustworthy explanations; (ii) explanation uncertainty, which highlights the challenge posed by model multiplicity; and (iii) AutoML, which serves as a practical mechanism that generates this multiplicity and thereby provides the foundation for our proposed Rashomon-based explanation method.
2.1 Human-centered XAI
Human-centered XAI has emerged as a response to the limitations of traditional, technically focused explainability approaches. Rather than prioritizing algorithmic transparency alone, human-centered XAI seeks to ensure that AI systems are comprehensible, trustworthy, and acceptable from the perspective of diverse users (Ehsan & Riedl, 2020; Maity & Deroy, 2024). This shift reflects the broader sociotechnical understanding that explanations should not merely describe internal model behavior, but also align with human values, contextual needs, and decision-making goals (Ferrario et al., 2024). Designing such explanations introduces new challenges. human-centered XAI must account for variation in users’ expectations, cognitive styles, and domain expertise, which often renders static or generic explanations insufficient (Ehsan et al., 2021). As a result, the development of human-centered explanations involves balancing technical rigor with usability considerations—a tension that is particularly salient in complex or high-stakes domains such as healthcare and criminal justice (Suffian et al., 2023).
Moreover, recent critiques highlight that human-centered XAI is still largely implemented within machine learning pipelines that may be ill-suited to fully capture the ambiguity, nuance, or situated reasoning required for effective explanation (Nirenburg et al., 2024). The explanatory power of such systems may be inherently limited by the constraints of the models they aim to interpret. To overcome this, it is essential that human-centered XAI methods begin with a careful elicitation of user requirements and incorporate them meaningfully into the design process (Nguyen & Zhu, 2022).
Our work builds on this foundation by emphasizing both user-centered design and model-centered pluralism: the idea that embracing explanation variability across high-performing models—the Rashomon set—can offer users richer and more calibrated understandings. In this way, we aim to extend the human-centered XAI paradigm from explanation delivery to explanation diversity, supporting critical engagement rather than passive acceptance.
2.2 Explanation Uncertainty
Despite the growing adoption of XAI techniques, recent research highlights explanation uncertainty or disagreement as an important phenomenon that can emerge in practice. This is the reflection of predictive multiplicity (Marx et al., 2020), meaning that different model configurations applied to the same data and observation can produce diverging, sometimes contradictory explanations. Even models that have similar performance on a test set can have contradictory explanations. These contradictory explanations can come from the fact that the models excel on different test instances, or they have similar predictions on the same test instances, but use different attributes to come to this prediction. This uncertainty challenges both the validity of post-hoc explainability tools and the trustworthiness of model outputs in decision-making contexts (Roy et al., 2022; Krishna et al., 2024; Wickstrøm et al., 2024), but can also be used to further inform the human expert about patterns in the data.
Explanation uncertainty often arises from the absence of a definitive ground truth for explanations. Since the internal logic of complex models may not be reducible to a single “correct” attribution, most evaluation frameworks approximate explanation quality via properties such as faithfulness, complexity, or robustness (Chiaburu et al., 2025; Rawal et al., 2025). However, empirical findings suggest that even widely used metrics fail to reach consensus across methods, leaving users uncertain about which explanation—or evaluation criterion—to trust (Barr et al., 2023; Miró-Nicolau et al., 2025).
Importantly, explanation uncertainty is not necessarily detrimental; in fact, when multiple models perform similarly but diverge in their behavior, the resulting diverse or dissenting explanations can serve a constructive role. Rather than undermining trust, such disagreements can reveal epistemic uncertainty and help users adopt a more calibrated stance toward AI recommendations (Reingold et al., 2024). Empirical evidence supports the idea that exposure to conflicting rationales reduces overreliance on AI outputs and promotes more reflective human-AI interaction, especially in high-stakes applications such as healthcare, law, and finance.
Several methodological strategies have been proposed to address this uncertainty. One line of work focuses on local consensus, identifying regions in the input space where different XAI methods tend to agree and are therefore more trustworthy (Mitruț et al., 2024; Laberge et al., 2024). Others advocate for consensus-driven aggregation, combining outputs from multiple explanation methods to synthesize more robust insights (Banegas-Luna et al., 2023; Chatterjee et al., 2025). Additionally, training models with explanation-consistency objectives has been explored as a way to reduce internal ambiguity in feature attributions (Schwarzschild et al., 2023).
Nonetheless, not all sources of uncertainty are technical. Some argue that the root cause lies in the lack of a true internal rationale for model decisions—especially in black-box systems—making any explanation inherently partial or speculative (Goethals et al., 2023). The notion of a single definitive explanation may thus be fundamentally incompatible with modern machine learning models, especially when applied to high-dimensional, underdetermined problems (Simson et al., 2025).
In light of these challenges, we adopt the position that explanation uncertainty should not be viewed solely as a failure of XAI but rather as a feature of the broader model landscape. Our proposed Rashomon PDP builds directly on this insight by capturing explanation variability across a set of high-performing models. This approach aligns with calls in the literature to treat explanations as arguments rather than facts, and to embrace multiplicity as a resource for enhancing interpretability, rather than obscuring it.
A major source of such uncertainty is model multiplicity—situations where many models achieve similar performance but differ in their reasoning. AutoML systems naturally produce this multiplicity by generating large sets of near-optimal models, making them a practical and relevant setting for studying explanation uncertainty.
Fig. 2
Three important components of AutoML, figure taken from Baratchi et al. (2024)
Training of machine learning models requires extensive expertise. It requires one to select a proper model class, deploy the proper pre-processing techniques, as well as tune the hyperparameters of all individual components. AutoML supports the human in the data science loop by automating these tasks (Baratchi et al., 2024). AutoML generally consists of three components:
1.
A search space that consists of candidate models and hyperparameter spaces. The hyperparameter space defines all relevant hyperparameters as well as the ranges of these that can be selected.
2.
A search algorithm that determines the strategy for traversing this search space. As an extreme example, one could randomly select model and hyperparameter combinations from the search space, and continue until a reasonable one has been found (Bergstra & Bengio, 2012), but also more advanced paradigms exist (e.g., Bayesian optimisation Snoek et al., 2012; Bergstra et al., 2011; Hutter et al., 2011).
3.
A performance evaluation strategy, that determines for each candidate model and hyperparameter configuration how well it performs on the given task. Typically, some form of nested cross-validation is used (Varma & Simon, 2006), but also more efficient methods can be used, for example, using subsampling (Jamieson & Talwalkar, 2016) or learning curves (Klein et al., 2017; Mohr & van Rijn, 2023).
These components are summarised in Fig. 2. AutoML packages typically combine all these components in an easy-to-deploy tool. Various advanced AutoML tools have been proposed, including AutoGluon (Erickson et al., 2020), AutoSklearn (Feurer et al., 2022), and H2O AutoML (LeDell et al., 2020).
It has already been recognised that many AutoML techniques result in multiple trained models, and rather than discarding these models, it would be useful to utilise them in a useful manner, for example, by building an ensemble of the best-performing models (Feurer et al., 2015).
Various studies have aimed to interpret the results of AutoML processes. After a hyperparameter optimisation process has been completed, one can interpret which are the most important hyperparameters to optimise. This can be done by techniques such as functional ANOVA (Hutter et al., 2014) or ablation analysis (Fawcett & Hoos, 2016). By gathering such results across many datasets, one can build configuration spaces that work well across datasets (Perrone et al., 2018) or interpret which are generally the important hyperparameters across datasets (van Rijn & Hutter, 2018; Moussa et al., 2024). This latter work has also been applied to multi-objective performance criteria (Theodorakopoulos et al., 2024).
Importantly, AutoML is highly relevant to our work because it naturally generates many near-optimal models during its search process. Rather than relying on a single “best” model, these alternative high-performing models form the Rashomon set that underlies our explanation approach. Thus, AutoML not only motivates the need to handle explanation variability but also provides the practical source of model multiplicity that our method builds upon.
3 Methodology
This section introduces a model-agnostic methodology for aggregating feature effect estimates across a set of near-optimal models. The goal is to capture the diversity of explanations that arise due to model multiplicity and to quantify the stability of such explanations through uncertainty estimation.
Background
Let \(\mathcal {D} = \{(\textbf{x}_i, y_i)\}_{i=1}^n\) be a labeled dataset, where \(\textbf{x}_i \in \mathbb {R}^p\) is a feature vector and \(y_i \in \{0,1\}\) is a binary response for classification or \(y_i \in \mathbb {R}\) is a continuous response for the regression task. The objective is to train a diverse set of predictive models and examine their behavior using partial dependence analysis restricted to a Rashomon set.
Model Training and the Best-performing Model
A collection of machine learning models \(\mathcal {M} = \{M_1, M_2, \ldots , M_K\}\) is obtained via an AutoML process, where each model \(M_k\) is trained on the same training dataset and evaluated on a test set. Denote a predictive performance measure \(\phi (M_k)\) of a model \(M_k\). It can be accuracy or area under the curve (AUC), etc., calculated on the test set for classification tasks, or mean squared error (MSE), mean absolute error (MAE), or other relevant measures for regression tasks. Define the best-performing model (hereafter referred to as the best model) as:
Note that \(\phi (M_k)\) is assumed to be maximized for classification tasks (e.g., accuracy, AUC), and minimized for regression tasks (e.g., MSE, MAE).
Rashomon Set
To explore model multiplicity, we then define the Rashomon set\(\mathcal {R} \subseteq \mathcal {M}\) as:
for a predefined tolerance \(\varepsilon> 0\) also known as the Rashomon parameter.
Rashomon Ratio
While the Rashomon set R provides the collection of near-optimal models, it is often useful to summarize the extent of model multiplicity with a single scalar. We define the Rashomon ratio as:
where |R| denotes the number of models in the Rashomon set and |M| the total number of models.
Intuitively, a higher Rashomon ratio suggests a greater potential for explanation variability, as more models achieve near-optimal performance. Conversely, a low Rashomon ratio may indicate a more concentrated model space; however, this does not guarantee stable explanations. Even a small number of high-performing models can diverge substantially in their feature attributions. Therefore, Rashomon ratio should be interpreted only as an indirect indicator, while the actual explanatory variability must be assessed empirically. This measure provides a coarse indication of how model diversity may relate to explanation variability, and we empirically examine this relationship in Section 5.
Partial Dependence Profile
Let \(X_j\) be a feature of interest. Here, we denote by x a single fixed value that the feature \(X_j\) can take, while \(\textbf{x}_i\) represents the p-dimensional feature vector of the i-th observation, and \(\textbf{x}_{i,-j}\) denotes the same vector with the j-th feature removed. The notation \([x, \textbf{x}_{i,-j}]\) thus refers to a modified observation where the j-th feature is set to x while all other features retain their values from the i-th observation. The partial dependence profile of model \(M_k \in \mathcal {R}\) for \(X_j\) is defined as the expected prediction when \(X_j\) is fixed at a value x and all other features vary according to their empirical distribution in the data:
where \(\textbf{x}_{i,-j}\) denotes the i-th observation with the j-th feature removed, and \([x, \textbf{x}_{i,-j}]\) represents the modified observation where the j-th feature is set to x. The function \(\hat{f}^{(k)}\) is the prediction function of model \(M_k\). For classification, it returns predicted probabilities (e.g., \(\hat{P}(Y=1|\textbf{x})\)); for regression, it returns the expected response.
Rashomon Partial Dependence Profile
The Rashomon partial dependence profile across all models in the Rashomon set is defined as an average feature effect:
It is computed as the element-wise average of the individual PDPs of all models in the Rashomon set. Specifically, for each feature value, we calculate the predicted response from each model in the set and then average these predictions to obtain the Rashomon PDP. This aggregation captures the consensus effect of the feature across multiple similarly performing models.
Bootstrap Intervals for Rashomon Partial Dependence Profile
To quantify the uncertainty of \(\bar{f}_j(x)\), we employ a nonparametric bootstrap over the Rashomon set. Specifically, for B bootstrap replicates, we resample models with replacement from \(\mathcal {R}\) to form bootstrap samples \(\mathcal {R}^{(1)}, \mathcal {R}^{(2)}, \ldots , \mathcal {R}^{(B)}\), and compute:
This procedure allows visualizing both the central tendency and the variability in feature effect estimates induced by model multiplicity within the Rashomon set. Note that sampling models from the Rashomon set rather than observations from training data allows you to capture uncertainty arising from model selection rather than data variability. In line with the goals of human-centered XAI, the Rashomon PDP not only provides a more comprehensive view of model behavior but also informs users about the uncertainty inherent in model explanations to enhance user trust. By presenting a range of plausible feature effects, this approach enables users to form more calibrated mental models of the system, reduces overreliance on any single model, and encourages critical engagement with model outputs. Thus, it helps foster appropriate trust and improves the interpretability of AutoML predictions in high-stakes settings.
4 Experiments
We evaluate our framework across diverse regression and classification tasks using standardized real-world dataset benchmarks and a widely used AutoML framework. Since the ground-truth PDPs are unknown in real-world datasets, we also investigate the functionality of our proposed framework through experiments on synthetic datasets where the ground-truth PDPs can be explicitly verified. The goal is to assess the usability of our framework regarding Rashomon set characteristics and uncertainty in explanations. This section details the datasets used and the experimental configuration.
4.1 Dataset
We utilize the OpenML-CTR23 (Fischer et al., 2023), which comprises 35 diverse regression datasets, and the 29 binary classification datasets from the OpenML-CC18 (Bischl et al., 2021). These datasets vary in domain, size, and complexity, allowing for a comprehensive evaluation of our proposed framework across diverse real-world scenarios.
4.2 Setup
In the synthetic dataset and real-world dataset experiments, we set the Rashomon parameter \(\varepsilon = 0.05\), a choice motivated by its frequent use in related literature. It provides a balanced trade-off between (i) obtaining a Rashomon set with sufficient model diversity and (ii) avoiding overly loose thresholds that would include poorly performing models (Müller et al., 2023; Cavus & Biecek, 2024). In all experiments, the Rashomon set includes models whose performance (in terms of RMSE for the regression task and AUC for the classification task) is within 5% of the best model found by the AutoML system. In the following subsection, we detail the AutoML configuration used in all experiments and the setups for synthetic data generation.
4.2.1 AutoML Configuration
We conduct our experiments using the H2O AutoML framework (LeDell et al., 2020), an automated machine learning tool that supports a variety of algorithms, including gradient boosting machines, random forests, generalized linear models, and stacked ensembles. In the modeling phase, we configure H2O AutoML with the search constraints max_runtime_secs for maximum runtime and max_models for maximum number of models under several designs given in Table 1.
Table 1
Configurations for H2O AutoML in terms of the maximum runtime in seconds and the maximum number of models
Configuration
Runtime
Models
Configuration
Runtime
Models
A
360
20
I
360
80
B
720
20
J
720
80
C
1440
20
K
1440
80
D
2880
20
L
2880
80
E
360
40
M
360
160
F
720
40
N
720
160
G
1440
40
O
1440
160
H
2880
40
P
2880
160
The results are given based on Configuration A, but we also test the effect of these constraints using statistical hypothesis tests, which are used in the evaluation of the proposed framework, and the results are given in Section 5.
4.2.2 Synthetic Dataset Setups
Synthetic datasets with \(n=10\,000\) observations were generated under three different setups to capture a range of data complexities and feature types. These setups include (1) combined linear and nonlinear polynomial terms, (2) trigonometric and interaction effects, and (3) complex nonlinear and categorical effects. The features included continuous, binary, categorical, and exponential features, with noise levels varying across experiments as described in Table 2.
\(X_1, X_3, X_7, X_9 \sim \mathcal {N}(0,1)\), \(X_2 \sim \text {Bernoulli}(0.5)\), \(X_4 \sim U(-2, 2)\), \(X_5 \sim \text {Exp}(1)\), \(X_6 \in \{A,B,C\}\), \(X_8 \in \{0, 1\}\), \(\varepsilon \sim \mathcal {N}(0, v_i^2)\) with \(v_i^2 \in \{1, 4, 9\}\). \(\mathbb {1}_{\text {condition}}\) is the indicator function that equals 1 if the condition is true, otherwise 0. Categorical feature effects: Setup (1): \(X_6=B\) adds 2, \(X_6=C\) adds -1, \(X_6=A\) is baseline 0; Setup (2): \(X_6=A\) adds -2, \(X_6=C\) adds 1, \(X_6=B\) is baseline 0; \(X_8=1\) adds 2, else 0; Setup (3): \(X_6=C\) adds 3, \(X_6 \in \{A,B\}\) baseline 0
4.3 Metrics
We employ three evaluation metrics to quantify the variability and reliability of feature effect estimates derived from the Rashomon set. The mean distance to ground-truth PDP measures the distance of the best-model PDP and the Rashomon PDP to the ground-truth PDP in synthetic data experiments. We used the rest of the metrics in the real-world dataset experiments. The mean width of confidence intervals captures the overall uncertainty of the Rashomon PDP. In contrast, the coverage rate assesses how well the Rashomon PDP aligns with the best-model PDP.
These metrics are calculated at the feature level and introduced in the following subsections. In the experiments, these feature-level values can also be utilized at the data level with the help of summary statistics such as mean, standard deviation, etc.
4.3.1 Mean Distance to Ground-truth PDP
We define the mean distance (MD) to ground-truth PDP as a metric that quantifies how closely a given PDP approximates the ground-truth PDP. For a given feature \(X_j\), the metric is computed as the mean of the pointwise absolute differences between the estimated PDP and the corresponding ground-truth PDP over a predefined evaluation grid. Formally, let \(\hat{f}_j^{(k)}(x)\) denote the partial dependence profile of model \(M_k \in \mathcal {R}\) for feature \(X_j\) evaluated at value x. The ground-truth PDP is denoted as \(f_j^{\text {gt}}(x)\). The mean absolute error between the two profiles for feature \(X_j\) under noise level \(\nu\) is computed as:
where \(\{x_1, \ldots , x_m\}\) is the evaluation grid for feature \(X_j\). The ground-truth PDP values \(f_j^{\text {gt}}(x)\) are generated from the data-generating process used in the synthetic setup.
4.3.2 Mean Width of Confidence Intervals
To summarize the overall uncertainty in the Rashomon PDP for a feature \(X_j\), we define the mean width of confidence intervals. Given a grid of input values \(\{x_1, x_2, \ldots , x_{n_x}\}\) for \(X_j\), the it is calculated as:
where \(\bar{f}j^{(b)}(x\ell )\) denotes the partial dependence estimate at point \(x_\ell\) from the b-th model in the Rashomon set, \(Q_p(\cdot )\) is the p-th quantile operator, and \(\alpha\) is the significance level for the confidence interval, e.g., \(\alpha =0.05\) for 95% confidence intervals. A larger mean width of confidence intervals indicates greater uncertainty or variability in the model explanations for feature \(X_j\), whereas a smaller mean width of confidence intervals suggests more stable and consistent PDPs across models.
4.3.3 Coverage Rate
The coverage rate measures the proportion of feature values for which the best model’s PDP lies entirely within the Rashomon-based confidence interval. A higher coverage rate indicates that the Rashomon set reliably captures the feature effect of the best model, whereas a lower rate suggests discrepancies between the best model and the set of near-optimal models. Let \(f_j(x)\) denote the PDP of the best model \(M^*\). Then the coverage rate is defined as:
where \(\mathbb {I}[\cdot ]\) is the indicator function, which is 1 if the value \(f_j(x_\ell )\) lies within the Rashomon-based confidence interval \(\text {CI}_j(x_\ell )\), and 0 otherwise. This provides a quantitative measure of how consistently the Rashomon PDP captures the behavior of the best model.
5 Results
This section presents empirical results demonstrating the utility of the proposed Rashomon PDP framework. The findings are organized into two main parts: synthetic dataset experiments and real-world benchmark datasets. The following subsections include quantitative metrics and graphical summaries. All the presented results in this section are obtained by running the AutoML tool under Configuration A, as detailed in Table 1.
5.1 Synthetic Dataset Experiments
We present the results of the synthetic dataset experiments under the setup detailed in Section 4.2.2 to demonstrate that Rashomon PDP is a more accurate estimator of the ground-truth PDP than the best-model PDP returned by the AutoML tool, largely due to its improved plausibility. In this context, plausibility refers to the degree to which the estimated PDPs faithfully capture the true underlying data-generating process, making it a key requirement for reliable interpretability.
5.1.1 Regression Task Results
Table 3 summarizes the performance and complexity characteristics of models trained on synthetic datasets under varying noise levels across three distinct data-generating setups. The reported metrics include the root mean squared error of the best model, model set size, Rashomon set size, Rashomon ratio, and the mean absolute distance comparison of the best model and the Rashomon PDP over the corresponding ground-truth PDP.
Table 3
For the regression task, model performance, Rashomon set characteristics, and PDP distance metrics are reported for synthetic datasets under different noise levels. BMP: Best Model Performance (AUC), MSS: Model Set Size, RSS: Rashomon Set Size, RR: Rashomon Ratio, and mean distance of best-model and Rashomon PDPs to the ground-truth PDP
Mean distance to ground-truth PDP
Setup
Noise level
BMP
MSS
RSS
RR
Best model PDP
Rashomon PDP
(1)
low
1.6244
18
10
0.56
0.218 ± 0.20
0.105 ± 0.06
mid
2.3669
17
10
0.59
0.189 ± 0.16
0.130 ± 0.07
high
3.2390
18
12
0.67
0.232 ± 0.22
0.150 ± 0.07
(2)
low
1.9751
18
11
0.61
0.262 ± 0.12
0.160 ± 0.06
mid
2.6364
18
12
0.67
0.368 ± 0.11
0.214 ± 0.08
high
3.4517
18
12
0.67
0.347 ± 0.21
0.219 ± 0.11
(3)
low
1.7646
18
14
0.78
0.111 ± 0.05
0.109 ± 0.07
mid
2.4797
18
14
0.78
0.155 ± 0.07
0.132 ± 0.08
high
3.3417
18
14
0.78
0.216 ± 0.12
0.171 ± 0.07
Noise levels correspond to Gaussian noise with variances \(v_i \in \{1, 4, 9\}\) used in the synthetic data generation formulas listed in Table 2, and labeled as low (\(v_i = 1\)), mid (\(v_i = 4\)), and high (\(v_i = 9\)) throughout the results
Across all setups, an increase in noise level consistently leads to a deterioration in predictive performance, as indicated by higher BMP values, e.g., Setup (1): 1.62 \(\rightarrow\) 3.24. This trend reflects the expected challenge of learning reliable models under greater stochasticity in the data. The Rashomon set size and Rashomon ratio also exhibit a mild increasing trend with noise, indicating that more models qualify as near-optimal under noisier conditions. This expansion of the Rashomon set may be attributed to the reduced discriminative power of the loss function, allowing a broader set of models to perform similarly.
When examining the alignment between estimated and ground-truth PDPs, we observe that the Rashomon PDPs are generally closer to the true data-generating relationships than the best model PDPs. For instance, in Setup (1) under high noise, the mean distance for the best model is 0.232 (±0.22), while the corresponding Rashomon PDP average is only 0.150 (±0.07). This pattern is consistent across setups and noise levels and indicates that the Rashomon set, by aggregating multiple valid models, can provide a more stable and faithful representation of the underlying feature-response relationship. Interestingly, Setup (3) yields the lowest PDP distances overall, indicating that the ground-truth relationships in this scenario are more readily recoverable, even in the presence of noise. The Rashomon PDP distances remain notably low, e.g., 0.109 ± 0.07 for the low level of noise, further highlighting the robustness of Rashomon-based interpretations in this setting. Overall, these findings support the hypothesis that the aggregated explanations derived from the Rashomon set can enhance reliability in model interpretation, particularly when individual models may be affected by noise.
The Wilcoxon signed-rank test (Wilcoxon, 1945) was conducted to assess whether the median distances of the Rashomon PDP and the best model PDP to the ground-truth PDP differ significantly, accounting for the paired nature of the measurements. The medians are 0.150 for the Rashomon PDP and 0.218 for the best model PDP. The test showed a statistically significant difference between the two groups (\(V = 0\), \(p =.0039\)), indicating that the Rashomon PDP distances differ consistently from those of the best model PDP. This statistically supports the observation that Rashomon PDPs generally provide explanations that are both distinct and, in many cases, more plausible than those of a single best model.
To further illustrate these quantitative findings, Fig. 3 presents the PDPs for the continuous features (\(X_1\), \(X_2\), \(X_3\), \(X_4\), \(X_5\), \(X_7\), \(X_9\)) across three noise levels for Setup (3). Here, the ground-truth PDPs are compared against both the best model PDP and the Rashomon PDP. Across all noise levels, the Rashomon PDPs generally provide a closer match to the ground-truth PDPs than the best-model PDPs, particularly for nonlinear relationships such as those involving \(X_1\) and \(X_4\). As noise increases, the deviation between model-based PDPs and ground-truth PDPs becomes more pronounced, highlighting the value of considering model multiplicity for more reliable interpretations.
Fig. 3
Comparison of Rashomon PDP and best model PDP against ground-truth for selected continuous features under different noise levels in Setup (3) from Table 2
Kruskal–Wallis rank-sum tests were conducted to examine whether the differences between the Rashomon PDPs and the best model PDPs systematically varied across regression setups. The analyses showed no significant effects for the number of models (\(p = 0.5215\)) or maximum runtime constraints (\(p = 0.9992\)), suggesting that these parameters did not influence the observed discrepancies. In contrast, noise level had a statistically significant effect (\(\chi ^2 = 7.61\), \(df = 2\), \(p = 0.0210\)), indicating that the divergence between Rashomon and best-model PDPs increases as noise grows. Similarly, the Rashomon ratio was strongly affected by noise level as well (\(\chi ^2 = 113.75\), \(df = 2\), \(p < 2.1\times 10^{-16}\)). Overall, these findings statistically confirm that, at least within the regression settings explored in this paper (see Table 1), noise is the dominant factor shaping both the alignment of Rashomon PDPs with the ground truth and the stability of Rashomon set properties.
5.1.2 Classification Task Results
The results in Table 4 illustrate how different levels of noise affect model performance and set characteristics across three synthetic data-generating processes for the classification task.
Table 4
For the classification task, model performance, Rashomon set characteristics, and PDP distance metrics are reported for synthetic datasets under different noise levels. BMP: Best Model Performance (AUC), MSS: Model Set Size, RSS: Rashomon Set Size, RR: Rashomon Ratio, and mean distance of best-model and Rashomon PDPs to the ground-truth PDP
Mean distance to ground-truth PDP
Setup
Noise level
BMP
MSS
RSS
RR
Best model PDP
Rashomon PDP
(1)
low
0.9027
18
18
1
0.039 ± 0.032
0.037 ± 0.045
mid
0.8717
18
18
1
0.045 ± 0.030
0.042 ± 0.042
high
0.8407
18
18
1
0.054 ± 0.031
0.053 ± 0.034
(2)
low
0.8225
18
15
0.83
0.037 ± 0.024
0.031 ± 0.022
mid
0.7871
18
16
0.89
0.050 ± 0.027
0.041 ± 0.021
high
0.7427
18
16
0.89
0.063 ± 0.028
0.049 ± 0.024
(3)
low
0.8161
18
17
0.94
0.052 ± 0.020
0.028 ± 0.024
mid
0.7708
18
17
0.94
0.071 ± 0.013
0.056 ± 0.017
high
0.7230
18
17
0.94
0.083 ± 0.015
0.072 ± 0.015
Noise levels correspond to Gaussian noise with variances \(v_i \in \{1, 4, 9\}\) used in the synthetic data generation formulas listed in Table 2, and labeled as low (\(v_i = 1\)), mid (\(v_i = 4\)), and high (\(v_i = 9\)) throughout the results
In all synthetic dataset setups for the classification task, increasing the noise level leads to a consistent decline in model performance, as reflected by decreasing best model performance values, for example, from 0.90 to 0.84 in Setup (1). This trend indicates the expected degradation in model accuracy under higher noise conditions. Despite this, Rashomon set size and Rashomon ratio remain relatively high and stable across noise levels, suggesting that many models continue to meet the performance threshold even under noise, possibly due to the broader performance plateau induced by reduced signal-to-noise ratio.
When examining the plausibility of the estimated PDPs to the ground-truth relationships, we observe that Rashomon PDPs are frequently closer to the true function than the PDPs from the single best model. For example, in Setup (2) under high noise, the best model’s PDP has a distance of 0.063, while the Rashomon PDP is closer at 0.049. Similarly, in Setup (3) under low noise, the Rashomon PDP achieves a minimal deviation of 0.028 (± 0.024), outperforming the best model’s 0.052. This pattern highlights the increased robustness and fidelity of Rashomon PDPs under varying levels of noise. Notably, Setup (3) demonstrates the smallest overall PDP distances across all noise levels. Even at high noise levels, the Rashomon PDP distance is only 0.072, and it drops to as low as 0.028 under low noise conditions. These findings imply that the underlying data-generating relationships in Setup (3) are more amenable to learning, and that the Rashomon approach effectively captures these relationships with greater stability across noise conditions.
To examine whether the median distances from the Rashomon PDP and the best model PDP to the ground-truth PDP differed in a statistically meaningful way, a Wilcoxon signed-rank test (Wilcoxon, 1945) was applied, taking into account the paired structure of the data. The median distance for the Rashomon PDP was 0.042, compared to 0.052 for the best model PDP. The analysis yielded a statistically significant result (\(V = 0\), \(p =.0039\)), suggesting that the Rashomon PDP distances consistently deviate from those of the best model. This finding statistically reinforces the view that Rashomon PDPs tend to offer explanations that are distinct from, and in some cases potentially more credible than, those provided by a single best-performing model.
Figure 4 presents the PDPs for the continuous features \(X_1\), \(X_3\), \(X_4\), \(X_5\), \(X_7\), \(X_9\) across three distinct noise levels. In these plots, the ground-truth PDPs are juxtaposed with the best model’s PDP and the Rashomon PDP. The Rashomon PDPs consistently provide a more reliable representation of the ground truth than the single best-model PDPs, especially for nonlinear relationships, such as those observed for \(X_1\) and \(X_4\). The growing deviation between the model-based PDPs and the ground truth as noise increases from 0.05 to 0.75 underscores the importance of considering model multiplicity for more trustworthy interpretations.
Fig. 4
Comparison of Rashomon PDP and best model PDP against ground-truth for selected continuous features under different noise levels as introduced in Setup (3) from Table 2
To further investigate whether the differences between the Rashomon PDPs and the best model PDPs varied systematically across AutoML configurations, Kruskal–Wallis rank-sum tests were applied. The analyses revealed no significant effects for the number of models (\(p = 0.4300\)) or maximum runtime constraints (\(p = 0.9805\)), indicating that these factors did not influence the observed differences. In contrast, noise level exhibited a statistically significant effect (\(\chi ^2 = 7.51\), \(df = 2\), \(p = 0.0239\)), suggesting that the divergence between Rashomon and best-model PDPs becomes more pronounced with increasing noise. Moreover, the Rashomon ratio was strongly affected by noise level as well (\(\chi ^2 = 109.05\), \(df = 2\), \(p < 2.2\times 10^{-16}\)). Taken together, these results provide statistical confirmation that, at least under the AutoML configurations considered in this paper (detailed in Table 1), noise plays a pivotal role in shaping both the fidelity of Rashomon PDPs relative to the ground truth and the stability of Rashomon set characteristics.
5.2 Real-world Dataset Experiments
In this section, we present the results of the experiments conducted on the OpenML-CTR23 (Fischer et al., 2023) for the regression task and the OpenML-CC18 (Bischl et al., 2021) for the classification task
5.2.1 Regression Task Results
Table 5 presents key metrics to evaluate model performance and uncertainty across different datasets. Best model performance reflects the performance of the top model on each dataset in terms of root mean squared error. At the same time, the Model set size represents the total number of models trained for each dataset. The Rashomon set size indicates the number of models in the Rashomon set, which includes models with performance similar to the best model. The Rashomon ratio measures the proportion of the Rashomon set relative to the total model set, highlighting model diversity. A higher Rashomon ratio suggests a larger set of models performing similarly, implying greater model diversity, which is crucial for understanding model robustness. For datasets where the Rashomon set could not be formed due to having fewer than two models, the corresponding statistics were not computed.
Table 5
Set sizes and uncertainty metrics for each dataset. BMP: Best model performance in terms of RMSE, MSS: Model set size, RSS: Rashomon set size, RR: Rashomon ratio, MWCI: Mean width of confidence intervals, CR: Coverage rate
Dataset
BMP
MSS
RSS
RR
MWCI
CR
abalone
2.1467
19
13
0.6842
\(0.5646 \pm 0.32\)
\(0.4214 \pm 0.27\)
airfoil_self_noise
1.5478
22
3
0.1364
\(0.2886 \pm 0.22\)
\(1.0000 \pm 0.00\)
auction_verification
371.0463
22
3
0.1364
\(92.3236 \pm 57.02\)
\(1.0000 \pm 0.00\)
brazilian_houses
4.1028
22
5
0.2273
\(0.6134 \pm 0.17\)
\(0.6938 \pm 0.20\)
california_housing
3.5789
15
6
0.4000
\(1.2140 \pm 0.69\)
\(0.2889 \pm 0.25\)
cars
9.2580
6
5
0.8333
\(0.6905 \pm 0.68\)
\(0.4878 \pm 0.32\)
concrete_compressive
3.8452
22
3
0.1364
\(0.3885 \pm 0.25\)
\(0.9782 \pm 0.03\)
cps88wages
0.7593
22
9
0.4091
\(0.0487 \pm 0.05\)
\(0.2500 \pm 0.32\)
cpu_activity
0.0006
17
1
-
-
-
diamonds
0.6060
19
6
0.3158
\(0.0506 \pm 0.02\)
\(0.4455 \pm 0.19\)
energy_efficiency
0.6035
22
9
0.4091
\(0.0401 \pm 0.01\)
\(0.7136 \pm 0.22\)
fifa
0.0081
19
2
0.1053
\(0.0003 \pm 0.00\)
\(1.0000 \pm 0.00\)
forest_fires
1.1825
3
2
0.6667
\(0.4857 \pm 0.57\)
\(1.0000 \pm 0.00\)
fps_benchmark
15641.9545
6
1
-
-
-
geographical_origin
2.0706
6
4
0.6667
\(0.8788 \pm 0.74\)
\(0.4259 \pm 0.27\)
grid_stability
45532.4256
13
6
0.4615
\(8782.0985 \pm 4371.93\)
\(0.2688 \pm 0.27\)
health_insurance
2.1419
8
4
0.5000
\(2.0780 \pm 5.37\)
\(0.5976 \pm 0.32\)
kin8nm
528.6411
16
7
0.4375
\(350.7176 \pm 269.25\)
\(0.3417 \pm 0.19\)
kings_county
0.1085
18
2
0.1111
\(0.0119 \pm 0.01\)
\(1.0000 \pm 0.00\)
miami_housing
0.0215
15
8
0.5333
\(0.0004 \pm 0.00\)
\(0.7438 \pm 0.14\)
Moneyball
86377.6786
18
8
0.4444
\(10383.3733 \pm 7957.86\)
\(0.4323 \pm 0.26\)
naval_propulsion
443.8287
14
13
0.9286
\(38.5566 \pm 5.99\)
\(0.6408 \pm 0.01\)
physiochemical
12.0804
19
2
0.1053
\(30.0358 \pm 0.00\)
\(1.0000 \pm 0.00\)
pumadyn32nh
110139.3275
16
5
0.3125
\(77076.2159 \pm 113276.89\)
\(0.4378 \pm 0.28\)
QSAR_fish_toxicity
307.0835
18
1
-
-
-
red_wine
0.8370
15
2
0.1333
\(0.2242 \pm 0.19\)
\(0.9667 \pm 0.18\)
sarcos
14.6829
19
17
0.8947
\(1.5426 \pm 0.73\)
\(0.8250 \pm 0.17\)
socmob
8659.1076
16
9
0.5625
\(1093.6380 \pm 2232.40\)
\(0.7093 \pm 0.21\)
solar_flare
22.6084
22
3
0.1364
\(6.9054 \pm 4.18\)
\(0.9812 \pm 0.04\)
space_ga
0.7068
22
4
0.1818
\(0.6040 \pm 0.44\)
\(0.8667 \pm 0.18\)
student_performance
112.5122
22
21
0.9545
\(5.9025 \pm 0.69\)
\(0.1400 \pm 0.23\)
superconductivity
0.9307
22
2
0.0909
\(0.0312 \pm 0.03\)
\(1.0000 \pm 0.00\)
video_transcoding
0.8382
22
12
0.5455
\(0.1391 \pm 0.04\)
\(0.5333 \pm 0.23\)
wave_energy
2056.7710
22
2
0.0909
\(538.4280 \pm 681.47\)
\(1.0000 \pm 0.00\)
white_wine
0.1094
19
7
0.3684
\(0.0440 \pm 0.04\)
\(0.7500 \pm 0.14\)
The results drawn from the table indicate significant variations in model diversity and uncertainty levels. For instance, some datasets, such as airfoil_self_noise and auction_verification, have very small Rashomon sets, suggesting that most models perform similarly, leading to lower model diversity. On the other hand, datasets with a higher Rashomon ratio, like cps88wages and health_insurance, demonstrate that different models perform similarly, indicating a higher level of model diversity. The mean width of confidence intervals and coverage rates also show higher uncertainty in some datasets, such as energy_efficiency and fifa, implying that model predictions exhibit more variability and wider confidence intervals. These findings highlight the importance of model reliability and potential prediction differences when considering multiple models within a Rashomon set.
To provide a deeper understanding of how model diversity relates to predictive uncertainty, Fig. 5 presents a visual analysis of the interactions between the Rashomon ratio, the mean width of confidence intervals, and the coverage rate.
Fig. 5
The trade-off between log(Mean Width of CI), Coverage Rate, and Rashomon Ratio. The contour levels indicate values of the Rashomon ratio, with warmer colors representing higher values. The x-axis corresponds to the logarithm of the average width of the confidence intervals, while the y-axis shows the empirical coverage rate of the predictions. Regions with intermediate interval widths and moderate-to-high coverage tend to exhibit higher Rashomon ratios, suggesting greater model multiplicity. In contrast, extreme combinations—very narrow or very wide intervals, or very low/high coverage—are associated with lower Rashomon diversity
As the log-transformed mean width of confidence intervals increases, the Rashomon ratio changes non-linearly. Notably, regions with higher coverage rates and moderate confidence interval widths are associated with elevated Rashomon ratios, indicating a greater diversity of models that achieve similar performance. Conversely, low Rashomon ratios tend to cluster in areas where the confidence intervals are too wide or too narrow, and coverage rates are very low or very high.
To illustrate these results on dataset applications, we consider the following examples. Figure 6 presents two illustrative examples from the datasets for the regression task, selected to highlight how Rashomon PDPs reveal model agreement and divergence. In Subfigure (a), the feature surface_area from the energy_efficiency dataset exhibits a high alignment between the best model PDP and the Rashomon PDP, along with a narrow confidence interval and high coverage. This agreement corresponds with a relatively low Rashomon ratio of 0.4091, indicating that only a small portion of the model set meets the Rashomon criterion and thus the best model’s behavior is more representative of the overall set.
Conversely, Subfigure (b), which depicts the feature temp from the forest_fires dataset, shows notable divergence between the best model PDP and the Rashomon PDP, especially for higher temperature values. The wider confidence interval and lower coverage rate correspond to a higher Rashomon ratio of 0.6667, suggesting a greater diversity among well-performing models.
These examples empirically support the key claim of this paper: when the Rashomon ratio is high, there is a greater risk that the best model PDP may not faithfully represent the consensus behavior of the broader model set. Therefore, using the Rashomon PDP provides a more trustworthy summary of feature-response relationships.
Fig. 6
Selected examples for two data sets for the regression task. Panel a shows an example where the effect seen by the best model is consistent with Rashomon PDP. Panel b shows an example of PDP estimates that differ between Rashomon and the best-model
We conducted Kruskal-Wallis rank-sum tests with the number of models as the grouping factor for assessing how different AutoML configurations influence the statistical properties of Rashomon sets. The analyses revealed no significant effect for the Rashomon ratio (\(\chi ^2 = 4.26\), \(df = 3\), \(p = 0.201\)), suggesting that the number of models did not substantially influence the relative size of the Rashomon set. In contrast, both the mean width of confidence interval (\(\chi ^2 = 45.19\), \(df = 3\), \(p < 1 \times 10^{-8}\)) and the coverage rate (\(\chi ^2 = 10.21\), \(df = 3\), \(p = 0.019\)) exhibited statistically significant effects, indicating that changes in the maximum number of models considerably affected the uncertainty and reliability of model explanations.
5.2.2 Classification Task Results
Table 6 presents key performance and uncertainty metrics across various real datasets.
Table 6
Set sizes and uncertainty metrics for each dataset. BMP: Best model performance in terms of RMSE, MSS: Model set size, RSS: Rashomon set size, RR: Rashomon ratio, MWCI: Mean width of confidence intervals, CR: Coverage rate
Dataset
BMP
MSS
RSS
RR
MWCI
CR
adult
0.9264
16
16
1
\(0.0353 \pm 0.01\)
\(0.1188 \pm 0.07\)
bank-marketing
0.9369
14
14
1
\(0.0440 \pm 0.03\)
\(0.1571 \pm 0.22\)
banknote
1.0000
19
19
1
\(0.0544 \pm 0.02\)
\(0.2250 \pm 0.16\)
Bioresponse
0.8722
14
13
0.9286
\(0.0050 \pm 0.00\)
\(0.2797 \pm 0.33\)
blood-transfusion
0.7970
22
16
0.7273
\(0.0668 \pm 0.04\)
\(0.5125 \pm 0.37\)
breast-w
0.9934
22
22
1
\(0.0207 \pm 0.01\)
\(0.3333 \pm 0.18\)
churn
0.9206
17
16
0.9412
\(0.1409 \pm 0.01\)
\(0.8018 \pm 0.24\)
cylinder-bands
0.9336
22
11
0.5000
\(0.0656 \pm 0.01\)
\(0.5396 \pm 0.27\)
climate-model
0.9794
22
9
0.4091
\(0.0200 \pm 0.00\)
\(0.0875 \pm 0.10\)
credit-approval
0.9621
22
20
0.9091
\(0.0445 \pm 0.01\)
\(0.4481 \pm 0.26\)
credit-g
0.8077
22
17
0.7727
\(0.0291 \pm 0.00\)
\(0.0357 \pm 0.07\)
diabetes
0.8435
22
20
0.9091
\(0.0348 \pm 0.01\)
\(0.1906 \pm 0.15\)
electricity
0.9821
14
10
0.7143
\(0.0385 \pm 0.02\)
\(0.0857 \pm 0.13\)
ilpd
0.8038
22
13
0.5909
\(0.0546 \pm 0.02\)
\(0.5056 \pm 0.24\)
jm1
0.7714
16
12
0.7500
\(0.0309 \pm 0.01\)
\(0.5452 \pm 0.39\)
kc1
0.8563
19
12
0.6316
\(0.0349 \pm 0.02\)
\(0.6754 \pm 0.36\)
kc2
0.8059
22
22
1.0000
\(0.0468 \pm 0.01\)
\(0.4347 \pm 0.32\)
madelon
0.9201
22
10
0.4545
\(0.0066 \pm 0.00\)
\(0.0111 \pm 0.06\)
nomao
0.9953
14
14
1
\(0.0155 \pm 0.01\)
\(0.1556 \pm 0.21\)
numerai28.6
0.5314
13
13
1
\(0.0038 \pm 0.00\)
\(0.4810 \pm 0.17\)
ozone-level-8hr
0.9651
18
14
0.7778
\(0.0158 \pm 0.00\)
\(0.0861 \pm 0.15\)
pc3
0.9321
22
9
0.4091
\(0.0631 \pm 0.09\)
\(0.6988 \pm 0.35\)
pc4
0.9690
22
22
1
\(0.0336 \pm 0.02\)
\(0.1074 \pm 0.21\)
phoneme
0.9654
18
12
0.6667
\(0.0313 \pm 0.01\)
\(0.3700 \pm 0.20\)
qsar-biodeg
0.9434
22
21
0.9545
\(0.0638 \pm 0.02\)
\(0.4256 \pm 0.28\)
sick
0.9961
18
14
0.7778
\(0.0179 \pm 0.01\)
\(0.0857 \pm 0.09\)
spambase
0.9895
11
11
1
\(0.0705 \pm 0.04\)
\(0.1596 \pm 0.22\)
wdbc
0.9859
22
22
1
\(0.0166 \pm 0.01\)
\(0.4667 \pm 0.28\)
wilt
0.9977
16
16
1
\(0.0818 \pm 0.11\)
\(0.1900 \pm 0.20\)
The results drawn from the table indicate meaningful differences in model diversity and uncertainty across datasets. Several datasets, such as pc4, kc2, nomao, and wilt, display Rashomon ratio of 1, suggesting that a large number of models achieve performance within the Rashomon set threshold. This implies a high degree of model diversity where many models perform similarly well. Conversely, datasets pc3, madelon, and climate-model-simulation-crashes have relatively low Rashomon ratio values, 0.4091 and 0.4545, indicating a smaller set of similarly performing models and, therefore, limited model diversity. In such cases, model selection becomes more sensitive, as fewer alternatives meet the Rashomon criteria. Uncertainty, measured via mean width of confidence intervals and coverage rate, also shows substantial variation. For instance, numerai28.6, Bioresponse, and madelon have notably small mean width of confidence intervals, \(0.0038 \pm 0.00\) and \(0.0050 \pm 0.00\), suggesting that model predictions are more stable. In contrast, datasets churn, pc3, and blood-transfusion exhibit higher mean width of confidence intervals and coverage rate, indicating increased uncertainty in model predictions. Specifically, the coverage rate values for pc3 (\(0.6988 \pm 0.35\)) and churn (\(0.8018 \pm 0.24\)) reflect wider predictive variability and possibly greater disagreement across models in the Rashomon set.
To further investigate the interplay between predictive uncertainty and model multiplicity, Fig. 7 visualizes the relationship between the log-transformed mean width of confidence intervals, the coverage rate, and the Rashomon ratio for the classification task.
Fig. 7
Contour plot depicting the relationship between log(Mean Width of CI), Coverage Rate, and Rashomon Ratio for the classification datasets. Warmer colors correspond to higher Rashomon ratios, indicating greater model multiplicity
As the figure shows, the Rashomon ratio varies non-linearly across the space defined by interval width and coverage rate. Regions characterized by moderate confidence interval widths combined with coverage rates around 0.5–0.75 tend to exhibit the highest Rashomon ratios, suggesting that a larger number of diverse models can achieve similar predictive performance in these regimes. In contrast, areas with either very narrow or very wide intervals, or coverage rates close to the extremes (near 0 or 1), are associated with lower Rashomon ratios. This pattern highlights that both overly confident and overly uncertain predictions tend to restrict model multiplicity, while a balanced trade-off between interval width and coverage supports greater diversity in predictive models.
Similar to the previous subsection, we consider the following examples. Figure 8 presents two illustrative examples from the datasets for the classification task, selected to highlight how Rashomon PDPs reveal model agreement. In Subfigure (a), the feature age from the adult dataset exhibits a notable divergence between the best model PDP and the Rashomon PDP, particularly for age values above approximately 60. This divergence, coupled with the reported coverage rate of 0.15, indicating low coverage and an mean width of confidence intervals of 0.0226, suggests a lack of consensus among the well-performing models regarding the effect of age at higher values. This behavior corresponds with a relatively high Rashomon ratio implied by the low coverage rate and visual divergence, though not explicitly stated for this figure, indicating that a significant portion of the model set behaves differently from the best model.
Fig. 8
Selected examples for two data sets for the classification task. Panel a shows an example where the effect seen by the best model is different from that identified by Rashomon PDP. Panel b shows an example of consistent explanations
Conversely, Subfigure (b), which depicts the feature DESIGN_COMPLEXITY from the pc3 dataset, shows a high alignment between the best model PDP and the Rashomon PDP, along with a high coverage rate of 1 and a narrow confidence interval of 0.0265. This agreement corresponds to a low Rashomon ratio implied by the high coverage rate, suggesting that the best model’s behavior is highly representative of the overall set of well-performing models.
These findings underscore the trade-off between model diversity. While high Rashomon ratio values provide flexibility in model choice, they may also increase the need to evaluate models for robustness. Similarly, datasets with high uncertainty warrant cautious interpretation of predictions, even when many models appear to perform similarly.
To assess how different AutoML configurations influence the statistical properties of Rashomon sets, we conducted Kruskal–Wallis rank-sum tests with the number of models as the grouping factor. The analyses revealed no significant effect for the Rashomon ratio (\(\chi ^2 = 4.46\), \(df = 3\), \(p = 0.216\)), suggesting that the number of models did not substantially influence the relative size of the Rashomon set. In contrast, both the mean width of confidence interval (\(\chi ^2 = 43.53\), \(df = 3\), \(p < 1 \times 10^{-8}\)) and the coverage rate (\(\chi ^2 = 10.47\), \(df = 3\), \(p = 0.015\)) exhibited statistically significant effects, indicating that changes in the maximum number of models considerably affected the uncertainty and reliability of model explanations. Taken together, these results provide statistical evidence that while the relative Rashomon set size remains robust to the number of models, other stability measures, such as mean width of confidence interval and coverage rate, are sensitive to this AutoML configuration choice.
6 Conclusion
In human-centered XAI, effectively communicating this uncertainty is paramount for users to appropriately calibrate their trust in AI systems. This paper addresses the critical issue of explanation uncertainty arising from the Rashomon effect, particularly within AutoML processes. While AutoML simplifies model development, its conventional reliance on a single best-performing model overlooks the diversity of near-optimal models that can offer varied, yet equally valid, insights into underlying feature relationships. This oversight introduces a significant source of uncertainty in model explanations, which is rarely communicated to users.
To mitigate this, we introduced Rashomon PDP, a novel model-agnostic framework designed to quantify the inherent variability in explanations across a set of near-optimal models. By aggregating feature effect estimates from models within the Rashomon set, which is already generated by AutoML but typically discarded, our method leverages existing computational outputs with minimal additional cost. This approach explicitly embraces variability rather than concealing it, thereby fostering a more comprehensive understanding of model behavior among users.
Our empirical evaluations on both synthetic and real-world datasets underscore the effectiveness of the Rashomon PDP framework. In synthetic datasets with known ground-truth feature relationships, Rashomon PDP demonstrated superior accuracy, reducing the average deviation from ground-truth feature relationships by up to \(38\%\) compared to explanations based on a single best model generated by the AutoML tool. This improvement was particularly pronounced under high-noise conditions, where the average deviation decreased from 0.232 to 0.150. Furthermore, evaluations on 35 real-world datasets for the regression task indicated an average confidence interval coverage of \(71.6\%\), demonstrating that our approach consistently captures model uncertainty while maintaining strong alignment with the most accurate model outputs. For classification datasets, the results consistently showed a high Rashomon Ratio, indicating broad model diversity, alongside varying levels of mean width of confidence intervals and coverage rate, highlighting different degrees of explanation uncertainty across diverse real-world scenarios. Importantly, our analysis revealed that the Rashomon ratio—a measure of the proportion of near-optimal models—is a reliable signal of explanation uncertainty, particularly in noisy datasets. Higher Rashomon ratios consistently corresponded with greater variability in feature effect estimates, underscoring their utility as early indicators of instability in model explanations.
These findings strongly support the utility of Rashomon PDP in designing more trustworthy explanations within the AutoML context. Crucially, this achievement directly advances the core principles of human-centered XAI. By explicitly presenting a range of plausible feature effects, our framework enables users to form more calibrated mental models, promotes appropriate trust through transparent communication of uncertainty, and encourages critical engagement with AI-generated insights rather than passive acceptance. This is particularly crucial in high-stakes decision-making environments where misinterpretation of model behavior can have significant consequences.
Beyond improving explanatory reliability, Rashomon PDP also reveals a promising direction for evolving AutoML tools themselves. Since the Rashomon set and ratio are derived from models already generated during the AutoML search, they incur no additional computational cost, yet provide information about the uncertainty of model explanations. Integrating this perspective into AutoML workflows could help address recognized open challenges in the field as posed by Baratchi et al. (2024), specifically related to trustworthy AutoML; by achieving better model interpretability and explanation uncertainty, we aim to transform AutoML from a model selection utility into a more transparent and interactive explanatory assistant.
Future research directions include extending the Rashomon PDP framework to other model-agnostic explanation techniques, such as SHAP values or LIME. Investigating the impact of different Rashomon parameter (\(\varepsilon\)) selection strategies and exploring adaptive methods to determine optimal confidence interval widths based on user needs or domain criticality could also yield valuable insights. It is worth noting that PDP may yield misleading results when class imbalance and correlated variables, which represents a current limitation of our study. In future work, the framework could be extended using Accumulated Local Effect Profiles (Apley & Zhu, 2020), improving robustness and interpretability of feature effect estimates. Furthermore, user studies evaluating the impact of Rashomon PDP on human decision-making, trust calibration, and overall understanding in real-world scenarios will provide critical validation for the human-centered aspects of this work. Finally, exploring the integration of Rashomon PDP into interactive AutoML interfaces could enhance the practical applicability of our framework for a broader range of users.
Declarations
Competing Interests
The authors have no competing interests to disclose.
Ethics Approval and Consent to Participate
Not applicable.
Consent for Publication
Not applicable.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Mustafa Cavus
is an Associate Professor in the Department of Statistics at Eskisehir Technical University; he received his PhD from Eskisehir Technical University and conducted postdoctoral research at Warsaw University of Technology. His research focuses on Explainable Artificial Intelligence and Responsible Artificial Intelligence, specifically on the uncertainty of explanations in machine learning models and methodological risks questioning algorithmic stability. Dr. Cavus conducts studies on epistemic responsibility in Explainable Artificial Intelligence systems and makes these systems more reliable within the framework of transparency and accountability principles.
Jan N. van Rijn
is a tenured Assistant Professor at Leiden University, within the Leiden Institute of Advanced Computer Science (LIACS) and the Automated Design of Algorithms (ADA) cluster. His research focuses on Trustworthy Artificial Intelligence, Automated Machine Learning, and Metalearning. Dr. van Rijn obtained his PhD from Leiden University in 2016, during which he developed OpenML.org, an open science platform designed to facilitate the sharing of machine learning results. Following postdoctoral research positions at the University of Freiburg and Columbia University, he co-authored the book “Metalearning: Applications to Automated Machine Learning and Data Mining” (published by Springer). His research aim is to democratize access to machine learning and artificial intelligence by developing tools and knowledge that empower domain experts.
Przemysław Biecek
is the Director of the Centre for Credible AI at the Warsaw University of Technology and a Full Professor in the Faculty of Mathematics, Informatics and Mechanics at the University of Warsaw. His research focuses on Responsible and Explainable Artificial Intelligence—a field he defines as “Model Science”—with an emphasis on model interpretability, controllability, and verifiability. Recognized as one of the world’s top 2% most influential scientists by Stanford University, Dr. Biecek is the creator and maintainer of widely used open-source software packages, such as DALEX, designed to enhance the reliability of machine learning models. A laureate of the Fulbright IMPACT Award, he leads efforts to develop AI systems within the framework of transparency, accountability, and social responsibility, and serves as a prominent voice in the global establishment of credible AI standards.
Li, Y., & Yan, K. (2025). Prediction of bank credit customers churn based on machine learning and interpretability analysis. Data Science in Finance and Economics,5(1), 19–34.CrossRef
Ehsan, U., Riedl, M.O.: Human-centered explainable AI: Towards a reflective sociotechnical approach. In: International Conference on Human-Computer Interaction, pp. 449–466 (2020). Springer
Maity, S., Deroy, A.: Human-centric eXplainable AI in education. arXiv preprint arXiv:2410.19822 (2024)
Ferrario, A., Termine, A., Facchini, A.: Addressing social misattributions of large language models: An HCXAI-based approach. arXiv preprint arXiv:2403.17873 (2024)
Suffian, M., Stepin, I., Alonso-Moral, J.M., Bogliolo, A., et al. Investigating human-centered perspectives in explainable artificial intelligence. In: CEUR Workshop Proceedings, vol. 3518, pp. 47–66 (2023)
Nguyen, T., Zhu, J.: Towards better user requirements: How to involve human participants in XAI research. Workshop: Human Centered AI at NeurIPS (2022)
Roy, S., Laberge, G., Roy, B., Khomh, F., Nikanjam, A., Mondal, S.: Why don’t xai techniques agree? Characterizing the disagreements between post-hoc explanations of defect predictions. In: 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 444–448 (2022). IEEE
Barr, B., Fatsi, N., Hancox-Li, L., Richter, P., Proano, D., Mok, C.: The disagreement problem in faithfulness metrics. Workshop: XAI in Action: Past, Present, and Future Applications (2023)
Krishna, S., Han, T., Gu, A., Wu, S., Jabbari, S., Lakkaraju, H.: The disagreement problem in explainable machine learning: A practitioner’s perspective. Transactions on Machine Learning Research (2024)
Löfström, H., Löfström, T., Szabadvary, J.H.: Ensured: Explanations for decreasing the epistemic uncertainty in predictions. arXiv preprint arXiv:2410.05479 (2024)
Goethals, S., Martens, D., Evgeniou, T.: Manipulation risks in explainable AI: The implications of the disagreement problem. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 185–200 (2023). Springer
Mitruț, O., Moise, G., Moldoveanu, A., Moldoveanu, F., Leordeanu, M., & Petrescu, L. (2024). Clarity in complexity: how aggregating explanations resolves the disagreement problem. Artificial Intelligence Review,57(12), 338.CrossRef
Reingold, O., Shen, J.H., Talati, A.: Dissenting explanations: Leveraging disagreement to reduce model overreliance. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 21537–21544 (2024)
Vascotto, I., Rodriguez, A., Bonaita, A., Bortolussi, L.: When can you trust your explanations? A robustness analysis on feature importances. The 3rd World Conference on eXplainable Artificial Intelligence (2025)
Schwarzschild, A., Cembalest, M., Rao, K., Hines, K., Dickerson, J.: Reckoning with the disagreement problem: Explanation consensus as a training objective. In: Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp. 662–678 (2023)
Feurer, M., Klein, A., Eggensperger, K., Springenberg, J.T., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, pp. 2962–2970 (2015)
Erickson, N., Mueller, J., Shirkov, A., Zhang, H., Larroy, P., Li, M., Smola, A.J.: AutoGluon-Tabular: Robust and accurate automl for structured data. In: Proceedings of the 7th International Conference on Machine Learning Workshop on Automated Machine Learning (2020)
Semenova, L., Chen, H., Parr, R., & Rudin, C. (2023). A path to simpler models starts with noise. Advances in Neural Information Processing Systems,36, 3362–3401.
Breiman, L. (2001). Statistical modeling: The two cultures. Statistical science,16(3), 199–231.CrossRef
Rudin, C., Chen, C., Chen, Z., Huang, H., Semenova, L., & Zhong, C. (2022). Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistic Surveys,16, 1–85.
Rudin, C., Zhong, C., Semenova, L., Seltzer, M., Parr, R., Liu, J., Katta, S., Donnelly, J., Chen, H., Boner, Z.: Position: Amazing things come from having many good models. Proceedings of the 41st International Conference on Machine Learning (2024)
Watson-Daniels, J., Calmon, F.d.P., D’Amour, A., Long, C., Parkes, D.C., Ustun, B.: Predictive churn with the set of good models. arXiv preprint arXiv:2402.07745 (2024)
Watson-Daniels, J., Parkes, D.C., Ustun, B.: Predictive multiplicity in probabilistic classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 10306–10314 (2023)
Cavus, M., Biecek, P.: Investigating the impact of balancing, filtering, and complexity on predictive multiplicity: A data-centric perspective. Information Fusion, 103243 (2025)
Yardimci, Y., Cavus, M.: Rashomon perspective for measuring uncertainty in the survival predictive maintenance models. In: 2025 33rd Signal Processing and Communications Applications Conference (SIU), pp. 1–4 (2025)
Biecek, P., Samek, W.: Position: explain to question not to justify. In: Proceedings of the 41st International Conference on Machine Learning, pp. 3996–4006 (2024)
Kobylińska, K., Krzyziński, M., Machowicz, R., Adamek, M., & Biecek, P. (2024). Exploration of the rashomon set assists trustworthy explanations for medical data. IEEE Journal of Biomedical and Health Informatics,28(11), 6454–6465.CrossRef
Friedman, J.H.: Greedy function approximation: a gradient boosting machine. Annals of statistics, 1189–1232 (2001)
Cavus, M., van Rijn, J.N., Biecek, P., et al.: Beyond the Single-Best Model: Rashomon Partial Dependence Profile for Trustworthy Explanations in AutoML. 28th International Conference on Discovery Science (2025)
Ehsan, U., Liao, Q.V., Muller, M., Riedl, M.O., Weisz, J.D.: Expanding explainability: Towards social transparency in AI systems. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–19 (2021)
Marx, C., Calmon, F., Ustun, B.: Predictive multiplicity in classification. In: International Conference on Machine Learning, pp. 6765–6774 (2020). PMLR
Wickstrøm, K., Höhne, M., Hedström, A.: From flexibility to manipulation: The slippery slope of XAI evaluation. In: European Conference on Computer Vision, pp. 233–250 (2024). Springer
Chiaburu, T., Bießmann, F., Haußer, F.: Uncertainty propagation in XAI: A comparison of analytical and empirical estimators. arXiv preprint arXiv:2504.03736 (2025)
Rawal, K., Fu, Z., Delaney, E., Russell, C.: Evaluating model explanations without ground truth. In: Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pp. 3400–3411 (2025)
Miró-Nicolau, M., Jaume-i-Capó, A., & Moyà-Alcover, G. (2025). A comprehensive study on fidelity metrics for XAI. Information Processing & Management,62(1), Article 103900.CrossRef
Laberge, G., Pequignot, Y.B., Marchand, M., Khomh, F.: Tackling the XAI disagreement problem with regional explanations. In: International Conference on Artificial Intelligence and Statistics, pp. 2017–2025 (2024). PMLR
Chatterjee, S., Colombo, E.R., Raimundo, M.M.: Multi-criteria Rank-based Aggregation for eXplainable AI. International Joint Conference on Neural Networks (2025)
Banegas-Luna, A.J., Martınez-Cortes, C., Perez-Sanchez, H.: Fighting the disagreement in explainable machine learning with consensus. arXiv preprint arXiv:2307.01288 (2023)
Simson, J., Draxler, F., Mehr, S., Kern, C.: Preventing harmful data practices by using participatory input to navigate the machine learning multiverse. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems, pp. 1–30 (2025)
Baratchi, M., Wang, C., Limmer, S., van Rijn, J.N., Hoos, H., Bäck, T., Olhofer, M.: Automated machine learning: past, present and future. Artificial Intelligence Review 57 (2024)
Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine Learning Research,13, 281–305.
Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Proceedings of the 25th International Conference on Neural Information Processing Systems, pp. 2951–2959. Curran Associates Inc., USA (2012)
Bergstra, J., Bardenet, R., Bengio, Y., Kégl, B.: Algorithms for hyper-parameter optimization. In: Proceedings of the 24th International Conference on Neural Information Processing Systems, vol. 24, pp. 2546–2554. Curran Associates Inc., USA (2011)
Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. In: Proceedings of the 5th International Conference on Learning and Intelligent Optimization, pp. 507–523. Springer, Berlin, Heidelberg (2011)
Varma, S., & Simon, R. (2006). Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics,7, 91.CrossRef
Jamieson, K.G., Talwalkar, A.: Non-stochastic best arm identification and hyperparameter optimization. In: Proceedings of the International Conference on Artificial Intelligence and Statistics, Cadiz, Spain, May 9-11, 2016. Journal of Machine Learning Research Workshop and Conference Proceedings, vol. 51, pp. 240–248 (2016)
Mohr, F., & van Rijn, J. N. (2023). Fast and informative model selection using learning curve cross-validation. IEEE Transactions on Pattern Analysis and Machine Intelligence,45(8), 9669–9680.CrossRef
Klein, A., Falkner, S., Bartels, S., Hennig, P., Hutter, F.: Fast bayesian optimization of machine learning hyperparameters on large datasets. In: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. Proceedings of Machine Learning Research, vol. 54, pp. 528–536 (2017)
Feurer, M., Eggensperger, K., Falkner, S., Lindauer, M., Hutter, F.: Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning. Journal of Machine Learning Research 23(261), 1–61 (2022)
LeDell, E., Poirier, S., et al.: H2o automl: Scalable automatic machine learning. In: Proceedings of the AutoML Workshop at ICML, vol. 2020, p. 24 (2020)
Hutter, F., Hoos, H.H., Leyton-Brown, K.: An efficient approach for assessing hyperparameter importance. In: Proceedings of the International Conference on Machine Learning, Beijing, China, 21-26 June 2014, pp. 754–762 (2014)
Fawcett, C., & Hoos, H. H. (2016). Analysing differences between algorithm configurations through ablation. Journal of Heuristics,22(4), 431–458.CrossRef
Perrone, V., Jenatton, R., Seeger, M., Archambeau, C.: Scalable hyperparameter transfer learning. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 6846–6856 (2018)
van Rijn, J.N., Hutter, F.: Hyperparameter importance across datasets. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2367–2376 (2018)
Moussa, C., Patel, Y. J., Dunjko, V., Bäck, T., & van Rijn, J. N. (2024). Hyperparameter importance and optimization of quantum neural networks across small datasets. Machine Learning,113, 1941–1966.CrossRef
Theodorakopoulos, D., Stahl, F., Lindauer, M.: Hyperparameter Importance Analysis for Multi-Objective AutoML. In: ECAI 2024, pp. 1100–1107. IOS Press, Netherlands (2024)
Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., Mantovani, R.G., van Rijn, J.N., Vanschoren, J.: Openml benchmarking suites. In: Proceedings of the NeurIPS 2021 Datasets and Benchmarks Track, (2021)
Müller, S., Toborek, V., Beckh, K., Jakobs, M., Bauckhage, C., Welke, P.: An empirical evaluation of the rashomon effect in explainable machine learning. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 462–478 (2023). Springer
Cavus, M., & Biecek, P. (2024). An experimental study on the rashomon effect of balancing methods in imbalanced classification. DEARING: Workshop on Data-cEntric ARtIficial iNtelliGence workshop at ECML.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics bulletin,1(6), 80–83.CrossRef
Apley, D. W., & Zhu, J. (2020). Visualizing the effects of predictor variables in black box supervised learning models. Journal of the Royal Statistical Society Series B: Statistical Methodology,82(4), 1059–1086.CrossRef