Sie können Operatoren mit Ihrer Suchanfrage kombinieren, um diese noch präziser einzugrenzen. Klicken Sie auf den Suchoperator, um eine Erklärung seiner Funktionsweise anzuzeigen.
Findet Dokumente, in denen beide Begriffe in beliebiger Reihenfolge innerhalb von maximal n Worten zueinander stehen. Empfehlung: Wählen Sie zwischen 15 und 30 als maximale Wortanzahl (z.B. NEAR(hybrid, antrieb, 20)).
Findet Dokumente, in denen der Begriff in Wortvarianten vorkommt, wobei diese VOR, HINTER oder VOR und HINTER dem Suchbegriff anschließen können (z.B., leichtbau*, *leichtbau, *leichtbau*).
Dieser Artikel vertieft sich in die umfassende Evaluierung neuronaler Prototypen, wobei der Schwerpunkt auf deren Interpretierbarkeit und Leistung über verschiedene Datensätze hinweg liegt. Die Bewertung umfasst eine detaillierte Analyse dreier prominenter teilbasierter Prototypmodelle: ProtoPNet, ProtoPool und PIPNet. Die Studie stellt 22 Messgrößen vor, darunter 13 neuartige Beiträge, um verschiedene Aspekte dieser Modelle zu bewerten, wie Output-Vollständigkeit, Kontinuität, Kontrastivität, kovariate Komplexität und Kompaktheit. Die Evaluierung umfasst mehrere Datensätze, darunter CUB200, Cars196, NICO und AWA2, was eine umfassende Perspektive auf die Anwendbarkeit und den Nutzen dieser Modelle bietet. Die Ergebnisse heben die Stärken und Schwächen der einzelnen Modelle hervor, wobei sich PIPNet als das am besten interpretierbare und robusteste Modell herauskristallisiert hat. Der Artikel diskutiert auch die Herausforderungen beim Erlernen semantisch unterschiedlicher Prototypen und die Effektivität unterschiedlicher Designentscheidungen bei der Förderung der Interpretationsfähigkeit. Insgesamt bietet die Studie wertvolle Einblicke in den aktuellen Stand neuronaler Prototypen und ihr Potenzial für künstliche Intelligenz unterstützte Entscheidungsfindung.
KI-Generiert
Diese Zusammenfassung des Fachinhalts wurde mit Hilfe von KI generiert.
Abstract
Prototype models are an important method for explainable artificial intelligence (XAI) and interpretable machine learning. In this paper, we perform an in-depth analysis of a set of prominent prototype models including ProtoPNet, ProtoPool and PIPNet. For their assessment, we apply a comprehensive set of metrics. In addition to applying standard metrics from literature, we propose several new metrics to further complement the analysis of model interpretability. In our experimentation, we apply the set of prototype models on a diverse set of datasets including fine-grained classification, Non-IID settings and multi-label classification to further contrast the performance. Furthermore, we also provide our source code as an open-source library (https://github.com/uos-sis/quanproto), which facilitates simple application of the metrics itself, as well as extensibility—providing the option for easily adding new metrics and models.
1 Introduction
While there have been enormous advances in deep learning [1, 2], the challenge of developing truly interpretable models with accuracy comparable to their black-box counterparts remains unresolved [3‐5]. This is a prominent and active area of research connecting both interpretability and explainability of the respective models [6]. In particular, in the domain of image-related tasks like classification, non-interpretable models typically outperform most other models or techniques. However, in high-risk domains often encountered in industry or the medical sectors, explainability and interpretability are key requirements [7, 8], also enabling computational sensemaking [9].
Different post-hoc techniques have been developed in the past, such as SHAP and LIME [10, 11]. These explain the decision of a model by calculating the importance of individual input features and can be applied to a large family of models. In the domain of image classification, saliency maps are a particular prominent subgroup of these model-agnostic methods, e.g., Integrated Gradients, Grad-CAM and other variants explain the attribution of individual or groups of pixels for a given prediction [12, 13]. Although, such model-agnostic techniques are widely used and can provide considerable insight into trained models, they are in themselves only post-hoc methods with several drawbacks, e.g., [14] compared to a completely intrinsically interpretable model.
Anzeige
One possible solution to overcome some of these problems, has been the development of deep part-based prototype models, firstly presented in [15] and [16]. A key advantage is their intrinsic interpretability. Part-based prototype models focus on learning a meaningful embedding space using an often pre-trained black-box network as feature extractor. A key distinction from standard black-box models is the aim to extract features that concentrate on specific sections of the image. This division of the image in the embedding space allows the network to learn prototypes that represent only particular parts of the object. The classification layer then utilizes the presence of a prototype in the image to calculate the prediction. To provide an explanation for a prediction, the prototypes that have the highest evidence are projected back to the input space through a saliency method.
However, users have different experiences with model explanations. Confirmation bias and subjective understanding make it hard to guess the prediction correctly [17, 18]. This shows that these models are not yet reliable enough to use for AI-assisted decision-making. This encouraged us to survey existing assessments of explainability using a well-founded categorization.
Our research is based on and extends the work of Nauta et al. [19]. With an implementation of a substantial portion of the evaluation framework presented in [19], we facilitate a thorough assessment of part-based prototype models. Our contributions are summarized as follows:
1.
We expand the set of metrics referenced in [19] by proposing several novel metrics designed to enhance the interpretability assessment of part-based prototype models.
2.
Specifically, we provide an implementation of 22 metrics, including 13 novel contributions, via our open-source library QuanProto.
3.
Furthermore, we conducted a comprehensive analysis of three prominent part-based prototype models, i. e., ProtoPNet [16], ProtoPool [20] and PIPNet [21]. We provide new insights and demonstrate the utility and applicability of our library.
4.
Our evaluation offers a broad perspective on potential application scenarios by utilizing established datasets such as CUB200 [22] and Cars196 [23] while also extending to a more practice-oriented Non-IID context with the NICO [24] dataset and a multi-label classification task focused on the challenge of learning class-independent features from animals using the AWA2 [25] dataset.
2 Related Work
This work addresses a broad spectrum of subdomains that are relevant to prototype-based networks. Central to this line of research is the idea of learning human-interpretable concepts, a property particularly important in high-stakes applications. While prototype-based models represent a widely adopted approach in this domain, they also exhibit notable limitations. These shortcomings highlight the need of practical and comprehensive analysis tools.
Anzeige
Concept-based Artificial Intelligence Central to concept-based explanation is the notion of conceptualization [26] and concepts which has already been investigated in different fields, most notably in case-based reasoning [27, 28]. The basic idea is to focus on human understandable higher-order concepts not only related to single features. Here, a recent and closely related direction is concept-based artificial intelligence [29]. Examples of this approach are e.g., concept bottleneck models [30] or concept activation vectors [31]. Another approach, which shares similarities with prototype-based models, is TesNet by Wang et al. [32]. Here, an interpretable embedding space constructed by basis-concepts is learned.
Architectures The prototypical part network (ProtoPNet) from Chen et al. [16] is one of the pioneering networks that popularized the concept of part prototype networks in image classification. In addition to the models examined in this article, other networks have been developed that combine the prototype approach with many concepts from the ML field. Nauta et al. [33] varied the classification process, by replacing the fully connected layer at the end of the model through a decision tree (ProtoTree) [33]. ProtoPShare [34] uses an iterative post-processing technique to merge prototypes from a ProtoPNet model, effectively creating shared prototypes between classes and reducing the amount of needed prototypes. Xue et al. [35] adapted the idea of ProtoPNet to be used together with vision transformers as feature extractors. Prototype-based deep learning models have also been used for time series [36], graph data [37], or image segmentation tasks [38]. There is also work, which combines deep prototype models with an autoencoder architecture [39]. Other variants of the ProtoPNet architecture, include [40‐43]. Li et al. [44] provide a comprehensive overview of the design choices and prototype formulations that are used in the literature analysing prototype quality relations.
Application Domain Part-based prototype models have been used in several industrial application settings e.g., to inspect power grids visually [45], in the classification of MRI scans to detect Alzheimer’s disease [46] or X-Ray images to classify Chest X-rays [47]. Carlone et al. [48] propose a prototype-based model to identify breast cancer, highlighting the acceptance of visual explanations.
Limitations Besides the success of ProtoPNet and the improvement through architectural changes, there are also limitations and some critique regarding prototypical networks. Elhadri et al. [49] survey general quality issues regarding the interpretability of learned prototypes in these models. Xu-Darme et al. [50] criticize the way ProtoPNet and ProtoTree visualize the learned prototypes and propose to use different saliency methods to overcome this problem. Hoffmann et al. [51] point out that ProtoPNets struggle with adversarial examples or compression artifacts in the input. Nevertheless, Hoffmann et al. do not want to dispute part prototypical networks, but raise the awareness of these problems. Saralajew et al. [52] contribute to this discussion by introducing a method for fast and provable adversarial robustness certification for Nearest Prototype Classifiers (NPCs).
Evaluation of Prototypical-Part Models In the literature, there have been several approaches aiming at evaluating different prototypical-part models. These include different evaluation methods as well as specific metrics. The work of Huang et al. [53], e.g., proposes two metrics called consistency score and stability score that focus on the semantic properties of prototypes alongside a new network architecture to improve these metrics. Recently, there has also been work to create frameworks to simplify the utilization of different types of part-based prototype models, cf. ProtoPNeXt [54] and the work on the library CaBRNet [55]. Ma et al. [56], try to improve the visualization of the prototypes using multiple examples from the training set to describe the concept behind a learned prototype. The framework HIVE, created by Kim et al. [47], is designed for the evaluation of explanation techniques. They highlight key issues that are crucial when humans must interpret these explanations. Nauta et al. [57] examine, why the model considers a prototype and a patch of the input image as similar. A comprehensive survey focusing on the interpretability assessment of part-based prototype models is also given by Nauta et al. [19].
3 Datasets
For evaluating the respective networks in diverse domains, we selected several datasets from the areas of (multi-label) classification, in particular fine-grained prototypical classification, and general object detection in a Non-IID setting.
Caltech-UCSD Birds (CB200) Dataset The first fine-grained dataset is Caltech-UCSD Birds-200-2011 (CUB200) [22], that features a challenging collection of bird species. This dataset is frequently used as a benchmark for part-based prototype methods. The rich annotations allow for a detailed evaluation of the importance of individual object parts in the classification process, enabling a comprehensive assessment of the prototypes. The dataset contains 11,788 images of 200 bird species with segmentation masks, bounding boxes, part locations, and attribute labels.
Stanford Cars Dataset (Cars 196) For our second fine-grained dataset, we selected the Stanford Cars dataset (Cars196), introduced by Krause et al. [23]. Unlike the natural objects in the CUB200 dataset, this dataset focuses on fine-grained classification of manufactured objects, and is similarly used to benchmark part-based prototype methods. It consists of 16,185 images of 196 car models, covering a wide range of vehicle types including sedans, SUVs, coupes, convertibles, pickups, and hatchbacks.
NICO Dataset For general object detection in a Non-IID setting, we used the NICO Dataset from He et al. [24]. This dataset includes 10 animal classes and 9 vehicle classes, though our focus is on the animal classes. Each class is further divided into 10 context categories representing various environments, attributes, and activities. The dataset contains 12,980 images of animals. Essentially, this Non-IID setting is a more realistic scenario, which is also new for part-based prototypical models. Hence, we want to apply this setting mainly through different background environments in the training and test sets to assess if the networks naturally promote object-focused prototypes. Since this cannot be achieved using the provided context categories (as few represent distinct environments), we use a simple image analysis inspired by Cheng et al. [58], based on the HSV colour space [59], to split the dataset into training/validation and test sets. This approach effectively creates a Non-IID scenario with varying distributions across training and test data. Further details are given in the implementation.
Animals with Attributes (AWA2) Dataset The last dataset is Animals with Attributes 2 (AWA2) introduced by Xian et al. [25]. This dataset is a benchmark for attribute-based classification, a related model domain to part-based prototype models. The dataset comprises 37,322 images across 50 animal classes, with binary and continuous class attributes. For evaluating the performance of part-based prototype methods in a multi-label classification setting, we focus on the given attributes. The AWA2 dataset includes 85 attributes that are used to describe a class. Since not all attributes describe visible features, we only use a subset, i. e., those corresponding to the 49 visible attributes.
4 Prototype Networks
Fig. 1
A visual representation of a prototype-based neural network. Starting from A a deep neural network, such as ResNet, extracting a feature map (B), the model identifies the presence of prototypes (C) by comparing them with feature vectors; generating similarity maps (D). Classification (F) is performed according to the highest similarity scores (E) derived from these maps. Prototypes are visualized (G) using the Prototypical Relevance Propagation (PRP) method. Note cards list the metrics that focus on that specific component. Some metrics are used in multiple evaluation techniques, which is later indicated by a subscript, see table 1
Because this work focuses on a comprehensive set of metrics, we kept our set of models small and selected three prominent networks based on the following criteria: class-specific vs. shared prototypes and explicit vs. indirect prototype representations. We chose ProtoPNet [16], the first popular prototype model, that uses class-specific explicit prototype vectors. We further include the ProtoPool [20] model as a representative for shared and explicit prototype learning. Finally, we chose PIPNet [21] as a recently introduced architecture for the shared and indirect case. To the best of our knowledge, there is currently no model for the class-specific indirect case. Other models with different classification approaches, such as hierarchical classifications or other backbones such as transformers, are further important studies that should be investigated in connection with the interpretability of prototypes. However, due to a lack of implementations of visual methods and complex adjustments for the multi-label scenario, we decided to start with models that can be easily adapted to our training and evaluation setups.
Each network uses a similar architecture visualized in Fig. 1, the components are labelled with the letters (A)–(G) in the figure: Starting with a deep neural network (A) as backbone. This network extracts a feature map (B) from the input image. The important characteristic of a part-based prototype model is to calculate the presence of prototypes in the feature map. In the case of a model with specific prototypes, each prototype (C) is compared with the feature vectors in (B) resulting in a set of similarity maps (D). In the case of indirect prototypes, the feature map is further processed and directly interpreted as a collection of similarity maps (D) without an explicit comparison. The final classification (F) is then obtained based on the highest similarity score (E) in each map. Another important characteristic of these models is the visualization of such prototypes (G). In our case, this is done with the Prototypical Relevance Propagation (PRP) method that propagates the similarity score back to the input.
Below, we summarize each applied prototype network architecture, highlighting the unique design choices and training procedures.
ProtoPNet The first network is ProtoPNet. It follows the general architecture shown in Fig. 1 consisting of a convolutional neural network as a backbone followed by two additional \(1 \times 1\) convolutional layers to adjust the output dimension D. This results in a feature map \({\textbf{z}} \in {\mathcal {R}}^{H \times W \times D}\) with individual vectors \(\mathbf {\tilde{z}} \in {\mathcal {R}}^{1 \times 1 \times D}\). The feature map is then processed by a prototype layer, that holds a set of n prototypes \({\textbf{P}}=\left\{ {\textbf{p}}_m\right\} _{m=1}^n\) with \({\textbf{p}}_m \in {\mathcal {R}}^{1 \times 1 \times D}\), such that each prototype has the same dimensions as one feature vector. A pre-determined number of prototypes \(m_k\) is allocated for each class \(k \in \)\(\{1, \ldots , K\}\). The prototype layer computes a set of similarity maps \({\textbf{A}}=\left\{ {\textbf{a}}_m\right\} _{m=1}^n\) based on the squared L2 distance between a prototype and a feature vector.
The similarity score of a prototype is then calculated using a max pooling operation \({\textbf{s}}_m=\max \left( {\textbf{a}}_m\right) \) and lastly used in the fully connected layer to make the classification.
ProtoPNet uses a multistep training process, separating the training of the last layer from the rest of the network. At the beginning, the last layer is initialized using 1 for weights connecting a prototype to its corresponding class and \(-0.5\) otherwise. This ensures the learning of class specific prototypes. The first training step optimizes the backbone and prototype layer to learn a meaningful latent space, clustering prototypes from the same class and separating them from prototypes of other classes. This is done using a linear combination of a classification loss and two novel losses, the cluster loss and separation loss \(\lambda _C {\mathcal {L}}_C+\lambda _{cl} {\mathcal {L}}_{cl}+\lambda _{sep} {\mathcal {L}}_{sep}\). In the second step, each prototype is first projected onto the nearest feature vector from its assigned class. The last layer is then fine-tuned to improve classification performance, using only the classification loss and an L1 regularization term to keep the positive reasoning characteristic. Chen et al. [16] observed that some prototypes focus on the background. These prototypes contradict the goal of the model to learn class-specific object parts. To address this, the authors propose a pruning step after training, which removes such prototypes. We use both the original and pruned version in the evaluation to analyse the effectiveness of this approach.
Our experiments include a multi-label dataset. To adapt the training objectives for this setting, we switch from a cross-entropy loss to a multi-label margin loss. Other common loss functions for multi-label classification include binary cross-entropy loss and multi-label soft margin loss. However, these losses tend to learn negative weights in the classification layer for labels not present in the image, which conflicts with the L1 regularization used to promote positive reasoning in the classification layer. The multi-label adaption for the cluster and separation loss is done by calculating the mean cluster and separation loss across all classes \(C_{y_i} = \{\,k \mid y_i^k = 1\,\}\), i.e., the set of classes assigned to sample \(y_i\).
We argue that this is a fitting adaption that does not contradict with the goal described in the original study.
ProtoPool The ProtoPool [20] model by Rymarczyk et al., addresses several limitations of ProtoPNet. It introduces a new similarity function and an automatic, fully differentiable assignment of prototypes to classes, that enables the network to learn shared prototypes. The model itself follows the architecture of ProtoPNet consisting of a CNN backbone with additional convolutional layer to adjust the output shape, a prototype layer, and a fully connected layer. The prototype layer contains a set of n trainable prototypes \({\textbf{P}}=\left\{ {\textbf{p}}\right\} ^n\) with \({\textbf{p}} \in {\mathcal {R}}^{1 \times 1 \times D}\), similar to ProtoPNet. A novel part is the assignment matrix with L prototype slots for each class, where each slot is implemented as a distribution \(q_l \in {\mathbb {R}}^N\) over all available prototypes. The similarity maps \({\textbf{A}}=\left\{ {\textbf{a}}\right\} ^n\) are computed using the same equation 1 as in ProtoPNet. To enhance prototype locality, the authors introduce a focal similarity function, which computes the difference between the maximum similarity and the average similarity.
After calculating all similarity scores, the assignment matrix is used to compute an aggregated similarity value \(g_l\) for each slot of a class as follows \(g_l=\)\(\sum _{i=1}^N q_l^i g_i\), which is then used in the fully connected layer to make the final prediction.
ProtoPool follows the same multistep training process as ProtoPNet. In the first step, the last layer is initialized with a 1 for a slot assigned to a class and 0 otherwise, and then fixed during the training of the backbone and prototype layer. The optimization uses classification loss, cluster loss, and separation loss as in ProtoPNet. To learn the distributions \(q_l\), the authors apply a Gumbel-Softmax layer, that adds additional noise to the distribution at the start of the training process. This noise is then reduced during the training period, so the distribution can converge to a one-hot encoded vector. To prevent that a prototype is assigned to multiple slots of the same class, an additional orthogonal loss is introduced. Thus, the overall loss term becomes \(\lambda _C {\mathcal {L}}_C+\lambda _{cl} {\mathcal {L}}_{cl}+\lambda _{sep} {\mathcal {L}}_{sep}+ \lambda _{o} {\mathcal {L}}_{o}\).
After this training step the prototypes are projected to the nearest feature vectors similar to ProtoPNet and the last layer is fine-tuned to optimize the classification performance.
The adaptation for multi-label training objectives follows the same approach as the ProtoPNet model. This involves switching from cross entropy loss to a multi-label margin loss and using the mean cluster and separation loss across all labels as illustrated in equation 2 and 3. No adaptation is necessary for the orthogonal loss \({\mathcal {L}}_{o}\), as it computes cosine similarities between each slot without utilizing class information.
PIPNet Nauta et al. introduced PIPNet [21], which incorporates two novel loss functions based on CARL (Consistent Assignment for Representation Learning) [60] to learn prototypes that more accurately align with real object parts. Unlike ProtoPNet or ProtoPool, PIPNet does not use explicit prototype vectors. This removes the need to project prototypes onto nearby latent patches. The architecture includes a CNN backbone (without additional convolutional layers), followed by a softmax layer and a fully connected layer. The feature map \({\textbf{z}} \in {\mathcal {R}}^{H \times W \times D}\) from the backbone is treated as D two-dimensional \((H \times W)\) prototype similarity maps. The softmax operation across the D dimension encourages a patch from the feature map \(\varvec{\tilde{z}}_{h, w}\), to belong to exactly one prototype similarity map. The similarity scores are again computed via a max-pooling operation and used in the fully connected layer for classification.
Unlike ProtoPNet and ProtoPool, PIPNet trains the entire model without fixing the last layer. To ensure positive reasoning, the fully connected layer is constrained to have only positive weights. Another key distinction is the Siamese data augmentation pipeline, where two views of the same image patch are created. These views are used in the first novel loss function, the alignment loss, that optimizes the two views to belong to the same prototype. This promotes the extraction of more general features. The second loss function, the tanh-loss, ensures that each prototype is represented at least once in a training batch to prevent trivial solutions. The last loss function is the standard classification loss. The overall loss is defined as a linear combination of these three terms \(\lambda _C{\mathcal {L}}_C + \lambda _A{\mathcal {L}}_A + \lambda _T{\mathcal {L}}_T\).
PIPNet follows a multistep training process. In the first step, only the backbone is optimized using the alignment and tanh loss to learn prototypes representing general, class-independent features. In the second step, the entire network is optimized using all loss terms, including classification loss. To promote sparsity during training, ensuring that each class uses a minimal number of prototypes for classification, the output scores o are further modified.
Here, \(s_m\) are the prototype similarity scores and \(\varvec{\omega }_c\) the weights of the fully connected layer. The natural logarithm is employed to promote sparsity, as decreasing the weights has a higher loss gain than increasing weights when the weights become too large.
Since PIPNet’s loss functions are independent of class assignments, they can be used directly for multi-label classification. The only modification required is that the classification loss \({\mathcal {L}}_C\) be calculated using the multi label margin loss rather than the cross entropy loss.
4.1 Prototype Visualization
The post-hoc visualization method is not bound to the individual models, allowing us to select the most suitable method for the networks. The original implementation of ProtoPNet employs a simple model-agnostic upscaling method, which was also used by ProtoPool and PIPNet. This method upscales the similarity map of a prototype to the size of the input using a cubic interpolation to visualize the similarity pattern of the prototype within the image. A key disadvantage is that it relies solely on the similarity map and assumes the preservation of spatial information in the feature extraction process. There are other model-specific methods that can better visualize the feature extraction process of the model. Based on these arguments and the findings from Xu-Darme et al. [61], which compared various visualization methods, we selected the Prototypical Relevance Propagation (PRP) method from Gautam et al. [62] for visualizing the prototypes (see Fig. 1).
5 Evaluation Metrics
In this section, we discuss the techniques and metrics used to evaluate the part-based prototype models. Following the Co-12 properties from Nauta et al. [19] as a guideline, we choose metrics to evaluate the output-completeness, continuity, contrastivity, covariate complexity, and compactness.
Table 1
Overview of metrics by evaluation category with references to original studies. Asterisks (*) indicate metrics inspired by similar measures in the cited work. Up/down arrows denote the direction of improvement; some directions differ across categories depending on the desired outcome
Output Completeness: Evaluates whether the visualization covers all pixels relevant to a prototype by perturbing unhighlighted
pixels and measuring the induced changes in the model and explanations.
\(PLC_{out} \downarrow \)
Measurement of maximum similarity location shift after perturbation. Large shifts indicate poor alignment between the visualization location and the nearest feature location in the feature map.
Quantifies the change in the maximum similarity value. A high change indicates that the relevant image region of the nearest feature is not fully captured by the visualization
\(PALC_{out} \%\downarrow \)
Measures the spatial similarity distribution shift across the feature map. Large deviations indicate poor alignment between visualization and spatial feature extraction.
\(PAC_{out} \%\downarrow \)
Quantifies changes in overall similarity magnitude irrespective of location. High values suggest that the visualization did not cover the image/object characteristics similar to the prototype.
Measures displacement of the visualization’s highlighted region between original and perturbed images. Large displacements imply a strong impact of the induced spatial feature extraction changes on the visualization method
VAC\(\%\downarrow \)
Quantifies changes in explanation activation between original and perturbed images. High values reveal a strong impact of the induced information extraction change on the visualization.
Continuity: Assesses prototype stability under image-wide, low-level perturbations, indicating whether the model has learned
robust, high-level semantic features.
\(PLC_{conti} \downarrow \)
Measures the shift in the maximum-similarity location after global low-level perturbations. Large shifts imply a misalignment to the feature map interpretation as spatial aligned high level object part representations.
Quantifies changes in the maximum similarity value under global perturbations. High values indicate the model learned unstable prototypes representing low level image characteristics.
\(PALC_{conti} \%\downarrow \)
Measures changes in the overall spatial similarity pattern after low-level perturbations. Large changes indicate spatial instability in the feature extraction process.
\(PAC_{conti} \%\downarrow \)
Quantifies changes in overall similarity magnitudes after low-level perturbations. High values indicate large movement in the embedding space showing features represent low level image characteristics.
Reports rank changes of the prototype score relative to other prototypes under perturbations. Large rank shifts indicate prototypes are influenced differently, indicating different degrees of semantic representation.
CAC% \(\downarrow \)
Measures changes in output logits induced by low-level perturbations. Large variations indicate that nuisance characteristics substantially affect the classification process.
CRC \(\downarrow \)
Reports rank changes of the predicted class under perturbations. Large rank drops indicate limited class-leve
robustness and reliance on such non-semantic information.
Contrastivity: Assesses whether different prototypes capture distinct object properties, promoting robust classification and
informative explanations.
\(PLC_{contra}\uparrow \)
Measures the location distance between prototype similarity maxima in a sample image. Larger distances indicate that prototypes focus on distinct object parts.
\(PALC_{contra}\%\uparrow \)
Measures spatial separation of prototype similarity pattern in a sample image. Higher values denote disentangled unique feature vectors resulting in focused prototype activations.
\(APD_{intra} \uparrow \)
Computes average pairwise distance between prototypes within a class. Larger distances indicate prototypes represent distinct object parts.
\(AFD_{intra} \uparrow \)
Computes average distance between feature vectors closest to a prototype. Large deviations to the average prototype distance indicate a distorted class cluster distribution.
Measures average distance between feature vectors closest to a prototype across classes. Small values imply minimal class distinct information preserved in the feature extraction process.
Entropy \(\downarrow \)
Computes the entropy of the similarity scores from a prototype across the test set. Lower entropy indicates better embedding space separation with only intended class features near the prototype.
Covariate Complexity: Assesses the semantic focus of prototypes by quantifying overlap between prototype visualizations,
Measures the difference between average activations inside versus outside the object region. Higher scores indicate object-centric explanations rather than background reliance.
Measures agreement between a prototype’s visual focus and annotated object parts across images. Higher consistency indicates stable and interpretable connection to semantic object parts.
Compactness: Assesses model size and interpretability of the classification process.
Counts the number of active prototypes used by the model. Smaller values increase interpretability of the classification process by reducing its overall size.
Measures sparsity of the classification layer. Higher sparsity indicates reliance on fewer prototypes per class, increasing interpretability.
NPR \(\downarrow \)
Computes the ratio of negative to positive classification weights. Lower ratios improve interpretability by aligning decisions with part-based evidence and not absence.
Counts prototypes near the extracted features. Smaller values increase interpretability, meaning only a sma
number of prototypes are used for classification.
We use a total of 22 metrics, including 13 metrics that are novel to this domain. Figure 1 provides an overview of which metrics address which model components (A-G). Additional categories indicate whether a metric measures location, activation, or semantic properties. Some metrics are used in multiple evaluation techniques, indicated by a subscript. An overview of which metric is used in which technique, with a brief description of how the metric should be interpreted in each technique, can be found in Table 1.
We only use the top-5 most activated prototypes of a sample image to evaluate prototype quality-related metrics. This approach is aligned with the actual explanation scheme, where only a few prototypes are visualized to explain a decision to the user. To evaluate the general performance of the networks, we use accuracy, top-3 accuracy, and F1 score.
5.1 Output-Completeness Evaluation
The output-completeness evaluation focuses on the visualization method’s ability to highlight all relevant parts in the input image. To evaluate the output completeness of the PRP method, we consider the study by Sacha et al. [63], which evaluates the spatial misalignment of visualization methods. The introduced metrics measure the location change in the visualization VLC and the prototype activation change PSC when a perturbation is applied to the input image. We extend this list by including the activation change of the visualization VAC, the location change of the maximum activation in the feature map PLC, as well as the general location and activation change of the similarity pattern, PALC and PAC, respectively, thereby achieving comprehensive coverage of the model components, see Fig. 1. Some metrics from Sacha et al. [63] have been renamed for consistency in terminology. Furthermore, some metrics are also used in continuity and contrastivity evaluation with different input variables. Here, the according formulas are only presented once.
Consider an image x and the perturbed counterpart \(\overline{x}_i\), created based on the visualization of the prototype \(p_i \in P\). Let \(v_{p_i}(x)\) be the saliency map of prototype \(p_i\). Each saliency map is pre-processed and only contains pixels with a relevance value above the 95th percentile. Let \(b_{p_i}(x)\) be the bounding box around this activated region of the saliency map. For the output-completeness evaluation, we perturb an image x to \(\overline{x}_i\) by adding a Gaussian noise mask with a standard deviation of 0.05 to the pixels around the bounding box; see Fig. 5 for a process visualization. Further, let \(a_{p_i}(x)\) be the similarity map and \(s_{p_i}(x)\) the similarity score of the prototype.
If the visualization method correctly highlights the relevant parts of the prototype, then we expect minimal changes in the model components (see Fig. 1), resulting in low scores for these metrics.
The visualization location change VLC, previously introduced as the prototypical part location change by Sacha et al. [63], assesses the change in the location of the bounding box following a perturbation by calculating the intersection over union.
To evaluate the similarity change, we use the prototype similarity change PSC, formerly introduced as prototype activation change by Sacha et al. [63]. This metric computes the difference between two similarity values and is used to measure the relative change in the similarity score after the perturbation is applied.
To complete the evaluation of changes in the saliency map, we introduce the visualization activation change VAC, which measures the change in relevance scores between two saliency maps. To be invariant to location changes in the visualization, we flatten and sort the saliency maps to get a relevance curve, represented as \(\tilde{v}_{p_i}(x)\). The difference in activation is measured by calculating the intersection over union of the two activation curves in the following way. Let \(h \times w\) represent the height and width of the saliency map.
We also assess location and activation changes in the feature space using the similarity map. In order to measure the location change, we introduce two metrics: First, the prototype location change PLC, measuring the distance between two maximum activation locations using the Manhattan distance. The output-completeness evaluation uses the similarity maps of a prototype from the original and perturbed image version.
The second metric is the prototype activation location change PALC. This metric measures the location difference of the hole activation pattern between two similarity maps, analogous to the VLC metric. We apply a min-max normalization and use a threshold of 0.5 to obtain a binary activation pattern, expressed by the \(\operatorname {bin}()\) operation. Again, the output-completeness evaluation uses the original and perturbed images to generate the two similarity maps (Fig. 2).
In order to evaluate the activation change in similarity maps, we introduce the prototype activation change PAC, which operates similar to the VAC metric and measures the activation difference between two similarity maps. Let \(m \times n\) represent the height and width of the feature map, then
Using these additional metrics, we evaluate the changes in location and activation at each step before the final classification in the general architecture of prototype models, illustrated in Fig. 1.
5.2 Continuity Evaluation
Fig. 2
The output-completeness of the visualization method is assessed by measuring the change in different parts of the model when the image is perturbed based on a prototype’s visualization. The continuity of a model is assessed by measuring the effect of augmentations to prototypes
The continuity evaluation assesses the stability of prototypes under different augmentations. We focus on the study by Hoffmann et al. [51], which assesses prototype stability under noise and compression artifacts, and the studies by Rymarczyk et al. [34] and Nauta et al. [57], which examine the influence of photometric augmentations. We train our model using geometric and photometric augmentations to promote continuity during training, as recommended by Nauta et al. [19]. In the evaluation, we perturb the input image x to \(\overline{x}\) with photometric and noise augmentations. The individual augmentations are applied all at once with the following settings: brightness \(+12.5\%\), contrast \(+12.5\%\), saturation \(+12.5\%\), hue 0.05, noise \(5\%\), JPEG quality \(90\%\), and blur using a \(3 \times 3\) kernel. A visual presentation of the process is illustrated in Fig. 1.
We utilize the metrics from the output-completeness evaluation to evaluate the changes in location and activation within the major components of the models (Fig. 1). This is done by switching the prototype-specific perturbation \(\overline{x}_i\) with the general perturbation \(\overline{x}\) based on the specified augmentation techniques. The metrics adopted are PLC, PSC, PALC, PRC, and PAC. The VLC and VAC scores are excluded, as they focus on assessing the visualization method instead of a prototype. In addition, we use the prototype rank change PRC from Sacha et al. [63]. This metric measures the prototype’s position change in the similarity score ranking of all prototypes.
We also introduce the following new metrics. The classification activation change CAC measures the change in prediction. Let K be the number of classes and o(x) be the output vector of the model for an image x.
The other metric is the classification rank change CRC that measures the change in the rank of the predicted class. Let \(r_c(x)\) be the rank of the predicted class c for the image x:
The contrastivity evaluation is used to assess the differences between prototypes. To evaluate the contrastivity of prototypes, we refer to the study by Wang et al. [64], which evaluated the contrastivity of prototypes in the feature space. We follow the recommendations of Nauta et al. [19] to also assess location differences.
Wang et al. [64] measured the average cosine distance between prototypes \({APD_{inter}}\) and feature vectors \({AFD_{inter}}\) of different classes. These metrics are designed for networks with class-specific prototypes, so each class has a subset of prototypes \(P_k \subset P\), where P is the set of all prototypes and k is the class index. To ensure compatibility with models that do not assign prototypes to classes, we use the ground truth class of the image x to create the subset \(P_x\), consisting of the top-5 most activated prototypes in the image. The subset \(P_k\) is then the combination of all subsets \(P_x\) in the test set \(X_{test}\). To create the corresponding feature vector sets \(F_k\), we use the nearest feature vectors of the prototypes in the subset \(P_x\).
We use the average inter-class distance metrics from Wang et al. [64] and also employ the average intra-class distance case \({APD_{intra}}\) and \({AFD_{intra}}\) respectively. Following the recommendations from Nauta et al. [19], we use the PLC and PALC metrics to assess the contrastivity of prototype locations in the feature map and introduce a new metric to evaluate the activation discriminativeness of prototypes, achieving a comprehensive model coverage (see Fig. 1). To use the PLC and the PALC described in the output-completeness evaluation in this context, we change the variable \(a_{p_i}(\overline{x}_i)\) to \(a_{p_j \in P_{x}/p_i}(x)\) in order to compare the similarity maps between activated prototypes in image x
The original study from Wang et al. [64] does not state the specific formula of the metrics, so the following calculations are our interpretation. The average inter-class prototype distance \({APD_{inter}}\) from Wang et al. [64] is the average cosine distance between prototypes of different classes.
We calculate the average inter-class feature distance \({AFD_{inter}}\) from Wang et al. [64] in the same way using the feature vector set F instead of the prototype set P. To evaluate the contrastivity between prototypes from the same class, we introduce the average intra-class prototype distance \({APD_{intra}}\). The metric measures the average cosine distance of a prototype to all other prototypes of the same class.
We also extend this metric to the feature vector case via introducing the average intra-class feature distance \({AFD_{intra}}\), by switching the prototype set P with the feature vector set F. In order to measure the class discriminativeness of a prototype, we compute the Shannon entropy of the similarity scores. Let \(S_{p_i}\) be the set of similarity scores of a prototype over all test images in \(X_{test}\). We use a max normalization to normalize the similarity scores and then calculate the histogram \(hist(S_{p_i})\) of the similarity scores with \(U=10\) bins.
A high entropy indicates that the prototype has no discriminative activation pattern.
5.4 Covariate Complexity
The covariate complexity evaluation assesses the complexity of a prototype regarding the interpretability. To evaluate the covariate complexity of prototypes, we focus on the study by Wang et al. [64], which evaluated the overlap of prototype visualizations with object masks, and the study by Huang et al. [53], which evaluated how consistently object parts are represented by a prototype.
The following metrics from Wang et al. [64] are again our interpretation, as the study does not provide specific formulas. To determine the activated region of a prototype in the input image, we again use the 95th percentile approximation on the prototype visualization for these metrics.
To assess the overlap between prototypes and objects, Wang et al. [64] introduces the content heatmap metric. We renamed this metric to \({Object\ Overlap}\) to be consistent with our naming scheme. The metric calculates the intersection of the prototype visualization and the object mask. Let m(x) be the object mask of image x.
We also adapted the idea of the outside-inside relevance ratio measure from Wang et al. [64], introducing the inside-outside relevance difference IORD as the signed difference between mean inside and outside activation. The adoption of the original metric was made to increase the stability of the measurements. Each saliency map is normalized before the calculation to ensure that IORD scores are consistent between different saliency maps.
Here, \({\mathbb {I}}\) is the indicator function that returns 1 if the condition is true and 0 otherwise.
In order to measure the alignment of prototypes with specific object parts, we adapted the consistency score from Huang et al. [53]. The coverage of an object part is calculated by computing a histogram of the object parts covered by a prototype over the test set \(X_{test}\). This is done by computing the bounding box and adding all object parts within the bounding box to the histogram of that prototype. After processing every image in the test set, we normalize the histograms. This results in a vector with object part percentages, denoted as \(l_{p_i}\). The consistency of the prototype is the average coverage measure over the vector \(l_{p_i}\).
In addition, we introduce the \({Background\ Overlap}\) metric as a counterpart to the \({Object\ Overlap}\). This metric assesses the area of the activated region in the saliency map that does not overlap with the object.
The compactness evaluation assesses general characteristics of the model. This evaluation will focus on the study by Nauta et al. [21], which evaluated global size, local size, and the sparsity of the classification layer. Additionally, we will include the negative-positive reasoning ratio metric in our compactness evaluation.
The global explanation size from Nauta et al. [21] counts all prototypes with at least one non-zero weight in the classification layer. This definition is not directly applicable to the ProtoPool model. Therefore, we will count non-zero weights in the prototype presence matrix from ProtoPool. The local explanation size, as described by Nauta et al. [21], can be defined as follows. Let \(\hat{s}_{p_i}(x)\) be the normalized similarity score of a prototype \(p_i\) for an image x, with s(x) being the similarity score vector of all prototypes for the image x.
The classification Sparsity from Nauta et al. [21] is the ratio of zero weights in the classification layer. Let W represent the weights in the classification layer, and \(\epsilon =0.001\) be the threshold for considering a weight as non-zero to improve the numerical stability.
Here, |W| denotes the total number of weights and the sum counts the number of positive and negative weights that exceed the threshold \(\epsilon \).
In order to evaluate the positive reasoning property of the model, we introduce the negative-positive reasoning ratio NPR, that calculates the ratio between the number of positive and negative weights in the classification layer. We will again use a threshold of \(\epsilon =0.001\) for considering a weight as positive or negative.
The numerator \(\sum _{w \in W} {\mathbb {I}}(w<-\epsilon )\) counts the number of negative weights that are less than \(-\epsilon \), and the denominator \(\sum _{w \in W} {\mathbb {I}}(w>\epsilon )\) counts the number of positive weights that are greater than \(\epsilon \).
6 Experimental Setup
In this section, we briefly describe the chosen model parameters and architectural settings, as well as the dataset and training setup.
In the original studies, datasets were split into 50% training and 50% test data. We will follow a different method based on the study by Raschka et al. [65], which argues that using the same data subset for model selection and final evaluation can result in overly optimistic outcomes. Following these guidelines, we first divide the datasets into 70% training and 30% test sets. We then apply 4-fold stratified cross-validation to split the training set into training and validation subsets, maintaining a consistent class distribution across all sets. This results in four different sets of training and validation pairs, with a 52.5% training and 17.5% validation split.
We selected the ResNet-50 architecture with pretrained ImageNet weights as a backbone for all models. The feature maps’ spatial dimensions are \(7 \times 7\). In the original study from PIPNet the dimension is modified to \(28 \times 28\). However, it is not described as a core architecture design, so we argue that the use of a uniform dimension setting makes the comparison fairer. For the dimension of explicit prototype vectors in the ProtoPNet and ProtoPool models, we chose \(1 \times 1 \times 128\). The last layer of ProtoPNet is initialized with a 1 indicating a prototype belongs to a class and 0 otherwise instead of \(-0.5\), because the sparsity objective described by Chen et al. [16] could otherwise not be achieved in our experiments. PIPNet uses 2048 prototypes for all datasets. ProtoPNet uses 2000, 1960, 50 and 490 for CUB200, Cars196, NICO and AWA2 respectively. ProtoPool uses 205, 201, 50, 168 prototypes for the datasets, respectively.
We used an online augmentation setting with the Albumentations library [66], that supports augmenting bounding boxes, segmentation masks, and key-points along with images, which is crucial for certain metrics in our evaluation. All networks were trained with geometric and photometric augmentations. The original studies from ProtoPNet and ProtoPool only used geometric augmentations, but due to our consistency evaluation, we choose to add additional photometric augmentations to promote better results. In the original study, the PIPNet model used photometric augmentation for the contrastive learning approach, which is also included in our experiments. The experiments on the CUB200 and Cars196 were done on cropped images using the bounding box information.
Table 2
General performance and Compactness evaluation results. The results are averaged over 4 runs with standard deviation. Training and validation subsets were created using 4-fold stratified cross-validation
CUB200
Accuracy\(\%\uparrow \)
top-3 Acc.\(\%\uparrow \)
F1 score\(\%\uparrow \)
Global Size \(\downarrow \)
Sparsity\(\%\uparrow \)
NPR \(\downarrow \)
Local Size \(\downarrow \)
ProtoPNet
70.18 ± 1.32
85.29 ± 0.73
70.39 ± 1.23
2000.00 ± 0.00
99.30 ± 0.18
0.25 ± 0.19
1348.02 ± 146.53
ProtoPNet P
69.12 ± 1.81
85.00 ± 0.95
69.20 ± 1.60
1891.75 ± 16.40
99.10 ± 0.38
0.19 ± 0.13
1261.84 ± 131.89
ProtoPool
67.33 ± 0.24
80.42 ± 0.96
67.81 ± 0.44
205.00 ± 0.00
96.04 ± 0.60
0.54 ± 0.06
168.33 ± 14.58
PIPNet
74.00 ± 0.64
86.77 ± 0.32
73.85 ± 0.56
858.00 ± 76.80
99.31 ± 0.09
0.00 ± 0.00
67.69 ± 0.30
Cars196
Accuracy\(\%\uparrow \)
top-3 Acc.\(\%\uparrow \)
F1 score\(\%\uparrow \)
Global Size \(\downarrow \)
Sparsity\(\%\uparrow \)
NPR \(\downarrow \)
Local Size \(\downarrow \)
ProtoPNet
82.60 ± 1.07
94.01 ± 0.56
82.63 ± 0.99
1960.00 ± 0.00
99.35 ± 0.07
0.10 ± 0.04
1197.61 ± 106.00
ProtoPNet P
81.77 ± 0.55
93.15 ± 0.66
81.67 ± 0.46
1785.50 ± 22.78
98.78 ± 0.51
0.15 ± 0.05
1080.18 ± 84.72
ProtoPool
83.25 ± 0.57
91.62 ± 0.42
83.29 ± 0.59
201.00 ± 0.00
99.28 ± 0.12
0.11 ± 0.07
170.47 ± 51.32
PIPNet
85.78 ± 0.33
94.68 ± 0.25
85.73 ± 0.32
514.00 ± 18.46
99.51 ± 0.04
0.00 ± 0.00
68.95 ± 0.28
NICO
Accuracy\(\%\uparrow \)
top-3 Acc.\(\%\uparrow \)
F1 score\(\%\uparrow \)
Global Size \(\downarrow \)
Sparsity\(\%\uparrow \)
NPR \(\downarrow \)
Local Size \(\downarrow \)
ProtoPNet
90.04 ± 1.30
96.74 ± 0.61
90.01 ± 1.32
50.00 ± 0.00
60.25 ± 8.46
0.46 ± 0.12
15.59 ± 3.83
ProtoPNet P
90.19 ± 1.15
96.77 ± 0.60
90.15 ± 1.17
50.00 ± 0.00
55.05 ± 15.82
0.51 ± 0.14
15.69 ± 3.83
ProtoPool
89.89 ± 0.92
95.94 ± 0.87
89.87 ± 0.93
50.00 ± 0.00
89.55 ± 0.66
0.04 ± 0.07
3.38 ± 0.83
PIPNet
91.40 ± 0.28
97.49 ± 0.39
91.49 ± 0.30
214.00 ± 136.81
98.90 ± 0.93
0.00 ± 0.00
70.80 ± 0.65
AwA2
Accuracy\(\%\uparrow \)
top-3 Acc.\(\%\uparrow \)
F1 score\(\%\uparrow \)
Global Size \(\downarrow \)
Sparsity\(\%\uparrow \)
NPR \(\downarrow \)
Local Size \(\downarrow \)
ProtoPNet
43.86 ± 3.01
-
91.74 ± 0.60
490.00 ± 0.00
18.27 ± 0.95
0.95 ± 0.02
184.20 ± 17.85
ProtoPNet P
40.61 ± 8.45
-
91.75 ± 0.47
467.75 ± 7.32
18.16 ± 0.67
1.05 ± 0.13
185.56 ± 25.42
ProtoPool
46.33 ± 6.45
-
91.98 ± 1.29
168.00 ± 0.00
96.43 ± 1.30
0.24 ± 0.29
87.76 ± 20.57
PIPNet
62.41 ± 2.72
-
94.47 ± 0.56
206.00 ± 10.21
98.68 ± 0.09
0.00 ± 0.00
70.38 ± 0.27
Due to the differences between our training setup and the original studies, we adjusted the respective learning rates, training epochs, and schedulers for each model and dataset. The new parameters were chosen based on hyperparameter tuning with 50 trails for each dataset. To reduce complexity and the computational load, we used a fixed number of 100 joint epochs to tune all models. The individual loss weights were kept the same as the original paper suggested over all datasets. All networks were re-implemented based on the original code for our experiments and are also part of our open-source library. We point out that our training setup deviates from the optimal learning procedure introduced in the original studies, and we expect a reduction in performance. Therefore, our focus is to evaluate if interpretable properties discussed in the original studies remain in suboptimal learning conditions.
7 Results
The general performance, compactness, contrastivity, and continuity evaluations are conducted on all datasets. The output-completeness and complexity evaluations are conducted exclusively on the CUB200 dataset. The output-completeness evaluation assesses the employed visualization method, in our case the computationally intensive PRP method; thus, this method was evaluated only on the CUB200 dataset. The complexity evaluation is limited to the CUB200 dataset due to the absence of object masks and part annotations in the other datasets.
The used networks are ProtoPNet [16], ProtoPNet Pruned [16], ProtoPool [20], and PIPNet [21]. The pruned version of ProtoPNet was created using the pruning strategy stated in the original paper.
7.1 General and Compactness
Table 2 presents the outcomes of our evaluations on general performance and compactness.
The pruned version of ProtoPNet exhibits a performance drop compared to the original model. This indicates that prototypes not clearly associated with a single class, presumably focusing on background regions, still play a significant role in the classification process. In other words, classification in ProtoPNet relies on a complex and sensitive interplay between prototypes, including those that are not class-distinct. Evidence from the compactness analysis supports this interpretation. First, the Local Size shows that a relatively high number of prototypes are active per sample. Second, the NPR score improves after removing background prototypes, suggesting that even though the fine-tuning stage was initialized with zeros (instead of the original –0.5), which should bias towards positive reasoning, the model nonetheless learns intricate negative relations during fine-tuning. Interestingly, sparsity often decreases slightly after pruning, which is counterintuitive. One would expect that the targeted background prototypes would be shared across more classes; thus their removal should increase sparsity. Combined with the minor reduction in global size, these findings suggest that pruning does not simply eliminate “background prototypes” but is strongly influenced by the quality of learned prototypes.
On the CUB200 dataset, ProtoPool struggles to discover meaningful shared prototypes compared to its performance on other datasets, such as Cars196. This is reflected both in performance rankings and compactness measures. Specifically, ProtoPool on CUB200 exhibits a high proportion of negative weights, low sparsity, and a large number of prototypes active per sample, all indicators of a complex classification process. In contrast, when applied to NICO and AWA2, ProtoPool shows improvements in interpretability over ProtoPNet. The Local Size is reduced on NICO, while NPR and Sparsity scores improve on AWA2. These results suggest that shared prototypes may be more advantageous in multi-label settings than in fine-grained single-label classification tasks.
PIPNet achieves the best overall results in both predictive performance and compactness. This illustrates the potential of contrastive learning for prototype-based methods. The outcome is intuitive, as contrastive learning naturally aligns with the goal of learning a high-level semantic prototype representation. Furthermore, the hard constraint imposed on the classification layer does not negatively affect performance, indicating that PIPNet does not require a separate fine-tuning phase. In contrast, other models that rely on fine-tuning often learn additional, more complex, relationships to increase predictive performance in this phase. PIPNet is also the only model capable of reducing Global Size during training without the need for external pruning strategies. This highlights the effectiveness of the softmax-based constraint in limiting the number of active prototypes, even while the tanh loss encourages all prototypes to be active at some point during training. Moreover, the consistently small Local Size relative to Global Size indicates that this design robustly limits prototype usage per sample, thereby enhancing interpretability without sacrificing accuracy.
7.2 Contrastivity
Fig. 3
Contrastivity evaluation results. The results are averaged over 4 runs with standard deviation. Training and validation subsets were created using 4-fold stratified cross-validation
All contrastivity results in Fig. 3 were normalized using either fixed metric boundaries or the maximum values observed across models. The metric axes, where lower values indicate better performance, are inverted to consistently increase the coverage area of the better performing models.
The original and pruned versions of ProtoPNet exhibit highly similar behaviours. Both show weak location contrast in the feature maps and low feature and prototype distance scores, indicating that the learned embedding space is not disentangled. Consequently, the extracted features contain similar information rather than distinct, high-level object-part representations. This lack of disentanglement also reduces class separability in the embedding space, suggesting that the backbone network primarily encodes generic shape and colour information rather than interpretable semantic parts. As a result, the embedding space is comparatively smaller and produces ambiguous prototype activations, as reflected by higher entropy values.
ProtoPool exhibits characteristics similar to ProtoPNet, with an embedding space that appears densely clustered and dominated by generic shape and colour information. This indicates the model’s limiting ability to capture separable high-level features. The attempt to learn shared prototypes seems to exacerbate this issue, compressing the embedding space even further compared to ProtoPNet. However, improvements are observed in the NICO dataset, where inter-class prototype and feature distances are slightly higher. This suggests that in domains where low-level features (e.g., basic shapes) provide stronger class distinctions, such as differentiating between animal species, ProtoPool’s shared prototype approach can be more effective.
PIPNet achieves the highest scores in the feature distance measures. Prototype distances are not taken into account due to the lack of explicit prototype vectors. Unlike the other models, PIPNet does not treat the backbone outputs as embedding vectors; instead, it interprets each feature dimension as a prototype assignment via softmax. Under this formulation, the feature distance can be understood as the degree of overlap between prototype activations. Including the PALC, PLC, and AFD scores, PIPNet shows near-perfect contrast in prototype localization, demonstrating its ability to learn a disentangled feature extraction process. This further indicates that PIPNet successfully learns high-level prototypes representing distinct semantic properties. In other words, the extracted features align with the intended goal of representing meaningful and separable object parts. The Entropy results reinforce this conclusion, showing that PIPNet’s embedding space is more evenly distributed than that of other models. This distribution facilitates discrete prototype-to-class activations, even when prototypes are shared across classes.
7.3 Continuity
Fig. 4
Continuity evaluation results. The results are averaged over 4 runs with standard deviation. Training and validation subsets were created using 4-fold stratified cross-validation
All continuity results in Fig. 4 were normalized using either fixed metric boundaries or the maximum values observed across models. Again, we inverted axes of metrics where better performance is indicated by lower scores, so higher performing models cover a larger area in the chart.
The pruned version of ProtoPNet achieves slightly improved PRC scores on the CUB200 and Cars196 datasets compared to the original model. This suggests that pruning successfully removes prototypes more sensitive to low-level perturbations. However, this improvement does not translate into greater robustness at the classification level, as reflected by unchanged CAC and CRC scores. This observation is consistent with the compactness evaluation, which revealed a complex mixture of relations rather than a straightforward classification process. Interestingly, results on the NICO dataset show higher sensitivity to low-level perturbations than on the fine-grained datasets. This indicates that ProtoPNet tends to rely on simpler, low-level features when the overall classification task is less challenging. This underscores the absence of a dedicated mechanism that encourages the learning of high-level features.
The design of ProtoPool also does not include such a mechanism. However, it appears learning shared prototypes promote the extraction of more robust features under the right circumstances. This is reflected in the superior performance on the NICO and AWA2 dataset compared to ProtoPNet. A noteworthy finding is the high PLC score relative to other similarity-map metrics, such as PALC. This suggests that while the location of maximum activation may vary, overall the broader activation patterns remain relatively stable. Furthermore, ProtoPool performs particularly well on the AWA2 dataset, indicating that the multi-label classification loss contributes to learning more robust shared prototypes. Another key observation is that better PRC scores do not necessarily translate into improved CRC scores. This again highlights the complex and sensitive nature of the classification process from the trained ProtoPool models, which is in agreement with the findings from the compactness evaluation.
PIPNet demonstrates the strongest robustness in both prototype location and activation under low-level image augmentations. This result is expected, as the model employs a contrastive learning strategy that explicitly reduces the influence of such perturbations. Notably, this approach also appears to facilitate the extraction of high-level object-part features, as reflected in the contrastive evaluation. Nevertheless, some fluctuations are observed in classification-focused metrics, particularly in the PRC. The larger variability in PRC compared to the PSC suggests that activations among the top-5 prototypes are highly similar. This is likely a consequence of the softmax layer constraining activations between 0 and 1. The sometimes larger CAC further reveals that the classification layer amplifies these minor differences. Despite some activation fluctuations, the predicted class remains relatively stable, as indicated by the smaller fluctuations in the CRC compared to the CAC.
7.4 Covariate Complexity
Fig. 5
Output-Completeness (left) and Complexity (right) evaluation results on the CUB-200-2011 dataset. The results are averaged over 4 runs with standard deviation. Training and validation subsets were created using 4-fold stratified cross-validation
All covariate complexity results in Fig. 5 were normalized using either fixed metric boundaries or the maximum values observed across models. The metric axes, where lower values indicate higher performance, are inverted to consistently increase the coverage area of better models.
ProtoPNet and its pruned variant produce highly similar results, suggesting that background prototypes were not effectively removed. This finding is consistent with the compactness evaluation, which revealed a complex classification process and indicated that prototypes failed to capture well-defined concepts. Nevertheless, the lower Background Overlap compared to ProtoPool and PIPNet suggests that the class-specific prototypes of the ProtoPNet model focus more on object regions. This is intuitive, as backgrounds often serve as shared concepts across classes, and are therefore well suited to be used as shared prototypes.
The ProtoPool model exhibits the highest Background Overlap, indicating that it is particularly affected by this issue. This observation aligns with the model’s overall performance and compactness results on the CUB200 dataset, which suggest that the learned prototypes lack sufficient class-discriminative information and are of comparatively low quality.
Interestingly, the IORD scores are nearly identical across all models, indicating that, on average, model focus is balanced between object and background regions despite differing levels of Background Overlap. This suggests that prototype activations are relatively uniform, probably due to the 95th percentile crop, leading to similar average scores for both object- and background-overlapping activations. This can also be observed in the example visualization in Fig. 6 and 7.
Object-part Consistency is generally low across models, with PIPNet performing slightly better. This indicates that prototypes in all models tend to capture multiple object parts rather than maintaining a one-to-one correspondence between prototypes and individual semantic parts, which diverges from the intended interpretability objective of prototype-based learning. On average, prototypes focus on approximately one-tenth of the object, as measured by the Object Overlap. This indicates that while a prototype may represent a coherent concept within a single sample image, it typically corresponds to multiple concepts across the test set.
Fig. 6
Example prototype saliency maps using the PRP visualization method on the CUB200 (top) and Cars196 (bottom) dataset. The visualizations show the top-2 prototypes with the highest similarity score. Images in column a) illustrate a sample from the test set. The column b) visualizes the prototype on the nearest training image, respectively. Each saliency map visualizes only the 95th percentile
All output-completeness results in Fig. 5 were normalized using either fixed metric boundaries or the maximum values observed across models. As previously mentioned, metric axes are inverted when lower values indicate better performance.
The evaluation of the PRP method for visualizing relevant pixels yields mixed results across all models. The visualizations are most consistent with the feature extraction process of PIPNet, which aligns best with the intended interpretation of high-level object-part extraction, as shown in previous evaluations. This suggests a clear correlation between the accuracy of PRP visualizations and the degree of disentanglement in the extracted features. The second-best visualizations were done on ProtoPNet, although with greater variability in activation location metrics such as PLC and PALC. This indicates that ProtoPNet prototypes are influenced by broader image regions compared to PIPNet, which the PRP method does not fully capture. Examples of PRP visualizations are illustrated in Fig. 6 and 7,which show similar sizes of the relevance areas. A similar trend is observed for ProtoPool, where visualizations are less reliable due to pronounced prototype location instability (as seen in the continuity evaluation) and a similar weak inter-prototype contrast, resulting in a more entangled feature space. Overall, these findings suggest that the effectiveness of backpropagation-based visualization methods is strongly dependent on the disentanglement of the learned embedding space.
8 Discussion
Fig. 7
Example prototype saliency maps using the PRP visualization method on the AWA2 (top) and NICO (bottom) dataset. The visualizations show the top-2 prototypes with the highest similarity score. Images in column a) illustrate a sample from the test set. The column b) visualizes the prototype on the nearest training image, respectively. Each saliency map visualizes only the 95th percentile
ProtoPNet primarily learned complex prototypes that captured low-level image characteristics rather than the intended semantic object parts, leading to an equally complex classification process. Pruning removed a small number of prototypes and slightly improved interpretability, but consistently deleting background prototypes could not be achieved.
Example visualizations on the fine-grained datasets (Fig. 6) illustrate the challenges in interpreting the concepts learned by individual prototypes. While the learned concept can often be guessed when examining a single image, comparisons between the nearest feature visualizations from the training set and the sample image reveal that a prototype often encodes multiple concepts simultaneously. This issue is particularly pronounced in the other datasets (Fig. 7). For example, in the AWA2 dataset, interpretation is especially vague, we can hardly guess prototype concepts such as “stripes” in zebras, while in the NICO dataset, sample features often span much larger regions than their nearest training features.
On CUB200 and Cars196, our accuracies are 5–10% below the original reports, most likely due to halving prototype channels (256\(\rightarrow \)128) for more stability under our training setup. In addition, only the classification loss was class-weighted to counter imbalance; cluster/separation losses were not. Pruning produced only marginal gains in interpretability, falling short of the clearer pruning effects reported by Chen et al. [16]. Compared to Wang et al. [64], our PRP-based visualizations yield smaller, sparser activations than the original upsampling method, explaining their higher Object Overlap measures.
ProtoPool exhibited prototypes and classification processes of varying complexity, with clear differences between the two fine-grained datasets. Its performance appears to be strongly influenced by the complexity of shared object parts between classes. For example, in Cars196, shared car parts such as tires can be represented with low-level features like colour and shape, whereas in CUB200, the shared relations among bird species are more complex. Overall, ProtoPool tends to learn low-level image characteristics but emphasizes simpler, more interpretable classification processes than ProtoPNet, particularly in less complex classification tasks. We saw surprisingly good performance on the challenging multi-label dataset, suggesting that the shared prototype approach aligns well with such applications.
Figure 6 illustrates that prototypes in both fine-grained datasets often focus heavily on the background. Even in the better-performing Cars196 dataset, this issue is clearly visible. In contrast, visualizations on NICO and AWA2 (Fig. 7) demonstrate improved object focus, comparable to ProtoPNet, which is consistent with ProtoPool’s stronger results on these datasets.
Similar to ProtoPNet, our obtained ProtoPool scores on CUB200 and Cars196 are 5–10% below the original, that we explain by the same channel reduction and class-imbalance handling.
PIPNet achieved the best results in both prototype quality and classification simplicity, making it the most interpretable model. The combination of the softmax layer with the tanh loss proved highly effective in learning a minimal yet sufficient set of prototypes to achieve high accuracy. Furthermore, design choices in the classification layer, especially the hard positive constraint and the adapted output computation, successfully reduced the Local Size, yielding a more interpretable classification process compared to the other models.
However, visual explanations (Figs. 6 and 7) reveal issues similar to those observed in ProtoPNet and ProtoPool. While prototype concepts can often be guessed in single-image visualizations, the nearest feature comparisons again reveal mismatches. Interestingly, PIPNet focuses more on background regions in the NICO and AWA2 datasets, showing the opposite trend of ProtoPool, despite both models employing shared prototypes. Relative to Nauta et al. [21], our PIPNet models underperform by 8%, which we primarily attribute to the lower feature map resolution used here (\(7{\times }7\) vs. \(28{\times }28\)) to ensure a uniform comparison across methods. Despite this gap, our compactness results mirror the original findings: comparable reductions in global size and high sparsity in the classification layer. PRP visualizations align best with PIPNet, though object-part consistency drops notably compared to the purity metric from the original study that focuses only on the top-10 nearest training images. On AWA2, PIPNet attains the highest accuracy. However, performance is likely suppressed by highly imbalanced labels and a multi-label margin loss that cannot incorporate class weights.
Overall, our evaluation focused on diverse model aspects relevant to interpretability, with mixed results across models and reappearing patterns in different evaluation techniques from individual architectures. We see that models generally struggle to learn semantically distinct prototypes and are more focused on lower level image characteristics and shapes. This often results in a cramped latent space and similar activation locations, facilitating a complex relation between prototypes and classes. It is evident that learning semantic prototypes is a complex task, that current architectures struggle to consistently achieve. Other approaches that promote some of the evaluated properties in a more direct matter like contrastive learning approaches seem to be able to mitigate this problem well, as seen by the PIPNet model.
9 Conclusion
We examined the evaluation of prototype-based neural networks, drawing on the foundational research by Nauta and Seifert [19] and Nauta et al. [67]. The Co-12 properties identified by Nauta et al. [67] provide a robust framework for assessing the quality of explanations in XAI methods. We complement these using several new metrics and provide a comprehensive evaluation using four datasets. Our python library integrates the existing evaluation techniques in the Co-12 framework and offers a practical toolkit for validating, benchmarking, and comparing part-based prototype networks.
For future work, we aim to apply this library for developing and refining new prototype models, in particular for (fine-)tuning and evaluation. In addition, we aim to extend the experimentation on further datasets.
Acknowledgements
This work has been partially supported by the funded project FRED, German Federal Ministry for Economic Affairs and Climate Action (BMWK), FKZ: 01MD22003E, as well as by the German Research Foundation (DFG), grant MODUS-II (316679917, AT 88/4-2).
Declarations
Competing Interests
The authors declare that they have no competing interests related to the subject matter discussed in this manuscript.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Wang X, Zhao Y, Pourpanah F (2020) Recent advances in deep learning. Int J Mach Learn Cybern 11:747–750CrossRef
2.
Pouyanfar Samira, Sadiq Saad, Yan Yilin, Tian Haiman, Tao Yudong, Reyes Maria Presa, Shyu Mei-Ling, Chen Shu-Ching, Iyengar Sundaraja S (2018) A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv 51(5):1–36CrossRef
3.
Rudin C, Chen C, Chen Z, Huang H, Semenova L, Zhong C (2022) Interpretable machine learning:fundamental principles and 10 grand challenges. Statistic Surveys 16:1–85MathSciNet
4.
Mengnan D, Liu N, Xia H (2019) Techniques for interpretable machine learning. Commun ACM 63(1):68–77CrossRef
5.
Zhang Yu, Tiňo P, Leonardis A, Tang K (2021) A survey on neural network interpretability. IEEE Trans Emerg Top Comput Intell 5(5):726–742CrossRef
6.
Atzmueller M, Fürnkranz J, Kliegr T, Schmid U (2024) Explainable and interpretable machine learning and data mining. Data Min Knowl Disc 38(5):2571–2595MathSciNetCrossRef
7.
Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1(5):206–215CrossRef
8.
Stiglic G, Kocbek P, Fijacko N, Zitnik M, Verbert K, Cilar L (2020) Interpretability of machine learning-based prediction models in healthcare. Wiley Interdisciplin Rev Data Mining Knowl Discov 10(5):e1379CrossRef
9.
Atzmueller M (2017) Declarative aspects in explicative data mining for computational sensemaking. In: Proceedings of international conference on declarative programming. Springer, pp 97–114
10.
Lundberg SM, Lee SI (2017) A unified approach to interpreting model predictions. In: Guyon I, Von Luxburg U, Bengio S, Wallach H, Fergus R, Vishwanathan S, Garnett R (eds) Advances in neural information processing systems, vol 30. Curran Associates, Inc.,
11.
Ribeiro MT, Singh S, Guestrin C (2016) "Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of ACM SIGKDD, pp 1135–1144
12.
Qi Z, Khorram S, Li F (2019) Visualizing deep networks by optimizing with integrated gradients. In CVPR Workshops 2:1–4
13.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2020) Grad-cam: visual explanations from deep networks via gradient-based localization. Int J Comput Vision 128:336–359CrossRef
14.
Adebayo J, Gilmer J, Muelly M, Goodfellow I, Hardt M, Kim B (2018) Sanity checks for saliency maps. Adv Neural Inf Process Syst 31
15.
Li O, Liu H, Chen C, Rudin C (2018) Deep learning for case-based reasoning through prototypes:a neural network that explains its predictions. In: Proceedings of AAAI conference on artificial intelligence 32(1)
16.
Chen C, Li O, Tao D, Barnett A, Rudin C, Su JK (2019) This looks like that:deep learning for interpretable image recognition. Adv Neural Inf Process Syst 32
17.
Kim SS, Meister N, Ramaswamy VV, Fong R, Russakovsky O (2022) Hive: Evaluating the human interpretability of visual explanations. In: European conference on computer vision, Springer, pp 280–298
18.
Nguyen G, Kim D, Nguyen A (2021) The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. Adv Neural Inf Process Syst 34:26422–26436
19.
Nauta M, Seifert C (2023) The co-12 recipe for evaluating interpretable part-prototype image classifiers. World conference on explainable artificial intelligence. Springer, pp 397–420CrossRef
20.
Rymarczyk D, Struski Ł, Górszczak M, Lewandowska K, Tabor J, Zieliński B (2022) Interpretable image classification with differentiable prototypes assignment. European conference on computer vision. Springer, pp 351–368
21.
Nauta M, Schlötterer J, Van Keulen M, Seifert C (2023a) Pip-net: Patch-based intuitive prototypes for interpretable image classification. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition, pp 2744–2753
22.
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset
23.
Krause J, Stark M, Deng J, Fei-Fei L (2013) 3d object representations for fine-grained categorization. In: Proceedings of IEEE international conference on computer vision workshops, pp 554–561
24.
He Y, Shen Z, Cui P (2021) Towards non-iid image classification: a dataset and baselines. Pattern Recogn 110:107383CrossRef
25.
Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning-a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell 41(9):2251–2265CrossRef
26.
Schank RC (1975) The structure of episodes in memory. Representation and understanding. Elsevier, pp 237–272CrossRef
27.
Kolodner JL (1992) An introduction to case-based reasoning. Artif Intell Rev 6(1):3–34CrossRef
28.
Sørmo F, Cassens J, Aamodt A (2005) Explanation in case-based reasoning-perspectives and goals. Artif Intell Rev 24:109–143CrossRef
29.
Poeta E, Ciravegna G, Pastor E, Cerquitelli T, Baralis E (2023) Concept-based explainable artificial intelligence: A survey. arXiv:2312.12936
30.
Koh PW, Nguyen T, Siang ST. Mussmann, Pierson Emma, Been Kim, Percy Liang (2020) Concept bottleneck models. In: Hal Daumé III and Aarti Singh (eds) Proceedings of 37th international conference on machine learning, vol 119 of proceedings of machine learning research, PMLR, pp 5338–5348
31.
Kim B, Wattenberg M, Gilmer J, Cai C, Wexler J, Viegas F et al. (2018) Interpretability beyond feature attribution: Quantitative testing with concept activation vectors (tcav). In International conference on machine learning, PMLR, pp 2668–2677
32.
Wang J, Liu H, Wang X, Jing L (2021) Interpretable image recognition by constructing transparent embedding space. In Proceedings of IEEE/CVF International conference on computer vision (ICCV), pp 895–904
33.
Nauta M, Van Bree R, Seifert C (2021a) Neural prototype trees for interpretable fine-grained image recognition. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition, pp 14933–14943
34.
Rymarczyk D, Struski Ł, Tabor J, Zieliński B (2021) Protopshare: prototypical parts sharing for similarity discovery in interpretable image classification. In Proceedings of ACM SIGKDD, KDD ’21. ACM, New York, pp 1420–1430
35.
Xue M, Huang Q, Zhang H, Cheng L, Song J, Wu M, Song M (2022) Protopformer: Concentrating on prototypical parts in vision transformers for interpretable image recognition. arXiv:2208.10431
36.
Ming Y, Xu P, Qu H, Ren L (2019) Interpretable and steerable sequence learning via prototypes. In Proceedings of ACM SIGKDD, pp 903–913
37.
Zhang Z, Liu Q, Wang H, Lu C, Lee C (2022) Protgnn: Towards self-explaining graph neural networks. In Proceedings of AAAI conference on artificial intelligence, vol.36, pp 9127–9135
38.
Sacha M, Rymarczyk D, Struski Ł, Tabor J, Zieliński B (2023a) Protoseg: Interpretable semantic segmentation with prototypical parts. In Proceedings of IEEE/CVF winter conference on applications of computer vision (WACV), pp 1481–1492
39.
Gautam S, Boubekki A, Hansen S, Salahuddin S, Jenssen R, Höhne M, Kampffmeyer M (2022) Protovae: a trustworthy self-explainable prototypical variational model. Adv Neural Inf Process Syst 35:17940–17952
40.
Pach M, Rymarczyk D, Lewandowska K, Tabor J, Zieliński B (2024) Lucidppn: unambiguous prototypical parts network for user-centric interpretable computer vision. arXiv:2405.14331
41.
van der Klis R, Alaniz S, Mancini M, Dantas CF, Ienco D, Akata Z, Marcos D (2023) Pdisconet: Semantically consistent part discovery for fine-grained recognition. In Proceedings of IEEE/CVF international conference on computer vision (ICCV), pp 1866–1876
42.
Carmichael Z, Redgrave T, Cedre DG, Scheirer WJ (2024) This probably looks exactly like that: an invertible prototypical network. arXiv:2407.12200
43.
Donnelly J, Barnett AJ, Chen C (2022) Deformable protopnet: An interpretable image classifier using deformable prototypes. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition, pp 10265–10275
44.
Li MX, Rudolf KF, Blank N, Lioutikov R (2024) An overview of prototype formulations for interpretable deep learning. arXiv:2410.08925
45.
Stefenon SF, Singh G, Yow K-C, Cimatti A (2022) Semi-protopnet deep neural network for the classification of defective power grid distribution structures. Sensors 22(13):4859CrossRef
46.
Mohammadjafari S, Cevik M, Thanabalasingam M, Basar A (2021) Using protopnet for interpretable alzheimer’s disease classification. In Canadian AI
47.
Kim E, Kim S, Seo S, Yoon S (2021) Xprotonet: Diagnosis in chest radiography with global and local explanations. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 15719–15728
48.
Carloni G, Berti A, Iacconi C, Pascali MA, Colantonio S (2022) On the applicability of prototypical part learning in medical images: breast masses classification using protopnet. International conference on pattern recognition. Springer, pp 539–557
49.
Elhadri K, Michalski T, Wróbel A, Schlötterer J, Zieliński B, Seifert C (2025) This looks like what? challenges and future research directions for part-prototype models. arXiv:2502.09340
50.
Xu-Darme R, Quénot G, Chihani Z, Rousset M-C (2023a) Sanity checks for patch visualisation in prototype-based image classification. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition, pp 3691–3696
51.
Hoffmann A, Fanconi C, Rade R, Kohler J (2021) This looks like that... does it? shortcomings of latent space prototype interpretability in deep networks. arXiv:2105.02968
52.
Saralajew S, Holdijk L, Villmann T (2020) Fast adversarial robustness certification of nearest prototype classifiers for arbitrary seminorms. Adv Neural Inf Process Syst 33:13635–13650
53.
Huang Q, Xue M, Huang W, Zhang H, Song J, Jing Y, Song M (2023) Evaluation and improvement of interpretability for self-explainable part-prototype networks. In Proceedings of IEEE/CVF international conference on computer vision, pp 2011–2020
54.
Willard F, Moffett L, Mokel E, Donnelly J, Guo S, Yang J, Kim G, Barnett AJ, Rudin C (2024) This looks better than that: Better interpretable models with protopnext. arxiv:2406.14675
55.
Xu-Darme R, Varasse A, Grastien A, Girard J, Chihani Z (2024) Cabrnet, an open-source library for developing and evaluating case-based reasoning models. arXiv:2409.16693
56.
Ma C, Zhao B, Chen C, Rudin C (2024) This looks like those: illuminating prototypical concepts using multiple visualizations. Adv Neural Inf Process Syst 36
57.
Nauta M, Jutte A, Provoost J, Seifert C (2021b) This looks like that, because... explaining prototypes for interpretable image recognition. Joint european conference on machine learning and knowledge discovery in databases. Springer, pp 441–456
58.
Cheng HD, Jiang XH, Sun Y, Wang J (2001) Color image segmentation: advances and prospects. Pattern Recogn 34(12):2259–2281CrossRef
Silva T,Rivera AR (2022) Representation learning via consistent assignment of views to clusters. In Proceedings of 37th ACM/SIGAPP symposium on applied computing, pp 987–994
61.
Xu-Darme R, Quénot G, Chihani Z, Rousset MC (2023b) Sanity checks for patch visualisation in prototype-based image classification. In Proceedings of IEEE/CVF conference on computer vision and pattern recognition, pp 3690–3695
62.
Gautam S, Höhne MM, Hansen S, Jenssen R, Kampffmeyer M (2023) This looks more like that:enhancing self-explaining models by prototypical relevance propagation. Pattern Recognition 136:109172CrossRef
63.
Sacha M, Jura B, Rymarczyk D, Struski Ł,Tabor J, Zieliński B (2023b) Interpretability benchmark for evaluating spatial misalignment of prototypical parts explanations. arXiv:2308.08162
64.
Wang C, Liu Y, Chen Y, Liu F, Tian Y, McCarthy D, Frazer H, Carneiro G (2023) Learning support and trivial prototypes for interpretable image classification. arXiv:2301.04011
65.
Raschka S (2018) Model evaluation, model selection, and algorithm selection in machine learning. arXiv:1811.12808
66.
Buslaev A, Iglovikov VI, Khvedchenya E, Parinov A, Druzhinin M, Kalinin AA (2020) Albumentations: Fast and flexible image augmentations. Information, 11(2)
67.
Nauta M, Trienes J, Pathak S, Nguyen E, Peters M, Schmitt Y, Schlötterer J, van Keulen M, Seifert C (2023) From anecdotal evidence to quantitative evaluation methods: a systematic review on evaluating explainable ai. ACM Comput Surv 55(13s):1–42CrossRef