Skip to main content
Erschienen in:

Open Access 01.12.2025 | Research

A problem-agnostic approach to feature selection and analysis using SHAP

verfasst von: John T. Hancock, Taghi M. Khoshgoftaar, Qianxin Liang

Erschienen in: Journal of Big Data | Ausgabe 1/2025

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Der Artikel untersucht einen problemagnostischen Ansatz zur Auswahl und Analyse von Funktionen mittels SHAP zur Erkennung von Kreditkartenbetrug. Sie adressiert die Herausforderung unterschiedlicher Szenarien der Verfügbarkeit von Daten, wie beispielsweise klassenlose, einklassige und binäre Daten. Die Methodik beinhaltet die Anwendung von SHAP in Verbindung mit drei Arten von maschinellem Lernen: Isolation Forest, Gaussian Mixture Model (GMM) und XGBoost. Die Studie verwendet den Kaggle Credit Card Fraud Detection Dataset, um die Technik zu validieren, und zeigt, dass die Bedeutung von SHAP-Merkmalen sinnvoll ist und in verschiedenen Szenarien effektiv angewendet werden kann. Die Ergebnisse zeigen, dass Modelle, die mit den von SHAP ausgewählten Top-15-Merkmalen gebaut wurden, ähnlich funktionieren wie Modelle, die alle Merkmale verwenden, was die Technik praktisch und effizient macht. Die Feature-Analyse hebt gemeinsame wichtige Merkmale verschiedener Klassifikatoren hervor und bietet Einblicke in die Natur betrügerischer Transaktionen. Dieser Ansatz bietet einen wertvollen Beitrag im Bereich der Erkennung von Kreditkartenbetrug und des maschinellen Lernens.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Abkürzungen
ANOVA
Analysis of variance
AUC
Area Under the Receiver Operating Characteristic Curve
AUPRC
Area under the precision recall curve
CAE
Convolutional Autoencoder
CNN
Convolutional Neural Network
GBDT
Gradient Bosted Decision Trees
HSD
Honestly Significant Difference
PCA
Principal Components Analysis
SHAP
SHapley Additive exPlanations
SVM
Support Vector Machine

Introduction

Credit card fraud costs consumers billions of dollars annually in the United States [1]. Therefore, research into techniques for detecting credit card fraud is a worthwhile pursuit. Detection is the first step in prevention. Credit card transaction data is initially unlabeled, since fraudulent transactions are identified after the fact. Therefore, techniques for analyzing unlabeled and one-class credit card transaction data, such as we exhibit in this study, are always relevant.
In this study, we demonstrate a technique for feature analysis that applies in scenarios where data is of two classes, and may be labeled or not. There are three scenarios of data label availability. The first scenario, in which unsupervised machine learning algorithms are applicable, is when class labels are not available. We refer to this as the “no-class” scenario. The second scenario, where one-class classifiers are suitable, is when we have labeled instances of one class available. We refer to this as the “one-class” scenario. Finally, when the dataset contains labeled instances of two classes, we have the “binary-class” scenario.
In the methodology section of our study, we define a feature analysis technique applicable in all three data label availability scenarios. We are able to accomplish this by showing that our technique works in all three scenarios with a combination of SHapley Additive exPlanations (SHAP) [2] and algorithms that are applicable in each scenario. To illustrate the technique, we utilize the Kaggle Card Fraud Detection dataset [3], hereafter referred to as the Credit Card Fraud Detection dataset. Since the focus of our study is the exposition of a technique for working with data in the three data label availability scenarios, we use one dataset. We selected one dataset since it is well-known, which makes applying the new technique to it more easily understood. Moreover, no publicly available credit card fraud dataset, similar to the Credit Card Fraud Detection dataset used in this study, can be found. We simulate the no-class and one-class scenarios with the Credit Card Fraud data. The Credit Card Fraud Detection dataset is labeled. We use labeled data to measure the classifiers’ performance in order to validate SHAP feature selection for feature analysis. We use unmodified, publicly available open-source software in our study; therefore the techniques covered here are easily transferable to other datasets. The novelty of our study rests not only on the application of SHAP to the Credit Card Fraud Detection dataset, but in the application of SHAP in conjunction with three types of machine learning algorithms, for three different problem types.
For the purposes of anonymization, the actual attribute names of the Credit Card Fraud Detection dataset are not defined. The authors of the dataset’s documentation state that 28 out of the 30 independent variables of the Credit Card Fraud Dataset are the result of applying Principal Components Analysis (PCA) [4] to some other data. To be precise, the Credit Card Fraud comes from applying PCA to some dataset, which we will call dataset D. After considerable research, we are unable to find any further information about the dataset D that PCA was applied to generate the 28 independent variables’ values. We have absolutely no idea about what is actually in dataset D. We only know that PCA was applied to it, to generate features with meaningless names V1, V2,... V28. That is all we know about the origin of these features. For this reason, we state that, to the best of our knowledge, information about the attributes of the Credit Card Fraud Dataset is incomplete. Nevertheless, it is still possible to use SHAP to make conclusions about which features of the dataset are important for all of the label availability scenarios.
Research work on new data often begins in a situation where data appears valuable, but information about it is sometimes incomplete. For instance, an employee at a large financial institution may have access to a large amount of financial transaction data, and be tasked with discovering fraudulent transactions. One often begins in such a no-class scenario. Moreover, information about attributes of the dataset may be incomplete, as the case with the Credit Card Fraud dataset. This study demonstrates techniques which can be used in these situations. To the best of our knowledge, this is the first study to show how to successfully apply these techniques in the credit card fraud detection application domain.
We feel it is important to state explicitly that we prefer not to compare the model performance for the three scenarios because the scenarios should never coexist. If one has labeled data, one should use a binary classifier since they tend to perform better due to having more information about both class labels [5]. If one has data of only one class, then it is not possible to use a binary classifier, but a one-class classifier will yield better results than an unsupervised classifier. As a last resort, if one has no information about class labels, then one must use an unsupervised classifier. Moreover, for an unsupervised classifier, we chose to use Isolation Forest since it is commonly used as an unsupervised classifier in related work [6].
The remainder of our study is organized to show our technique for analysis that can be applied in scenarios that are common when researching credit card fraud detection. Hence, we begin with a review of related work, which shows the novelty of the research work we exhibit here. Following that we discuss the data used in our study, then we provide sections that cover the machine learning algorithms used, and our experimental methodology. Finally, we provide conclusions based on the outcome of our results.
Our study began with a literature review to look for opportunities to make a contribution to the study of feature analysis for the three dataset label availability scenarios. During our review it became apparent that there is no study that covers the technique described in our methodology for feature analysis with differing label availability in the credit card fraud application domain. We found many studies that investigate unsupervised learning or feature selection in classifying credit card transaction data, but not both. Moreover, we found most of the studies contain results reported with threshold-dependent metrics, such as accuracy, which do not provide as clear of an insight into performance as threshold-independent metrics, such as area under the precision recall curve (AUPRC) [7], and Area Under the Receiver Operating Characteristic Curve (AUC) [8]. Threshold-dependent metrics give us the performance of a model for one choice of model output probability classification threshold value, whereas threshold-independent metrics give us the performance of a model for an approximation of all possible threshold values. Here we cover similar studies and discuss how our work contains a novel contribution, not present in current related literature.
“The effect of feature extraction and data sampling on credit card fraud detection” by Salekshahrezaee et al. [9], is a related study that uses the same dataset we use in this study. The authors cover data sampling techniques to include Random Undersampling, which can drastically reduce the size of a dataset. We use SHAP, for feature selection, which can also reduce the size of a dataset, but does not require a labeled dataset. Random Undersampling by definition requires a labeled dataset. Another aspect of Salekshahrezaee et al.’s study that is different from ours is their use of feature extraction techniques. Salekshahrezaee et al. use PCA, and a Convolutional Autoencoder (CAE), [10]. We use SHAP for feature selection. PCA and CAE both involve extracting new features out of the existing dataset. Finally, our study covers three different data label availability scenarios. We provide analysis that applies in situations where one has no labels for data, in situations where data of one class is available, and situations where the dataset is labeled. Moreover, our study is a demonstration of a technique that can be applied to validate the quality of a feature selection technique for further use.
Another related study we found is “Anomaly detection using unsupervised methods: credit card fraud case study” by Rezapour [11]. Rezapour evaluates the performance of three unsupervised techniques for classifying the Credit Card Fraud detection dataset. Hence, Rezapour’s study lies completely in the domain of the no-class scenario. Furthermore, unlike our study, Rezapour does not investigate feature importance with any of the unsupervised techniques used. The three unsupervised techniques Rezapour covers are: the One-Class Support Vector Machine (SVM) [12], Autoencoder, and the Robust Mahanabolis technique. Autoencoder and One-class SVM are well-known machine learning techniques with publicly available, open-source implementations. The Robust Mahanabolis technique is a method of outlier detection that leverages the Mahanabolis distance metric to detect outliers. Rezapour reports results in the form of confusion matrices for each of the unsupervised techniques. A shortcoming of this method for presenting results is that results for a single experiment are included, whereas we present results in a manner that captures the outcomes of multiple iterations of experiments, which provides better insight into the repeatability and statistical significance of our results. Moreover, the figures in Rezapour’s reported confusion matrices indicate that results are reported for a sample of the Credit Card Fraud data which is undersampled to an approximately 1:1 class ratio. Rezapour’s confusion matrices show that One-class SVM and Autoencoder outperform the Robust Mahanabolis technique in terms of false negative classifications, and that Autoencoder outperforms One-class SVM in terms of false positive classifications. Our study differs from Rezapour’s because we conduct experiments covering all three label availability scenarios, and we provide a feature importance analysis for each scenario.
In “Combining unsupervised and supervised learning in credit card fraud detection” [13], by Carcillo et al., the goal of their study is to propose a hybrid technique that combines supervised and unsupervised machine learning techniques in a single model that yields better performance in the classification of credit card transaction. Our study has a completely different goal, which is to apply a technique for feature analysis in different label availability scenarios. Carcillo et al. assume label availability in their study. Their approach is to add a vector of outlier scores to the attributes of the dataset, where the outlier scores are computed with different techniques, some of which are considered unsupervised machine learning techniques. The scores used are multi-variate Z-score, two scores based on Principal Component Analysis, one score based on the output of Isolation Forest, and one score based on the output of Gaussian Mixture Model. Isolation Forest and Gaussian Mixture model are unsupervised techniques. Hence, unsupervised machine learning techniques are used in the study, but they are used to add features to a labeled dataset. Our study is an investigation of the approaches one might take under different conditions of dataset label availability. For example, Carcillo et al. provide a table of feature importance for the data used in their study, where Random Forest is used to calculate feature importance. Since Random Forest is a supervised machine learning technique, this approach is only viable when a dataset is labeled. In our study, we use SHAP to determine feature importance. SHAP feature importance is determined by SHAP values which are computed by determining the features’ impact on model output values; therefore, it can be used in the absence of labels on a dataset. Unlike Carcillo et al. we provide analysis of feature importance in the three scenarios of label availability. Hence, we provide a more general contribution to credit card fraud detection research.
“Performance evaluation of machine learning algorithms for credit card fraud detection” [14], by Mittal and Tyagi, is a study on the impact of the choice of classifier on experimental outcomes in classifying credit card transaction data. Hence, their study involves multiple classifiers, some of which use supervised techniques, and some of which use unsupervised techniques. In their experiments, they employ ten supervised learners, and four unsupervised learners. Three hybrid learners are also employed; however, we did not find much detail on the hybrid learners. In the context of hybrid learners, Mittal and Tyagi provide a reference to a survey which discusses hybrid learners, but we are unable to determine which hybrid learners mentioned in the survey correspond to those used in their study. Mittal and Tyagi do not discuss feature importance or feature selection. This is a significant difference in our studies. Our study is an investigation into how the features of a dataset can be analyzed in different conditions of label availability. In the results presented by Mittal and Tyagi, we find only threshold-dependent metrics are reported. When reporting the performance of models, threshold independent metrics such as AUC and AUPRC, that we use here, give a better picture their performance. We find our study addresses a completely different set of research questions than that of Mittal and Tyagi’s.
A fourth related study that we found is “Credit Card Fraud Detection using Classification, Unsupervised, Neural Networks Models” [15]. In their study, Bhavya et al. investigate the performance of three classifiers, applied to credit card transaction data. Two of the classifiers are supervised machine learning techniques, Logistic Regression, and Convolutional Neural Networks (CNNs) [16]. Bhavya et al. Also employ k-means clustering as an unsupervised technique for classifying credit card transaction data. Hence, their approach is similar to previous related work, in that it can answer questions about the performance of classifiers given availability of labels on a dataset, but it does not offer the insight into how one may validate feature importance with varying degrees of availability of the labels of a dataset. Bhavya et al. report results in terms of the accuracy metric, which we consider to be unsuitable for evaluating the performance of classifiers on imbalanced data. Another shortcoming of the accuracy metric is that it is threshold dependent. Therefore, it can be uninformative of a model’s performance due to the choice of threshold value. As stated previously, the threshold-independent metrics AUC and AUPRC that we use here allow us to evaluate the performance of models over a large number of possible threshold values in the closed interval (0, 1). Another matter, related to the performance analysis done by Bhavya et al. is that the experimental outcomes of the supervised and unsupervised techniques are compared directly. Generally, supervised machine learning techniques outperform unsupervised techniques, because they have additional information which can be used during model training [5]. Our study is not meant to inform researchers on which sort of technique, supervised versus unsupervised yields better performance in credit card transaction classification. Rather, our study is meant to inform researchers on a technique for assessing the viability of feature importance in different scenarios of label data availability.
Our literature review includes feature selection techniques applied to highly imbalanced credit card fraud transaction data. Fadaei and Moattar conducted a study “Ensemble Classification and Extended Feature Selection for Credit Card Fraud Detection” [17], that employs an “extended wrapper” technique for feature selection. The extended wrapper technique is an ensemble feature selection technique since it incorporates feature importance from the Chi-squared, gain ratio, and ReliefF techniques. An important fact to mention about these techniques is that a labeled dataset is required for their application. Our study focuses on the application of a feature selection technique that is viable in situations where the dataset is not labeled. Fadaei and Moattar propose a model that incorporates their feature selection technique and the Decision Forest learner. Decision Forest is a general term for ensembles of Decision Trees. Fadaei and Moattar evaluate their proposed model against several, previously implemented algorithms such as the Naive Bayes classifier and J48 Decision Tree. Their proposed approach for detecting fraudulent credit card transactions outperforms the existing machine learning algorithms in terms of precision, recall, and f-measure. Hence, their study suffers the same shortcoming as others, since only threshold-dependent metrics are used to evaluate classification results. Fadaei and Moattar’s study is similar to ours in that it covers a feature selection technique used in classifying credit card transaction data. However, our study covers feature selection which can be used in an unsupervised machine learning setting. Our studies also have vastly different outcomes. The outcome of Fadaei and Moattar’s study is to show that their feature selection technique and classifier outperform previously implemented classifiers. The outcome of our study is to demonstrate a feature selection technique which we prove to be viable in different scenarios of label availability.
In a similar exposition of a feature selection technique, Prabhakaran and Nedunchelian present the “oppositional cat swarm optimization-based feature selection model with a deep learning model for credit card fraud detection” (OCSODL-CCFD) technique [18]. We conclude that OCSODL-CCFD is a supervised feature selection technique since Prabhakaran and Nedunchelian mention a “classifier error” term, \(\gamma _s\), used in calculating a value that is used in their feature selection technique. This is an important differentiating aspect of our studies, since the feature selection technique we use, SHAP, is applicable whether the data is labeled or not. Moreover, our study is not a venture into the discovery of new techniques for feature selection, but to show how one might go about proving a feature selection technique is viable in unsupervised machine learning scenarios. In what appears to be a trend in related studies, we find all the metrics Prabhakaran and Nedunchelian use to report experimental outcomes: precision, recall, accuracy, F-measure, and MCC, are all threshold-dependent metrics.
A final, related study is “A model-agnostic feature selection technique to improve the performance of one-class classifiers”[19]. In their study, Hancock et al. cover the use of SHAP for feature selection for two one-class classifiers: One-Class SVM, and GMM. We extend the work in this study to include the no-class scenario, with Isolation Forest, and the binary-class scenario, with XGBoost. Furthermore, we jettison One-class SVM in this study, since Hancock et al. show that GMM outperforms One-class SVM in terms of AUPRC. The key difference between the present study and Hancock et al.’s is that here, we show that the SHAP feature selection technique is viable, since it can be applied in the binary-class scenario, and yield results that are similar to, or better than using all features. Hence, the SHAP feature importance we obtain as part of SHAP feature selection is meaningful. Hancock et al. apply their feature selection technique in the one-class scenario. One could object that the AUPRC results they report as an endorsement of the SHAP feature selection technique are not meaningful since in a true one-class scenario, one would not have a dataset with instances of both classes, from which performance in terms of AUPRC could be evaluated. Here, we show how a feature selection technique, which applies in unsupervised settings, can be validated in a supervised setting, and then confidently applied in unsupervised settings. To the best of our knowledge, this is the first study in the credit card fraud detection application domain that demonstrates the viability of a feature selection technique by demonstrating its applicability in three data label availability scenarios.
In our review of related work, we discovered research in the credit card fraud detection domain that involved the subjects covered in this study. We found work on feature selection techniques, and unsupervised learning. However, we were unable to find studies that incorporate both feature selection and unsupervised machine learning simultaneously. Moreover, we present a technique for evaluating the viability of SHAP feature selection for credit card fraud transaction detection that shows it can apply in situations that are truly one-class, and no-class scenarios. We show that SHAP feature importance can be used to build machine learning models that perform consistently. Hence, in a situation where one has only legitimate transaction data, or is tasked with discovering instances of transaction data are fraudulent (a typical no-class scenario), our results show that one-class or unsupervised learners can be used, and SHAP can be applied for model explainability purposes. As stated previously, to the best of our knowledge, there are no existing studies that encompass a similar technique.

Case study data

The data used for the study is the Kaggle Credit Card Fraud Detection dataset [3], hereafter referred to as the Credit Card Fraud dataset. The dataset is the result of a collaborative effort between the machine learning Group at the Universitè Libre de Bruxelles and Worldline. It is publicly available for download from the Kaggle.com website.
The Credit Card Fraud data has a total of 31 attributes. We consider 29 of the attributes to be independent variables that are useful for machine learning. Descriptive statistics of these attributes are listed in Table 1. We define two terms to help explain the attributes of the Credit Card Fraud data: concrete, and abstract. Concrete attributes are attributes that are clearly defined and directly relatable to something in everyday human experience. For example, transaction amount, is a concrete attribute. Transaction amount is directly relatable to something in human experience. It is some amount of money used in the exchange of goods or services. Abstract attributes are not defined in terms of things directly relatable to human experience. For example, an attribute of a dataset that is the result of applying PCA to some unknown data is abstract. Two of the attributes of the dataset are concrete. One of the concrete attributes, Time, is not used since it is equivalent to a unique identifier. The Time attribute holds the number of seconds elapsed since the first transaction in the dataset. The dataset has one other concrete attribute, Amount, which is simply the transaction amount. It ranges in value from 0.0 to 25,691.16. We do not know the units for the Amount attribute. The remaining 28 attributes are abstract. They are the result of applying principal component analysis (PCA) [4] to an unknown source of numeric data. An inspection of the descriptive statistics of features V1 through V28 shows that their mean values are all approximately zero, and they range in values from approximately -100 to 100. However, their 25% to 75% percentile ranges are much smaller, generally in the interval from -1 to 1.
Table 1
Credit card fraud data descriptive statistics
 
Mean
Std
Min
25%
50%
75%
Max
V1
0.0000
1.9587
\(-\)56.4075
\(-\)0.9204
0.0181
1.3156
2.4549
V2
0.0000
1.6513
\(-\)72.7157
\(-\)0.5985
0.0655
0.8037
22.0577
V3
\(-\)0.0000
1.5163
\(-\)48.3256
\(-\)0.8904
0.1798
1.0272
9.3826
V4
0.0000
1.4159
\(-\)5.6832
\(-\)0.8486
\(-\)0.0198
0.7433
16.8753
V5
0.0000
1.3802
\(-\)113.7433
\(-\)0.6916
\(-\)0.0543
0.6119
34.8017
V6
0.0000
1.3323
\(-\)26.1605
\(-\)0.7683
\(-\)0.2742
0.3986
73.3016
V7
\(-\)0.0000
1.2371
\(-\)43.5572
\(-\)0.5541
0.0401
0.5704
120.5895
V8
0.0000
1.1944
\(-\)73.2167
\(-\)0.2086
0.0224
0.3273
20.0072
V9
\(-\)0.0000
1.0986
\(-\)13.4341
\(-\)0.6431
\(-\)0.0514
0.5971
15.5950
V10
0.0000
1.0888
\(-\)24.5883
\(-\)0.5354
\(-\)0.0929
0.4539
23.7451
V11
0.0000
1.0207
\(-\)4.7975
\(-\)0.7625
\(-\)0.0328
0.7396
12.0189
V12
\(-\)0.0000
0.9992
\(-\)18.6837
\(-\)0.4056
0.1400
0.6182
7.8484
V13
0.0000
0.9953
\(-\)5.7919
\(-\)0.6485
\(-\)0.0136
0.6625
7.1269
V14
0.0000
0.9586
\(-\)19.2143
\(-\)0.4256
0.0506
0.4931
10.5268
V15
0.0000
0.9153
\(-\)4.4989
\(-\)0.5829
0.0481
0.6488
8.8777
V16
0.0000
0.8763
\(-\)14.1299
\(-\)0.4680
0.0664
0.5233
17.3151
V17
\(-\)0.0000
0.8493
\(-\)25.1628
\(-\)0.4837
\(-\)0.0657
0.3997
9.2535
V18
0.0000
0.8382
\(-\)9.4987
\(-\)0.4988
\(-\)0.0036
0.5008
5.0411
V19
0.0000
0.8140
\(-\)7.2135
\(-\)0.4563
0.0037
0.4589
5.5920
V20
0.0000
0.7709
\(-\)54.4977
\(-\)0.2117
\(-\)0.0625
0.1330
39.4209
V21
0.0000
0.7345
\(-\)34.8304
\(-\)0.2284
\(-\)0.0295
0.1864
27.2028
V22
\(-\)0.0000
0.7257
\(-\)10.9331
\(-\)0.5424
0.0068
0.5286
10.5031
V23
0.0000
0.6245
\(-\)44.8077
\(-\)0.1618
\(-\)0.0112
0.1476
22.5284
V24
0.0000
0.6056
\(-\)2.8366
\(-\)0.3546
0.0410
0.4395
4.5845
V25
0.0000
0.5213
\(-\)10.2954
\(-\)0.3171
0.0166
0.3507
7.5196
V26
0.0000
0.4822
\(-\)2.6046
\(-\)0.3270
\(-\)0.0521
0.2410
3.5173
V27
\(-\)0.0000
0.4036
\(-\)22.5657
\(-\)0.0708
0.0013
0.0910
31.6122
V28
\(-\)0.0000
0.3301
\(-\)15.4301
\(-\)0.0530
0.0112
0.0783
33.8478
Amount
88.3496
250.1201
0.0000
5.6000
22.0000
77.1650
25691.1600
Moreover, the dataset is highly imbalanced. 492 out of 284,807 transactions are labeled as fraudulent (positive), and the remaining transactions are labeled as not-fraudulent (negative). The dataset has a Class attribute, which we treat as the dependent variable in the dataset. Instances of the negative class have a Class value of 0, and instances of the positive class have a Class value of 1.
Although the Credit Card Fraud dataset has a Class attribute, we do not use it to train models in the no-class and one-class scenarios. Hence, we do not use the Class attribute to fit GMM or Isolation Forest. SHAP does not require the dataset’s label, so we do not use it as part of SHAP feature selection. Class is only used for training the XGBoost model.

Algorithms

Here we review the three learners and the feature selection technique used in this study, beginning with Isolation Forest. Isolation Forest is an unsupervised machine learning algorithm because it does not need information about the labels of the dataset in order to classify a dataset. Hence, it is a suitable learner for scenarios where a dataset is not labeled. Above, we coined the term, “no-class” for this scenario. In our study, we use Isolation Forest to classify labeled data. This is only to prove that Isolation Forest is capable of separating instances of a dataset into classes with a reasonable degree of performance. Since labeling a dataset may require manual effort, one may be faced with a decision to expend effort to label a dataset. Reasonable performance from Isolation Forest might compel one to decide to make the expenditure.
Isolation Forest was announced by Liu et al. in 2008. It is an ensemble model, where the ensemble is composed of Isolation Trees, or “iTrees”. iTrees are decision trees built with a recursively applied process: the Isolation Forest algorithm takes a random sample of (a user defined) size s, of the training data, it then randomly selects an attribute, a of the data, and a split value v between the minimum and maximum values of a. Instances of the training data are split into two subsets according to whether their value of a is less than v. This process is applied repeatedly until either one of the subsets has one instance in it, or data has been split a predefined number of times. According to Liu et al. the expected number of times the process repeats is \(log_2(s)\). When a dataset contains anomalies, this process is likely to discover them. Anomalies have extreme values for their attributes, so are likely to have attributes on one side of a randomly selected split value when the majority of instances have attributes on the other side of the randomly selected split value. The user specifies the number of iTrees isolation Forest builds.
Once the ensemble of iTrees is built it can be used for classification. In a scenario where data is not labeled, evaluation of the classification would not be possible. In order to use Isolation Forest for classification, each instance of the test data is evaluated according to each iTree in the ensemble, and an anomaly score is assigned to each instance. The anomaly score is a function of the mean path length to evaluate the instance according to all iTrees in the ensemble. In this study, we use the scikit-learn [20] implementation of Isolation Forest. We found it necessary to negate the output of the decision_function function in order to obtain a value which we could use as a probability for calculating AUPRC.
The second scenario in this study is one where we have some information about the class membership of the instances of our dataset. However, the dataset contains instances that belong to only one of the classes. As stated previously, we call this the one-class scenario. Such a scenario might arise when theory predicts the existence of a new class of a phenomenon, but it has not been observed yet. For the OCC scenario, we use GMM, since previous research shows GMM yields the best performance [5]. GMM fits multiple, multivariate Gaussian distributions to the instances of a dataset via the expectation-maximization technique. Expectation-maximization is an iterative technique that optimizes the parameters of a set of Gaussian distributions. In the expectation stage, the probability of the points in the dataset belonging to the sum of the distributions is computed. Then in the maximization stage the distributions’ parameters are updated to increase the probability that the points in the training data belong to the distributions. Iterations stop when the change in the distributions’ parameters falls below a pre-determined threshold. Once the iterations of the expectation-maximization algorithm are complete, GMM may be used for classification. In order to classify an instance of a dataset with GMM, one calculates the probability of the sum of Gaussian distributions for a small region about the point. The probability can be interpreted as a class membership probability, and threshold agnostic metrics, such as AUC and AUPRC, can be used to evaluate GMM’s classification performance.
The third scenario in our study is the scenario where we have instances of labeled data from both classes in our dataset. Earlier we named this the “binary-class” scenario. Supervised machine learning algorithms are appropriate in this scenario. GBDT classifiers have their roots in Friedman’s Gradient Boosted machine algorithm [21]. The output of the Gradient Boosted machine algorithm is an ensemble of classifiers. Friedman’s method is iterative in nature. It starts with an initial learner that makes a preliminary prediction for the dependent variable. This prediction is then compared against the labels on the data, and the residual differences between the predictions and the labels are used as a secondary dataset. A second learner is fit to the secondary dataset. A new model is formed by taking the sum of the output values of the primary and secondary learners. This iterative process continues, with each new model fine-tuning its predictions based on the residuals left by the previous ensemble, thereby enhancing overall accuracy. The number of iterations is equal to the number of members of the ensemble.
For the binary-class scenario, we use XGBoost since previous research shows it yields the best performance in classifying the Credit Card Fraud dataset [22]. XGBoost, introduced by Chen and Guestrin in 2016, is a refinement of Friedman’s original concept [23]. It uses Decision Trees for members of the ensemble. Hence, XGBoost is known as a Gradient Boosted Decision Tree (GBDT) Technique. It builds upon the traditional GBDT framework with several key enhancements. One of the notable additions is an advanced loss function that incorporates regularization to prevent overfitting. XGBoost also brings a refined method for determining splits in its Decision Tree ensemble. The algorithm includes an “approximate algorithm” designed to estimate optimal split values, particularly useful in scenarios involving large datasets or distributed computing. Moreover, XGBoost addresses the challenges of sparse data with its “sparsity aware split finding” feature, allowing for more efficient fitting to datasets with sparse characteristics.
In all scenarios, we use SHAP in conjunction with the machine learning algorithms listed above to determine feature importance. Lundberg and Lee introduced SHAP (SHapley Additive exPlanations) in their 2017 paper, aiming to bridge the gap between simplicity and accuracy in model predictions [2]. They sought to provide an easily interpretable yet accurate method, using a game-theoretic approach from Shapley’s 1953 work for determining individual contributions in a coalition. Applying this to machine learning, they used SHAP to assess feature importance, showing that many existing methods fall under a broader class that includes their technique.
This class of methods builds secondary models to determine feature importance for specific input values of the main model. SHAP, in particular, stands out for its three properties: local accuracy (the secondary model’s output matches the main model’s output for the same input), missingness (a feature’s absence in the original dataset means it doesn’t influence the secondary model’s approximation), and consistency (the effect of a feature on the secondary models should mirror its impact on the main models). Lundberg and Lee demonstrated that SHAP is unique in the class of methods to which it belongs; it is the only technique with these three properties.
In their paper, Lundberg and Lee also discuss several SHAP variants, with Kernel SHAP being the primary model-agnostic version, applicable to any model. Kernel SHAP is a technique to calculate how much each attribute of a particular instance contributes to the model’s output value for that instance. Thus, SHAP values may be interpreted as representing feature importance. SHAP values, can be either positive or negative and are interpreted based on their absolute magnitude. Averaging these absolute SHAP values across multiple instances is a technique for calculating feature importance. SHAP does not function differently for any of the three label availability scenarios. This is because SHAP does not rely on the class label, but on model output values instead. Moreover, we apply SHAP as a stand-alone library. SHAP is passed a sample of data and a trained model, and with these, SHAP calculates feature importance.
SHAP is available as an open-source Python library, providing tools for visualizing feature importance. These tools include plots showing attribute contributions to model output, the impact of input changes on outputs, and local importance plots for individual instances. This availability and theoretical backing make SHAP a valuable, model-agnostic resource for understanding and explaining feature importance in machine learning models.

Methodology

The goal of our research is to explore methods for applying machine learning to datasets where we have varying degrees of information about the labels of a dataset. Hence, we have designed a suite of experiments that comprises an approach to execute these methods. In this research, the experimental framework was established on a distributed computing system. This system consisted of nodes outfitted with 16-core Intel Xeon processors, each accompanied by 256 GB of RAM, and equipped with Nvidia V100 graphics processing units.
As stated previously, our motivation for selecting Isolation Forest, GMM, and XGBoost is to use one learner for each scenario of label availability. Each classifier is appropriate depending on how much information we have about class membership. We chose SHAP as the method for determining feature importance for each model because it is applicable to each model in each scenario. Furthermore, in order to utilize SHAP, one does not need expert knowledge about a dataset’s attributes. We have very little information about the origins of the Credit Card Fraud dataset’s attributes because they are the result of PCA. Even though the Kaggle Credit Card Fraud Detection dataset is the product of PCA, further data reduction can be achieved by selecting features that are more useful for credit card fraud detection. Therefore, SHAP is an appropriate technique for determining feature importance. Once we have a ranking of feature importance, we confirm the feature importance by building models with increasing numbers of important features. The expected result is that models built with some number of the most important features should yield performance equivalent to models built with all features. Such a result would confirm that SHAP is capable of identifying the important attributes of a dataset.
Therefore, for each classifier, Isolation Forest, GMM, and XGBoost, we first train a model on all the viable attributes of the Credit Card Fraud dataset, that is, Amount, and features V1–V28. Then we apply the SHAP kernel explainer to the trained model and a sample of the Credit Card Fraud data. We then call the kernel explainer’s shap_values function, passing it another sample of the Credit Card Fraud data. The output of the shap_values function is a list of each attribute and value that indicates how much the attribute contributes to the output value of the classifier, for each instance in the sample passed to shap_values. These values are known as the SHAP values of the attributes. SHAP values may be positive or negative, and their magnitude indicates the effect the attribute has on the model’s output value for a particular instance of the dataset. Therefore, the absolute value of a SHAP value is an indicator of the effect of an attribute on the model’s output. We then use the mean SHAP value of every attribute to put an order on the attributes of the data. In the SHAP library documentation, Lundberg, who co-authored the original paper on SHAP, notes his method of calculating attribute importance is to use the average of the absolute SHAP scores for each feature.1 We find that this technique for computing the attributes’ importance places different importance on data attributes depending on the classifier we use. This aligns with the expectation that different models will leverage different features to learn how to classify a dataset. Note that in the no-class and one-class scenarios, labels are not required for training the model, or SHAP.
After computing the SHAP feature importance for each classifier we select the k most important attributes, according to their mean absolute SHAP value, and use the model in ten iterations of five-fold cross validation. k takes the values 3, 5, 7, 10, 15, and 29. In five-fold cross validation, we shuffle the data, and split it into stratified samples of 80% and 20%. We train the models with 80% of the data, and test with 20% of the data. The process is repeated five times. Each fold of five-fold cross validation yields one AUC and AUPRC score, which we record, and report in the results section. Ten iterations of five-fold cross validation yield 50 experimental outcomes, which is a sufficient number of values for statistical analysis. The object of the statistical analysis is to determine whether there is any merit to the ranking provided by the mean absolute SHAP values of the attributes. If the statistical analysis indicates that models built with a subset of the attributes with higher mean absolute SHAP values yield performance that is similar to, or better than models built with all attributes, then we have confirmation that the mean absolute SHAP value is an indication of the attribute’s importance.
As mentioned previously, we record one AUC and one AUPRC score during each fold of five-fold cross validation. AUC is calculated by plotting the true positive rate and false positive rate for many model output probability threshold values. The true positive rate is used as the y-coordinate, and the false positive rate is used as the x-coordinate. The model output probability is the probability of class membership that the model assigns to an instance. The model output probability threshold is a value that is compared to the model output probability. If the output probability is larger than the threshold the instance is classified as a member of one class, otherwise it is classified as a member of the other class. After the true positive and false positive rates have been plotted for each threshold value, a curve forms. The area under this curve is calculated by means of numerical methods. AUPRC is calculated similarly, however, instead of plotting true positive, and false positive rates, we plot precision and recall.
Once we have confirmation that our technique for using SHAP to compute feature importance is viable, we can do feature analysis. The SHAP feature importance, combined with knowledge of the functioning of the classifier enables us to make conjectures on the nature of the important features. In order to devise a methodology, we conduct feature analysis first by considering each list of the most important features individually. Then we analyze the features in the intersection of the most important features for each of the three possible pairs of classifiers. There is a similarity metric, the Kuncheva index, which can be used to quantify the similarity of two sets. The index was introduced by Kuncheva [24]. The Kuncheva index takes a value from − 1 to 1, where identical sets have a Kuncheva index closer to 1, and disjoint sets have a Kuncheva index closer to -1. The formula for the Kuncheva similarity index, \(I_c\), of two sets, \(T_i\) and \(T_j\) is
$$\begin{aligned} I_c(T_i, T_j) = \frac{dp - k^2}{k (p-k)}, \end{aligned}$$
(1)
where k is equal to the sizes of the sets \(T_i\) and \(T_j\), d is the size of the intersection of the sets, and p is the total number of features in the dataset. In addition to using the Kuncheva index to compare pairs of feature sets, we consider the group of attributes which appear in all three groups of important features.

Results

We present two forms of results. The first form of results are classification performance results. These motivate the second form of results, which are presented as a feature analysis. The feature analysis covers the features selected by the combination of machine learning models and the SHAP feature importance ranking. The results we present cover all three label availability scenarios: no-class, one-class, and binary-class. We report classification results in terms of the threshold agnostic metrics AUC and AUPRC. However, we do statistical analysis of the AUPRC scores because previous research in experiments in classifying highly imbalanced data shows that AUPRC is the better metric for revealing the impact of experimental factors [25].

Performance analysis

Experiments were conducted with three learners and six levels of features selected: 3, 5, 7, 10, 15, and 29. We conduct ten iterations of five-fold cross-validation, so for every learner, we have 300 experimental outcomes. We present the results for Isolation Forest first. This is the first, or, no-class label availability scenario. Table 2 contains the mean AUC and AUPRC scores for ten rounds of five-fold cross validation. Hence, the scores in Table 2 are the mean values of 50 scores, where each score is obtained as the outcome of one round of five-fold cross validation. We observe that AUC and AUPRC scores usually rise as the number of features increases. There is one exception in Table 2, which is that Isolation Forest yields better performance with 15 features than with 29 features. Moreover, we note in Table2 that the change in AUC scores is not as dramatic. This is attributed to AUC being an inappropriate metric for imbalanced data.
Table 2
Mean AUC and AUPRC scores by the number of features selected for isolation forest
Features
AUC
AUPRC
'3'
0.5827
0.0023
'5'
0.7791
0.0151
'7'
0.8274
0.0615
'10'
0.9261
0.2162
'15'
0.9430
0.2636
'29'
0.9520
0.2448
Next, we present a statistical analysis of the AUPRC scores listed in Table 2. We perform a one-factor Analysis of Variance (ANOVA) [26] test to assess the impact of the number of features on experimental outcomes. The result of the ANOVA test is in Table 3. The Pr(\(>\hbox {F}\)), or p-value in Table 3 is practically zero, which implies the number of features has a significant impact on experimental outcomes.
Table 3
ANOVA for features as a factor of performance in terms of AUPRC
 
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Features
5
3.63
0.73
819.28
0.0000
Residuals
294
0.26
0.00
  
Since the ANOVA test result shows the number of features has a significant effect on experimental outcomes, we conduct a Tukey’s Honestly Significant Difference (HSD) [27] test to rank the number of features used in terms of their effect on experimental outcomes. The result of a Tukey HSD test is to group experimental factors such that members of the first group, designated as group ‘a’ are associated with the highest mean AUPRC scores, and there is not a statistically significant difference in the mean AUPRC scores between members of the group. Subsequent groups have identifiers that follow in alphabetical order. For example, if an HSD test were to have groups denoted ‘b’ and ‘c’, then members of group ‘b’ would have mean AUPRC scores between the those of members of groups ‘a’ and ‘c’. The HSD result in Table 4 shows that Isolation Forest models built with either 15 or 29 features yield the best performance.

One factor ANOVA for isolation forest feature selection experiments with credit data analysis of results in terms of AUPRC

Table 4
HSD test groupings after ANOVA of AUPRC for the features factor
Group a consists of: '15, 29'
Group b consists of: '10'
Group c consists of: '7'
Group d consists of: '5, 3'
The next performance results we present are for GMM. This is the one-class label availability scenario. In Table 5, we see that both AUC and AUPRC scores peak for models built with 15 features.
Table 5
Mean AUC and AUPRC Scores by the number of features selected for GMM
Features
AUC
AUPRC
'3'
0.8971
0.2515
'5'
0.9260
0.3669
'7'
0.9280
0.3411
'10'
0.9397
0.4071
'15'
0.9584
0.6420
'29'
0.9447
0.5825
Next, we do a statistical analysis of GMM’s AUPRC scores for varying levels of features used. First, we confirm that the number of features used to build a model has a significant impact on experimental outcomes with an ANOVA test. The ANOVA test result in Table 6 has a Pr(>F) value which is practically zero. This means that differences in the mean AUPRC scores of GMM for models built with different numbers of features is not due to random chance. Therefore, a Tukey HSD test will inform us as to which models built with which numbers of features yield the best performance. The results in Table 7 show that GMM models built with 15 features yield statistically similar performance.

One factor ANOVA for isolation forest feature selection experiments with credit data analysis of results in terms of AUPRC

Table 6
ANOVA for features as a factor of performance in terms of AUPRC
 
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Features
5
3.63
0.73
819.28
0.0000
Residuals
294
0.26
0.00
  
Table 7
HSD test groupings after ANOVA of AUPRC for the features factor
Group a consists of: '15, 29'
Group b consists of: '10'
Group c consists of: '7'
Group d consists of: '5, 3'
Now we move on to consider the performance of XGBoost as we increase the number of features in the order of their SHAP ranking. This is the binary-class scenario. In Table 8, we find a pattern similar to the AUC and AUPRC scores we recorded for Isolation Forest and GMM. Scores increase with the number of features, however, the increase slows for the largest numbers of features.
Table 8
Mean AUC and AUPRC scores by the number of features selected for XGBoost
Features
AUC
AUPRC
'3'
0.9699
0.7247
'5'
0.9722
0.8165
'7'
0.9730
0.8302
'10'
0.9739
0.8446
'15'
0.9783
0.8535
'29'
0.9793
0.8570
Finally, we conduct a statistical analysis of XGBoost’s AUPRC scores. In order to determine if an HSD test is applicable, we conduct an ANOVA test. The ANOVA test results are listed in Table 9. Since the Pr(\(>\hbox {F}\)) value associated with the number of features is practically zero, an HSD test is applicable. The HSD test result in Table 10 shows that, as is the case with GMM and Isolation Forest, there is no statistically significant difference between the AUPRC scores of models built with 15 or 29 features.

One factor ANOVA for XGBoost feature selection experiments with credit data analysis of results in terms of AUPRC

Table 9
ANOVA for features as a factor of performance in terms of AUPRC
 
Df
Sum Sq
Mean Sq
F value
Pr(>F)
Features
5
0.61
0.12
152.06
0.0000
Residuals
294
0.24
0.00
  
Table 10
HSD test groupings after ANOVA of AUPRC for the features factor
Group a consists of: '29, 15'
Group ab consists of: '10'
Group bc consists of: '7'
Group c consists of: '5'
Group d consists of:' 3'
The key take-away from our performance analysis is regarding the number of features that we can build models with. The HSD test results for XGBoost, Isolation Forest, and GMM all indicate that, with the Credit Card Fraud Data, one may build models with the top 15 features and they yield performance similar to, or better than models built using all features. It is also important to note that we obtain these results for models that are appropriate to use in all three scenarios of label data availability. Put another way, the HSD results show that no matter how much information we have about the class membership of instances of a dataset, we can apply SHAP to rank the attributes of the dataset, select the top 50% of features, and built models that yield performance equivalent to using all the features. This confirms the relevance of SHAP feature importance.

Feature analysis

The second results we present are in the form of a feature analysis. In Tables 11 and 12, we list the 15 highest-ranked features, where the rank of a feature is determined by applying SHAP to the classifier and the Credit Card Fraud data. Arranging the features in this way enables us to observe which important features are in common for different combinations of classifiers. If a cell in Tables 11 or 12 is blank, it means the feature’s SHAP rank is not in the top 15 for the classifier.
Table 11
Part I, features selected by applying SHAP to Isolation Forest (IF), GMM, and XGBoost (XGB)
Classifier, amount, features V1-V13
IF
Amount
 
V2
V3
V4
     
V10
 
V12
V13
GMM
  
V2
V3
 
V5
 
V7
V8
V9
V10
V11
V12
 
XGB
Amount
  
V3
V4
  
V7
V8
V9
V10
V11
V12
V13
Table 12
Part II, features selected by applying SHAP to Isolation Forest (IF), GMM, and XGBoost (XGB)
Classifier, features V14-V28
IF
  
V16
 
V18
V19
  
V22
 
V24
V25
V26
V27
GMM
V14
 
V16
V17
V18
  
V21
    
V26
 
XGB
V14
V15
 
V17
 
V19
  
V22
     
There are three forms of feature analysis that we can do based on the results from ranking the features with SHAP. We can analyze each feature set individually, we can compare pairs of feature sets, and we can look at features that all three feature sets have in common. Therefore, we proceed with analysis in this order. To begin the analysis of individual feature sets, we remind the reader that the Credit Card Fraud dataset has two features with a clear meaning, Time and Amount. As stated previously, we do not use the Time feature because it is equivalent to a unique identifier and would cause memorization. Therefore, the only feature fed to our models which relates directly to something in experience is the Amount feature, which holds the value of the transaction amount. As we explained above, the remaining features named V1–V28 are mixtures of other values of features that are computed by PCA. Since it is possible to train machine learning models such as XGBoost that can classify the Credit Card Fraud data with high AUPRC scores, we know the values V1–V28 carry meaningful information that is useful in identifying fraudulent transactions. Of the three models we use, GMM and Isolation Forest do anomaly detection, so important features for these models pertain to anomaly detection. XGBoost does supervised learning, so features important for XGBoost must carry information that relates to the Class (fraudulent/not fraudulent) label of the Credit Card Fraud dataset.
We begin the analysis by inspecting the 15 most important features for Isolation Forest, in the no-class scenario. Since Isolation Forest is an unsupervised learning algorithm, it operates by detecting anomalous instances. In the context of credit card transactions, these anomalies can potentially represent fraudulent activities. When SHAP is applied to Isolation Forest, to rank the features of the Credit Card Fraud data, the top 15 features are: Amount, V2, V3, V4, V10, V12, V13, V16, V18, V19, V22, V24, V25, V26, and V27. Therefore, anomalies in the Credit Card Fraud data must be more apparent in these features. The inclusion of Amount indicates the transaction size’s significant role in Isolation Forest’s anomaly detection process.
Next, we consider the feature set which results from applying SHAP to GMM and selecting the 15 features with the highest SHAP ranking. This is the one-class scenario. GMM the fits the data to one or more multivariate Gaussian distributions. This lends GMM a natural capability to identify clusters or groups within the data. The fifteen most important features identified by SHAP are V2, V3, V5, V7, V8, V9, V10, V11, V12, V14, V16, V17, V18, V21, V26. We make a note that Amount is missing from GMM’s feature set. This indicates GMM detects anomalies with a focus on inherent transaction characteristics rather than transaction size.
The third feature set we cover in this analysis is the one obtained by in the binary-class label availability scenario by applying SHAP to XGBoost, and selecting the 15 most important features. The features are Amount, V3, V4, V7, V8, V9, V10, V11, V12, V13, V14, V15, V17, V19, V22. The combination of XGBoost and SHAP demonstrates a balanced emphasis on both the transaction amount and PCA derived features. Since XGBoost yields the best performance, it is tempting to conclude that XGBoost and SHAP is the best feature selection technique. However, we wish to point out that in true no-class and one-class scenarios, one will not have the luxury of a labeled dataset to employ XGBoost with.
A second form of analysis we can do is to compare features selected between the pairs of classifiers. This is an important analysis to do. In order to explain why the analysis is important, we must remind the reader that we do not have information about what 28 out of the 30 attributes of the Credit Card Fraud data represent. These are the features named V1–V28. We only know that they are the outcome of applying PCA to some other data, which we no nothing about. However, SHAP uses the classifier, and the independent features of the data to determine feature importance. If SHAP indicates the same features are important for two classifiers, and we know something about how the two classifiers learn to classify the data, we can learn something about the feature. We can deduce that the feature is useful in distinct processes simultaneously. Such deductions can provide information about the nature of a feature. We cannot deduce, for example, that V3 is a person’s age, but we can deduce that it influences the outcome of GMM and XGBoost at the same time. We believe this method of analysis is novel. Even though we do not know anything about the origins of the features V1–V28, we can nevertheless learn something about them by observing how important they are to pairs of classifiers.
Since we have three classifiers, there are three possible pairs, Isolation Forest and GMM, Isolation Forest and XGBoost, and GMM and XGBoost. Isolation Forest and GMM have seven features in common: V2, V3, V10, V12, V16, V18, and V26. Therefore, in Table 13, we see the Kuncheva index is \(-\)0.0667. Since Isolation Forest and GMM take different approaches to anomaly detection, we surmise these features are useful in both approaches. That is to say, since these features are important to Isolation Forest, we know certain combinations of values of these features occur infrequently in the data. Moreover, since these features are important to GMM, we also know that these features tend to cause data points to lie outside high probability regions of the multivariate Gaussian distributions that GMM fits to the Credit Card Fraud data. Hence, we conjecture the instances of the Credit Card Fraud data that correspond to fraudulent transactions are in isolated regions of the feature space of the dataset.
Table 13
XGBoost+SHAP vs GMM+SHAP vs Isolation Forest+SHAP
Feature Set 2
GMM+SHAP
Isolation Forest+SHAP
XGBoost+SHAP
Feature Set 1
GMM+SHAP
1.0000
− 0.0667
0.2000
Isolation Forest+SHAP
− 0.0667
1.0000
− 0.0667
XGBoost+SHAP
0.2000
− 0.0667
1.0000
We move on to compare the nine features in common between the GMM and XGBoost feature sets. These are V3, V7, V8, V9, V10, V11, V12, V14, and V17. We note that XGBoost and GMM have a larger number of features in common. Table 13 shows that these two feature sets have the largest Kuncheva index value of 0.200. XGBoost and GMM are dissimilar techniques. As stated previously, GMM approximates the majority class with a summation of multivariate Gaussian distributions. XGBoost adds the outputs of decision trees to yield a probability that an instance is a member of a class. Hence, on one hand, we find no argument in terms of the classifiers functioning that explains the feature overlap in these two scenarios. On the other hand, we note that XGBoost and GMM yield stronger performance. From that perspective, it would make sense that we see a larger overlap in the 15 most important features, as compared to Isolation Forest. XGBoost and GMM are also allowed more information about the class labels, and this is another factor that contributes to the larger overlap in the sets of features selected in both scenarios.
The final pair of classifiers we compare are XGBoost and Isolation Forest. Of the 15 most important features that SHAP selects for both classifiers, the features in common are: Amount, V3, V10, V12, V13, V19, V22. Similar to Isolation Forest and GMM, there are seven features in common. Hence, in Table 13 the Kuncheva index for these two feature sets is also \(-\)0.0667. Though the Kuncheva index value for this pair of feature sets is in the same as Isolation Forest and GMM, the features are different. This is not surprising since we are comparing the outcome of feature selection in different label availability scenarios. Interestingly, transaction amount is selected for both classifiers. This is interesting since both Isolation Forest and XGBoost are decision tree based learners. Therefore, we surmise that Amount is a useful variable for separating the sets of transactions into classes, since the nodes in decision trees are rules for dividing instances of the dataset into classes based on the instances’ relation to a specific value of an attribute. We also know that Isolation Forest has no information about the class label, and therefore separates the instances of a dataset by anomaly detection. We conclude that transaction amount carries information about how unusual (anomalous) an instance of the credit card data is, and it also carries information about class membership. Hence, we have evidence of how one feature can be important in different label availability scenarios.
Another comparison of feature sets is to look at what things are in common for all three feature sets. First, we look at the features listed in Tables 11 and 12 that are selected in all three scenarios. These are V3, V10, and V12. We conclude that these features are both strong indicators of anomalies, and they must also be effective in determining the label. The features are strong indicators of anomalies because they are important for both GMM and Isolation Forest. The features must also be effective in determining the label because they are important for XGBoost. Such a conclusion about features of the Credit Card Fraud data can only be made by doing this type of analysis involving the feature importance for classifiers that may be used in the different label availability scenarios.
The second interesting feature of Tables 11 and 12 that we take note of is the features that are not important for any of the classifiers. We find V1, V6, V23, and V28 are not in the 15 most important features for any scenario. Hence, we conclude that these features do not indicate anomalies, and they do not form patterns that XGBoost might use to predict the fraud label. This concludes the second set of results we provide in our study.

Conclusions

In this study, we demonstrate a technique for analyzing a dataset that is applicable in all label availability scenarios. SHAP is used in combination with the Isolation Forest, GMM, and XGBoost algorithms for each label availability scenario. This enables us to do the analysis that determines which features are important, regardless of whether labels are available for the data. Our results corroborate that, when labels are available, the feature importance assigned by SHAP is consistent with performance. Put another way, as we add features to models in order of their mean absolute SHAP values, the models’ performance improves until we have added 10 or 15 features, depending on the classifier.
Our study shows a treatment that can be applied to data in all data labeling scenarios. Whether there are no labels on the data, or only instances of one class are available, or the data is labeled, we show researchers can use SHAP to determine feature importance. Performance analysis of experiments that utilize SHAP feature importance to build models confirms that SHAP feature importance is meaningful in terms of its ability to identify features that contribute to the performance of a model. SHAP values also place an order on the features in terms of their impact on performance. Moreover, even though we are only given that most of the Credit Card Fraud data’s features are derived from principal components analysis of unknown data, our feature analysis technique is nevertheless applicable. We have demonstrated a new technique for doing feature analysis in all label availability scenarios. Future work includes application of our feature analysis to new application domains.

Acknowledgements

The authors would like to thank the various members of the Data Mining and Machine Learning Laboratory, Florida Atlantic University, for their assistance with the reviews.

Declarations

Not applicable
Not applicable

Competing interests

The authors declare that they have no Competing interests.
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by-nc-nd/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literatur
2.
Zurück zum Zitat Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017; 30 Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017; 30
4.
Zurück zum Zitat Wold S, Esbensen K, Geladi P. Principal component analysis. Chemom Intell Lab Syst. 1987;2(1–3):37–52.CrossRefMATH Wold S, Esbensen K, Geladi P. Principal component analysis. Chemom Intell Lab Syst. 1987;2(1–3):37–52.CrossRefMATH
5.
Zurück zum Zitat Leevy JL, Hancock J, Khoshgoftaar TM. Comparative analysis of binary and one-class classification techniques for credit card fraud data. J Big Data. 2023;10(1):118.CrossRef Leevy JL, Hancock J, Khoshgoftaar TM. Comparative analysis of binary and one-class classification techniques for credit card fraud data. J Big Data. 2023;10(1):118.CrossRef
6.
Zurück zum Zitat Kennedy RK, Salekshahrezaee Z, Khoshgoftaar TM. Unsupervised anomaly detection of class imbalanced cognition data using an iterative cleaning method. In: 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI), IEEE. 2023. pp. 303–308. Kennedy RK, Salekshahrezaee Z, Khoshgoftaar TM. Unsupervised anomaly detection of class imbalanced cognition data using an iterative cleaning method. In: 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI), IEEE. 2023. pp. 303–308.
7.
Zurück zum Zitat Boyd K, Eng KH, Page CD. Area under the precision-recall curve: point estimates and confidence intervals. In: Joint European conference on machine learning and knowledge discovery in databases. Berlin: Springer; 2013. p. 451–66.MATH Boyd K, Eng KH, Page CD. Area under the precision-recall curve: point estimates and confidence intervals. In: Joint European conference on machine learning and knowledge discovery in databases. Berlin: Springer; 2013. p. 451–66.MATH
8.
Zurück zum Zitat Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013; 3(10) Bekkar M, Djemaa HK, Alitouche TA. Evaluation measures for models assessment over imbalanced data sets. J Inf Eng Appl. 2013; 3(10)
9.
Zurück zum Zitat Salekshahrezaee Z, Leevy JL, Khoshgoftaar TM. The effect of feature extraction and data sampling on credit card fraud detection. J Big Data. 2023;10(1):6.CrossRef Salekshahrezaee Z, Leevy JL, Khoshgoftaar TM. The effect of feature extraction and data sampling on credit card fraud detection. J Big Data. 2023;10(1):6.CrossRef
10.
Zurück zum Zitat Masci J, Meier U, Cireşan D, Schmidhuber J. Stacked convolutional auto-encoders for hierarchical feature extraction. In: Artificial Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I 21, Springer; 2011. pp. 52–59. Masci J, Meier U, Cireşan D, Schmidhuber J. Stacked convolutional auto-encoders for hierarchical feature extraction. In: Artificial Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, June 14-17, 2011, Proceedings, Part I 21, Springer; 2011. pp. 52–59.
12.
Zurück zum Zitat Schölkopf B, Williamson RC, Smola A, Shawe-Taylor J, Platt J. Support vector method for novelty detection. Advances in neural information processing systems. 1999; 12 Schölkopf B, Williamson RC, Smola A, Shawe-Taylor J, Platt J. Support vector method for novelty detection. Advances in neural information processing systems. 1999; 12
13.
Zurück zum Zitat Carcillo F, Le Borgne Y-A, Caelen O, Kessaci Y, Oblé F, Bontempi G. Combining unsupervised and supervised learning in credit card fraud detection. Inf Sci. 2021;557:317–31.MathSciNetCrossRefMATH Carcillo F, Le Borgne Y-A, Caelen O, Kessaci Y, Oblé F, Bontempi G. Combining unsupervised and supervised learning in credit card fraud detection. Inf Sci. 2021;557:317–31.MathSciNetCrossRefMATH
14.
Zurück zum Zitat Mittal S, Tyagi S. Performance evaluation of machine learning algorithms for credit card fraud detectio. In: 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), IEEE. 2019. pp. 320–324. Mittal S, Tyagi S. Performance evaluation of machine learning algorithms for credit card fraud detectio. In: 2019 9th International Conference on Cloud Computing, Data Science & Engineering (Confluence), IEEE. 2019. pp. 320–324.
16.
Zurück zum Zitat LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.CrossRefMATH LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.CrossRefMATH
17.
Zurück zum Zitat Fadaei Noghani F, Moattar M. Ensemble classification and extended feature selection for credit card fraud detection. J AI Data Min. 2017;5(2):235–43.MATH Fadaei Noghani F, Moattar M. Ensemble classification and extended feature selection for credit card fraud detection. J AI Data Min. 2017;5(2):235–43.MATH
18.
Zurück zum Zitat Prabhakaran N, Nedunchelian R, et al. Oppositional cat swarm optimization-based feature selection approach for credit card fraud detection. Comput Intell Neurosci. 2023;2023:2693022.CrossRef Prabhakaran N, Nedunchelian R, et al. Oppositional cat swarm optimization-based feature selection approach for credit card fraud detection. Comput Intell Neurosci. 2023;2023:2693022.CrossRef
19.
Zurück zum Zitat Hancock JT, Bauder RA, Khoshgoftaar TM. A model-agnostic feature selection technique to improve the performance of one-class classifiers. In: 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE. 2023. Hancock JT, Bauder RA, Khoshgoftaar TM. A model-agnostic feature selection technique to improve the performance of one-class classifiers. In: 2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI). IEEE. 2023.
22.
Zurück zum Zitat Wang H, Liang Q, Hancock JT, Khoshgoftaar TM. Enhancing credit card fraud detection through a novel ensemble feature selection technique. In: 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI), IEEE. 2023. pp. 121–126. Wang H, Liang Q, Hancock JT, Khoshgoftaar TM. Enhancing credit card fraud detection through a novel ensemble feature selection technique. In: 2023 IEEE 24th International Conference on Information Reuse and Integration for Data Science (IRI), IEEE. 2023. pp. 121–126.
23.
Zurück zum Zitat Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International conference on knowledge discovery and data mining - KDD. 2016. 16 Chen T, Guestrin C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International conference on knowledge discovery and data mining - KDD. 2016. 16
24.
Zurück zum Zitat Kuncheva LI. A stability index for feature selection. In: Artificial intelligence and applications. Princeton: Citeseer; 2007. p. 421–7.MATH Kuncheva LI. A stability index for feature selection. In: Artificial intelligence and applications. Princeton: Citeseer; 2007. p. 421–7.MATH
25.
Zurück zum Zitat Hancock J, Khoshgoftaar TM, Johnson JM. Informative evaluation metrics for highly imbalanced big data classification. In: 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE. 2022 pp. 1419–1426. Hancock J, Khoshgoftaar TM, Johnson JM. Informative evaluation metrics for highly imbalanced big data classification. In: 2022 21st IEEE International Conference on Machine Learning and Applications (ICMLA), IEEE. 2022 pp. 1419–1426.
Metadaten
Titel
A problem-agnostic approach to feature selection and analysis using SHAP
verfasst von
John T. Hancock
Taghi M. Khoshgoftaar
Qianxin Liang
Publikationsdatum
01.12.2025
Verlag
Springer International Publishing
Erschienen in
Journal of Big Data / Ausgabe 1/2025
Elektronische ISSN: 2196-1115
DOI
https://doi.org/10.1186/s40537-024-01041-1