SCARFF: A scalable framework for streaming credit card fraud detection with spark
Graphical abstract
Introduction
The increasing adoption of electronic payments is opening new perspectives to fraudsters and asks for innovative countermeasures to their criminal activities. If on the one hand fraudsters continuously improve their techniques to emulate genuine behaviour, on the other hand it becomes affordable for the companies managing transactional services to collect data about customers and monitor their behavior.
The need of automatic systems able to detect frauds from historical data led to the design of a number of machine learning algorithms for fraud detection [1], [2], [3]. Supervised methods, typically based on binary classification, as well as unsupervised and one-class classification [4], [5] have been proposed in literature. Most of these works address some specific issues of fraud detection, notably class imbalance [6], [7], [8] (the percentage of fraudulent transactions is usually very small), concept drift [9], [10], [11], [12], [13], [14] (the distribution of fraudulent transactions might change in time) and stream processing [15], [16].
The authors of this paper studied and analysed in detail the existing literature in previous works [17], [18], [19], [20] and proposed an original solution for accurate classification of fraudulent credit card transactions in imbalanced and non-stationary settings. In particular we assessed the superiority of undersampling versus oversampling techniques in our specific problem, we proposed a sliding window approach to effectively tackle concept-drift and we addressed in [19], [20] an issue often overlooked in literature: the verification latency due to the fact that in real settings the transaction label is obtained only after that human investigators contacted the card holders.
Though a large number of learning techniques have been proposed, most solutions assume a conventional setting where the entire dataset is resident in memory. It follows that very few studies made the implementation of these techniques scalable and studied their performances. Also what exists is typically related to other domains than the fraud: for instance [21] and [22] studied already the issue of data imbalance in a Hadoop/MapReduce framework1 but only for public and bioinformatics data.
In domains closer to fraud detection most of the existing works are preliminary or in progress. Hormoz et al. [23] made a comparison between a serial implementation and a Hadoop/MapReduce batch processing solution based on Artificial Immune Systems (AIS). The same authors made some tests on cloud services and provided accuracy measurements [24]. A web service framework for near real-time credit card fraud detection is described, together with some preliminary results, in [25]. A big data architecture based on Flume, Hadoop and HDFS is proposed in [26] but no validation results are provided. An example of application in a non banking environment is presented in [27] where Chen et al. describe the Hadoop based fraud detection infrastructure at Alibaba. Other works in progress can be found on several git servers [28], [29], [30], [31], [32], [33], [34].
In this paper we start from the conclusions of our published works [17], [19], [20] and we propose a realistic and scalable implementation of a fraud detection system. SCARFF (Scalable Real-time Fraud Finder) is an open source platform which processes and analyses streaming data in order to return reliable alerts in a nearly real-time setting. These are the main original contributions:
- 1.
The design, implementation and test of an entirely open-source solution integrating state-of-the-art components from the Apache ecosystem. This architecture deals seamlessly with data ingestion, streaming, feature engineering, storage and classification;
- 2.
A scalable learning solution able to provide accurate classification in a context characterized by nonstationarity, class imbalance and verification latency. This is obtained by implementing in a scalable and distributed manner an ensemble solution able to deal with concept drift and delayed feedback;
- 3.
The design of a distributed on-line feature engineering functionality, which constantly updates historical features relevant to better identify fraud patterns. This on-line functionality relies on a MapReduce programming model;
- 4.
A real-world extensive assessment, in terms of scalability, computational performance and precision, carried out by testing the platform on a stream of more than 8 millions of transactions (corresponding to more than 1.9 millions of cards) provided by our industrial partner;
- 5.
The virtualisation of the complete workflow proposed in this article as a Docker container, making the workflow fully reproducible.
The paper is organized as follows. Section 2 introduces the main characteristics of real-world Fraud-Detection Systems. Section 3 gives an overview of the big data tools from the Apache ecosystem that are integrated in our framework. Section 4 details the learning and the streaming functionalities of the platform. Finally, in Section 5 we assess the scalability, computational speed and precision on a real dataset, as a function of allocated resources and incoming transaction rates.
Section snippets
Real-world Fraud Detection Systems
Real world Fraud-Detection System for credit card transactions rely on both automatic and manual operations [20], [35] (Fig. 1). Manual operations are performed offline by human investigators, while automatic components are implemented by algorithms that work in real-time and near real-time configurations. Real-time operations take place before the payment is authorized, while near real-time operations are executed after the payment occurred.
Real-time processing consists of a set of security
The Big Data ecosystem
This paper proposes a scalable implementation of the DDM learning module which relies on standard tools from the Apache ecosystem, notably Kafka, Spark and Cassandra (Fig. 2). A major advantage of these components is that they similarly handle fault tolerance and tasks distribution.
Online learning and streaming solutions
This section details the functionalities of the proposed framework. Our pipeline implements two main functionalities: a machine learning classification engine and a streaming component. In the first subsection, Section 4.1, the selected machine learning techniques are described. The machine learning engine includes a weighted ensemble of two classifiers. The second subsection, Section 4.2, focuses on the streaming component. Here, more details will be given regarding the data preprocessing (
Experiments
This section assesses the proposed scalable architecture according to different criteria:
- •
Scalability;
- •
Impact of internal parametrization on computational performance;
- •
Classification precision.
Experiments were carried out on a cluster of ten machines, each with 24 cores and 80GB of RAM. Spark was run on top of the cluster resource manager Yarn [59]. For all experiments, each executor was allocated 1GB of RAM, and the driver was allocated 10GB of RAM. Further discussion over memory usage will be
Conclusions and future work
The paper presented SCARFF, an original scalable platform to automatically detect frauds in a near real-time horizon. The most original contribution of this framework is the design and the implementation of an open source big data solution for Real-world Fraud Detection and its test on a massive real-world data set. We wish to emphasize that the workflow proposed in our article, while not disclosing the data, has been made fully open source and reproducible by means of a Docker8
Acknowledgments
The authors FC, YLB and GB acknowledge the funding of the Brufence project (scalable machine learning for automating defense system) supported by INNOVIRIS (Brussels Institute for the encouragement of scientific research and innovation). ADP acknowledges the funding of the Doctiris (Adaptive real-time machine learning for credit card fraud detection) project supported by INNOVIRIS (Brussels Institute for the encouragement of scientific research and innovation).
References (62)
- et al.
Association rules applied to credit card fraud detection
Expert Syst. Appl.
(2009) - et al.
A cost-sensitive decision tree approach for fraud detection
Expert Syst. Appl.
(2013) - et al.
Incremental weighted one-class classifier for mining stationary data streams
J. Comput, Sci.
(2015) - et al.
Countering the concept-drift problems in big data by an incrementally optimized stream mining model
J. Syst. Soft.
(2015) - et al.
Learned lessons in credit card fraud detection from a practitioner perspective
Expert Syst. Appl.
(2014) - et al.
On the use of mapreduce for imbalanced big data using random forest
Inf. Sci.
(2014) - et al.
Rosefw-rf: the winner algorithm for the ecbdl14 big data competition: an extremely imbalanced big data bioinformatics problem
Knowl. Based Syst.
(2015) - et al.
Big data based fraud risk management at Alibaba
J. Finance Data Sci.
(2015) Decision forest: twenty years of research
Inf. Fus.
(2016)- et al.
Data mining for credit card fraud: a comparative study
Decis. Support Syst.
(2011)
Apate: a novel approach for automated credit card transaction fraud detection using network-based extensions
Decis. Support Syst.
A survey of multiple classifier systems as hybrid systems
Inf. Fus.
Ensemble learning for data stream analysis: a survey
Inf. Fus.
Credit card fraud detection with a neural-network
Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.
One-class classifiers with incremental learning and forgetting for data streams with concept drift
Soft comput.
Learning from imbalanced data: open challenges and future directions
Progress Artif. Intell.
When is undersampling effective in unbalanced classification tasks?
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
Calibrating probability with undersampling for unbalanced classification
Symposium Series on Computational Intelligence
A survey on concept drift adaptation
ACM Comput. Surv.
Just-in-time classifiers for recurrent concepts
IEEE Trans. Neural Netw. Learn. Syst.
Novelty detection in data streams
Artif. Intell. Rev.
Anynovel: detection of novel concepts in evolving data streams
Evolving Syst.
Countering the concept-drift problem in big data using iovfdt
2013 IEEE International Congress on Big Data
Scalable distributed stream join processing
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
Using hddt to avoid instances propagation in unbalanced and evolving data streams
2014 International Joint Conference on Neural Networks (IJCNN)
Credit card fraud detection and concept-drift adaptation with delayed supervised information
International Joint Conference on Neural Networks (IJCNN)
Credit card fraud detection: a realistic modeling and a novel learning strategy
IEEE Trans. Neural Netw. Learn.Syst.
Credit cards fraud detection by negative selection algorithm on hadoop (to reduce the training time)
Information and Knowledge Technology (IKT)
Accuracy evaluation of a credit card fraud detection system on hadoop mapreduce
Information and Knowledge Technology (IKT)
Web service for detecting credit card fraud in near real-time
Proceedings of the 8th International Conference on Security of Information and Networks
Cited by (156)
Assessment of catastrophic forgetting in continual credit card fraud detection
2024, Expert Systems with ApplicationsTime series forecasting and anomaly detection using deep learning
2024, Computers and Chemical EngineeringConsumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods
2024, Expert Systems with ApplicationsA novel Edge architecture and solution for detecting concept drift in smart environments
2024, Future Generation Computer SystemsA Grey Literature Review on Data Stream Processing applications testing
2023, Journal of Systems and SoftwareBTextCAN: Consumer fraud detection via group perception
2023, Information Processing and Management