SCARFF: A scalable framework for streaming credit card fraud detection with spark

doi:10.1016/j.inffus.2017.09.005

Information Fusion

Volume 41, May 2018, Pages 182-194

https://doi.org/10.1016/j.inffus.2017.09.005 Get rights and content

Highlights

•
The open source / Big Data nature of the framework.
•
The capability to deal with nonstationarity, class imbalance and verification latency.
•
The distributed on-line feature engineering functionality included in the framework.
•
The scalability, efficiency and accuracy assessed over a big stream of transactions.

Abstract

The expansion of the electronic commerce, together with an increasing confidence of customers in electronic payments, makes of fraud detection a critical factor. Detecting frauds in (nearly) real time setting demands the design and the implementation of scalable learning techniques able to ingest and analyse massive amounts of streaming data. Recent advances in analytics and the availability of open source solutions for Big Data storage and processing open new perspectives to the fraud detection field. In this paper we present a Scalable Real-time Fraud Finder (SCARFF) which integrates Big Data tools (Kafka, Spark and Cassandra) with a machine learning approach which deals with imbalance, nonstationarity and feedback latency. Experimental results on a massive dataset of real credit card transactions show that this framework is scalable, efficient and accurate over a big stream of transactions.

Graphical abstract

Introduction

The increasing adoption of electronic payments is opening new perspectives to fraudsters and asks for innovative countermeasures to their criminal activities. If on the one hand fraudsters continuously improve their techniques to emulate genuine behaviour, on the other hand it becomes affordable for the companies managing transactional services to collect data about customers and monitor their behavior.

The need of automatic systems able to detect frauds from historical data led to the design of a number of machine learning algorithms for fraud detection [1], [2], [3]. Supervised methods, typically based on binary classification, as well as unsupervised and one-class classification [4], [5] have been proposed in literature. Most of these works address some specific issues of fraud detection, notably class imbalance [6], [7], [8] (the percentage of fraudulent transactions is usually very small), concept drift [9], [10], [11], [12], [13], [14] (the distribution of fraudulent transactions might change in time) and stream processing [15], [16].

The authors of this paper studied and analysed in detail the existing literature in previous works [17], [18], [19], [20] and proposed an original solution for accurate classification of fraudulent credit card transactions in imbalanced and non-stationary settings. In particular we assessed the superiority of undersampling versus oversampling techniques in our specific problem, we proposed a sliding window approach to effectively tackle concept-drift and we addressed in [19], [20] an issue often overlooked in literature: the verification latency due to the fact that in real settings the transaction label is obtained only after that human investigators contacted the card holders.

Though a large number of learning techniques have been proposed, most solutions assume a conventional setting where the entire dataset is resident in memory. It follows that very few studies made the implementation of these techniques scalable and studied their performances. Also what exists is typically related to other domains than the fraud: for instance [21] and [22] studied already the issue of data imbalance in a Hadoop/MapReduce framework¹ but only for public and bioinformatics data.

In domains closer to fraud detection most of the existing works are preliminary or in progress. Hormoz et al. [23] made a comparison between a serial implementation and a Hadoop/MapReduce batch processing solution based on Artificial Immune Systems (AIS). The same authors made some tests on cloud services and provided accuracy measurements [24]. A web service framework for near real-time credit card fraud detection is described, together with some preliminary results, in [25]. A big data architecture based on Flume, Hadoop and HDFS is proposed in [26] but no validation results are provided. An example of application in a non banking environment is presented in [27] where Chen et al. describe the Hadoop based fraud detection infrastructure at Alibaba. Other works in progress can be found on several git servers [28], [29], [30], [31], [32], [33], [34].

In this paper we start from the conclusions of our published works [17], [19], [20] and we propose a realistic and scalable implementation of a fraud detection system. SCARFF (Scalable Real-time Fraud Finder) is an open source platform which processes and analyses streaming data in order to return reliable alerts in a nearly real-time setting. These are the main original contributions:

1.
The design, implementation and test of an entirely open-source solution integrating state-of-the-art components from the Apache ecosystem. This architecture deals seamlessly with data ingestion, streaming, feature engineering, storage and classification;
2.
A scalable learning solution able to provide accurate classification in a context characterized by nonstationarity, class imbalance and verification latency. This is obtained by implementing in a scalable and distributed manner an ensemble solution able to deal with concept drift and delayed feedback;
3.
The design of a distributed on-line feature engineering functionality, which constantly updates historical features relevant to better identify fraud patterns. This on-line functionality relies on a MapReduce programming model;
4.
A real-world extensive assessment, in terms of scalability, computational performance and precision, carried out by testing the platform on a stream of more than 8 millions of transactions (corresponding to more than 1.9 millions of cards) provided by our industrial partner;
5.
The virtualisation of the complete workflow proposed in this article as a Docker container, making the workflow fully reproducible.

The paper is organized as follows. Section 2 introduces the main characteristics of real-world Fraud-Detection Systems. Section 3 gives an overview of the big data tools from the Apache ecosystem that are integrated in our framework. Section 4 details the learning and the streaming functionalities of the platform. Finally, in Section 5 we assess the scalability, computational speed and precision on a real dataset, as a function of allocated resources and incoming transaction rates.

Section snippets

Real-world Fraud Detection Systems

Real world Fraud-Detection System for credit card transactions rely on both automatic and manual operations [20], [35] (Fig. 1). Manual operations are performed offline by human investigators, while automatic components are implemented by algorithms that work in real-time and near real-time configurations. Real-time operations take place before the payment is authorized, while near real-time operations are executed after the payment occurred.

Real-time processing consists of a set of security

The Big Data ecosystem

This paper proposes a scalable implementation of the DDM learning module which relies on standard tools from the Apache ecosystem, notably Kafka, Spark and Cassandra (Fig. 2). A major advantage of these components is that they similarly handle fault tolerance and tasks distribution.

Online learning and streaming solutions

This section details the functionalities of the proposed framework. Our pipeline implements two main functionalities: a machine learning classification engine and a streaming component. In the first subsection, Section 4.1, the selected machine learning techniques are described. The machine learning engine includes a weighted ensemble of two classifiers. The second subsection, Section 4.2, focuses on the streaming component. Here, more details will be given regarding the data preprocessing (

Experiments

This section assesses the proposed scalable architecture according to different criteria:

•
Scalability;
•
Impact of internal parametrization on computational performance;
•
Classification precision.

Experiments were carried out on a cluster of ten machines, each with 24 cores and 80GB of RAM. Spark was run on top of the cluster resource manager Yarn [59]. For all experiments, each executor was allocated 1GB of RAM, and the driver was allocated 10GB of RAM. Further discussion over memory usage will be

Conclusions and future work

The paper presented SCARFF, an original scalable platform to automatically detect frauds in a near real-time horizon. The most original contribution of this framework is the design and the implementation of an open source big data solution for Real-world Fraud Detection and its test on a massive real-world data set. We wish to emphasize that the workflow proposed in our article, while not disclosing the data, has been made fully open source and reproducible by means of a Docker⁸

Acknowledgments

The authors FC, YLB and GB acknowledge the funding of the Brufence project (scalable machine learning for automating defense system) supported by INNOVIRIS (Brussels Institute for the encouragement of scientific research and innovation). ADP acknowledges the funding of the Doctiris (Adaptive real-time machine learning for credit card fraud detection) project supported by INNOVIRIS (Brussels Institute for the encouragement of scientific research and innovation).

References (62)

D. Sánchez et al.
Association rules applied to credit card fraud detection
Expert Syst. Appl.
(2009)
Y. Sahin et al.
A cost-sensitive decision tree approach for fraud detection
Expert Syst. Appl.
(2013)
B. Krawczyk et al.
Incremental weighted one-class classifier for mining stationary data streams
J. Comput, Sci.
(2015)
H. Yang et al.
Countering the concept-drift problems in big data by an incrementally optimized stream mining model
J. Syst. Soft.
(2015)
A. Dal Pozzolo et al.
Learned lessons in credit card fraud detection from a practitioner perspective
Expert Syst. Appl.
(2014)
S. del Río et al.
On the use of mapreduce for imbalanced big data using random forest
Inf. Sci.
(2014)
I. Triguero et al.
Rosefw-rf: the winner algorithm for the ecbdl14 big data competition: an extremely imbalanced big data bioinformatics problem
Knowl. Based Syst.
(2015)
J. Chen et al.
Big data based fraud risk management at Alibaba
J. Finance Data Sci.
(2015)
L. Rokach
Decision forest: twenty years of research
Inf. Fus.
(2016)
S. Bhattacharyya et al.
Data mining for credit card fraud: a comparative study
Decis. Support Syst.
(2011)

V. Van Vlasselaer et al.

Apate: a novel approach for automated credit card transaction fraud detection using network-based extensions

Decis. Support Syst.

(2015)

M. Woźniak et al.

A survey of multiple classifier systems as hybrid systems

Inf. Fus.

(2014)

B. Krawczyk et al.

Ensemble learning for data stream analysis: a survey

Inf. Fus.

(2017)

S. Ghosh et al.

Credit card fraud detection with a neural-network

Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

(1994)

B. Krawczyk et al.

One-class classifiers with incremental learning and forgetting for data streams with concept drift

Soft comput.

(2015)

B. Krawczyk

Learning from imbalanced data: open challenges and future directions

Progress Artif. Intell.

(2016)

A. Dal Pozzolo et al.

When is undersampling effective in unbalanced classification tasks?

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

(2015)

A. Dal Pozzolo et al.

Calibrating probability with undersampling for unbalanced classification

Symposium Series on Computational Intelligence

(2015)

J.a. Gama et al.

A survey on concept drift adaptation

ACM Comput. Surv.

(2014)

C. Alippi et al.

Just-in-time classifiers for recurrent concepts

IEEE Trans. Neural Netw. Learn. Syst.

(2013)

E.R. Faria et al.

Novelty detection in data streams

Artif. Intell. Rev.

(2016)

Z.S. Abdallah et al.

Anynovel: detection of novel concepts in evolving data streams

Evolving Syst.

(2016)

H. Yang et al.

Countering the concept-drift problem in big data using iovfdt

2013 IEEE International Congress on Big Data

(2013)

T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, S. Whittle,...

Q. Lin et al.

Scalable distributed stream join processing

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

(2015)

A. Dal Pozzolo et al.

Using hddt to avoid instances propagation in unbalanced and evolving data streams

2014 International Joint Conference on Neural Networks (IJCNN)

(2014)

A. Dal Pozzolo et al.

Credit card fraud detection and concept-drift adaptation with delayed supervised information

International Joint Conference on Neural Networks (IJCNN)

(2015)

A. Dal Pozzolo et al.

Credit card fraud detection: a realistic modeling and a novel learning strategy

IEEE Trans. Neural Netw. Learn.Syst.

(2017)

H. Hormozi et al.

Credit cards fraud detection by negative selection algorithm on hadoop (to reduce the training time)

Information and Knowledge Technology (IKT)

(2013)

E. Hormozi et al.

Accuracy evaluation of a credit card fraud detection system on hadoop mapreduce

Information and Knowledge Technology (IKT)

(2013)

A. Tselykh et al.

Web service for detecting credit card fraud in near real-time

Proceedings of the 8th International Conference on Security of Information and Networks

(2015)

Cited by (156)

Assessment of catastrophic forgetting in continual credit card fraud detection
2024, Expert Systems with Applications
The volume of e-commerce continues to increase year after year. Buying goods on the internet is easy and practical, and took a huge boost during the lockdowns of the Covid crisis. However, this is also an open window for fraudsters and the corresponding financial loss costs billions of dollars. In this paper, we study e-commerce credit card fraud detection, in collaboration with our industrial partner, Worldline. Transactional companies are more and more dependent on machine learning models such as deep learning anomaly detection models, as part of real-world fraud detection systems (FDS). We focus on continual learning to find the best model with respect to two objectives: to maximize the accuracy and to minimize the catastrophic forgetting phenomenon. For the latter, we proposed an evaluation procedure to quantify the forgetting in data streams with delayed feedback: the plasticity/stability visualization matrix. We also investigated six strategies and 13 methods on a real-size case study including five months of e-commerce credit card transactions. Finally, we discuss how the trade-off between plasticity and stability is set, in practice, in the case of FDS.
Time series forecasting and anomaly detection using deep learning
2024, Computers and Chemical Engineering
Recent advances in time series forecasting and anomaly detection have been attributed to the growing popularity of deep learning approaches. Traditional methods, such as rule-based systems and statistical techniques, have limitations when applied to complex and dynamic real-world data. This study investigates using various deep learning models for anomaly detection, recognising aberrant patterns in data, and time series forecasting. The performance of the proposed models is evaluated on benchmarks like the Numenta Anomaly Benchmark (NAB) corpus and credit card fraud detection, showing their ability to detect aberrant patterns in various scenarios. Preprocessing strategies, such as normalisation and feature scaling, play a significant role in both time series forecasting and anomaly detection. In addition, the paper proposes a statistical method for selecting different or more important features from a dataset to overcome the limitations of high-dimensional sequencing data. In many ways, the suggested feature selection technique outperforms previous solutions. It keeps the original meanings of the attributes while selecting those with statistical relevance. Furthermore, it is computationally efficient and successfully solves the problem of excessive dimensions. Overall, deep learning approaches for time series forecasting and anomaly detection are promising in banking, healthcare, and manufacturing industries.
Consumer credit risk assessment: A review from the state-of-the-art classification algorithms, data traits, and learning methods
2024, Expert Systems with Applications
Credit risk assessment is a crucial element in credit risk management. With the extensive research on consumer credit risk assessment in recent decades, the abundance of literature on this topic can be overwhelming for researchers. Therefore, this article aims to provide a more systematic and comprehensive analysis from three perspectives: classification algorithms, data traits, and learning methods. Firstly, the state-of-the-art classification algorithms are categorized into traditional single classifiers, intelligent single classifiers, hybrid and ensemble multiple classifiers. Secondly, considering the diversity of data traits in the credit dataset, data traits are divided into external structure information traits, data quality traits, data quantity traits, and internal information traits. Data traits-driven modeling framework based on multiple classifiers is proposed for solving credit risk assessment. Thirdly, considering the differences in data modeling methods, learning methods are classified into data status, label status, and structure form. Furthermore, model interpretability, model bias, model multi-pattern, and model fairness are discussed. Finally, the limitations and future research directions are presented. This review article serves as a helpful guide for researchers and practitioners in the field of credit risk modeling and analysis.
A novel Edge architecture and solution for detecting concept drift in smart environments
2024, Future Generation Computer Systems
The proliferation of the Internet of Things (IoT), artificial intelligence (AI), the adoption of 5G, and progress towards 6G technology have led to the accumulation of massive amounts of real-world data; however, a significant portion of the data generated by smart cities and smart buildings remains unused. A notable problem is the shift of statistical properties in real-world streaming over time caused by unexpected factors, referred to as concept drift, which results in less efficient predictive models. To address this problem, the latest research leverages the cloud–edge continuum paradigm for the deployment of AI and general smart city applications while utilising the available resources optimally. In this article, we propose a computing architecture for different smart city applications in edge micro data centre (EMDC) settings over a hybrid cloud–edge continuum to support the deployment of AI workloads. We implement a feedback-driven automated concept drift detection and adaptation methodology, combining base learner long short-term memory (LSTM) with Page–Hinkley test (PHT), adaptive windowing (ADWIN) and the Kolmogorov–Smirnov windowing (KSWIN). Real-world data streams are utilised to forecast from various environmental sensors installed at the University of Oulu Smart Campus. The feedback-based concept drift detection and adaption process is first evaluated using synthetic datasets with known concept drift points and then employed in the real-world data. Subsequently, the implementation is evaluated using the state-of-the-art MAE, RMSE, and MAPE methods. The results showed a reduction in MAPE from 8.5% to 3.88% when concept drift detection was applied. Additionally, the challenges faced and the effectiveness of the suggested solutions are explored.
A Grey Literature Review on Data Stream Processing applications testing
2023, Journal of Systems and Software
The Data Stream Processing (DSP) approach focuses on real-time data processing by applying specific techniques for capturing and processing relevant data for on-the-fly results, i.e. without necessarily requiring prior storage. Like in any other software, testing plays a vital role in the quality assurance of DSP applications. However, testing such kind of software is not a simple task. In this context, some factors that make challenging testing are message temporality, parallelism, data volume, complex infrastructure, variability, and speed of messages.
This work aims to map and synthesize industry knowledge and experience regarding DSP application testing. Specifically, we want to know about challenges, test purposes, test approaches, test data sources, and adopted tools.
To achieve the objective, we performed a Grey Literature Review (e.g., blog posts, white papers, discussion lists, lecture themes at technical events, professional social networks, software repositories, and other web-published) on testing DSP applications. We searched the grey literature using Google’s regular search engine in addition to specific searches on technical software development content websites. The selected studies were analyzed using qualitative and quantitative techniques.
Results are based on evidence from 154 selected sources. The challenges for testing DSP applications are the complexity of DSP applications, test infrastructure complexity, timing, and data acquisition issues. The main test objectives identified are functional suitability, performance efficiency, reliability, and maintainability. The main test approaches reported: Performance Testing, Regression Testing, Property-Based Testing, Chaos Testing, and Contract/Schema Testing. The strategies adopted by practitioners to obtain test data: Historical Data, Production Data Mirroring, Semi-Synthetic Data, and Synthetic Data. We also report 50 tools used in various testing activities, which are used for: automating infrastructure, generating test data, test utilities, dealing with timing issues, load generation, simulation, and others. Furthermore, we identified gaps and opportunities for future scientific work.
This work selected and summarized content produced by practitioners regarding DSP application testing. We identified that knowledge, techniques, and tools intrinsic to the practice were not present in the formal literature, so this study helps reduce the gap between industry and academia on this topic. The document has delivered benefits to industry practitioners and academic researchers.
BTextCAN: Consumer fraud detection via group perception
2023, Information Processing and Management
Traditional consumer fraud detection usually relies on the relevant regulatory authorities to conduct inspections through sampling. This would be labor-intensive and inefficient. To address this issue, we conducted a statistical analysis to explore the relationship between frauds and consumer perceptions. Based on the statistical results, we propose a novel deep mixture model-based consumer fraud detection method BTextCAN to detect consumer frauds via the perception of consumer group. By designing a text convolutional attention network (TextCAN) to extract local features with contextual semantic relations from consumer reviews, our approach can mine the opinions of consumers and use their group perception to detect consumer fraud behaviors. Experimental results show that our method outperforms the baseline models. In particular, BTextCAN achieves an accuracy of 79.8% in the binary detection task and 76.5% in the multiclassification detection task. This work is the first research effort to detect fraudulent merchant behavior from consumer reviews. In addition, we have collated and made publicly available the first dataset in this area.

View all citing articles on Scopus

View full text

SCARFF: A scalable framework for streaming credit card fraud detection with spark

Highlights

Abstract

Graphical abstract

Introduction

Section snippets

Real-world Fraud Detection Systems

The Big Data ecosystem

Online learning and streaming solutions

Experiments

Conclusions and future work

Acknowledgments

Expert Syst. Appl.

Expert Syst. Appl.

J. Comput, Sci.

J. Syst. Soft.

Expert Syst. Appl.

Inf. Sci.

Knowl. Based Syst.

J. Finance Data Sci.

Inf. Fus.

Decis. Support Syst.

Decis. Support Syst.

Inf. Fus.

Inf. Fus.

Credit card fraud detection with a neural-network

Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

One-class classifiers with incremental learning and forgetting for data streams with concept drift

Soft comput.

Learning from imbalanced data: open challenges and future directions

Progress Artif. Intell.

When is undersampling effective in unbalanced classification tasks?

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

Calibrating probability with undersampling for unbalanced classification

Symposium Series on Computational Intelligence

A survey on concept drift adaptation

ACM Comput. Surv.

Just-in-time classifiers for recurrent concepts

IEEE Trans. Neural Netw. Learn. Syst.

Novelty detection in data streams

Artif. Intell. Rev.

Anynovel: detection of novel concepts in evolving data streams

Evolving Syst.

Countering the concept-drift problem in big data using iovfdt

2013 IEEE International Congress on Big Data

Scalable distributed stream join processing

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Using hddt to avoid instances propagation in unbalanced and evolving data streams

2014 International Joint Conference on Neural Networks (IJCNN)

Credit card fraud detection and concept-drift adaptation with delayed supervised information

International Joint Conference on Neural Networks (IJCNN)

Credit card fraud detection: a realistic modeling and a novel learning strategy

IEEE Trans. Neural Netw. Learn.Syst.

Credit cards fraud detection by negative selection algorithm on hadoop (to reduce the training time)

Information and Knowledge Technology (IKT)

Accuracy evaluation of a credit card fraud detection system on hadoop mapreduce

Information and Knowledge Technology (IKT)

Web service for detecting credit card fraud in near real-time

Proceedings of the 8th International Conference on Security of Information and Networks