Elsevier

Information Fusion

Volume 41, May 2018, Pages 182-194
Information Fusion

SCARFF: A scalable framework for streaming credit card fraud detection with spark

https://doi.org/10.1016/j.inffus.2017.09.005Get rights and content

Highlights

  • The open source / Big Data nature of the framework.

  • The capability to deal with nonstationarity, class imbalance and verification latency.

  • The distributed on-line feature engineering functionality included in the framework.

  • The scalability, efficiency and accuracy assessed over a big stream of transactions.

Abstract

The expansion of the electronic commerce, together with an increasing confidence of customers in electronic payments, makes of fraud detection a critical factor. Detecting frauds in (nearly) real time setting demands the design and the implementation of scalable learning techniques able to ingest and analyse massive amounts of streaming data. Recent advances in analytics and the availability of open source solutions for Big Data storage and processing open new perspectives to the fraud detection field. In this paper we present a Scalable Real-time Fraud Finder (SCARFF) which integrates Big Data tools (Kafka, Spark and Cassandra) with a machine learning approach which deals with imbalance, nonstationarity and feedback latency. Experimental results on a massive dataset of real credit card transactions show that this framework is scalable, efficient and accurate over a big stream of transactions.

Introduction

The increasing adoption of electronic payments is opening new perspectives to fraudsters and asks for innovative countermeasures to their criminal activities. If on the one hand fraudsters continuously improve their techniques to emulate genuine behaviour, on the other hand it becomes affordable for the companies managing transactional services to collect data about customers and monitor their behavior.

The need of automatic systems able to detect frauds from historical data led to the design of a number of machine learning algorithms for fraud detection [1], [2], [3]. Supervised methods, typically based on binary classification, as well as unsupervised and one-class classification [4], [5] have been proposed in literature. Most of these works address some specific issues of fraud detection, notably class imbalance [6], [7], [8] (the percentage of fraudulent transactions is usually very small), concept drift [9], [10], [11], [12], [13], [14] (the distribution of fraudulent transactions might change in time) and stream processing [15], [16].

The authors of this paper studied and analysed in detail the existing literature in previous works [17], [18], [19], [20] and proposed an original solution for accurate classification of fraudulent credit card transactions in imbalanced and non-stationary settings. In particular we assessed the superiority of undersampling versus oversampling techniques in our specific problem, we proposed a sliding window approach to effectively tackle concept-drift and we addressed in [19], [20] an issue often overlooked in literature: the verification latency due to the fact that in real settings the transaction label is obtained only after that human investigators contacted the card holders.

Though a large number of learning techniques have been proposed, most solutions assume a conventional setting where the entire dataset is resident in memory. It follows that very few studies made the implementation of these techniques scalable and studied their performances. Also what exists is typically related to other domains than the fraud: for instance [21] and [22] studied already the issue of data imbalance in a Hadoop/MapReduce framework1 but only for public and bioinformatics data.

In domains closer to fraud detection most of the existing works are preliminary or in progress. Hormoz et al. [23] made a comparison between a serial implementation and a Hadoop/MapReduce batch processing solution based on Artificial Immune Systems (AIS). The same authors made some tests on cloud services and provided accuracy measurements [24]. A web service framework for near real-time credit card fraud detection is described, together with some preliminary results, in [25]. A big data architecture based on Flume, Hadoop and HDFS is proposed in [26] but no validation results are provided. An example of application in a non banking environment is presented in [27] where Chen et al. describe the Hadoop based fraud detection infrastructure at Alibaba. Other works in progress can be found on several git servers [28], [29], [30], [31], [32], [33], [34].

In this paper we start from the conclusions of our published works [17], [19], [20] and we propose a realistic and scalable implementation of a fraud detection system. SCARFF (Scalable Real-time Fraud Finder) is an open source platform which processes and analyses streaming data in order to return reliable alerts in a nearly real-time setting. These are the main original contributions:

  • 1.

    The design, implementation and test of an entirely open-source solution integrating state-of-the-art components from the Apache ecosystem. This architecture deals seamlessly with data ingestion, streaming, feature engineering, storage and classification;

  • 2.

    A scalable learning solution able to provide accurate classification in a context characterized by nonstationarity, class imbalance and verification latency. This is obtained by implementing in a scalable and distributed manner an ensemble solution able to deal with concept drift and delayed feedback;

  • 3.

    The design of a distributed on-line feature engineering functionality, which constantly updates historical features relevant to better identify fraud patterns. This on-line functionality relies on a MapReduce programming model;

  • 4.

    A real-world extensive assessment, in terms of scalability, computational performance and precision, carried out by testing the platform on a stream of more than 8 millions of transactions (corresponding to more than 1.9 millions of cards) provided by our industrial partner;

  • 5.

    The virtualisation of the complete workflow proposed in this article as a Docker container, making the workflow fully reproducible.

The paper is organized as follows. Section 2 introduces the main characteristics of real-world Fraud-Detection Systems. Section 3 gives an overview of the big data tools from the Apache ecosystem that are integrated in our framework. Section 4 details the learning and the streaming functionalities of the platform. Finally, in Section 5 we assess the scalability, computational speed and precision on a real dataset, as a function of allocated resources and incoming transaction rates.

Section snippets

Real-world Fraud Detection Systems

Real world Fraud-Detection System for credit card transactions rely on both automatic and manual operations [20], [35] (Fig. 1). Manual operations are performed offline by human investigators, while automatic components are implemented by algorithms that work in real-time and near real-time configurations. Real-time operations take place before the payment is authorized, while near real-time operations are executed after the payment occurred.

Real-time processing consists of a set of security

The Big Data ecosystem

This paper proposes a scalable implementation of the DDM learning module which relies on standard tools from the Apache ecosystem, notably Kafka, Spark and Cassandra (Fig. 2). A major advantage of these components is that they similarly handle fault tolerance and tasks distribution.

Online learning and streaming solutions

This section details the functionalities of the proposed framework. Our pipeline implements two main functionalities: a machine learning classification engine and a streaming component. In the first subsection, Section 4.1, the selected machine learning techniques are described. The machine learning engine includes a weighted ensemble of two classifiers. The second subsection, Section 4.2, focuses on the streaming component. Here, more details will be given regarding the data preprocessing (

Experiments

This section assesses the proposed scalable architecture according to different criteria:

  • Scalability;

  • Impact of internal parametrization on computational performance;

  • Classification precision.

Experiments were carried out on a cluster of ten machines, each with 24 cores and 80GB of RAM. Spark was run on top of the cluster resource manager Yarn [59]. For all experiments, each executor was allocated 1GB of RAM, and the driver was allocated 10GB of RAM. Further discussion over memory usage will be

Conclusions and future work

The paper presented SCARFF, an original scalable platform to automatically detect frauds in a near real-time horizon. The most original contribution of this framework is the design and the implementation of an open source big data solution for Real-world Fraud Detection and its test on a massive real-world data set. We wish to emphasize that the workflow proposed in our article, while not disclosing the data, has been made fully open source and reproducible by means of a Docker8

Acknowledgments

The authors FC, YLB and GB acknowledge the funding of the Brufence project (scalable machine learning for automating defense system) supported by INNOVIRIS (Brussels Institute for the encouragement of scientific research and innovation). ADP acknowledges the funding of the Doctiris (Adaptive real-time machine learning for credit card fraud detection) project supported by INNOVIRIS (Brussels Institute for the encouragement of scientific research and innovation).

References (62)

  • V. Van Vlasselaer et al.

    Apate: a novel approach for automated credit card transaction fraud detection using network-based extensions

    Decis. Support Syst.

    (2015)
  • M. Woźniak et al.

    A survey of multiple classifier systems as hybrid systems

    Inf. Fus.

    (2014)
  • B. Krawczyk et al.

    Ensemble learning for data stream analysis: a survey

    Inf. Fus.

    (2017)
  • S. Ghosh et al.

    Credit card fraud detection with a neural-network

    Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

    (1994)
  • B. Krawczyk et al.

    One-class classifiers with incremental learning and forgetting for data streams with concept drift

    Soft comput.

    (2015)
  • B. Krawczyk

    Learning from imbalanced data: open challenges and future directions

    Progress Artif. Intell.

    (2016)
  • A. Dal Pozzolo et al.

    When is undersampling effective in unbalanced classification tasks?

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases

    (2015)
  • A. Dal Pozzolo et al.

    Calibrating probability with undersampling for unbalanced classification

    Symposium Series on Computational Intelligence

    (2015)
  • J.a. Gama et al.

    A survey on concept drift adaptation

    ACM Comput. Surv.

    (2014)
  • C. Alippi et al.

    Just-in-time classifiers for recurrent concepts

    IEEE Trans. Neural Netw. Learn. Syst.

    (2013)
  • E.R. Faria et al.

    Novelty detection in data streams

    Artif. Intell. Rev.

    (2016)
  • Z.S. Abdallah et al.

    Anynovel: detection of novel concepts in evolving data streams

    Evolving Syst.

    (2016)
  • H. Yang et al.

    Countering the concept-drift problem in big data using iovfdt

    2013 IEEE International Congress on Big Data

    (2013)
  • T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, S. Whittle,...
  • Q. Lin et al.

    Scalable distributed stream join processing

    Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

    (2015)
  • A. Dal Pozzolo et al.

    Using hddt to avoid instances propagation in unbalanced and evolving data streams

    2014 International Joint Conference on Neural Networks (IJCNN)

    (2014)
  • A. Dal Pozzolo et al.

    Credit card fraud detection and concept-drift adaptation with delayed supervised information

    International Joint Conference on Neural Networks (IJCNN)

    (2015)
  • A. Dal Pozzolo et al.

    Credit card fraud detection: a realistic modeling and a novel learning strategy

    IEEE Trans. Neural Netw. Learn.Syst.

    (2017)
  • H. Hormozi et al.

    Credit cards fraud detection by negative selection algorithm on hadoop (to reduce the training time)

    Information and Knowledge Technology (IKT)

    (2013)
  • E. Hormozi et al.

    Accuracy evaluation of a credit card fraud detection system on hadoop mapreduce

    Information and Knowledge Technology (IKT)

    (2013)
  • A. Tselykh et al.

    Web service for detecting credit card fraud in near real-time

    Proceedings of the 8th International Conference on Security of Information and Networks

    (2015)
  • Cited by (156)

    • BTextCAN: Consumer fraud detection via group perception

      2023, Information Processing and Management
    View all citing articles on Scopus
    View full text