A non-canonical hybrid metaheuristic approach to adaptive data stream classification

https://doi.org/10.1016/j.future.2019.07.067Get rights and content

Highlights

  • A three-layered ensemble technique for non-stationary data stream classification task is proposed.

  • The framework is based on an evolutionary algorithm (RD) and a bio-inspired optimisation technique (PSO).

  • Five different variations of the proposed framework is provided.

  • The proposed framework is compared to other state-of-the-art methods using standard criteria.

Abstract

Data stream classification techniques have been playing an important role in big data analytics recently due to their diverse applications (e.g. fraud and intrusion detection, forecasting and healthcare monitoring systems) and the growing number of real-world data stream generators (e.g. IoT devices and sensors, websites and social network feeds). Streaming data is often prone to evolution over time. In this context, the main challenge for computational models is to adapt to changes, known as concept drifts, using data mining and optimisation techniques. We present a novel ensemble technique called RED-PSO that seamlessly adapts to different concept drifts in non-stationary data stream classification tasks. RED-PSO is based on a three-layer architecture to produce classification types of different size, each created by randomly selecting a certain percentage of features from a pool of features of the target data stream. An evolutionary algorithm, namely, Replicator Dynamics (RD), is used to seamlessly adapt to different concept drifts; it allows good performing types to grow and poor performing ones to shrink in size. In addition, the selected feature combinations in all classification types are optimised using a non-canonical version of the Particle Swarm Optimisation (PSO) technique for each layer individually. PSO allows the types in each layer to go towards local (within the same type) and global (in all types) optimums with a specified velocity. A set of experiments are conducted to compare the performance of the proposed method to state-of-the-art algorithms using real-world and synthetic data streams in immediate and delayed prequential evaluation settings. The results show a favourable performance of our method in different environments.

Introduction

As the digital world advances, the number of data streams produced by various sources such as IoT devices and sensors and social media networks grows rapidly. Such streams are usually characterised by high velocity and changes in data distributions over time. Therefore, a significant number of recent research are concentrated on data stream classification challenges specially in non-stationary data streams [1]. The main challenge in this context is adaptation to different concept drifts; that is, when the data distribution evolve over time in unforeseen ways.

Various forms of concept drifts can be categorised into four generic groups of sudden (abrupt), incremental, gradual and recurrent. In sudden concept drifts, the distribution of data is suddenly replaced with a new distribution. An incremental concept drift is a drift to the distribution of data when it goes through various, unstable distributions before being stable at a specific data distribution. In the gradual concept drifts, the ratio of a new distribution of incoming data raises, and the ratio of the data from the former distribution drops over time. In a recurrent concept drift, the same old distribution of data reappears after passing some time with different distribution/s.

Ensemble learning technique is a machine learning approach, where multiple classifiers are created and combined using a voting mechanism to establish a single class as the output of a data instance. Ensemble techniques have demonstrated superiority over other classification techniques for stream classification tasks in non-stationary environments [2], [3]. This is due to their flexibility in training and updating classifiers. Furthermore, ensemble techniques offer a solution to keep the effects of old and new instances using multiple classifiers since each instance in a data stream is processed only once on the arrival in most cases.

An ideal approach to non-stationary data stream classification should satisfy the following objective with a multifold features: having the least possible misclassification rate while minimising the computational complexity and quickly adapting to possible concept drifts. However, there is a lack of comprehensive approaches offering these features in a single framework; the majority of the existing ensemble methods are focused on either one or two of them, or a specific type of data streams.

The aim of this study is to propose a novel method using a modified bio-inspired algorithm, namely, Particle Swarm Optimisation (PSO), to comply with the aforementioned characteristics of an ideal approach to cope with evolving data streams in classification challenges. The proposed technique comprises a three-layer architecture. Each layer is initially assigned some predefined classification types that randomly created from a pool of features of the target data stream. We used an evolutionary algorithm called Replicator Dynamics (RD) to seamlessly cope with smooth (i.e. gradual or incremental) concept drifts; it allows the classification types with good performance to grow and those with poor performance to shrink in size. The combination of features in all types is then optimised using a modified version of PSO for each layer individually. This helps the method to cope with more sudden (i.e. recurring or abrupt) concept drifts. PSO allows the types in each layer to go towards local (within the same type) and global (in all types) optimums with a specified velocity.

The proposed method in this paper is evaluated over 5 real-world data streams and 4 synthetic (artificial) data stream generators. Different types of concept drifts are added to synthetic datasets in order to examine how the proposed method adapts to different concept drifts compared to state-of-the-art methods. As the process of labelling instances in the real-world datasets is different depending on the application, real-world data streams can be categorised into two different labelling mechanisms of complete labelling and partial labelling. Complete labelling is done where the true labels of respective instances are completely accessible either instantly or after a delay with no/small extra overhead to the system. This is the case in most of the forecasting tasks such as weather forecasting, stock market analysis, forest monitoring, airline predictions, and bill estimation. Partial labelling is done where the real labels are accessible with an extra overhead to the system via a third party (usually human). This is the case where the real labels are retrieved after an analysis of related data in the stream (e.g. medical diagnosis in health data, anomaly/fraud detection in credit card transactions, etc.). The main goal of the proposed framework is to deal with complete labelling data streams. Hence in the experiments part of this paper, it is assumed that the true labels of instances are completely accessible either instantly or after a specified delay. In this regards, each data-set in the experiments are processed twice; once in the delayed setting (where the actual labels of instances are accessible after a specified time) as well as once in the immediate setting (where the actual labels are accessible instantly).

The rest of this paper is organised as follows: Section 2 overviews related research; Section 3 details our proposed method; Section 4 outlines the experimental setup and results, and compares the presented RED-PSO method to other state-of-the-art methods; Section 5 overviews conclusion and outlines the possible future work.

Section snippets

Related work

A majority of the existing data stream classification algorithms for evolving environments use ensemble learning techniques [1], [2], [3], [4] due to their flexibility in updating the classification model (i.e. adding, removing and retraining their constituent classifiers), and consequently the fact that such methods are more trustworthy than single-classifier methods, especially in non-stationary environments, thanks to their experimentally validated higher accuracy.

In general, many of the

RED-PSO framework

In this section, we present RED-PSO, a novel ensemble learning framework for non-stationary classification using the RD and PSO techniques. We train an ensemble of classifiers comprising three categories (layers) of different classification types generated by randomly selecting features (subspaces) from the set of attributes of the target dataset. For each layer of the proposed ensemble, the number of classification types and features in each one of them is chosen so that the higher the number

Experimental study

A collection of experiments is conducted using five real-world data streams and four synthetic data stream generators to evaluate the proposed RED-PSO framework and compare it with the existing state-of-the-art methods that have shown good performance and liable results [13], [20], including Dynamic Weighted Majority (DWM) [9], Online Accuracy Updated Ensemble (OAUE)[13], OSBoost [8], Leveraging Bag (LevBag) [19], Adaptive Random Forest (ARF) [20], Learn++.NSE [27] and Adaptable Diversity-based

Conclusion and future work

We proposed a novel ensemble learning framework called RED-PSO to seamlessly adapt to different concept drifts in non-stationary data stream classification tasks. RED-PSO framework is based on a three-layer architecture to produce classification types of different size that are created by randomly selecting features from a pool of features of the target data stream. An evolutionary algorithm, Replicator Dynamics (RD), is used to seamlessly adapt to different concept drifts. Furthermore, a

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Hossein Ghomeshi is a Ph.D. student and a Visiting Lecturer in the Cloud Computing department at Birmingham City University. His research interests are in the area of data analytics, big data and machine learning. Hossein holds a M.Sc. degree in computer engineering from the University of Tehran, Iran, where his thesis looked into developing a framework for tag-based social image search based on textual and visual features. He also holds a B.Sc. computer engineering degree from Shahid Chamran

References (38)

  • KolterJ.Z. et al.

    Dynamic weighted majority: An ensemble method for drifting concepts

    J. Mach. Learn. Res.

    (2007)
  • BrzezinskiD. et al.

    Reacting to different types of concept drift: The accuracy updated ensemble algorithm

    IEEE Trans. Neural Netw. Learn. Syst.

    (2014)
  • WangH. et al.

    Mining concept-drifting data streams using ensemble classifiers

  • DomingosP. et al.

    Mining high-speed data streams

  • JaberG.

    An approach for online learning in the presence of concept change

    (2013)
  • de Carvalho SantosS.G.T. et al.

    Speeding up recovery from concept drifts

  • BifetA. et al.

    Learning from time-changing data with adaptive windowing

  • ChuF. et al.

    Fast and light boosting for adaptive mining of data streams

  • BifetA. et al.

    New ensemble methods for evolving data streams

  • Cited by (20)

    • Semi-supervised classification on data streams with recurring concept drift and concept evolution

      2021, Knowledge-Based Systems
      Citation Excerpt :

      In the real world, many applications such as online shopping, stock market and medical health generate a huge amount of data in the form of stream from a variety of sources at an ever higher speed, called data streams. Although these streams are integral part of our daily lives and have been extensively studied in literatures [1,2], it is still challenging for us to deal with them in an effective and efficient method. This is because the streaming data pose a lot of inherent properties, including continuous orderly arrival, infinite length, concept drift and evolution, limited labeled data, and so on.

    View all citing articles on Scopus

    Hossein Ghomeshi is a Ph.D. student and a Visiting Lecturer in the Cloud Computing department at Birmingham City University. His research interests are in the area of data analytics, big data and machine learning. Hossein holds a M.Sc. degree in computer engineering from the University of Tehran, Iran, where his thesis looked into developing a framework for tag-based social image search based on textual and visual features. He also holds a B.Sc. computer engineering degree from Shahid Chamran University of Ahvaz, also in Iran. Prior to his Ph.D. study, Hossein has worked as a senior technical engineer in the telecommunication industry.

    Mohamed Gaber is a Professor in Data Analytics at the School of Computing and Digital Technology, Birmingham City University. Mohamed received his Ph.D. from Monash University, Australia. He has published over 200 papers, co-authored 3 monograph-style books, and edited/co-edited 6 books on data mining and knowledge discovery. His work has attracted well over four thousand citations, with an h-index of 32. Mohamed has served in the program committees of major conferences related to data mining, including ICDM, PAKDD, ECML/PKDD and ICML. He has also co-chaired numerous scientific events on various data mining topics. Professor Gaber is recognised as a Fellow of the British Higher Education Academy (HEA). He is also a member of the International Panel of Expert Advisers for the Australasian Data Mining Conferences. In 2007, he was awarded the CSIRO teamwork award.

    Yevgeniya Kovalchuk is a Senior Lecturer in Computer Science at the School of Computing and Digital Technology and its Cyber Security Research Group, Birmingham City University. She received an M.Sc. in Economic Cybernetics from the National Technical University of Ukraine and Ph.D. in Computer Science from the University of Essex. During her professional career, she worked in both pure industrial and academic environments, as well as on projects connecting the two. She has a strong track record of applying computer and data science to solve problems across a wide range of business areas, including banking, insurance, logistics, healthcare, sport and entertainment, among many others.

    View full text