Top

2019 | Book

Read chapter Read first chapter

Learning from Data Streams in Evolving Environments

Methods and Applications

Editor: Dr. Moamar Sayed-Mouchaweh

Publisher: Springer International Publishing

Book Series : Studies in Big Data

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This edited book covers recent advances of techniques, methods and tools treating the problem of learning from data streams generated by evolving non-stationary processes. The goal is to discuss and overview the advanced techniques, methods and tools that are dedicated to manage, exploit and interpret data streams in non-stationary environments. The book includes the required notions, definitions, and background to understand the problem of learning from data streams in non-stationary environments and synthesizes the state-of-the-art in the domain, discussing advanced aspects and concepts and presenting open problems and future challenges in this field.

Provides multiple examples to facilitate the understanding data streams in non-stationary environments;

Presents several application cases to show how the methods solve different real world problems;

Discusses the links between methods to help stimulate new research and application directions.

Frontmatter

Introduction

Abstract

This introductory chapter intends to present the challenges related to the problem of learning from data streams in nonstationary environments. It focuses on the major challenges related to the learning with concept drift, learning with concept evolution, and learning with both concept drift and concept evolution. Then, it classifies the different methods and techniques of the state of the art that are used to address these challenges. This categorization is achieved according to how the learning is performed, how the data streams are processed, and how the changes are detected and integrated into the model. Finally, this chapter ends with a compact summary of the contents of this book by providing a paragraph about each of the contributions and how the learning process from data streams is performed (single learner or ensemble learners, centralized processing or distributed computing, classification, regression or clustering, window-based or sequential-based, applications targeted, etc.).

Moamar Sayed-Mouchaweh

Transfer Learning in Non-stationary Environments

Abstract

The fields of transfer learning and learning in non-stationary environments are closely related. Both look into the problem of training and test data that come from different probability distributions. However, these two fields have evolved separately. Transfer learning enables knowledge to be transferred between different domains or tasks in order to improve predictive performance in a target domain and task. It has no notion of continuing time. Learning in non-stationary environments concerns with updating learning models over time in such a way to deal with changes that the underlying probability distribution of the problem may suffer. It assumes that training examples arrive in the form of data streams. Very little work has investigated the connections between these two fields. This chapter provides a discussion of such connections and explains two existing approaches that perform online transfer learning in non-stationary environments. A brief summary of the results achieved by these approaches in the literature is presented, highlighting the benefits of integrating these two fields. As the first work to provide a detailed discussion of the relationship between transfer learning and learning in non-stationary environments, this chapter opens up the path for future research in the emerging area of transfer learning in non-stationary environments.

Leandro L. Minku

A New Combination of Diversity Techniques in Ensemble Classifiers for Handling Complex Concept Drift

Abstract

Recent advances in Computational Intelligent Systems have focused on addressing complex problems related to the dynamicity of the environments. Generally in dynamic environments, data are presented as streams that may evolve over time and this is known by concept drift. Handling concept drift through ensemble classifiers has received a great interest in last decades. The success of these ensemble methods relies on their diversity. Accordingly, various diversity techniques can be used like block-based data, weighting-data or filtering-data. Each of these diversity techniques is efficient to handle certain characteristics of drift. However, when the drift is complex, they fail to efficiently handle it. Complex drifts may present a mixture of several characteristics (speed, severity, influence zones in the feature space, etc.) which may vary over time. In this case, drift handling is more complicated and requires new detection and updating tools. For this purpose, a new ensemble approach, namely EnsembleEDIST2, is presented. It combines the three diversity techniques in order to take benefit from their advantages and outperform their limits. Additionally, it makes use of EDIST2, as drift detection mechanism, in order to monitor the ensemble’s performance and detect changes. EnsembleEDIST2 was tested through different scenarios of complex drift generated from synthetic and real datasets. This diversity combination allows EnsembleEDIST2 to outperform similar ensemble approaches in terms of accuracy rate, and present stable behaviors in handling different scenarios of complex drift.

Imen Khamassi, Moamar Sayed-Mouchaweh, Moez Hammami, Khaled Ghédira

Analyzing and Clustering Pareto-Optimal Objects in Data Streams

Abstract

Stream data analysis is a high relevant topic in various academic and business fields. Users want to analyze data streams to extract information in order to learn from this ever-growing amount of data. Although many approaches exist for effective processing of data streams, learning from streams requires new algorithms and methods to be able to learn under the evolving and unbounded data. In this chapter we focus on the task of preference-based stream processing and clustering to analyze data streams. We show that this method is a real alternative to the state-of-the-art approaches.

Markus Endres, Johannes Kastner, Lena Rudenko

Error-Bounded Approximation of Data Stream: Methods and Theories

Abstract

Since the development of sensor network and Internet of Things, the volume of data is rapidly increasing and the streaming data has attracted much attention recently. To efficiently process and explore data streams, the compact data representation is playing an important role, since the data approximations other than the original data items are usually applied in many stream mining tasks, such as clustering, classification, and correlation analysis. In this chapter, we focus on the maximum error-bounded approximation of data stream, which represents the streaming data with constrained approximation error on each data point. There are two criteria for the approximation solution: self-adaption over time for varied error bound and real-time processing. We reviewed the existing data approximation techniques and summarized some essential theories such as optimization guarantee. Two optimal linear-time algorithms are introduced to construct error-bounded piecewise linear representation for data stream. One generates the line segments by data convex analysis, and the other one is based on the transformed space, which can be extended to a general model. We theoretically analyzed and compared these two different spaces, and proved the theoretical equivalence between them, as well as the two algorithms.

Qing Xie, Chaoyi Pang, Xiaofang Zhou, Xiangliang Zhang, Ke Deng

Ensemble Dynamics in Non-stationary Data Stream Classification

Abstract

Data stream classification is the process of learning supervised models from continuous labelled examples in the form of an infinite stream that, in most cases, can be read only once by the data mining algorithm. One of the most challenging problems in this process is how to learn such models in non-stationary environments, where the data/class distribution evolves over time. This phenomenon is called concept drift. Ensemble learning techniques have been proven effective adapting to concept drifts. Ensemble learning is the process of learning a number of classifiers, and combining them to predict incoming data using a combination rule. These techniques should incrementally process and learn from existing data in a limited memory and time to predict incoming instances and also to cope with different types of concept drifts including incremental, gradual, abrupt or recurring. A sheer number of applications can benefit from data stream classification from non-stationary data, including weather forecasting, stock market analysis, spam filtering systems, credit card fraud detection, traffic monitoring, sensor data analysis in Internet of Things (IoT) networks, to mention a few. Since each application has its own characteristics and conditions, it is difficult to introduce a single approach that would be suitable for all problem domains. This chapter studies ensembles’ dynamic behaviour of existing ensemble methods (e.g. addition, removal and update of classifiers) in non-stationary data stream classification. It proposes a new, compact, yet informative formalisation of state-of-the-art methods. The chapter also presents results of our experiments comparing a diverse selection of best performing algorithms when applied to several benchmark data sets with different types of concept drifts from different problem domains.

Hossein Ghomeshi, Mohamed Medhat Gaber, Yevgeniya Kovalchuk

Processing Evolving Social Networks for Change Detection Based on Centrality Measures

Abstract

Social networks have an evolving characteristic due to the continuous interaction between users, with nodes associating and disassociating with each other as time flies. The analysis of such networks is especially challenging, because it needs to be performed with an online approach, under the one-pass constraint of data streams. Such evolving behavior leads to changes in the network topology that can be investigated under different perspectives. In this work we focus on the analysis of nodes position evolution—a node-centric perspective. Our goal is to spot change-points in an evolving network at which a node deviates from its normal behavior. Therefore, we propose a change detection model for processing evolving network streams which employs three different aggregating mechanisms for tracking the evolution of centrality metrics of a node. Our model is space and time efficient with memory less mechanisms and in other mechanisms at most we require the network of current time step T only. Additionally, we also compare the influence on different centralities’ fluctuations by the dynamics of real-world preferences. Consecutively, we apply our model in the user preference change detection task, reaching competitive levels of accuracy on Twitter network.

Fabíola S. F. Pereira, Shazia Tabassum, João Gama, Sandra de Amo, Gina M. B. Oliveira

Large-Scale Learning from Data Streams with Apache SAMOA

Abstract

Apache SAMOA (Scalable Advanced Massive Online Analysis) is an open-source platform for mining big data streams. Big data is defined as datasets whose size is beyond the ability of typical software tools to capture, store, manage, and analyze, due to the time and memory complexity. Apache SAMOA provides a collection of distributed streaming algorithms for the most common data mining and machine learning tasks such as classification, clustering, and regression, as well as programming abstractions to develop new algorithms. It features a pluggable architecture that allows it to run on several distributed stream processing engines such as Apache Flink, Apache Storm, and Apache Samza. Apache SAMOA is written in Java and is available at https://samoa.incubator.apache.org under the Apache Software License version 2.0.

Nicolas Kourtellis, Gianmarco De Francisci Morales, Albert Bifet

Process Mining for Analyzing Customer Relationship Management Systems: A Case Study

Abstract

Process Mining aims to discover and evaluate As-Is processes from sets of sequential events, by examining different instances of the same process and building models that can detect patterns and behaviors. In the meanwhile, organizational perspective is being considered in Process Mining by taking advantage of the ability to extract social networks that represent different kinds of relations between resources performing the process. The case study tries to describe how Process Mining could be applied in order to detect and improve “Customer Relationship Management” process and extract some kind of social networks that represent the relations between the employees(resources) of National Institute of Statistics of Portugal (INE) using event logs.

Ahmed Fares, João Gama, Pedro Campos

Detecting Smooth Cluster Changes in Evolving Graph Structures

Abstract

Graph mining is a set of techniques for finding useful patterns in various types of structured data. Many effective algorithms for mining static graphs have been proposed. However, graphs of human relationships and evolving genes change over time, and such evolving graphs require different algorithms for analysis. In this chapter, we explain a method called O2I for clustering in evolving graphs that can detect changes in clusters over time. O2I partitions the graph sequence into smooth clusters, even when the numbers of clusters and vertices vary. It first constructs a graph from the graph sequence, then uses spectral clustering and the RatioCut to apply k partitioning to this graph. O2I is compared in detail with the preserving clustering membership (PCM) algorithm, which is a conventional online graph-sequence clustering algorithm in which the numbers of clusters and vertices must remain constant. We further show that, in contrast to PCM, the performance of O2I is not dependent on the clustering of the initial graph in the graph sequence. Experiments on synthetic evolving graphs show that O2I is practical to calculate and addresses the main disadvantages of PCM. Further tests on real-world data show that O2I can obtain reasonable clusters. This method is hence a flexible clustering solution and will be useful on a wide range of graph-mining applications in which the connections, number of clusters, and number of vertices of the graphs evolve over time.

Sohei Okui, Kaho Osamura, Akihiro Inokuchi

Efficient Estimation of Dynamic Density Functions with Applications in Data Streams

Abstract

Recently, many applications such as network monitoring, traffic management and environmental studies generate huge amount of data that cannot fit in the computer memory. Data of such applications arrive continuously in the form of streams. The main challenges for mining data streams are the high speed and the large volume of the arriving data. A typical solution to tackle the problems of mining data streams is to learn a model that fits in the computer memory. However, the underlying distributions of the streaming data change over time in unpredicted scenarios. In this sense, the learned models should be updated continuously and rely more on the most recent data in the streams.

In this chapter, we present an online density estimator that builds a model called KDE-Track for characterizing the dynamic density of the data streams. KDE-Track summarizes the distribution of a data stream by estimating the Probability Density Function (PDF) of the stream at a set of resampling points. KDE-Track is shown to be more accurate (as reflected by smaller error values) and more computationally efficient (as reflected by shorter running time) when compared with existing density estimation techniques. We demonstrate the usefulness of KDE-Track in visualizing the dynamic density of data streams and change detection.

Abdulhakim Qahtan, Suojin Wang, Xiangliang Zhang

Incremental SVM Learning: Review

Abstract

The aim of this paper is to present a review of methods for incremental Support Vector Machines (SVM) learning and their adaptation for data stream classification in evolving environments. We formalize a taxonomy of these methods based on their characteristics and the type of solution they provide. We discuss the strength and weakness of the various learning methods and also highlight some applications involving data stream, where incremental SVM learning has been used.

Isah Abdullahi Lawal

On Social Network-Based Algorithms for Data Stream Clustering

Abstract

Extracting useful patterns from data is a challenging task that has been extensively investigated by both machine learning researchers and practitioners for many decades. This task becomes even more problematic when data is presented as a potentially unbounded sequence, the so-called data streams. Albeit most of the research on data stream mining focuses on supervised learning, the assumption that labels are available for learning is unverifiable in most streaming scenarios. Thus, several data stream clustering algorithms were proposed in the last decades to extract meaningful patterns from streams. In this study, we present three recent data stream clustering algorithms based on insights from social networks’ theory that exhibit competitive results against the state of the art. The main distinctive characteristics of these algorithms are the following: (1) they do not rely on a hyper-parameter to define the number of clusters to be found; and (2) they do not require batch processing during the offline steps. These algorithms are detailed and compared against existing works on the area, showing their efficiency in clustering quality, processing time, and memory usage.

Jean Paul Barddal, Heitor Murilo Gomes, Fabrício Enembreck

Title: Learning from Data Streams in Evolving Environments
Editor: Dr. Moamar Sayed-Mouchaweh
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-89803-2
Print ISBN: 978-3-319-89802-5
DOI: https://doi.org/10.1007/978-3-319-89803-2

Springer Professional

About this book

Table of Contents

Frontmatter

Introduction

Transfer Learning in Non-stationary Environments

A New Combination of Diversity Techniques in Ensemble Classifiers for Handling Complex Concept Drift

Analyzing and Clustering Pareto-Optimal Objects in Data Streams

Error-Bounded Approximation of Data Stream: Methods and Theories

Ensemble Dynamics in Non-stationary Data Stream Classification

Processing Evolving Social Networks for Change Detection Based on Centrality Measures

Large-Scale Learning from Data Streams with Apache SAMOA

Process Mining for Analyzing Customer Relationship Management Systems: A Case Study

Detecting Smooth Cluster Changes in Evolving Graph Structures

Efficient Estimation of Dynamic Density Functions with Applications in Data Streams

Incremental SVM Learning: Review

On Social Network-Based Algorithms for Data Stream Clustering