Skip to main content
Top

Open Access 08-02-2025 | Review

Online deep learning’s role in conquering the challenges of streaming data: a survey

Authors: Muhammad Sulaiman, Mina Farmanbar, Shingo Kagami, Ahmed Nabil Belbachir, Chunming Rong

Published in: Knowledge and Information Systems

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In an era defined by the relentless influx of data from diverse sources, the ability to harness and extract valuable insights from streaming data has become paramount. The rapidly evolving realm of online learning techniques is tailored specifically for the unique challenges posed by streaming data. As the digital world continues to generate vast torrents of real-time data, understanding and effectively utilizing online learning approaches are pivotal for staying ahead in various domains. One of the primary goals of online learning is to continuously update the model with the most recent data trends while maintaining and improving the accuracy of previous trends. Based on the various types of feedback, online learning tasks can be divided into three categories: learning with full feedback, learning with limited feedback, and learning without feedback. This survey aims to identify and analyze the key challenges associated with online learning with full feedback, including concept drift, catastrophic forgetting, skewed learning, and network adaptation, while the other existing reviews mainly focus on a single challenge or two without considering other scenarios. This article also discusses the application and ethical implications of online learning. The results of this survey provide valuable insights for researchers and instructional designers seeking to create effective online learning experiences that incorporate full feedback while addressing the associated challenges. In the end, some conclusions, remarks, and future directions for the research community are provided based on the findings of this review.
Notes
Mina Farmanbar, Shingo Kagami, Ahmed Nabil Belbachir, and Chunmingl Rong have contributed equally to this work.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Machine learning is a subfield of artificial intelligence that focuses on developing algorithms that can learn patterns from data and make predictions or decisions based on that learning. Machine learning algorithms can be broadly divided into two categories: offline learning and online learning. Offline learning, also known as batch learning, involves training a machine learning model on a fixed-size dataset and then using the trained model to make predictions on new, unseen data. This type of learning requires both the data and a learning algorithm to be available before the training process begins [1]. During the training process, the model parameters are adjusted to fit the patterns in the training data, with the goal of making accurate predictions. The model parameters are determined during the training phase and are static in the sense that they are not updated after training. Once the model is trained, it is deployed for use in the application, and the model’s parameters remain fixed during inference or prediction.
Online learning, contrary to offline backpropagation, involves updating the machine learning model continuously as new data becomes available. This makes it particularly useful in applications where data are generated in real-time by mobile phones, Internet of Things devices, and different network sensors [25]. By updating the model with the most recent data trends, online learning can help to improve the accuracy and reliability of the model over time. Each of these categories of machine learning can also be further divided into three subcategories based on the type of learning, including supervised learning, semi-supervised learning, and unsupervised learning. Based on the various types of feedback, online learning tasks can be divided into three categories: learning without feedback, learning with limited feedback, and learning with full feedback. In learning without feedback, the algorithms receive little or no feedback about their performance on a task. This type of learning is similar to unsupervised learning in machine learning, where the model must identify patterns in the data without explicit feedback. In learning with limited feedback, the algorithms receive some feedback about their performance, but the feedback may not be complete or detailed. This type of learning is similar to semi-supervised learning in machine learning, where the model has access to some labeled examples but must still learn from a large amount of unlabeled data. In learning with full feedback, the algorithms receive complete and detailed feedback about their performance on a task. This type of learning is similar to supervised learning in machine learning, where the model is trained on a labeled dataset and receives feedback in the form of correct labels.
Online learning has the advantage of being able to adapt to changing data and continuously improve the model’s performance. However, it also requires more resources and can be more complex to implement than offline learning. There are also challenges associated with online learning. Concept drift, catastrophic forgetting, skewed learning, and network adaptation are four challenges that can arise in online machine learning systems. Each of these challenges can affect the performance of the model and lead to inaccurate or unreliable results. This study aims to conduct a comprehensive review of the literature on online learning methodologies and the associated challenges, especially for online learning, with full feedback. In this study, each of the aforementioned challenges and some of the approaches that have been proposed to address them will be discussed in detail. The main contributions of this study are as follows:
1.
An overview of offline learning and online learning approaches strength, weaknesses, and their ability to handle challenges.
 
2.
Literature review of the investigated four major challenges, namely concept drift, catastrophic forgetting, skewed learning, and network adaptation associated with online learning with full feedback.
 
3.
A conclusion of various aspects of online learning, analyzing the limitations of existing methods.
 
4.
Identification of the open research questions and potential directions for future research in this area.
 
Figure 1 summarizes the major contributions of this study. In the introduction section, online and offline learning are compared first, then problems related to online learning are identified, and later, different challenges are described. In the literature review, different methods are described and analyzed for four challenges of online learning. The next chapter discuss application of online deep learning in different domain. Later, ethical implication of the online deep learning is discussed. In conclusion, all four challenges are summarized. In future directions, open research questions and potential directions are identified. Best viewed in color.

1.1 Offline learning analysis

Offline learning, also referred to as batch learning, involves training a machine learning model on a static dataset that is provided all at once. After deploying a trained model in the application scenario, the model predicts data according to the patterns or trends in the dataset. In applications where the underlying data distribution changes over time, such as in streaming data or non-stationary environments, the deployed model may become outdated and need to be retrained on new data to maintain accuracy and performance. This is especially important in safety-critical applications such as self-driving vehicles, unmanned aerial vehicles, and robots, where the consequences of model failure can be severe [6]. In such schemes, machine learning practitioners manually retrain the model on new data and redeploy the model. However, this approach can be time-consuming and may not be feasible in applications with large amounts of streaming data. For this purpose, some practitioners schedule the training and deploying automatically at any time stamp. In this way, the training and deployment problem is solved, but three drawbacks still need to be considered: redeploying cost, the model is yet not trained on the most recent data after the last schedule of training, and data from the last schedule to the next schedule need to be stored which is difficult in real-world dynamic applications, due to the huge amount of data to be stored [7]. Sometimes models are used for prediction at the edge, where the machines are limited to computing and memory resources, and therefore, it is important to have models that are lightweight and can be executed efficiently on the edge devices. Moreover, in streaming data applications, where data are constantly coming in, it is important to have a model that can adapt to the most recent trends at run time. This requires a model that is capable of incremental learning, which means it can learn from new data points without retraining the entire model from scratch.

1.2 Online learning analysis

Offline learning models are prone to change and deteriorate the performance in a non-stationary streaming environment [8]. Such kind of stationary models is prone to scale for real-world dynamic problems. Online machine learning is the technique that updates model parameters at run time to adapt to the most recent trends. Due to the adaptive learning nature of the online learning technique, no more batch retraining is required, nor redeploying, and neither needs to store data in the memory. In online machine learning, models fed with sequential data and parameters are updated in real-time (one data point at a time) [9]. On the contrary, backpropagation on one data point raises challenges like concept drift, catastrophic forgetting, skewed learning, and network adaptation. Concept drift occurs when the underlying distribution of the data changes over time, which can cause the model to become outdated and less accurate. Catastrophic forgetting occurs when the model forgets previously learned information as it learns new information, which can be problematic for tasks that require long-term memory. Skewed learning occurs when the data are imbalanced, which can lead to the model making biased predictions. Network adaptation refers to the ability of the architecture to adapt to new data while retaining previously learned knowledge. Online learning is indeed one of the key techniques that make machine learning practical for real-time analysis of big data.
In many online learning scenarios, obtaining large amounts of labeled data can be costly and time-consuming, which makes techniques that reduce the dependence on labeled data highly desirable. Some recent methods of online learning that can reduce the amount of labeled data required for training models are zero-shot learning, one-shot learning, and transfer learning. Zero-shot learning refers to the ability of a model to recognize and classify objects that it has never seen before without requiring any examples of those objects during training. Instead, the model is trained on a set of attributes or semantic descriptions that describe the properties of different object classes [10]. One-shot learning, on the other hand, is a type of machine learning in which a model is trained to recognize new objects or classes from just one or a few examples rather than requiring a large dataset for training [11]. Transfer learning is a technique that involves using knowledge gained from one task to improve performance on another related task [12]. Overall, online learning is a powerful technique for building models that can adapt to changing data distributions in real time. While backpropagation on a single data point can raise several challenges, online learning can help mitigate these challenges by allowing the model to learn from new data incrementally.

1.3 Problem setting/problem formulation

In online learning, data arrive sequentially at the learning algorithm, and the timestamp for each data instance of data arriving at the model is called “round.” Models predict the data sample, e.g., classifying data into a predefined class. After the prediction, the model receives the actual label as ground truth for the sample, and then, the model measures the loss value, which is the difference between the predicted and actual class labels. In the end, the model updates its parameter according to the loss suffered to improve the model prediction for the next upcoming samples.
Figure 2 mimics the working cycle of batch and online learning in the production environment. Memory is required to gather data for training purposes in batch learning, which may not be possible in high-dimensional data streaming applications, while data gathering is not required in an online fashion. In batch learning, training is necessary before deployment for production in a use case, while in online learning, training is performed continuously with each prediction. In online learning, a loop imitates continuous prediction in the application. The prediction time between two instances of batch learning associated with online learning is less because there exist two extra steps: receiving true outcomes and updating the model in the loop of online learning. To perform predictions about the most recent trends, batch learning has to perform all steps from the beginning, while online learning is specially designed to be updated with recent trends, which makes online learning suitable for streaming data.
Consider an online classification task where the primary objective is to minimize the cumulative loss suffered by the model. The cumulative loss could be minimized with the help of a learning function f based on a sequence of samples \(X=\{x_1,x_2, x_3,...,x_T\}\) that arrives sequentially, where T represents time stamp. After the arrival of each sample, function f will classify the sample as \(p_t\), which represents the predicted class label by the model. Ground truth \(Y=\{y_1,y_2, y_3,...,y_T\}\) is the output label for each sample. Loss is the difference between the predicted and actual output for a specific sample, which can be calculated with the help of loss function \(L(p_t,y_t)\), whereas cumulative predictive error can be calculated with equation 1.
$$\begin{aligned} E_T=\sum _1^T L(p_t,y_t) \end{aligned}$$
(1)
Algorithms 1 & 2 present the baseline for batch machine learning and online learning [9] algorithms.
Table 1
Research questions
RQ
Research questions
RQ1
How to detect and adapt concept drift?
RQ2
How to avoid catastrophic forgetting?
RQ3
How to avoid skewed/imbalanced learning?
RQ4
How to adapt network architecture?
The paper is organized as follows. Section 2 reviews key challenges in applying deep learning to streaming data, focusing on concept drift, catastrophic forgetting, skewed learning, and network adaptation. Section 3 delves into concept drift, covering methods for detecting drift and strategies for adaptation. Section 4 addresses catastrophic forgetting, exploring techniques such as hedging methods, selective training, and progressive neural networks to mitigate its impact. Section 5 examines skewed learning, analyzing both data-level and algorithm-level approaches to handling imbalanced streaming data. Section 6 discusses network adaptation, including methods for dynamically adjusting network depth and width to better handle streaming data demands. The Application Sect. 7 illustrates practical implementations of these techniques in smart city, finance, and healthcare systems. Following this, the Ethical Implication section 8 examines the ethical considerations surrounding the use of online deep learning in streaming data, including issues of bias, fairness, transparency, and accountability. The Conclusion Sect. 9 summarizes the main insights from the survey, and the Future Directions Sect.  10 proposes avenues for continued research, highlighting open challenges and potential advancements in this field.

2 Challenges

Online learning faces certain challenges due to network parameter adaptation for each data point, including concept drift, catastrophic forgetting (also known as catastrophic inference), skewed or imbalanced learning (also known as underfitting or overfitting), and network adaptation. Table 2 shows surveys related to these problems, which conclude that none of the surveys target all four questions. Table 1 presents the research questions, and Sects. 3, 4, 5, 6 elaborate on different proposed techniques to address each question mentioned in Table 1. This survey aims to address all four questions at the same time.
Table 2
A summary of past related surveys
Refs.
Year
RQ1
RQ2
RQ3
RQ4
[8]
2023
  
\(\checkmark \)
 
[13]
2022
  
\(\checkmark \)
 
[6]
2022
\(\checkmark \)
  
\(\checkmark \)
[14]
2021
\(\checkmark \)
\(\checkmark \)
\(\checkmark \)
 
[15]
2019
 
\(\checkmark \)
  
[16]
2019
\(\checkmark \)
  
\(\checkmark \)
[17]
2016
 
\(\checkmark \)
  
[18]
2014
 
\(\checkmark \)
  

2.1 Concept drift

Concept drift is a common problem in online learning where the underlying data distribution changes over time. Changes refer to the phenomenon where the statistical properties of the data used to train a machine learning model change over time, leading to a decrease in the model’s performance. The relationship between input and output in an online data stream can change over time due to various factors such as changes in user behavior, changes in the environment, or changes in the underlying system generating the data, and a static model with a fixed relationship may result in poor predictions [19, 20]. Therefore, a dynamic model is required to update the relationship between input and output [15]. To update the model for changing data, it is necessary to detect concept drift in the streaming data [21].
Recently, various methods have been proposed for detecting concept drift in machine learning models using the error rate of classifier [22], sliding window and accuracy [2325], fuzzy windowing [26], incremental least square density difference [27] and local drift detection (LDD) measurement [28]. After detecting concept drift, it is important to adapt the machine learning model to the changing data distribution in order to maintain or improve performance. Recent methods for concept drift adaptation are Bilevel Online Deep Learning (BODL) [22], update ensemble by adding or removing networks [23], update model by retraining [25], Fuzzy windowing concept drift adaptation (FW-DA) [26], case-based editing technique [29], and self-adjusting memory [24]. Concept drift can occur in different forms, depending on the nature and rate of change in the data distribution. Figure 3 illustrates the four different types of concept drift such as sudden, gradual, incremental, and reoccurring. Section 3 elaborates on the literature related to concept drift.

2.2 Catastrophic forgetting

In a dynamic environment, human cognitive reactions may change in response to the same stimulus due to neural variability in the sensory cortex, which fluctuates over time. Humans learn continuously in such environments, but this can lead to catastrophic forgetting, where previously learned tasks are disrupted by the brain’s neuro system. Neural variability helps compensate for both accuracy and plasticity in humans [30]. Online learning techniques face the same phenomenon of catastrophic forgetting when working with dynamic data [16]. This occurs when the network modifies information related to a previous task due to continuous training on a new task [3034]. Figure 4 depicts catastrophic forgetting, where the network forgets class A due to the recursive occurrence of class B. Section 3 elaborates on the literature related to catastrophic forgetting.

2.3 Skewed learning

Streaming data refer to an unbounded sequence of real-time data points with high velocity, volume, and skewed distribution [35]. In a normal distribution, the data points on both sides of the graph are equal, whereas in a skewed distribution, the data points are not equally distributed. In supervised learning, data points are labeled with classes, and if the difference between the data points of classes is enormous, then the data are imbalanced or skewed. Conventional online learning techniques aim to minimize the error rate and maximize accuracy, but raw accuracy can be misleading if the data are skewed or imbalanced [14]. Class-based accuracy needs to be improved to address this issue. Additionally, skewed distribution can result in catastrophic forgetting for the minority class. Section 5 elaborates on the literature related to skewed learning.

2.4 Network adaptation

In an online setting, the optimal network capacity is unknown due to concept drift in streaming data. Online learning starts with a small network and expands when concept drift occurs to achieve scalability and efficiency in training [36]. However, there are challenges involved in network expansion, such as when and how to expand the network and whether to expand the network’s width or depth [37]. Network expansion increases the training cost for the new task, and if the existing network is sufficient to handle the new task, then the network may not need to expand. Additionally, network contraction is necessary to prune redundant neurons and layers from the network, reducing the prediction and training costs of the new task. Section 6 elaborates on the literature related to network adaptation.
Figure 5 depicts the online adaptive structure of the deep neural network (DNN), where output depends on each layer output rather than just the last layer. The conventional DNN is composed of several hidden layers, where each layer is connected to the previous layer, and the output is generated from the last layer [23]. However, in the online learning process, the depth of the hidden layers is adjustable to adapt to the model capacity, and these layers are called adaptive depth units. Each adaptive depth unit works as a base classifier to avoid relying solely on the output of the final layer. The final output is a weighted combination of the base classifiers to prevent catastrophic forgetting and improve the convergence speed in case of concept drift.

3 Concept drift

Given time period [0, t], set of samples \(S_{0,t} = \{(X_0,y_0), (X_1,y_1),..., (X_d,y_d)\}\), where X is the feature vector and y is the label, concept drift at timestamp \(t+1\) occurs, if \((X_t,y_t) \ne (X_{t+1},y_{t+1})\) [18, 20, 24, 29]. Depending on the extent of drift, network parameters may need to be changed for small drift, and network architecture may need to be expanded for significant drift. Network expansion could involve adding neurons to the hidden layer or introducing a new hidden layer to the network [36]. Progressive neural networks expand the network by adding the same number of new layers as the existing network for a new task in streaming data [38, 39]. Feature evolution learning is another solution to manage concept drift by incrementally and decrementally adjusting features as the feature dimension changes dynamically, allowing the model to adapt to new patterns in the data [40]. To quickly overcome the performance degradation due to concept drift, three steps to perform are: detection of drift (whether drift occurs or not), quantifying drift (how much drift occurs), and drift adaption (how to react in drift) [15].

3.1 Drift detection

The sliding window method is commonly used to examine a small subset of instances for detecting drift. Drift detection can be categorized into two types: error rate-based and data distribution-based. Error rate-based detection can be conducted after the prediction phase, as it is dependent on the model’s performance. Data distribution-based detection, on the other hand, is not reliant on performance and can be performed at any stage, such as before or after classification.

3.1.1 Error rate-based

Error rate-based drift detection algorithms detect drift by monitoring the algorithm’s performance. These algorithms track the online error rate and trigger a drift alarm if it exceeds a specified level. Gama et al. [41] introduced the drift detection method (DDM), which set a benchmark for error rate-based drift detection algorithms. DDM consists of four steps: data retrieval, data modeling, test statistics calculation, and hypothesis testing. DDM focuses on the error rate of the classifier within a specified time window. If the error rate exceeds the warning level, DDM builds a new learner in parallel with the old learner’s prediction. If the error rate reaches the drift level, the old learner is replaced with the new learner for future prediction.
Ross et al. [42] proposed the exponentially weighted moving average (EWMA) chart for concept drift detection (ECDD) as an improvement over DDM. The method uses the same three steps as DDM but modifies the fourth step by using an EWMA chart to track changes in the error rate. The EWMA chart is a statistical tool that monitors the performance of a machine learning algorithm over time and detects changes in the data distribution that may indicate concept drift. The chart gives greater weight to more recent observations by tracking the mean of a process over time. However, a limitation of this procedure is the length of the window slot for dynamic mean, which may vary for different data and applications and can be challenging to determine a standard window slot.
Guo et al. [23] proposed the use of an ensemble of neural networks, each trained on a different subset of the data, for detecting and adapting to concept drift. To detect concept drift, the accuracy of each neural network within the ensemble is monitored on the data within the sliding window and compared to a threshold. If the accuracy falls below the threshold, it is assumed that concept drift has occurred. Similarly, Han et al. [22] utilized an ensemble of classifiers to detect and adapt to concept drift. Concept drift is detected by monitoring the probability distribution of the data using the error rate of the base classifiers. Losing et al. [24] utilized a sliding window approach to capture the most recent data, and the KNN classifier is trained on the current window to classify incoming data. To detect concept drift, the classification accuracy of the classifier is monitored, and if it falls below a threshold, it is assumed that concept drift has occurred. The memory is then adjusted to adapt to the new data distribution.

3.1.2 Data distribution-based

Data distribution-based drift detection algorithms use distance or similarity functions to detect drift by comparing historical and new data [4345]. These algorithms can be computationally expensive due to the need to measure distances for all instances, making them slower than error rate-based drift detection algorithms [29]. Additionally, determining appropriate window sizes for both historical and new data can be challenging. To address these issues, researchers have proposed various approaches for data distribution-based drift detection. Liu et al. [26] developed the Fuzzy Windowing Drift Detection (FW-DD) method, which uses a fuzzy time window instead of a traditional time window to focus on gradual drift detection. FW-DD compares statistical measures of the current time window with those of the previous time window to detect gradual concept drift, using fuzzy logic to allow for a gradual transition between states of no drift, gradual drift, and sudden drift.
Qahtan et al. [46] proposed the PCA-based change detection (PCA-CD) method, which uses PCA to reduce the dimensionality of multi-dimensional streaming data and identify the principal components that capture the most significant variation in the data. PCA-CD constructs a subspace model from the principal components to represent the underlying data distribution and compares the subspace model of the current data with that of historical data using a statistical test. This method is computationally efficient due to its use of an efficient density estimator, and it minimizes the need for setting user-defined thresholds with the help of the Page–Hinkley test. Gu et al. [47] developed the equal density estimation (EDE) method, which applies kernel density estimation to estimate the local data distribution within a fixed-size sliding window. EDE compares the density estimate of the current sliding window with that of the previous window and uses a threshold to determine whether the difference between the two is significant. Liu et al. [28] partitioned the input space into a set of regions using a clustering algorithm and estimated the underlying data density for each region using a kernel density estimator. They proposed an adaptive bandwidth selection method that improved the accuracy of density estimation and allowed the method to handle data streams with varying density levels. Least Squares Density Difference-based Change Detection Test (LSDD-CDT) [27] is an incremental version of [28] that uses a Gaussian mixture model for density estimation and a change detection test to detect significant drift.

3.2 Drift adaptation

Once concept drift is detected, adaptation methods can be employed to update the model to reflect the new data distribution. These methods can include retraining the model with new data, updating the model parameters, or using ensemble methods that combine multiple models trained on different data subsets. Elwell et al. [48] proposed Learn++.NSE detects concept drift by comparing the current and recent performance of base classifiers. Learn++.NSE adapts concept drift by building a new classifier unit for each batch of input data and combining them as an ensemble using dynamically weighted majority voting. The voting weights are updated based on the time-adjusted accuracy of each classifier.
Zhou et al. [39] addressed concept drift by adding a constant number of features to the network for underfitting and merging features to avoid overfitting and redundancy. However, this method is sub-optimal because of the complete retraining required and the constant number of features added without measuring the capacity of the drift. Shao et al. [45] proposed a method to detect concept drift by measuring the similarity between the incoming data stream and the learned prototype. Once the drift is detected, the learned prototypes are updated according to the new data for drift adaptation. Liu et al. [49] detect concept drift by measuring the conflict between the active learner and input data. If drift is detected, a new learner is initialized and trained on the conflicted input data. Losing et al. [24] proposed Self-Adjusting Memory KNN to deal with heterogeneous drift. KNN is used as a classifier, and the SAM concept is used to transfer the current concept from short-term memory (STM) to long-term memory (LTM).
Lu et al. [29] proposed a two-step Case-based Reasoning method to solve the drift problem. In the first step, drift is detected along with the competence region, indicating where the drift is more severe. Noise-efficient Fast Context Switching is used to identify noise and novel concepts, and then, the noise is removed. The second step is the preservation of novel concepts for drift adaptation. After preservation, Stepwise Redundancy Removal (SRR) uses KNN to remove redundant concepts, and then, the competence model is updated. Liu et al. [26] proposed the Fuzzy Windowing Drift Adaptation (FW-DA) algorithm, which detects concept drift using a certain warning threshold and membership function. When concept drift is detected, a new learner is created and trained, replacing the old learner.
Xu et al. [25] proposed the Dynamic Extreme Learning Machine (DELM) to detect drift using the same technique as [41]. After drift detection, the adaptation procedure is enhanced by using an Extreme Learning Machine (ELM) [50]. When concept drift is detected, more hidden layer neurons are added to the ELM, which serves as a base classifier. When concept drift reaches the provided upper limit or accuracy to the provided lower limit, the current classifier is deleted, and the new classifier starts training on new data. Ashfahani et al. [51] proposed a Drift Detection Scenario (DDS) to detect concept drift. When concept drift is detected, the depth of the network increases by adding a new hidden layer to the network. At the same time, the complexity reduction scenario removes a hidden layer if it is highly correlated to another hidden layer, in this way equilibrium in hidden layers is maintained. Concept drift was detected in the input space via evaluation of accuracy by Hoeffding’s bound method. Hoeffding’s bound method defines theoretical bound, i.e., many data points required to signal drift based on accuracy.
Guo et al. [23] proposed an ensembled-based technique called selective ensemble-based online adaptive (SEOA) neural network. When concept drift is detected, the next step is adaptation, in which model generalization and adaptability are improved by integrating different natures of base classifiers dynamically and selectively. Hen et al. [22] proposed the Bilevel Online Deep Learning Framework (BODL). When concept drift is detected, BODL updates the model parameters for base classifiers using a proposed bilevel optimization scheme. In bilevel optimization, the cross-entropy loss is used as an objective function for memory and model weights. Table 3 presents a list of articles on concept drift, highlighting the methodologies employed, along with their advantages and disadvantages. Ren et al. [52] handle concept drift by dynamically selecting network components based on the current data distribution. This dynamic network selection is conditioned on the discrete variable that models the distribution shifts, allowing the model to adapt to new patterns as they emerge.
Table 3
Concept drift methods and proposed positive and negative remarks
Refs.
Methodology
Remarks (Positive +\(\backslash \)- Negative)
[37]
Hedging Back Propagation (HBP) method
+ Hedging induced collaboration by sharing features of classifiers
[23]
Selective Ensemble-based Online Adaptive (SEOA) deep neural network
+ Dynamic selection of base classifier ensure stability and adaptability
  
\(-\) Without any prior knowledge, initial confidence level assigned to classifiers
  
\(-\) Slow convergence problem
[24]
Self-Adjusting Memory (SAM) model for KNN algorithm
+ Model can cope with heterogeneous concept drift and is easily applicable without parameterization
  
+ Generalization of the model permits an efficient combination of memory and time parameters
[26]
Fuzzy Windowing Concept Drift Detection and Adaptation (FW-DA)
+ The Fuzzy windowing method allows the sharing of information related to old and new concepts
  
\(-\) Noise investigation of the window is required
[27]
Incremental Least Squares Density Difference (LSDD) detection method
+ Computationally light and test in an online manner
  
\(-\) The window size increases adaptively to improve performance, which may result in the bottleneck
[28]
Local Drift Degree (LDD) measurement
+ Both time and spatial information are considered
  
\(-\) But computation complexity may increase by considering both pieces of information
[29]
Noise-Enhancement Fast Context Switch (NEFCS) and Stepwise Redundancy Removal (SRR)
+ Redundancy reduction improves and decreases the size of the case base, which may help to decrease response time
  
\(-\) NEFCS depends on the size of the case base, which is not suitable for quick response
[46]
PCA and density estimator-based detection
+ Reduces computational cost using efficient density estimator
  
\(-\) Number of components to be given as input
[35]
Cost-sensitive Sparse online learning via Truncate Gradient
+ Scale well and fast response
  
\(-\) No novel task detection on streaming data which makes it inefficient when new tasks come
[36]
Dynamically Expandable Networks (DEN)
+ Window slot increases if required
  
\(-\) Adaptive increase in window slot may result in a bottleneck. Controlling parameter required for window slot
[41]
Error rate base Drift detection
+ Simplest and computationally efficient
  
\(-\) Warning level and drift level need to be specified
[42]
Exponentially Weighted Moving Average (EWMA)
+ EWMA can be used with any arbitrary classifier
  
\(-\) Method tested with only binary classification
[44]
Drift detection in Multi-dimensional data using relative entropy
+ The proposed method can be used for drift detection in multi-dimension
  
\(-\) Method tested on static data only
[45]
- Gradual and Abrupt concept drift detection problem
\(-\) Model trained only on representative example, which may result in underfitting
 
- Prototype-based classification using PCA and statistical analysis
 
[47]
Equal density estimation-based drift detection
+ Approach work well with single and multi-dimension data
  
\(-\) Performance may improve with different distance algorithms
[51]
Drift Detection Scenario
+ Capable of achieving a trade-off between plasticity and stability
  
\(-\) Every change in distribution creates a new hidden layer, which may result in a very complex network at the end
[53]
Network expansion (Vertically)
\(-\) Biasness of the network needs to be evaluated
[54]
Oversampling and instance selection
\(-\) Concept drift detected only with incorrect prediction

4 Catastrophic forgetting

Catastrophic forgetting is a common problem in online machine learning where a model forgets previously learned information when it is trained on new data. In an attempt to address this problem, researchers have proposed various techniques, such as the hedging method, selective training, and progressive neural networks. Ans et al. [32] proposed a model called SRM (self-refreshing memory) that comprises a memory module storing essential information from past tasks and a network module learning the current task. The memory module is updated regularly by a self-refreshing mechanism, which selects the most pertinent information from the network module to avoid catastrophic forgetting. As a result, the memory is optimized to retain important knowledge and forget irrelevant or outdated information.
Goodfellow et al. [30] investigated the selection of appropriate learning algorithms and activation functions for different tasks and relationships between tasks to mitigate the catastrophic forgetting effect. They examined the relationship between tasks and found that dropout is the most effective training algorithm for modern feed-forward neural networks. However, the choice of activation function varies depending on the task and the relationship between tasks. Maxout is the only activation function that consistently performs well across all tasks, but it may not always be the optimal choice. Dropout tends to increase the optimal size of the network, but this effect is not always consistent. Kirkpatrick et al. [55] overcome catastrophic forgetting through Elastic Weight Consolidation (EWC) in neural networks by slowing the learning rate of the model. Plasticity of weights is selectively decreased toward previous weights according to the importance of the weights in the previously learned task.
Nguyen et al. [34] proposed a method for measuring catastrophic forgetting using the actual error rate and task sequence hardness. They investigated the relationship between catastrophic forgetting and task properties and found a strong correlation with total complexity and a weak correlation with sequential heterogeneity toward task sequence hardness. Ren et al. [52] uses a Bayesian framework with a discrete distribution-modeling variable to capture abrupt shifts in data. This approach helps in retaining knowledge from previous distributions while adapting to new ones, thereby mitigating catastrophic forgetting. The challenge of catastrophic forgetting is also addressed through instance incremental learning, which handles data attributes changing over time, ensuring that the model retains previously learned information while adapting to new data [40]. Park et al. [56] explore this challenge by introducing a speculative backpropagation method that leverages activation history to mitigate forgetting. Their approach allows neural networks to retain knowledge from previous tasks while adapting to new ones, thus enhancing the model’s ability to learn continuously without significant degradation in performance.

4.1 Hedging method

Littlestone et al. [57] proposed the Weighted Majority Algorithm, which assigns weights to each algorithm in a pool and combines their outputs to make the final prediction. Since each algorithm in the pool performs prediction in a different way, the Weighted Majority Algorithm allocates weights according to their performance to account for their heterogeneity. The weight assigned to each algorithm is updated based on its prior performance. Consequently, algorithms with good prior performance contribute more to the final decision than those with poor performance. The Weighted Majority Algorithm is a crucial step in maintaining prior knowledge and overcoming catastrophic forgetting.
Freund et al. [58] presented the Hedge Algorithm as a generalization of the Weighted Majority Algorithm proposed by [57] for online allocation problems. The Hedge Algorithm maintains a weight vector with time t for all algorithms, where weights are nonnegative and sum up to 1. The initial weight vector can be arbitrary or assigned high weights to those algorithms expected to perform best. If prior knowledge of strategy performance is missing, equal weights can be assigned to each strategy. A key improvement over [57] is the application of upper and lower bounds on weights and loss. Using a weight vector, strategies with better performance are given preference for the final output to overcome catastrophic forgetting. The hedging method utilizes each layer of the deep neural network as a classifier [23]. In the hedging method, the weights of the classifiers are updated according to their performance on the current task.
Han et al. [22] proposed a method to avoid catastrophic forgetting by using base classifiers as an ensemble, with the weights of these classifiers updated using exponential gradient descent in an online manner. The method involves solving a bilevel optimization problem, where the inner problem determines the optimal weights of the base classifiers for the current task, while the outer problem updates the weights of the base classifiers based on their performance on the current task. This approach allows for both retaining important knowledge and adapting to new tasks. It has also been shown to outperform other methods, such as fine-tuning and elastic weight consolidation.
Sahoo et al. [37] proposed a method for training deep neural networks in an online setting with stream data. Their network architecture includes an output classifier connected to each hidden layer, similar to the hedging method. The final output is a combination of the outputs from all layers weighted by their respective performance. This approach gives preference to the output of better-performing layers in the past to overcome catastrophic forgetting. Similarly, in [51], Ashfahani et al. proposed a method to overcome catastrophic forgetting in which each layer is directly connected to the output layer, and its contribution to the final output is determined by a dynamically assigned weight. These weights decide how much preference should be given to new or old knowledge.

4.2 Selective training

In order to overcome the problem of catastrophic forgetting in lifelong learning, Yoon et al. [36] propose a novel incremental learning algorithm called Dynamically Expandable Networks (DEN). DEN is designed to prevent semantic drift by selecting and retaining relevant knowledge from previous tasks while efficiently allocating resources to learn new information. Initially, the network is trained on the first task. When a new task is introduced, the network is duplicated, and the new task is trained on the duplicate network. During this process, the activations of each neuron in the original network are recorded. Based on the difference between a neuron’s activations on the new task and its activations on the original task, a relevance score is calculated for each neuron. The top-ranked neurons, based on their relevance score, are then selected and used in the expanded network for the new task. The expanded network is trained on both the old and new tasks. This process is repeated for each new task, allowing the expanded network to selectively retain knowledge from previous tasks while also efficiently allocating resources to learn new information. By selecting neurons based on their relevance to the new task, DEN effectively prevents semantic drift and retains important knowledge from previous tasks. Additionally, the dynamically expandable nature of the network allows it to efficiently allocate resources to new tasks as they arrive, improving its overall performance in lifelong learning scenarios.
Iman et al. [53] propose a two-step training approach to prevent catastrophic forgetting and over-biasing of the model. In the first step, the network is trained with a high learning rate and then fine-tuned to adjust the weights of the network. In the second step, to prevent catastrophic forgetting, the pre-trained layers are kept frozen, and the network capacity is expanded by adding new neurons and layers to handle the new task’s complexity. In this second step, selective training is performed only on the newly added neurons and layers, allowing the network to learn new information without overwriting previously learned knowledge. This approach improves the model’s ability to handle complex lifelong learning scenarios while maintaining a balanced representation of old and new knowledge.
Mousser et al. [59] designed the Incremental Deep Tree (IDT) framework to enable CNNs to learn new classes incrementally without forgetting previously learned information. The framework organizes the learning process in a tree-like structure, where each node represents a different class or task. This hierarchical approach helps in isolating the learning of new classes from the existing ones, thereby reducing interference and preventing catastrophic forgetting. Instead of retraining the entire network from scratch, the IDT framework updates only the relevant parts of the network. This selective updating mechanism ensures that the model retains its performance on previously learned tasks while incorporating new information.

4.3 Progressive neural network

Rusu et al. [38] proposed the progressive neural network (PNN) to overcome catastrophic inference in lifelong learning. PNN starts with a single-column network for one task, which consists of multiple layers of neurons similar to conventional deep neural networks, and adds a new column for each subsequent task or label. During the learning process of the second task, only the weights in the second column are updated, while the weights in the first column are kept frozen to maintain past knowledge and avoid catastrophic forgetting. Figure 6 illustrates a PNN with three columns, where each block in a column represents a hidden layer, and the third column is added for the final task. During the training process of the third task, only the green connections will be updated, and the remaining connections will be kept frozen to avoid catastrophic forgetting. The alpha box serves as a lateral connection, also known as an adapter, to ensure that previously learned features are reused, modified, or ignored, depending on their relevance to the current task. As the number of columns increases with the number of tasks, the network will become too complex. Same as [39] constant number of layers are added by [38] without measuring the difficulty of the task, in this way network complexity will increase exponentially, which needs to be considered for future direction.
In their recent paper, Ergun et al. [60] proposed a novel approach to address the issue of network complexity in progressive neural networks (PNNs) [38, 39]. Specifically, they introduced a new variant of PNN called the recursive progressive neural network (R-PNN), which incorporates sparse group LASSO regularization [61] to achieve better generalization performance by pruning unnecessary neurons from the network. The authors employed both \(l_1\) and \(l_2\) regularization techniques to promote sparsity in the network connections. By applying sparse group LASSO regularization, they were able to eliminate redundant neurons by setting all outgoing connections of a specific neuron to zero. Table 4 presents a list of articles on catastrophic forgetting, highlighting the methodologies employed, along with their advantages and disadvantages.
Table 4
Catastrophic forgetting methods and proposed positive and negative remarks
Refs.
Methodology
Remarks (Positive +\(\backslash \)- Negative)
[22]
Ensemble of base classifiers
\(-\) Weights of base classifiers are updated using exponential gradient descent in online passion
[30]
Investigation of learning algorithm and activation function for catastrophic forgetting
+ Recommend dropout algorithm with maxout activation function
  
+ Investigation recommends that the impact of choosing the best learning algorithm is more as compared to the activation function on catastrophic forgetting
[32]
Self-Refreshing Memory (SRM)
+ Knowledge transfer from memory module to model is more efficient for sequential learning as compared to concurrent learning
  
\(-\) Having two modules, one for storing historical patterns, required more memory as compared to a single network
[34]
Understanding of Catastrophic Forgetting
+ Total complexity has a strong correlation with Catastrophic Forgetting
  
+ Sequential heterogeneous has weak correlation with Catastrophic Forgetting
[36]
Selective retraining performed
+ Selective retraining reduced the retraining cost
[38]
Progressive Neural Network
+ Catastrophic problem solved by freezing previously learned columns
  
\(-\) New column weights are initialized randomly, could be done with some prior knowledge, as tasks are more or less correlated
[51]
Maximum Information Index (MICI)
+ Capable of achieving a trade-off between plasticity and stability
  
\(-\) Every change in distribution creates a new hidden layer, which may result in a very complex network at the end
[53]
Network expansion (Vertically and Horizontally)
+ Perfect samples added to the final training data to avoid catastrophic forgetting
  
\(-\) Difficult to decide which samples are perfect to avoid forgetting related to a specific class
[55]
Elastic weight consolidation (EWC)
+ Weights of the neural network are changed to previous values if their impact on the previous task is more as compared to the new one
[58]
Hedging Algorithm
\(-\) Prior weights for the algorithms are not required. Each algorithm is assigned equal weights at the start and sum of the weights are equal to 1
[60]
Least Absolute Shrinkage and Selection Operator (LASSO) regularization
+ Important parameters are changed via Fisher Information Matrix (FIM)
  
\(-\) Calculating FIM and storing model parameters offline is costly
[62]
Neuron selection using online clustering
+ Selection of neurons to minimize activation overlap using online clustering
  
+ Centroid assigned to each neuron for clustering purposes
[59]
Incremental Deep Tree (IDT)
+ The Hierarchical structure of the network allows selection training easily
  
+ Framework is compared with three other methods on thre e benchmark datasets
  
\(-\)The hierarchical structure of the IDT framework can become complex and computationally expensive as the number of classes increases. This can make it challenging to scale the model for very large datasets with numerous classes
[56]
Speculative Backpropagation (SB) and Activation History
+ 4.4% improvement in knowledge preservation compared to state-of-the-art techniques
  
+ 31% training time reduced, making the continual learning more efficient
  
\(-\)Scalability to very large datasets or complex tasks remains to be fully tested

5 Skewed learning

In the past few decades, skewed learning in data streams has been addressed using two main approaches: data-level and algorithm-level techniques. Data-level methods involve modifying the data prior to feeding it into the learning algorithm, such as through oversampling or undersampling. These approaches are algorithm-independent and can be used with any learning algorithm. Algorithm-level techniques, on the other hand, modify the learning algorithm’s training process to improve the effectiveness of classifiers on imbalanced data streams. These techniques are often more specialized for specific learning models [13].

5.1 Data-level approach

Ditzler et al. [21] proposed two algorithms to address concept drift and imbalanced data in dynamic environments. The first algorithm, Learn++ for Concept Drift with SMOTE (Learn++.CDS), combines Learn++ Non-Stationary Environment (Learn++.NSE) with SMOTE. Learn++.NSE accommodates various types of concept drift, such as slow, rapid, gradual, abrupt, and cyclical drift, while SMOTE balances the minority class ratio with the majority class. The second algorithm, Learn++ Nonstationary and Imbalanced Environment (Learn++.NIE), builds on the first algorithm by making two important updates. Firstly, it avoids using raw classification accuracy, which can be misleading as it improves overall accuracy rather than class-wise accuracy. Instead, error distribution is balanced to improve minority recall while preserving majority performance. The classifier weights are updated with class-dependent errors to avoid catastrophic inference for the minority class. Secondly, Bagging Variation is used to generate sub-ensembles of classifiers. These sub-ensemble classifiers are trained on the minority class data and an equal portion of random majority class data each time, thereby avoiding oversampling of the minority class or generating synthetic data. In this approach, a batch of samples arrives at each time stamp.
Aminian et al. [63] proposed Chebyshev’s inequality approach to find rare and frequent class instances in the data stream by using the mean and variance of the distribution. A low value indicates rarity and a high value indicates frequent samples. Later, the same information is used to perform oversampling and undersampling to balance class distribution in the data stream. The proposed approach is effective in both high and low levels of majority or minority cases in the data stream. The approach is dependent on the mean and variance of the data stream, which me not be effective in evolving the data stream.
Czarnowski et al. [64] proposed a hybrid framework using oversampling and instance selection that consists of three components: classification, summarization, and learning in Fig. 7. The classification model consists of an ensemble having classifiers for each target class. A predicted label is the result of a weighted majority vote from the classification ensemble. After classification, incorrect instances are gathered in the form of data chunks in the summarization component. Incorrect prediction is the detection of concept drift. Adequate instances that can improve the learning process are selected for learning components; others are removed from the chunk. The chuck summarizing process ensures balance among the target class instances is maintained. The summarization component belongs to data-level approach classifiers learning.

5.2 Algorithm-level approach

Czarnowski et al. [54] proposed an algorithm-based solution for the class imbalance problem. Whenever a new data chunk arrives, a new classifier is introduced to replace the worst component in the ensemble. The worst and best component in an ensemble is associated with weights assigned to each single class classifier.
Li et al. [65] proposed a hybrid approach on the algorithm level for imbalanced data stream misclassification. In the hybrid approach, the Online Sequential Extreme Learning Machine (OSELM) algorithm is combined with cost-sensitive learning strategy. In the initialization phase, chucks are created from incoming instances, and penalty weights are calculated for both minority and majority class samples. In both cases, the penalty weight is calculated with one divided by the number of minority or majority samples. In this way, the penalty weight for the minority will be higher than that for the majority to improve misclassification due to an imbalanced data stream.
Chen et al. [35] proposed cost-sensitive sparse online learning via truncate gradient (CSGT) to address concept drift and skewed learning problems on high-dimensional stream data. CSGT proposes a trade-off between low misclassification via convex loss function and sparse linear classifier via truncated gradient technique. In CSGT, sparsity and loss are taken into account in fitness functions. The influence of both creates a trade-off. CSGT is scalable and quick to respond to due to low space and computation complexity. Table 5 presents a list of articles on skewed learning, highlighting the methodologies employed, along with their advantages and disadvantages.
Table 5
Skewed learning methods and proposed positive and negative remarks
Refs.
Methodology
Remarks (Positive +\(\backslash \)- Negative)
[21]
- Nonstationary Environment (Learn++NSE)
+ Learn++NSE uses SMOTE for learning from imbalanced data
 
- Nonstationary Imbalance Environment (Learn++NIE)
+ Learn++NIE uses class-wise accuracy instead of raw
  
+ Weights of the ensemble classifiers are updated with the help of a class-wise error
[35]
Cost-sensitive Sparse online learning via Truncate Gradient
+ Trade-off between low misclassification and high sparsity
  
\(-\) No novel task detection on streaming data which makes it inefficient when new tasks come
[39]
Incremental feature learning (Adding and Merging)
+ Outperform denoising autoencoder in classification problems
  
+ Effective in learning new problems from streaming data
  
\(-\) searching for similar features for merging is expensive
[54]
Oversampling and instance selection
+ Sampling delay avoided by performing it after classification
  
\(-\) Concept drift detected only with incorrect prediction
  
\(-\) Classifiers update delayed due to summarization component
[63]
Under-sampling and oversampling via Chebyshev’s inequality
+ Both high and low-level imbalances in the data stream are targeted
  
\(-\) Proposed method is based on mean and variance, which may be ineffective with an evolving data stream
[64]
A hybrid approach using oversampling and instance selection
+ Chunk summarization process ensures to maintain balance in class instances
  
\(-\) Only incorrect instances are gathered in the summarization component because it is assumed that incorrect predictions happen due to concept drift
  
\(-\) But incorrect prediction may result due to the inability of the model
[65]
Online Sequential Extreme Learning Machine (OSELM) algorithm combined with cost-sensitive learning strategy
+ No artificial oversampling and instance removal without investigation
  
\(-\) Concept drift is ignored at all
  
\(-\) Classifiers update delayed due to summarization component

6 Network adaptation

In batch learning, the network capacity problem in neural networks is solved with the help of validation. Due to data absence, batch learning is unrealistic in an online setting, which raises the problem of deciding network capacity at the beginning. In a static network, the architecture and capability of the network are fixed after training, which may limit the inference and efficiency [6669]. On the other hand, the dynamic network can adapt its architecture, including structure and parameters, to attain favorable advantages according to applications. Due to their adaptive nature, these networks trade-off in different performance metrics for different target hardware and dynamic environments, according to applications. Apart from flexibility in nature, these networks are compatible with adapting recently advanced architectures [70, 71], optimizations techniques [72, 73], and data prepossessing methods [73, 74]. Just like a human brain, dynamic network process info dynamically [75, 76], analyzing which input part is more favorable for prediction [77]. Changes in the network architecture could be based on input samples, spatial information, and temporal information. This article targets input-wise dynamic networks. Input-wise dynamic networks are dependent on the incoming samples of data streams. Networks adjust architecture and parameters according to sample nature to reduce redundancy, increase efficiency via architecture, and improve inference with minimum increase in computational cost via parameter adjustment.
Adjusting architecture means changing the depth and width of the network according to sample input. Network architecture may change by performing dynamic routing according to the input sample. As compared to the static model, the dynamic model achieves efficiency by using a shallow network for easy samples and a deep network for hard samples [7880]. Bohnstingl et al. [81] introduce an online spatiotemporal learning framework that allows deep recurrent and spiking neural networks to adapt continuously. This framework contrasts with traditional backpropagation methods, which require extensive offline computations. The ability to learn in real time is further supported by the findings of Dongjin et al. [82], who explore deep joint learning for radar signal modulation recognition, demonstrating the potential for adaptive learning in signal processing applications. These advancements indicate a shift toward more flexible and responsive neural network architectures that can thrive in dynamic environments.

6.1 Dynamic depth

Sahoo et al. [37] identify the appropriate depth of the neural network with the help of classifier performance at the depth. Discount, learning rate, and smoothing parameters are provided as input parameters to the network. Three parameters need to be learned: initially, uniformly distributed classifier weights are provided, which contribute to the global output; classification weights, which contribute to local output; and network weights. These learning parameters of the model are updated using online gradient descent by hedge propagation instead of conventional backpropagation. In the hidden layer, ReLU is used as an activation function, and Softmax is used at the output layer. The performance of shallow layers is good as compared to higher layers due to the fast convergence of shallow networks with fewer data. The performance of higher layers is improved with the passage of time for more data. In [37], Sahoo et al. proposed that shallow networks converge fast as compared to complex networks, while complex networks gradually improve performance according to the amount of data.
The depth of the dynamic network increases exponentially due to hard samples, which makes it computationally expensive for easy sample prediction latency. Early exit and skipping layers are two possible solutions to address this issue. The complexity of input samples varies for dynamic and real-world application problems due to uncertainty, using the same depth of the network for easy samples may be naive in case of prediction latency. If there exists more difference in the complexity of the input and prediction latency is important to be decreased, then early exit is a better solution. Early exit allows an easy sample to be predicted with shallow layers according to the complexity level of the sample while skipping the remaining deep layers [8385].

6.1.1 Early exit

An initial approach to an early exit for the easy sample is cascading CNN, which consists of multiple models. Big/Little-net [86] cascade two network models with different depths. Early exit is concluded with the help of the score difference between class labels if the score difference exceeds a certain threshold. However, this solution only works for binary classification. [70, 85, 8790] cascade multiple CNNs to solve multi-classification problems. A decision function is used to evaluate whether obtaining features from the previous model is enough to be fed to a linear classifier for prediction or may be forwarded to the immediate model. In these procedures, models work independently, i.e., each model works from scratch without reusing the features learned from the previous model. Later, a backbone network is proposed which decides multi-early exit based on threshold [84, 91], the learned function [92, 93], and utilized feature learned from the previous model to intermediate model.

6.1.2 Skipping layers

In the above procedures, after a certain number of layers, the remaining deep layers are skipped for an early exit. A more flexible network ResNet [70] is proposed to skip only the intermediate layer and continue execution of the remaining deep layers. Halting score is a scalar value first proposed by [66] and is used to decide whether the learned feature of the previous layer will be fed to the intermediate layer or not. Later this procedure is improved by [94] for the vision task via residual blocks. [95] reduces the number of halting score evaluations after each layer to a certain extent by making a weight-sharing block that consists of multiple layers and halting score evaluated after each block instead of each layer. Ashfahani et al. [51] adapt network depth by adding or pruning hidden layers with the help of the Drift Detection Scenario (DDS) method. A new hidden layer is added for drift signaled which is based on a high-bias situation. Layer pruning is achieved via the analysis of mutual information across the hidden layer. If two layers consist of the same information, then one of them should be pruned to reduce network depth. In DDS, thresholds for network depth adaptation are self-learned. Zhou et al. [39] proposed an incremental feature learning algorithm to determine the optimal capacity of the network for a complex problem. Feature learning algorithm consists of two strategies, adding new features (hidden neurons or layers) and merging features (hidden layer).

6.2 Dynamic width

Network width can be controlled by skipping neurons, branches, or channels. In a fully connected network, according to sample nature, different neurons are responsible for different feature representations. Initially, neuron activations are controlled by auxiliary branches [9698], and low-rank approximation [99]. A soft mixture of experts consists of multiple network branches built in parallel to each other, and outputs are fused depending on weight policy [100102]. Later hard gates are developed for branch selection to skip some branches and improve inference by decreasing prediction latency [77, 103, 104].
Multi-stage architectures can be constructed using the width channel dimension, allowing for early predictions to be made with considerable confidence [105]. One method for achieving this is through the use of a Channel Gating Network (CGNet) [106], which activates a portion of the convolutional filters and selects subsequent filters based on specific criteria. Another approach is to dynamically decide which channels to activate at each stage, as done in Runtime Neural Pruning (RNP) [107], which performs layer-wise pruning using a Markov decision process. A gate model could also be used to decide the width of a stage for ResNet [108], though this requires training and optimization. Several other solutions [109111] prune both depth and width dynamically, allowing for more flexible networks. These methods can improve network efficiency and accuracy by removing unnecessary connections while retaining important ones.
Yoon et al. [36] addressed the scalability and efficiency of the networks in an online setting. If selective retraining fails to produce satisfied performance, then the network is expanded in a top-down manner, and unnecessary neurons are removed with the help of group sparsity regularization. DEN calculates the amount of drift from the number of neurons related to previous and new tasks. If the difference exceeds the limit, then neurons for the new task are duplicated. DEN is also compared with batch training; both have the same performance, though DEN uses less capacity of the network than batch. After fine-tuning, DEN outperforms batch training.
Ashfahani et al. [51] proposed a self-organized network which is also known as autonomous deep learning. Network width is adapted with the help of the network significance (NS) method, which adds and prunes hidden neurons for adaptation. Network adaptation is decided with the help of bias and variance. Bias is calculated as the difference between the actual and out and average predicted output. A high biased value of the network is considered underfitting, which can be overcome by adding new hidden neurons to the network. Variance is defined as variability in prediction; a high variance value of the network is considered as overfitting on the testing data but may underfit on unforeseen data. Overfitting can be overcome with the help of hidden neuron pruning. In the NS method, thresholds for bias and variance are not user-defined, but self-learned values.
Routing nodes are responsible for selecting different paths in dynamic routes. CapsuleNets [69, 112] perform dynamic routing between capsules to draw relations among objects or parts of the objects. Capsules are a group of neurons, and capsule parameters need to be trained, fine-tuned, and optimized. Another approach is to dynamically adjust the parameters of the network with little increase in computation cost as compared to dynamic architecture while keeping architecture static. However, this paper only targets dynamic architecture. Ren et al. [52] employ a dynamic masking strategy that supports inter-distribution transfer. By overlapping a set of sparse networks, the model can adapt to different distributions without needing to retrain from scratch. This strategy ensures that the network can flexibly adjust to varying data regimes. Table 6 presents a list of articles on concept drift, highlighting the methodologies employed, along with their advantages and disadvantages.
Table 6
Network adaptation methods and proposed positive and negative remarks
Refs.
Methodology
Remarks (Positive +\(\backslash \)- Negative)
[36]
Dynamically Expandable Network (DEN)
+ Additional neurons are added for the new task if required
[37]
Hedging Back Propagation (HBP) method
+ Start with shallow and move toward deep networks
  
\(-\) Depth of the neural network to be decided a priori without any info related to the task
[38]
Progressive Neural Network
+ Network adapted by adding a new column for the new task
  
\(-\) Random initialization of new column weights effect convergence
  
\(-\) Previously learned weights for the new column may contain basic feature information
[53]
Network expansion (Vertically and Horizontally)
\(-\) Network capacity increases adaptively, which may result in a very big network
[51]
Network Significance
+ can achieve a trade-off between plasticity and stability
  
\(-\) Every change in distribution creates a new hidden layer, which may result in a very complex network at the end
[60]
Training with recursive connections
+ Important parameters are changed via Fisher Information Matrix (FIM)
  
\(-\) Calculating FIM and storing model parameters offline is costly
[103]
HydraNet for transforming static architecture to dynamic
+ Multiple branches specialized for different inputs, and the gate is responsible for choosing branches
  
\(-\) The computational cost of stem, gating, and combiner components itself increases the computation cost of HydraNet
[104]
Dynamic Routing Networks (DRNets)
+ Input instances are routed to only necessary branches from the candidate set in the network
  
+ Dynamic branch selection depends on branch weight importance
  
+ Branch weights are generated from the lightweight hyper network and later re-calibrated using softmax
[105]
transforming various static CNN models into multi-stage models to support dynamic inference
+ Method is scalable for various CNNs to be converted
  
\(-\) This method is dependent on the threshold, which could be used for a trade-off between accuracy and computational cost
[106]
Channel Gating Neural Networks
+ Channel gating skip computation on the ineffective region in the input
  
\(-\) If the computational cost of channel gating itself is more than the cost reduction for the network, then channel gating is inefficient
[108]
Dynamic Slimmable Network (DS-Net)
+ Dynamically adjusting filter numbers of networks according to input instances
  
+ Filters stored statically and contiguously in hardware to avoid extra burden
  
+ Double-headed dynamic gate consists of an attention head and a slimming head to adjust network width
[110]
Layer-Net (L-Net) and Channel-Net (C-Net) combined called LC-Net
+ Dynamically decides per input instance which layer or channel to skip or scale
  
+ ReLU-1 activation function to reduce computational cost and improve accuracy
[111]
Dynamic channel and layer gating
+ Lightweight gating modules are used for the binary decision to decide whether to execute a particular channel or layer
  
+ Combine channel and layer gating reduces computational cost significantly
  
\(-\) The decision of lightweight gating modules is dependent on input instances
[81]
Online Spatio-Temporal Learning (OSTL)
+ real-time adaptation in spiking neural networks (SNNs) and recurrent neural networks (RNNs)
  
+ Achieves gradient equivalence to backpropagation through time (BPTT) for shallow networks, enabling efficient online training with comparable performance to traditional offline methods
  
\(-\) While OSTL is effective for shallow networks, its application to deeper networks may involve increased computational complexity and resource requirements
Table 7
Summary of datasets used in the literature
Refs.
Artificial datasets
Real-World datasets
[18]
1) SINE1. 2) SINE2. 3) SINIRRELS2. 4) CIRCLES. 5) GAUSS. 6) STAGGER. 7) Mixed
1) ELEC2
[23]
1) Sea. 2) Hyperplane. 3) RBFblips. 4) LED. 5) Tree
1) Electricity. 2) Kdd-cup99. 3) Covertype. 4) Weather
[24]
1) SEA concepts. 2) Rotating Hyperplane. 3) Moving and Interchanging RBF. 4) Moving squares. 5) Transient chessboard. 6) Mixed Drift
1) Weather. 2) Electricity. 3) Covertype. 4) Poker Hand. 5) Outdoor. 6) Rialto
[26]
SEA stream
1) Electricity. 2) Airline. 3) Spam Filtering
[27]
1) Gaussian Distribution. 2) Multivariate Gaussian Distribution. 3) two-class rotating mixture of Gaussian Distribution. 4) slow drift within the radius. 5) moving hyperplane problem. 6) STAGGER problem
1) Electrical Energy dataset
[28]
1) 1D. 2) 2D
1) Electricity prices. 2) Nebraska weather prediction dataset
[29]
Balance-scale
1) Breast tissue. 2) Ecoli. 3) Glass identification. 4) Haberman’s Survival. 5) IRIS. 6) Transfusion. 7) Vertebral 2 and 3 classes. 8) Wine. 9) Yeast. 10) Zoo
[35]
1) IJCNN’01. 2) W8A
1) USPS
[36]
MNIST variation
1) CIFAR-100. 2) Animal with Attributes (AWA)
[37]
1) Infinite-MNIST. 2) Syn8. 3) Concept Drift (CD1 and CD2) datasets. 4) Susy. 5) Higgs
 
[38]
1) Atari game. 2) 3D Maze game
[39]
1) Rect-images dataset. 2) MNIST variation
[42]
1) GUASS. 2) SINE
1) Electricity. 2) Colonoscopic video sequencing
[44]
1) Telecommunication dataset (3D)
[45]
Moving hyperplane
1) Spam Filtering. 2) Electricity. 3) Covertype 4) Sensors (temperature, humidity, light, and sensor voltage) dataset
[46]
1) Disk with Empty Circles (DEMC). 2) Disk with dense circle (DEMC). 3) Swiss Roll (SWRL)
1) Jogging, Walking and Elnino (3D). 2) Spruce, Lodgepole pine (10D).3) Ascending Stairs, Cycling, Descending Stairs, Ironing, and vacuum cleaning (30D)
[47]
2D normal distribution
[51]
1) Hyperplane. 2) Susy, Hepmass. 3) Sea. 4) Susy. 5) Permuted MNIST
1) Weather. 2) KDDCup. 3) RLCPS. 4) RFID Localization
[53]
1) MNIST
[54]
1) Heart. 2) Diabetes. 3) WBC. 4) ACredit. 5) GCredit.6) Sonar. 7) Satellite. 8) Banana. 9) Image. 10) Thyroid. 11) Spambase. 12) Twonorm
[58]
1) Horse Racing dataset
[59]
1) MNIST dataset 2) BreakHis dataset 3) Pap Smear datasets
[113]
1) CIFAR-100 2) mini-ImageNet dataset
[60]
1) MNIST. 2) CIFAR-100
[61]
1) 1D. 2) 2D
1) Sensorless Drive Diagnosis (SDD). 2) MNIST Handwritten. 3) Forest Covertypes (COVER)
[63]
1) Pumadyn. 2) Computer activity. 3) Elevators. 4) MV artificial domain. 5) Query Analytics Workload
1) Bike sharing. 2) Appliance energy prediction. 3) California housing. 4) Friedman Artificial Domain. 5) Beijing PM2.5 Data. 6) US used car sales data. 7) 3D road network
[65]
1) Pima 2) Yeast1 3) Haberma 4) Vehicle0 5) Segment0 6) Yeast3 7) Ecoli3 8) Page-blocks0 9) Vowel0
[67]
1) CIFAR-10 2) CIFAR-100 3) ImageNet
[74]
1) CIFAR-10 2) CIFAR-100 3) SVHN 4) ImageNet
[77, 87]
1) ImageNet
[80]
1) CIFAR 2) SVHN 3) ImageNet 4) MNIST
[89]
1) ILSVRC 2014
[93]
1) Max-Min MNIST. 2) Multi-scale Fashion MNIST. 3) CIFAR-10
[105]
1) CIFAR-10 2) CIFAR-100
[103, 106]
1) CIFAR-10 2) ImageNet
[114]
Point of Interest (POI) recommendation 1) New York 2) Tokyo [115]

7 Applications

7.1 Smart city

In the context of smart cities, online learning techniques are increasingly being applied to optimize energy harvesting and information decoding for Internet of Things (IoT) devices. These devices are capable of simultaneous wireless information and power reception, which is crucial for maintaining efficient and sustainable urban environments. Chun et al. [116] and Luo et al. [117] developed frameworks that have been developed to jointly optimize energy harvesting and information decoding for IoT devices, involving the design of a generalized power-splitting receiver where each antenna has an independent power splitter, thereby enhancing network performance. The primary objective of Lee et al. [118] and Al et al. [119] is to maximize the harvested energy for each IoT device while meeting data rate requirements. To achieve this, a double-deep deterministic policy gradient-based online learning algorithm has been proposed. Tang et al. [120] and Li et al. [121] algorithms allow each IoT device to determine receive beamforming and power-splitting ratio vectors in real time, and it can be implemented in a distributed manner using only local channel state information. This eliminates the need for cooperation and information exchange among base stations and IoT devices, making the system more efficient. Lee et al. [118] propose extensive simulations that have validated the effectiveness of the proposed algorithm, demonstrating significant improvements in energy harvesting and information decoding efficiency. These approaches highlight the potential of online learning to enhance the functionality and sustainability of smart cities by optimizing the performance of IoT devices in real time. Wang et al. [114] developed a novel deep interactive reinforcement learning framework to enhance the accuracy and relevance of POI recommendations within smart cities. The methodology involves modeling dynamic interactions between users and geospatial contexts through a dynamic knowledge graph stream, capturing human–human, geo–human, and geo–geo interactions. The recommendation process is treated as a series of actions taken by an agent in response to environmental changes, including user visits and POI updates. Jiang et al. [122] introduce the EduHawkes framework leveraging the Neural Hawkes Process to model online study behaviors. EduHawkes employs a hierarchical encode-decode architecture to simultaneously optimize two tasks: study behavior prediction (event-level) and study quality prediction (course-level).

7.2 Finance

Online learning algorithms are particularly effective in predicting stock prices as they can continuously update their models with new market data. This allows them to adapt to changing market conditions and improve the accuracy of their predictions over time. For instance, a model might initially predict stock prices based on historical data, but as new data come in, it can adjust its predictions to reflect recent trends and events. Padhi et al. [123] present a two-stage framework for stock market prediction. First, it uses the mean–variance approach for portfolio construction to minimize investment risk. Then, it employs an online machine learning technique combining perceptron and passive-aggressive algorithms to predict future stock price movements. The algorithm balances between being passive (not changing the model much) and aggressive (updating the model significantly) based on the prediction error. This makes it robust for online learning scenarios where quick adjustments are needed.
In algorithmic trading, online learning models can optimize trading strategies by learning from real-time data [124]. These models can make buy and sell decisions based on the latest market information, helping traders to maximize their profits. For example, an online learning model might identify a pattern in stock price movements and adjust its trading strategy accordingly, leading to more profitable trades. Tsantekidis et al. [125] approach involves diversity-driven knowledge distillation, where multiple teacher agents are trained on different subsets of real-time streaming data to learn diverse trading policies. These policies are then distilled into a student agent, enhancing its ability to adapt to noisy financial environments and improving overall trading performance. This method leverages online and incremental learning to continuously update and refine trading strategies, demonstrating substantial improvements in both stability and effectiveness in dynamic market conditions.
Online portfolio selection is a dynamic and evolving field that leverages advanced machine learning techniques to optimize investment strategies in real time. In portfolio management, online learning can assist in dynamically adjusting investment portfolios to optimize returns and minimize risks. By continuously learning from new market data, these models can make more informed decisions about which assets to buy or sell. The foundational work by Cesa-Bianchi and Lugosi [126] in “Prediction, Learning, and Games” provides a comprehensive framework for understanding the theoretical underpinnings of sequential decision-making and prediction, which are crucial for developing robust online portfolio selection algorithms. Building on this foundation, Li and Hoi survey [127] in “Online Portfolio Selection: A Survey” offers a detailed overview of various state-of-the-art approaches, categorizing them into benchmarks, “Follow-the-Winner,” “Follow-the-Loser,” “Pattern-Matching,” and “Meta-Learning Algorithms.” Their subsequent book [128], “Online Portfolio Selection: Principles and Algorithms,” further elaborates on these principles and introduces innovative strategies that utilize machine learning for financial investment. The development of practical tools, such as the OLPS toolbox by Li et al. [129], has significantly advanced the field by providing an open-source platform for implementing and benchmarking various online portfolio selection strategies. Additionally, the work on transaction cost optimization by Li et al. [130] addresses the practical challenges of incorporating transaction costs into online portfolio selection, proposing a novel framework that enhances the performance of existing strategies under realistic trading conditions5. Together, these contributions highlight the interdisciplinary nature of online portfolio selection, integrating concepts from finance, machine learning, and optimization to create sophisticated and effective investment strategies.

7.3 Healthcare systems

Online learning has significant applications in healthcare systems, particularly in real-time health monitoring for several devices. Le Sun and Yueyuan Wang [131] introduce an energy-efficient online time series classification algorithm called OTCD. This algorithm is designed to handle challenges such as concept drift and catastrophic forgetting, making it highly suitable for continuous monitoring of health data like electrocardiograms (ECG) and photoplethysmograms (PPG). By efficiently processing and classifying time series data on edge devices, OTCD enables timely and accurate health assessments, which are crucial for early detection and intervention in various medical conditions.
Sana Ayromlou et al. [132] introduces a novel data-free class incremental learning framework called Continual Class-Specific Impression (CCSI). This framework addresses the challenge of catastrophic forgetting in deep learning models, which is crucial for continuously updating healthcare systems with new disease types. CCSI ensures privacy and complies with storage regulations by synthesizing data from previously learned classes and combining it with new class data. This approach has demonstrated significant improvements in classification accuracy on various medical datasets, making it a valuable tool for real-time health monitoring and diagnosis.
Sun et al. [133] introduce an algorithm called Prevent Concept Drift in Online Continual Learning (PCDOL). This algorithm is designed to handle challenges such as concept drift and catastrophic forgetting, which are crucial for maintaining accurate and up-to-date health monitoring systems. PCDOL is energy-efficient, requiring minimal computational power and memory, making it ideal for use in nanorobots that collect and analyze health data like electrocardiograms (ECG) and electroencephalograms (EEG). The experimental results demonstrate that PCDOL outperforms several state-of-the-art methods in handling these challenges, ensuring reliable and efficient health monitoring.
Fatemeh Amrollahi et al. [134] introduce a privacy-preserving continual learning algorithm named Weight Uncertainty Propagation and Episodic Representation Replay (WUPERR). This algorithm addresses the challenge of catastrophic forgetting and maintains high predictive performance across different healthcare institutions. Validated using data from over 104,000 patients across four distinct healthcare systems for early sepsis prediction, WUPERR demonstrated superior performance compared to baseline transfer learning approaches. This approach ensures privacy and enhances the generalizability of predictive models, making it a valuable tool for real-time health monitoring and diagnosis.
Mengya Xu et al. [135] introduce a privacy-preserving synthetic continual semantic segmentation framework designed to enhance the precision of robotic-assisted surgeries. This framework addresses the challenge of catastrophic forgetting in deep neural networks by blending open-source old instrument foregrounds with synthesized backgrounds and new instrument foregrounds with extensively augmented real backgrounds. This approach ensures that real patient data are not revealed, maintaining privacy. The framework also incorporates overlapping class-aware temperature normalization (CAT) and multi-scale shifted-feature distillation (SD) to maintain long- and short-range spatial relationships among semantic objects. The effectiveness of this framework was demonstrated on the EndoVis 2017 and 2018 instrument segmentation datasets, making it a valuable tool for real-time health monitoring and diagnosis.

8 Ethical implication

Online deep learning operates in environments where data are continuously generated and processed. This dynamic nature presents unique challenges in ensuring that the models developed are not only effective but also ethical and fair. The integration of ethical considerations into the design and deployment of online learning systems is crucial, especially as these systems increasingly influence decision-making processes across various sectors.
The ethical implications of online deep learning systems primarily revolve around the potential for bias in the data streams they utilize. Bias can manifest in various forms, including historical bias [136], societal bias [137], representation bias [138, 139], and measurement bias [140, 141], which can lead to unfair treatment of certain groups or individuals when decisions are made based on model outputs. For instance, if the streaming data reflect historical inequalities or societal biases, the models trained on such data may perpetuate or even exacerbate these biases. This concern is echoed in the literature, where researchers emphasize the importance of understanding the sources of bias and implementing strategies to mitigate them.
One approach to addressing bias in online learning systems is through the implementation of fairness-aware algorithms [142]. These algorithms are designed to identify and correct biases in the training data or the model outputs. For example, techniques such as re-weighting the training samples or modifying the decision thresholds can help ensure that the model’s predictions do not disproportionately disadvantage any particular group. Moreover, continuous monitoring of model performance across different demographic groups is essential to identify and rectify any emerging biases in real time. This aligns with the findings of Lian et al., who discuss the importance of adapting online learning algorithms to account for the dynamic nature of streaming data, which can change over time and may introduce new biases [143].
Furthermore, the ethical deployment of online learning systems necessitates transparency and accountability. Stakeholders, including developers, users, and affected individuals, should be informed about how models are trained, the data sources used, and the potential limitations of the models. This transparency can foster trust and enable stakeholders to understand the implications of decisions made by these systems. Additionally, establishing accountability mechanisms, such as regular audits and assessments of model performance, can help ensure that ethical standards are upheld throughout the lifecycle of the online learning system.
Another critical aspect of addressing ethical implications in online deep learning is the need for interdisciplinary collaboration. Engaging ethicists, social scientists, and domain experts in the development process can provide valuable insights into the societal impacts of these technologies. This collaborative approach can help identify ethical dilemmas early in the design phase and facilitate the development of solutions that are not only technically sound but also socially responsible. The integration of diverse perspectives can lead to more robust and equitable online learning systems that better serve the needs of all stakeholders.
Moreover, it is essential to consider the regulatory landscape surrounding online learning systems. As these technologies become more prevalent, there is a growing demand for policies and regulations that govern their use. Policymakers must work closely with researchers and practitioners to develop guidelines that promote ethical practices in online learning. These regulations should address issues such as data privacy, informed consent, and the right to explanation, ensuring that individuals are aware of how their data are used and the implications of automated decisions made by these systems.

9 Conclusion

In this paper, we have explored various challenges and solutions associated with online learning, particularly in dynamic environments where data characteristics can shift over time. As online learning applications expand into fields with complex and evolving data streams, handling issues like concept drift, catastrophic forgetting, skewed learning, and network adaptation has become increasingly essential. Each of these challenges introduces unique considerations for model design, training efficiency, and long-term performance. However, most of the existing research on these four problems has been conducted on synthetic or artificial datasets, with only a few studies using real-world datasets of a specific nature. To further validate the effectiveness of these algorithms, more experiments on dynamic and diverse real-world datasets are necessary in the future. Table 7 divides datasets from the literature review into artificial and real-world datasets. Real-world datasets are more dynamic as compared to synthetic or artificial datasets. Future work needs to be more real-world dataset-oriented than artificial datasets. In the following sections, we summarize the key methods and approaches in each area, highlighting their strengths, limitations, and opportunities for further development.

9.1 Concept drift

The detection of concept drift based on data distribution typically involves the use of a window slot, similar to error rate-based detection. Statistical tests or distance measures are used to compare the current window slot to historical ones and detect drift. However, the effectiveness of these methods is highly dependent on the choice of threshold values, which can vary depending on the application. For high-dimensional data, dimensionality reduction techniques such as PCA may be employed.
Adapting to concept drift often involves building new classifiers, which can lead to catastrophic forgetting if prior knowledge is lost. To address this issue, some methods focus on adding new features to the existing network, such as adding neurons or layers and later merging similar features to control network growth. It is important to carefully choose when and how to add features to manage concept drift capacity and reduce the time complexity associated with online learning.

9.2 Catastrophic forgetting

The hedging method is a promising approach to mitigate the issue of catastrophic forgetting in machine learning algorithms by addressing noise and uncertainty in the data. However, its effectiveness in handling streaming data with concept drift is limited due to the predefined structure of the network. This method utilizes an ensemble of multiple classifiers, each trained with different hyperparameters and initialization conditions, which allows for a wider exploration of policies. While this approach can improve performance, it also has some drawbacks. For example, using multiple classifiers can increase the computation and memory cost by maintaining multiple models and their weights. Additionally, optimizing the combination of multiple models can be challenging, and selecting appropriate weights for each classifier can be time-consuming. Nevertheless, the hedging method remains a promising approach for handling concept drift in streaming data and can be further improved with future research.
Selective training prevents catastrophic forgetting by selectively updating only a subset of the weights in the network, allowing the model to retain important knowledge from previous tasks. Selective training saves computational resources as compared to hedging and is easy to implement because it does not require significant changes to the model architecture. However, selective training may not be able to adapt to new tasks completely, and it is difficult to determine which weights need to be updated. Selective training may lose some information by selectively retaining some of the weights from the previous task. Selective training reduces the computation cost by avoiding complete retraining. However, it is difficult to determine which neurons or layers to select on which basis for selective training. Other methods store information related to the previous task in another memory component that is costly.
The progressive neural network (PNN) is an effective technique for avoiding catastrophic forgetting by retaining previously learned knowledge from past tasks. PNN utilizes incremental learning, which reduces computational costs by avoiding the need to train separate networks for each new task. However, as the number of tasks and layers increases, PNN can become computationally expensive due to the introduction of new tasks. Furthermore, PNN has limited generalization capabilities for entirely new tasks that are different from those previously learned.

9.3 Skewed learning

To avoid skewed learning, various methods can be used at both the data level and algorithm level. The SMOTE technique is commonly used at the data level, but it may affect the performance due to the introduction of artificial data. Another method involves updating weights through class-dependent error to avoid creating artificial data. Ensemble classifiers may also be trained on minority classes with an equal portion of the majority class, but this method can result in some portion of the majority class data being useless. To improve this, it is essential to use only important data from the majority class. At the data level, the class-wise accuracy approach can be used instead of raw accuracy to avoid skewed learning. This approach is easy to implement and algorithm-independent. Oversampling of the minority class and undersampling of the majority class need to be carefully controlled, as excessive oversampling or undersampling can introduce bias and negatively impact model performance. At the algorithm level, the cost-sensitive approach is less computationally expensive than ensemble algorithms. However, the performance of the ensemble algorithm is generally better than that of the cost-sensitive approach due to the use of multiple models.

9.4 Network adaptation

Determining network capacity based on classifier performance can be challenging when dealing with imbalanced class distributions. Another approach is to dynamically add or remove neurons based on the bias and variance in performance and add or remove layers for tasks with high bias and variance. Early exit can be a more practical solution for network adaptation, as it is easier to determine when to exit early based on input performance rather than skipping layers for specific tasks. Network width can be increased by adding neurons for complex tasks and reduced through pruning to improve generalization.

10 Future directions

Although there have been numerous investigations in the literature, there remain several unresolved problems and difficulties that require further exploration and community efforts in future research. Below, we outline some significant and emerging research directions for scholars who have an interest in online learning.
First of all, concept drift detection and adaptation have been extensively studied, both in single- and multi-dimensional data. Detection is the first step, and adaptation is the last step in solving concept drift. Adaptation mainly depends on the amount and type of drift that occurs, which is still an open challenge. There is still a lack of approaches for measuring concept drift and then adapting drift according to the level of drift in the data.
Secondly, an important direction in online learning research is the exploration of large-scale streaming images for real-time big data analytics. While online learning has significant efficiency and scalability advantages over batch learning for static images, it becomes a challenging task when handling the extremely high volume and high velocity of streaming images.
Third, despite considerable research in large-scale batch machine learning, there is a need for further investigation into parallel online learning and distributed online learning using diverse computational resources, including high-performance computing machines, cloud computing infrastructures, and potentially low-cost IoT computing environments.
Last but not least, the impact of these four problems on each other and the evaluation metrics. In proposing a deep learning model in online passion, all four methods need to be considered, and their impact on each other needs to be specified. Preventing catastrophic forgetting may increase network complexity, while reducing network complexity may result in catastrophic forgetting. In the same way, increasing learning accuracy may increase model complexity, and increasing learning scalability may affect computational efficiency.
Finally, online learning is often applied in domains where data privacy and security are critical, such as healthcare and finance. Therefore, developing privacy-preserving and secure online learning algorithms is becoming increasingly important. Future research should focus on developing techniques that can provide strong privacy guarantees while maintaining the performance of the online learning algorithms.

Acknowledgements

This work is supported by the project EMB3DCAM “Next Generation 3D Machine Vision with Embedded Visual Computing” and co-funded under the grant number 325748 of the Research Council of Norway.

Declarations

Conflict of interest

The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Literature
1.
go back to reference Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T et al (2018) A general reinforcement learning algorithm that masters chess shogi and go through self-play. Science 362(6419):1140–1144MathSciNetMATHCrossRef Silver D, Hubert T, Schrittwieser J, Antonoglou I, Lai M, Guez A, Lanctot M, Sifre L, Kumaran D, Graepel T et al (2018) A general reinforcement learning algorithm that masters chess shogi and go through self-play. Science 362(6419):1140–1144MathSciNetMATHCrossRef
2.
go back to reference Zhou Z-H, Chawla NV, Jin Y, Williams GJ (2014) Big data opportunities and challenges: Discussions from data analytics perspectives [discussion forum]. IEEE Comput Intell Mag 9(4):62–74CrossRef Zhou Z-H, Chawla NV, Jin Y, Williams GJ (2014) Big data opportunities and challenges: Discussions from data analytics perspectives [discussion forum]. IEEE Comput Intell Mag 9(4):62–74CrossRef
3.
go back to reference Huijse P, Estevez PA, Protopapas P, Principe JC, Zegers P (2014) Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comput Intell Mag 9(3):27–39CrossRef Huijse P, Estevez PA, Protopapas P, Principe JC, Zegers P (2014) Computational intelligence challenges and applications on large-scale astronomical time series databases. IEEE Comput Intell Mag 9(3):27–39CrossRef
4.
go back to reference Zhai Y, Ong Y-S, Tsang IW (2014) The emerging" big dimensionality". IEEE Comput Intell Mag 9(3):14–26MATHCrossRef Zhai Y, Ong Y-S, Tsang IW (2014) The emerging" big dimensionality". IEEE Comput Intell Mag 9(3):14–26MATHCrossRef
5.
go back to reference Wu X, Zhu X, Wu G-Q, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107MATH Wu X, Zhu X, Wu G-Q, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107MATH
6.
go back to reference Shaheen K, Hanif MA, Hasan O, Shafique M (2022) Continual learning for real-world autonomous systems: Algorithms, challenges and frameworks. J Intell Robotic Syst 105(1):9CrossRef Shaheen K, Hanif MA, Hasan O, Shafique M (2022) Continual learning for real-world autonomous systems: Algorithms, challenges and frameworks. J Intell Robotic Syst 105(1):9CrossRef
8.
go back to reference Fahy C, Yang S, Gongora M (2022) Scarcity of labels in non-stationary data streams: a survey. ACM Comput Surv (CSUR) 55(2):1–39MATHCrossRef Fahy C, Yang S, Gongora M (2022) Scarcity of labels in non-stationary data streams: a survey. ACM Comput Surv (CSUR) 55(2):1–39MATHCrossRef
9.
go back to reference Shalev-Shwartz S, et al. (2012) Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2), 107–194 Shalev-Shwartz S, et al. (2012) Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2), 107–194
10.
go back to reference Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell 41(9):2251–2265MATHCrossRef Xian Y, Lampert CH, Schiele B, Akata Z (2018) Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Trans Pattern Anal Mach Intell 41(9):2251–2265MATHCrossRef
11.
go back to reference O’Mahony N, Campbell S, Carvalho A, Krpalkova L, Hernandez GV, Harapanahalli S, Riordan D, Walsh J (2019) One-shot learning for custom identification tasks; a review. Procedia Manuf 38:186–193CrossRef O’Mahony N, Campbell S, Carvalho A, Krpalkova L, Hernandez GV, Harapanahalli S, Riordan D, Walsh J (2019) One-shot learning for custom identification tasks; a review. Procedia Manuf 38:186–193CrossRef
12.
go back to reference Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2020) A comprehensive survey on transfer learning. Proc IEEE 109(1):43–76MATHCrossRef Zhuang F, Qi Z, Duan K, Xi D, Zhu Y, Zhu H, Xiong H, He Q (2020) A comprehensive survey on transfer learning. Proc IEEE 109(1):43–76MATHCrossRef
13.
go back to reference Aguiar G, Krawczyk B, Cano A (2022) A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. arXiv preprint arXiv:2204.03719 Aguiar G, Krawczyk B, Cano A (2022) A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework. arXiv preprint arXiv:​2204.​03719
14.
go back to reference Hoi SC, Sahoo D, Lu J, Zhao P (2021) Online learning: a comprehensive survey. Neurocomputing 459:249–289MATHCrossRef Hoi SC, Sahoo D, Lu J, Zhao P (2021) Online learning: a comprehensive survey. Neurocomputing 459:249–289MATHCrossRef
15.
go back to reference Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G (2019) Learning under concept drift: a review. IEEE Trans Knowl Data Eng 31(12):2346–2363MATH Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G (2019) Learning under concept drift: a review. IEEE Trans Knowl Data Eng 31(12):2346–2363MATH
16.
go back to reference Parisi GI, Kemker R, Part JL, Kanan C, Wermter S (2019) Continual lifelong learning with neural networks: a review. Neural Netw 113:54–71CrossRef Parisi GI, Kemker R, Part JL, Kanan C, Wermter S (2019) Continual lifelong learning with neural networks: a review. Neural Netw 113:54–71CrossRef
17.
18.
go back to reference Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46(4):1–37MATHCrossRef Gama J, Žliobaitė I, Bifet A, Pechenizkiy M, Bouchachia A (2014) A survey on concept drift adaptation. ACM Comput Surv (CSUR) 46(4):1–37MATHCrossRef
19.
go back to reference Wang H, Abraham Z (2015) Concept drift detection for streaming data. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. IEEE Wang H, Abraham Z (2015) Concept drift detection for streaming data. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. IEEE
20.
go back to reference Žliobaitė I, Pechenizkiy M, Gama J (2016) An overview of concept drift applications. Big data analysis: new algorithms for a new society, 91–114 Žliobaitė I, Pechenizkiy M, Gama J (2016) An overview of concept drift applications. Big data analysis: new algorithms for a new society, 91–114
21.
go back to reference Ditzler G, Polikar R (2013) Incremental learning of concept drift from streaming imbalanced data. IEEE Trans Knowl Data Eng 10(25):2283–2301MATHCrossRef Ditzler G, Polikar R (2013) Incremental learning of concept drift from streaming imbalanced data. IEEE Trans Knowl Data Eng 10(25):2283–2301MATHCrossRef
22.
go back to reference Han Y-n, Liu J-w, Xiao B-b, Wang X-T, Luo X-l (2021) Bilevel online deep learning in non-stationary environment. In: Artificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part II 30, pp. 347–358. Springer Han Y-n, Liu J-w, Xiao B-b, Wang X-T, Luo X-l (2021) Bilevel online deep learning in non-stationary environment. In: Artificial Neural Networks and Machine Learning–ICANN 2021: 30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part II 30, pp. 347–358. Springer
23.
go back to reference Guo H, Zhang S, Wang W (2021) Selective ensemble-based online adaptive deep neural networks for streaming data with concept drift. Neural Netw 142:437–456MATHCrossRef Guo H, Zhang S, Wang W (2021) Selective ensemble-based online adaptive deep neural networks for streaming data with concept drift. Neural Netw 142:437–456MATHCrossRef
24.
go back to reference Losing V, Hammer B, Wersing H (2016) Knn classifier with self adjusting memory for heterogeneous concept drift. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 291–300 . IEEE Losing V, Hammer B, Wersing H (2016) Knn classifier with self adjusting memory for heterogeneous concept drift. In: 2016 IEEE 16th International Conference on Data Mining (ICDM), pp. 291–300 . IEEE
25.
go back to reference Xu S, Wang J (2017) Dynamic extreme learning machine for data stream classification. Neurocomputing 238:433–449MATHCrossRef Xu S, Wang J (2017) Dynamic extreme learning machine for data stream classification. Neurocomputing 238:433–449MATHCrossRef
26.
go back to reference Liu A, Zhang G, Lu J (2017) Fuzzy time windowing for gradual concept drift adaptation. In: 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6. IEEE Liu A, Zhang G, Lu J (2017) Fuzzy time windowing for gradual concept drift adaptation. In: 2017 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE), pp. 1–6. IEEE
27.
go back to reference Bu L, Zhao D, Alippi C (2017) An incremental change detection test based on density difference estimation. IEEE Trans Syst, Man, Cybern: Syst 47(10):2714–2726MATHCrossRef Bu L, Zhao D, Alippi C (2017) An incremental change detection test based on density difference estimation. IEEE Trans Syst, Man, Cybern: Syst 47(10):2714–2726MATHCrossRef
28.
go back to reference Liu A, Song Y, Zhang G, Lu J (2017) Regional concept drift detection and density synchronized drift adaptation. In: IJCAI International Joint Conference on Artificial Intelligence Liu A, Song Y, Zhang G, Lu J (2017) Regional concept drift detection and density synchronized drift adaptation. In: IJCAI International Joint Conference on Artificial Intelligence
29.
30.
go back to reference Goodfellow IJ, Mirza M, Xiao D, Courville A, Bengio Y (2015) An empirical investigation of catastrophic forgetting in gradient-based neural networks (2013). arXiv preprint arXiv:1312.6211 Goodfellow IJ, Mirza M, Xiao D, Courville A, Bengio Y (2015) An empirical investigation of catastrophic forgetting in gradient-based neural networks (2013). arXiv preprint arXiv:​1312.​6211
31.
go back to reference Ans B, Rousset S (1997) Avoiding catastrophic forgetting by coupling two reverberating neural networks. Comptes Rendus de l’Académie des Sciences-Series III-Sciences de la Vie 320(12):989–997 Ans B, Rousset S (1997) Avoiding catastrophic forgetting by coupling two reverberating neural networks. Comptes Rendus de l’Académie des Sciences-Series III-Sciences de la Vie 320(12):989–997
32.
go back to reference Ans B, Rousset S (2000) Neural networks with a self-refreshing memory: knowledge transfer in sequential learning tasks without catastrophic forgetting. Connect Sci 12(1):1–19MATHCrossRef Ans B, Rousset S (2000) Neural networks with a self-refreshing memory: knowledge transfer in sequential learning tasks without catastrophic forgetting. Connect Sci 12(1):1–19MATHCrossRef
33.
go back to reference Bui TD, Nguyen CV, Swaroop S, Turner RE (2018) Partitioned variational inference: A unified framework encompassing federated and continual learning. arXiv preprint arXiv:1811.11206 Bui TD, Nguyen CV, Swaroop S, Turner RE (2018) Partitioned variational inference: A unified framework encompassing federated and continual learning. arXiv preprint arXiv:​1811.​11206
34.
go back to reference Nguyen CV, Achille A, Lam M, Hassner T, Mahadevan V, Soatto S (2019) Toward understanding catastrophic forgetting in continual learning. arXiv preprint arXiv:1908.01091 Nguyen CV, Achille A, Lam M, Hassner T, Mahadevan V, Soatto S (2019) Toward understanding catastrophic forgetting in continual learning. arXiv preprint arXiv:​1908.​01091
35.
go back to reference Chen Z, Fang Z, Fan W, Edwards A, Zhang K (2017) Cstg: An effective framework for cost-sensitive sparse online learning. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 759–767. SIAM Chen Z, Fang Z, Fan W, Edwards A, Zhang K (2017) Cstg: An effective framework for cost-sensitive sparse online learning. In: Proceedings of the 2017 SIAM International Conference on Data Mining, pp. 759–767. SIAM
36.
go back to reference Yoon J, Yang E, Lee J, Hwang SJ (2018) Lifelong learning with dynamically expandable networks. In: 6th International Conference on Learning Representations, ICLR 2018. International Conference on Learning Representations, ICLR Yoon J, Yang E, Lee J, Hwang SJ (2018) Lifelong learning with dynamically expandable networks. In: 6th International Conference on Learning Representations, ICLR 2018. International Conference on Learning Representations, ICLR
37.
38.
go back to reference Rusu AA, Rabinowitz NC, Desjardins G, Soyer H, Kirkpatrick J, Kavukcuoglu K, Pascanu R, Hadsell R (2016) Progressive neural networks. arXiv preprint arXiv:1606.04671 Rusu AA, Rabinowitz NC, Desjardins G, Soyer H, Kirkpatrick J, Kavukcuoglu K, Pascanu R, Hadsell R (2016) Progressive neural networks. arXiv preprint arXiv:​1606.​04671
39.
go back to reference Zhou G, Sohn K, Lee H (2012) Online incremental feature learning with denoising autoencoders. In: Artificial Intelligence and Statistics, pp. 1453–1461. PMLR Zhou G, Sohn K, Lee H (2012) Online incremental feature learning with denoising autoencoders. In: Artificial Intelligence and Statistics, pp. 1453–1461. PMLR
40.
go back to reference Yu H, Cong Y, Sun G, Hou D, Liu Y, Dong J (2023) Open-ended online learning for autonomous visual perception. IEEE Transactions on Neural Networks and Learning Systems Yu H, Cong Y, Sun G, Hou D, Liu Y, Dong J (2023) Open-ended online learning for autonomous visual perception. IEEE Transactions on Neural Networks and Learning Systems
41.
go back to reference Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Advances in Artificial Intelligence–SBIA 2004: 17th Brazilian Symposium on Artificial Intelligence, Sao Luis, Maranhao, Brazil, September 29–Ocotber 1, 2004. Proceedings 17, pp. 286–295. Springer Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. In: Advances in Artificial Intelligence–SBIA 2004: 17th Brazilian Symposium on Artificial Intelligence, Sao Luis, Maranhao, Brazil, September 29–Ocotber 1, 2004. Proceedings 17, pp. 286–295. Springer
42.
go back to reference Ross GJ, Adams NM, Tasoulis DK, Hand DJ (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recogn Lett 33(2):191–198MATHCrossRef Ross GJ, Adams NM, Tasoulis DK, Hand DJ (2012) Exponentially weighted moving average charts for detecting concept drift. Pattern Recogn Lett 33(2):191–198MATHCrossRef
44.
go back to reference Dasu T, Krishnan S, Venkatasubramanian S, Yi K (2006) An information-theoretic approach to detecting changes in multi-dimensional data streams. In: Proc. Symposium on the Interface of Statistics, Computing Science, and Applications (Interface) Dasu T, Krishnan S, Venkatasubramanian S, Yi K (2006) An information-theoretic approach to detecting changes in multi-dimensional data streams. In: Proc. Symposium on the Interface of Statistics, Computing Science, and Applications (Interface)
45.
go back to reference Shao J, Ahmadi Z, Kramer S (2014) Prototype-based learning on concept-drifting data streams. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 412–421 Shao J, Ahmadi Z, Kramer S (2014) Prototype-based learning on concept-drifting data streams. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 412–421
46.
go back to reference Qahtan AA, Alharbi B, Wang S, Zhang X (2015) A pca-based change detection framework for multidimensional data streams: Change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–944 Qahtan AA, Alharbi B, Wang S, Zhang X (2015) A pca-based change detection framework for multidimensional data streams: Change detection in multidimensional data streams. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 935–944
47.
go back to reference Gu F, Zhang G, Lu J, Lin C-T (2016) Concept drift detection based on equal density estimation. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 24–30. IEEE Gu F, Zhang G, Lu J, Lin C-T (2016) Concept drift detection based on equal density estimation. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp. 24–30. IEEE
48.
go back to reference Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531MATHCrossRef Elwell R, Polikar R (2011) Incremental learning of concept drift in nonstationary environments. IEEE Trans Neural Netw 22(10):1517–1531MATHCrossRef
49.
go back to reference Liu A, Zhang G, Lu J (2014) Concept drift detection based on anomaly analysis. In: Neural Information Processing: 21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3-6, 2014. Proceedings, Part I 21, pp. 263–270. Springer Liu A, Zhang G, Lu J (2014) Concept drift detection based on anomaly analysis. In: Neural Information Processing: 21st International Conference, ICONIP 2014, Kuching, Malaysia, November 3-6, 2014. Proceedings, Part I 21, pp. 263–270. Springer
50.
go back to reference Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501MATHCrossRef Huang G-B, Zhu Q-Y, Siew C-K (2006) Extreme learning machine: theory and applications. Neurocomputing 70(1–3):489–501MATHCrossRef
51.
go back to reference Ashfahani A, Pratama M (2019) Autonomous deep learning: Continual learning approach for dynamic environments. In: Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 666–674. SIAM Ashfahani A, Pratama M (2019) Autonomous deep learning: Continual learning approach for dynamic environments. In: Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 666–674. SIAM
52.
go back to reference Ren W, Zhao T, Qin W, Liu K (2023) T-sas: Toward shift-aware dynamic adaptation for streaming data. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 4244–4248 Ren W, Zhao T, Qin W, Liu K (2023) T-sas: Toward shift-aware dynamic adaptation for streaming data. In: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp. 4244–4248
53.
go back to reference Iman M, Miller JA, Rasheed K, Branchinst RM, Arabnia HR (2022) Expanse: A deep continual/progressive learning system for deep transfer learning. arXiv preprint arXiv:2205.10356 Iman M, Miller JA, Rasheed K, Branchinst RM, Arabnia HR (2022) Expanse: A deep continual/progressive learning system for deep transfer learning. arXiv preprint arXiv:​2205.​10356
54.
go back to reference Czarnowski I (2022) Weighted ensemble with one-class classification and over-sampling and instance selection (wecoi): An approach for learning from imbalanced data streams. J Comput Sci 61:101614MATHCrossRef Czarnowski I (2022) Weighted ensemble with one-class classification and over-sampling and instance selection (wecoi): An approach for learning from imbalanced data streams. J Comput Sci 61:101614MATHCrossRef
55.
go back to reference Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Nat Acad Sci 114(13):3521–3526MathSciNetMATHCrossRef Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, Rusu AA, Milan K, Quan J, Ramalho T, Grabska-Barwinska A et al (2017) Overcoming catastrophic forgetting in neural networks. Proc Nat Acad Sci 114(13):3521–3526MathSciNetMATHCrossRef
56.
go back to reference Park S, Suh T (2022) Continual learning with speculative backpropagation and activation history. IEEE Access 10:38555–38564MATHCrossRef Park S, Suh T (2022) Continual learning with speculative backpropagation and activation history. IEEE Access 10:38555–38564MATHCrossRef
58.
go back to reference Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139MathSciNetMATHCrossRef Freund Y, Schapire RE (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci 55(1):119–139MathSciNetMATHCrossRef
59.
go back to reference Mousser W, Ouadfel S, Taleb-Ahmed A, Kitouni I (2022) Idt: an incremental deep tree framework for biological image classification. Artif Intell Med 134:102392CrossRef Mousser W, Ouadfel S, Taleb-Ahmed A, Kitouni I (2022) Idt: an incremental deep tree framework for biological image classification. Artif Intell Med 134:102392CrossRef
60.
go back to reference Ergün E, Töreyin BU (2021) Sparse progressive neural networks for continual learning. In: Advances in Computational Collective Intelligence: 13th International Conference, ICCCI 2021, Kallithea, Rhodes, Greece, September 29–October 1, 2021, Proceedings 13, pp. 715–725. Springer Ergün E, Töreyin BU (2021) Sparse progressive neural networks for continual learning. In: Advances in Computational Collective Intelligence: 13th International Conference, ICCCI 2021, Kallithea, Rhodes, Greece, September 29–October 1, 2021, Proceedings 13, pp. 715–725. Springer
61.
go back to reference Scardapane S, Comminiello D, Hussain A, Uncini A (2017) Group sparse regularization for deep neural networks. Neurocomputing 241:81–89MATHCrossRef Scardapane S, Comminiello D, Hussain A, Uncini A (2017) Group sparse regularization for deep neural networks. Neurocomputing 241:81–89MATHCrossRef
62.
go back to reference Goodrich B, Arel I (2014) Unsupervised neuron selection for mitigating catastrophic forgetting in neural networks. In: 2014 IEEE 57th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 997–1000. IEEE Goodrich B, Arel I (2014) Unsupervised neuron selection for mitigating catastrophic forgetting in neural networks. In: 2014 IEEE 57th International Midwest Symposium on Circuits and Systems (MWSCAS), pp. 997–1000. IEEE
63.
go back to reference Aminian E, Ribeiro RP, Gama J (2021) Chebyshev approaches for imbalanced data streams regression models. Data Min Knowl Disc 35:2389–2466MathSciNetMATHCrossRef Aminian E, Ribeiro RP, Gama J (2021) Chebyshev approaches for imbalanced data streams regression models. Data Min Knowl Disc 35:2389–2466MathSciNetMATHCrossRef
64.
go back to reference Czarnowski I (2021) Learning from imbalanced data streams based on over-sampling and instance selection. In: Computational Science–ICCS 2021: 21st International Conference, Krakow, Poland, June 16–18, 2021, Proceedings, Part III, pp. 378–391. Springer Czarnowski I (2021) Learning from imbalanced data streams based on over-sampling and instance selection. In: Computational Science–ICCS 2021: 21st International Conference, Krakow, Poland, June 16–18, 2021, Proceedings, Part III, pp. 378–391. Springer
65.
go back to reference Li-wen W, Wei G, Yi-cheng Y (2021) An online weighted sequential extreme learning machine for class imbalanced data streams. In: Journal of Physics: Conference Series, vol. 1994, p. 012008. IOP Publishing Li-wen W, Wei G, Yi-cheng Y (2021) An online weighted sequential extreme learning machine for class imbalanced data streams. In: Journal of Physics: Conference Series, vol. 1994, p. 012008. IOP Publishing
67.
go back to reference Huang G, Chen D, Li T, Wu F, Van Der Maaten L, Weinberger KQ (2017) Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844 Huang G, Chen D, Li T, Wu F, Van Der Maaten L, Weinberger KQ (2017) Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:​1703.​09844
68.
go back to reference Yang B, Bender G, Le QV, Ngiam J (2019) Condconv: Conditionally parameterized convolutions for efficient inference. Advances in Neural Information Processing Systems 32 Yang B, Bender G, Le QV, Ngiam J (2019) Condconv: Conditionally parameterized convolutions for efficient inference. Advances in Neural Information Processing Systems 32
69.
go back to reference Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. Advances in neural information processing systems 30 Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. Advances in neural information processing systems 30
70.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
71.
go back to reference Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708
73.
go back to reference Wang Y, Pan X, Song S, Zhang H, Huang G, Wu C (2019) Implicit semantic data augmentation for deep networks. Advances in Neural Information Processing Systems 32 Wang Y, Pan X, Song S, Zhang H, Huang G, Wu C (2019) Implicit semantic data augmentation for deep networks. Advances in Neural Information Processing Systems 32
74.
go back to reference Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2019) Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123 Cubuk ED, Zoph B, Mane D, Vasudevan V, Le QV (2019) Autoaugment: Learning augmentation strategies from data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123
75.
go back to reference Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 160(1):106MATHCrossRef Hubel DH, Wiesel TN (1962) Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol 160(1):106MATHCrossRef
76.
go back to reference Murata A, Gallese V, Luppino G, Kaseda M, Sakata H (2000) Selectivity for the shape, size, and orientation of objects for grasping in neurons of monkey parietal area aip. J Neurophysiol 83(5):2580–2601CrossRef Murata A, Gallese V, Luppino G, Kaseda M, Sakata H (2000) Selectivity for the shape, size, and orientation of objects for grasping in neurons of monkey parietal area aip. J Neurophysiol 83(5):2580–2601CrossRef
77.
go back to reference Wang Y, Lv K, Huang R, Song S, Yang L, Huang G (2020) Glance and focus: a dynamic approach to reducing spatial redundancy in image classification. Adv Neural Inf Process Syst 33:2432–2444MATH Wang Y, Lv K, Huang R, Song S, Yang L, Huang G (2020) Glance and focus: a dynamic approach to reducing spatial redundancy in image classification. Adv Neural Inf Process Syst 33:2432–2444MATH
78.
go back to reference Huang G, Liu S, Maaten L, Weinberger KQ (2018) Condensenet: An efficient densenet using learned group convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2752–2761 Huang G, Liu S, Maaten L, Weinberger KQ (2018) Condensenet: An efficient densenet using learned group convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2752–2761
79.
go back to reference Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y (2016) Binarized neural networks. Advances in neural information processing systems 29 Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y (2016) Binarized neural networks. Advances in neural information processing systems 29
80.
go back to reference Liu Z, Li J, Shen Z, Huang G, Yan S, Zhang C (2017) Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744 Liu Z, Li J, Shen Z, Huang G, Yan S, Zhang C (2017) Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744
81.
go back to reference Bohnstingl T, Wozniak S, Pantazi A, Eleftheriou E (2023) Online spatio-temporal learning in deep neural networks. IEEE Trans Neural Netw Learn Syst 34(11):8894–8908MathSciNetMATHCrossRef Bohnstingl T, Wozniak S, Pantazi A, Eleftheriou E (2023) Online spatio-temporal learning in deep neural networks. IEEE Trans Neural Netw Learn Syst 34(11):8894–8908MathSciNetMATHCrossRef
82.
go back to reference Li D, Yang R, Li X, Zhu S (2020) Radar signal modulation recognition based on deep joint learning. IEEE Access 8:48515–48528MATHCrossRef Li D, Yang R, Li X, Zhu S (2020) Radar signal modulation recognition based on deep joint learning. IEEE Access 8:48515–48528MATHCrossRef
83.
go back to reference Huang G, Chen D (2018) Multi-scale dense networks for resource efficient image classification. ICLR 2018 Huang G, Chen D (2018) Multi-scale dense networks for resource efficient image classification. ICLR 2018
84.
go back to reference Teerapittayanon S, McDanel B, Kung H-T (2016) Branchynet: Fast inference via early exiting from deep neural networks. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469. IEEE Teerapittayanon S, McDanel B, Kung H-T (2016) Branchynet: Fast inference via early exiting from deep neural networks. In: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 2464–2469. IEEE
85.
go back to reference Bolukbasi T, Wang J, Dekel O, Saligrama V (2017) Adaptive neural networks for efficient inference. In: International Conference on Machine Learning, pp. 527–536. PMLR Bolukbasi T, Wang J, Dekel O, Saligrama V (2017) Adaptive neural networks for efficient inference. In: International Conference on Machine Learning, pp. 527–536. PMLR
86.
go back to reference Park E, Kim D, Kim S, Kim Y-D, Kim G, Yoon S, Yoo S (2015) Big/little deep neural network for ultra low power inference. In: 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pp. 124–132. IEEE Park E, Kim D, Kim S, Kim Y-D, Kim G, Yoon S, Yoo S (2015) Big/little deep neural network for ultra low power inference. In: 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pp. 124–132. IEEE
87.
go back to reference Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90MATHCrossRef Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90MATHCrossRef
88.
89.
go back to reference Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9
90.
go back to reference Wang X, Luo Y, Crankshaw D, Tumanov A, Yu F, Gonzalez JE (2017) Idk cascades: Fast deep learning by learning not to overthink. arXiv preprint arXiv:1706.00885 Wang X, Luo Y, Crankshaw D, Tumanov A, Yu F, Gonzalez JE (2017) Idk cascades: Fast deep learning by learning not to overthink. arXiv preprint arXiv:​1706.​00885
91.
go back to reference Leroux S, Bohez S, De Coninck E, Verbelen T, Vankeirsbilck B, Simoens P, Dhoedt B (2017) The cascading neural network: building the internet of smart things. Knowl Inf Syst 52:791–814CrossRef Leroux S, Bohez S, De Coninck E, Verbelen T, Vankeirsbilck B, Simoens P, Dhoedt B (2017) The cascading neural network: building the internet of smart things. Knowl Inf Syst 52:791–814CrossRef
92.
go back to reference Guan J, Liu Y, Liu Q, Peng J (2017) Energy-efficient amortized inference with cascaded deep classifiers. arXiv preprint arXiv:1710.03368 Guan J, Liu Y, Liu Q, Peng J (2017) Energy-efficient amortized inference with cascaded deep classifiers. arXiv preprint arXiv:​1710.​03368
93.
go back to reference Dai X, Kong X, Guo T (2020) Epnet: Learning to exit with flexible multi-branch network. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 235–244 Dai X, Kong X, Guo T (2020) Epnet: Learning to exit with flexible multi-branch network. In: Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp. 235–244
94.
go back to reference Figurnov M, Collins MD, Zhu Y, Zhang L, Huang J, Vetrov D, Salakhutdinov R (2017) Spatially adaptive computation time for residual networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1039–1048 Figurnov M, Collins MD, Zhu Y, Zhang L, Huang J, Vetrov D, Salakhutdinov R (2017) Spatially adaptive computation time for residual networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1039–1048
95.
go back to reference Leroux S, Molchanov P, Simoens P, Dhoedt B, Breuel T, Kautz J (2018) Iamnn: Iterative and adaptive mobile neural network for efficient image classification. arXiv preprint arXiv:1804.10123 Leroux S, Molchanov P, Simoens P, Dhoedt B, Breuel T, Kautz J (2018) Iamnn: Iterative and adaptive mobile neural network for efficient image classification. arXiv preprint arXiv:​1804.​10123
96.
go back to reference Bengio Y, Léonard N, Courville A (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432 Bengio Y, Léonard N, Courville A (2013) Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:​1308.​3432
97.
go back to reference Cho K, Bengio Y (2014) Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning. arXiv preprint arXiv:1406.7362 Cho K, Bengio Y (2014) Exponentially increasing the capacity-to-computation ratio for conditional computation in deep learning. arXiv preprint arXiv:​1406.​7362
98.
go back to reference Bengio E, Bacon P-L, Pineau J, Precup D (2015) Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297 Bengio E, Bacon P-L, Pineau J, Precup D (2015) Conditional computation in neural networks for faster models. arXiv preprint arXiv:​1511.​06297
99.
go back to reference Davis A, Arel I (2013) Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:1312.4461 Davis A, Arel I (2013) Low-rank approximations for conditional feedforward computation in deep neural networks. arXiv preprint arXiv:​1312.​4461
100.
go back to reference Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79–87MATHCrossRef Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE (1991) Adaptive mixtures of local experts. Neural Comput 3(1):79–87MATHCrossRef
101.
go back to reference Eigen D, Ranzato M, Sutskever I (2013) Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314 Eigen D, Ranzato M, Sutskever I (2013) Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:​1312.​4314
102.
go back to reference Ma J, Zhao Z, Yi X, Chen J, Hong L, Chi EH (2018) Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1930–1939 Ma J, Zhao Z, Yi X, Chen J, Hong L, Chi EH (2018) Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 1930–1939
103.
go back to reference Mullapudi RT, Mark WR, Shazeer N, Fatahalian K (2018) Hydranets: Specialized dynamic architectures for efficient inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8080–8089 Mullapudi RT, Mark WR, Shazeer N, Fatahalian K (2018) Hydranets: Specialized dynamic architectures for efficient inference. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8080–8089
104.
go back to reference Cai S, Shu Y, Wang W (2021) Dynamic routing networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3588–3597 Cai S, Shu Y, Wang W (2021) Dynamic routing networks. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 3588–3597
105.
go back to reference Yuan Z, Wu B, Sun G, Liang Z, Zhao S, Bi W (2020) S2dnas: Transforming static cnn model for dynamic inference via neural architecture search. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 175–192. Springer Yuan Z, Wu B, Sun G, Liang Z, Zhao S, Bi W (2020) S2dnas: Transforming static cnn model for dynamic inference via neural architecture search. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 175–192. Springer
106.
go back to reference Hua W, Zhou Y, De Sa CM, Zhang Z, Suh GE (2019) Channel gating neural networks. Advances in Neural Information Processing Systems 32 Hua W, Zhou Y, De Sa CM, Zhang Z, Suh GE (2019) Channel gating neural networks. Advances in Neural Information Processing Systems 32
107.
go back to reference Lin J, Rao Y, Lu J, Zhou J (2017) Runtime neural pruning. Advances in neural information processing systems 30 Lin J, Rao Y, Lu J, Zhou J (2017) Runtime neural pruning. Advances in neural information processing systems 30
108.
go back to reference Li C, Wang G, Wang B, Liang X, Li Z, Chang X (2021) Dynamic slimmable network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8607–8617 Li C, Wang G, Wang B, Liang X, Li Z, Chang X (2021) Dynamic slimmable network. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8607–8617
109.
go back to reference Wang Y, Shen J, Hu T-K, Xu P, Nguyen T, Baraniuk R, Wang Z, Lin Y (2020) Dual dynamic inference: enabling more efficient, adaptive, and controllable deep inference. IEEE J Selected Top Signal Process 14(4):623–633MATHCrossRef Wang Y, Shen J, Hu T-K, Xu P, Nguyen T, Baraniuk R, Wang Z, Lin Y (2020) Dual dynamic inference: enabling more efficient, adaptive, and controllable deep inference. IEEE J Selected Top Signal Process 14(4):623–633MATHCrossRef
110.
go back to reference Xia W, Yin H, Dai X, Jha NK (2021) Fully dynamic inference with deep neural networks. IEEE Trans Emerg Top Comput 10(2):962–972MATH Xia W, Yin H, Dai X, Jha NK (2021) Fully dynamic inference with deep neural networks. IEEE Trans Emerg Top Comput 10(2):962–972MATH
111.
go back to reference Ehteshami Bejnordi A, Krestel R (2020) Dynamic channel and layer gating in convolutional neural networks. In: KI 2020: Advances in Artificial Intelligence: 43rd German Conference on AI, Bamberg, Germany, September 21–25, 2020, Proceedings 43, pp. 33–45. Springer Ehteshami Bejnordi A, Krestel R (2020) Dynamic channel and layer gating in convolutional neural networks. In: KI 2020: Advances in Artificial Intelligence: 43rd German Conference on AI, Bamberg, Germany, September 21–25, 2020, Proceedings 43, pp. 33–45. Springer
112.
go back to reference Hinton GE, Sabour S, Frosst N (2018) Matrix capsules with em routing. In: International conference on learning representations Hinton GE, Sabour S, Frosst N (2018) Matrix capsules with em routing. In: International conference on learning representations
113.
go back to reference Zhu X, Yi J, Zhang L (2024) Continual learning with unknown task boundary. IEEE transactions on neural networks and learning systems Zhu X, Yi J, Zhang L (2024) Continual learning with unknown task boundary. IEEE transactions on neural networks and learning systems
114.
go back to reference Wang D, Liu K, Xiong H, Fu Y (2022) Online poi recommendation: learning dynamic geo-human interactions in streams. IEEE Trans Big Data 9(3):832–844MATHCrossRef Wang D, Liu K, Xiong H, Fu Y (2022) Online poi recommendation: learning dynamic geo-human interactions in streams. IEEE Trans Big Data 9(3):832–844MATHCrossRef
115.
go back to reference Yang D, Zhang D, Zheng VW, Yu Z (2014) Modeling user activity preference by leveraging user spatial-temporal characteristics in lbsns. IEEE Trans Syst, Man, Cybern: Syst 45(1):129–142MATHCrossRef Yang D, Zhang D, Zheng VW, Yu Z (2014) Modeling user activity preference by leveraging user spatial-temporal characteristics in lbsns. IEEE Trans Syst, Man, Cybern: Syst 45(1):129–142MATHCrossRef
116.
go back to reference Chun C-J, Kang J-M, Kim I-M (2018) Adaptive rate and energy harvesting interval control based on reinforcement learning for swipt. IEEE Commun Lett 22(12):2571–2574MATHCrossRef Chun C-J, Kang J-M, Kim I-M (2018) Adaptive rate and energy harvesting interval control based on reinforcement learning for swipt. IEEE Commun Lett 22(12):2571–2574MATHCrossRef
117.
go back to reference Luo J, Tang J, So DK, Chen G, Cumanan K, Chambers JA (2019) A deep learning-based approach to power minimization in multi-carrier noma with swipt. IEEE Access 7:17450–17460CrossRef Luo J, Tang J, So DK, Chen G, Cumanan K, Chambers JA (2019) A deep learning-based approach to power minimization in multi-carrier noma with swipt. IEEE Access 7:17450–17460CrossRef
118.
go back to reference Lee K, Lee W (2020) Learning-based resource management for swipt. IEEE Syst J 14(4):4750–4753MATHCrossRef Lee K, Lee W (2020) Learning-based resource management for swipt. IEEE Syst J 14(4):4750–4753MATHCrossRef
119.
go back to reference Al-Eryani Y, Akrout M, Hossain E (2020) Simultaneous energy harvesting and information transmission in a mimo full-duplex system: A machine learning-based design. arXiv preprint arXiv:2002.06193 Al-Eryani Y, Akrout M, Hossain E (2020) Simultaneous energy harvesting and information transmission in a mimo full-duplex system: A machine learning-based design. arXiv preprint arXiv:​2002.​06193
120.
go back to reference Tang J, Luo J, Ou J, Zhang X, Zhao N, So DKC, Wong K-K (2020) Decoupling or learning: joint power splitting and allocation in mc-noma with swipt. IEEE Trans Commun 68(9):5834–5848CrossRef Tang J, Luo J, Ou J, Zhang X, Zhao N, So DKC, Wong K-K (2020) Decoupling or learning: joint power splitting and allocation in mc-noma with swipt. IEEE Trans Commun 68(9):5834–5848CrossRef
121.
go back to reference Li L, Ma H, Ren H, Cheng Q, Wang D, Bai T, Han Z (2021) Learning-aided resource allocation for pattern division multiple access-based swipt systems. IEEE Wireless Commun Lett 10(1):131–135MATHCrossRef Li L, Ma H, Ren H, Cheng Q, Wang D, Bai T, Han Z (2021) Learning-aided resource allocation for pattern division multiple access-based swipt systems. IEEE Wireless Commun Lett 10(1):131–135MATHCrossRef
122.
go back to reference Jiang L, Wang P, Cheng K, Liu K, Yin M, Jin B, Fu Y (2021) Eduhawkes: a neural hawkes process approach for online study behavior modeling. In: Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp. 567–575. SIAM Jiang L, Wang P, Cheng K, Liu K, Yin M, Jin B, Fu Y (2021) Eduhawkes: a neural hawkes process approach for online study behavior modeling. In: Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), pp. 567–575. SIAM
123.
go back to reference Padhi DK, Padhy N, Bhoi AK, Shafi J, Yesuf SH (2022) An intelligent fusion model with portfolio selection and machine learning for stock market prediction. Comput Intell Neurosci 2022(1):7588303 Padhi DK, Padhy N, Bhoi AK, Shafi J, Yesuf SH (2022) An intelligent fusion model with portfolio selection and machine learning for stock market prediction. Comput Intell Neurosci 2022(1):7588303
124.
go back to reference Deng Y, Bao F, Kong Y, Ren Z, Dai Q (2016) Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans Neural Netw Learn Syst 28(3):653–664MATHCrossRef Deng Y, Bao F, Kong Y, Ren Z, Dai Q (2016) Deep direct reinforcement learning for financial signal representation and trading. IEEE Trans Neural Netw Learn Syst 28(3):653–664MATHCrossRef
125.
go back to reference Tsantekidis A, Passalis N, Tefas A (2021) Diversity-driven knowledge distillation for financial trading using deep reinforcement learning. Neural Netw 140:193–202MATHCrossRef Tsantekidis A, Passalis N, Tefas A (2021) Diversity-driven knowledge distillation for financial trading using deep reinforcement learning. Neural Netw 140:193–202MATHCrossRef
126.
go back to reference Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games. Cambridge University Press, CambridgeMATHCrossRef Cesa-Bianchi N, Lugosi G (2006) Prediction, Learning, and Games. Cambridge University Press, CambridgeMATHCrossRef
127.
go back to reference Li B, Hoi SC (2014) Online portfolio selection: a survey. ACM Comput Surv (CSUR) 46(3):1–36MATH Li B, Hoi SC (2014) Online portfolio selection: a survey. ACM Comput Surv (CSUR) 46(3):1–36MATH
128.
go back to reference Li B, Hoi SCH (2018) Online Portfolio Selection: Principles and Algorithms. Crc Press Li B, Hoi SCH (2018) Online Portfolio Selection: Principles and Algorithms. Crc Press
129.
go back to reference Li B, Sahoo D, Hoi SC (2016) Olps: a toolbox for on-line portfolio selection. J Mach Learn Res 17(35):1–5MathSciNetMATH Li B, Sahoo D, Hoi SC (2016) Olps: a toolbox for on-line portfolio selection. J Mach Learn Res 17(35):1–5MathSciNetMATH
130.
131.
go back to reference Wang Y, Sun L (2024) Energy-efficient dynamic sensor time series classification for edge health devices. Computer Methods and Programs in Biomedicine, 108268 Wang Y, Sun L (2024) Energy-efficient dynamic sensor time series classification for edge health devices. Computer Methods and Programs in Biomedicine, 108268
132.
go back to reference Ayromlou S, Tsang T, Abolmaesumi P, Li X (2024) Ccsi: Continual class-specific impression for data-free class incremental learning. Medical Image Analysis, 103239 Ayromlou S, Tsang T, Abolmaesumi P, Li X (2024) Ccsi: Continual class-specific impression for data-free class incremental learning. Medical Image Analysis, 103239
133.
go back to reference Sun L, Chen Q, Zheng M, Ning X, Gupta D, Tiwari P (2023) Energy-efficient online continual learning for time series classification in nanorobot-based smart health. IEEE journal of biomedical and health informatics Sun L, Chen Q, Zheng M, Ning X, Gupta D, Tiwari P (2023) Energy-efficient online continual learning for time series classification in nanorobot-based smart health. IEEE journal of biomedical and health informatics
134.
go back to reference Amrollahi F, Shashikumar SP, Holder AL, Nemati S (2022) Leveraging clinical data across healthcare institutions for continual learning of predictive risk models. Sci Rep 12(1):8380MATHCrossRef Amrollahi F, Shashikumar SP, Holder AL, Nemati S (2022) Leveraging clinical data across healthcare institutions for continual learning of predictive risk models. Sci Rep 12(1):8380MATHCrossRef
135.
go back to reference Xu M, Islam M, Bai L, Ren H (2024) Privacy-preserving synthetic continual semantic segmentation for robotic surgery. IEEE Transactions on Medical Imaging Xu M, Islam M, Bai L, Ren H (2024) Privacy-preserving synthetic continual semantic segmentation for robotic surgery. IEEE Transactions on Medical Imaging
136.
go back to reference Larrazabal AJ, Nieto N, Peterson V, Milone DH, Ferrante E (2020) Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc Natl Acad Sci 117(23):12592–12594CrossRef Larrazabal AJ, Nieto N, Peterson V, Milone DH, Ferrante E (2020) Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proc Natl Acad Sci 117(23):12592–12594CrossRef
137.
go back to reference Kim JY (2023) Machines do not decide hate speech: Machine learning, power, and the intersectional approach. 86272 12, 355–369 Kim JY (2023) Machines do not decide hate speech: Machine learning, power, and the intersectional approach. 86272 12, 355–369
138.
go back to reference Wang T, Zhao J, Yatskar M, Chang K-W, Ordonez V (2019) Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5310–5319 Wang T, Zhao J, Yatskar M, Chang K-W, Ordonez V (2019) Balanced datasets are not enough: Estimating and mitigating gender bias in deep image representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5310–5319
139.
go back to reference Mansoury M, Abdollahpouri H, Mobasher B, Pechenizkiy M, Burke R, Sabouri M (2021) Unbiased cascade bandits: Mitigating exposure bias in online learning to rank recommendation. arXiv preprint arXiv:2108.03440 Mansoury M, Abdollahpouri H, Mobasher B, Pechenizkiy M, Burke R, Sabouri M (2021) Unbiased cascade bandits: Mitigating exposure bias in online learning to rank recommendation. arXiv preprint arXiv:​2108.​03440
140.
go back to reference Xu D, Yuan S, Zhang L, Wu X (2018) Fairgan: Fairness-aware generative adversarial networks. In: 2018 IEEE International Conference on Big Data (big Data), pp. 570–575. IEEE Xu D, Yuan S, Zhang L, Wu X (2018) Fairgan: Fairness-aware generative adversarial networks. In: 2018 IEEE International Conference on Big Data (big Data), pp. 570–575. IEEE
141.
go back to reference Ai Q, Bi K, Luo C, Guo J, Croft WB (2018) Unbiased learning to rank with unbiased propensity estimation. In: The 41st International ACM SIGIR conference on research & development in information retrieval, pp. 385–394 Ai Q, Bi K, Luo C, Guo J, Croft WB (2018) Unbiased learning to rank with unbiased propensity estimation. In: The 41st International ACM SIGIR conference on research & development in information retrieval, pp. 385–394
142.
go back to reference Zhao C, Mi F, Wu X, Jiang K, Khan L, Chen F (2022) Adaptive fairness-aware online meta-learning for changing environments. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp. 2565–2575 Zhao C, Mi F, Wu X, Jiang K, Khan L, Chen F (2022) Adaptive fairness-aware online meta-learning for changing environments. In: Proceedings of the 28th ACM SIGKDD conference on knowledge discovery and data mining, pp. 2565–2575
143.
go back to reference Lian H, Atwood JS, Hou B-J, Wu J, He Y (2022) Online deep learning from doubly-streaming data. In: Proceedings of the 30th ACM international conference on multimedia, pp. 3185–3194 Lian H, Atwood JS, Hou B-J, Wu J, He Y (2022) Online deep learning from doubly-streaming data. In: Proceedings of the 30th ACM international conference on multimedia, pp. 3185–3194
Metadata
Title
Online deep learning’s role in conquering the challenges of streaming data: a survey
Authors
Muhammad Sulaiman
Mina Farmanbar
Shingo Kagami
Ahmed Nabil Belbachir
Chunming Rong
Publication date
08-02-2025
Publisher
Springer London
Published in
Knowledge and Information Systems
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-025-02351-3

Premium Partner