In an era defined by the relentless influx of data from diverse sources, the ability to harness and extract valuable insights from streaming data has become paramount. The rapidly evolving realm of online learning techniques is tailored specifically for the unique challenges posed by streaming data. As the digital world continues to generate vast torrents of real-time data, understanding and effectively utilizing online learning approaches are pivotal for staying ahead in various domains. One of the primary goals of online learning is to continuously update the model with the most recent data trends while maintaining and improving the accuracy of previous trends. Based on the various types of feedback, online learning tasks can be divided into three categories: learning with full feedback, learning with limited feedback, and learning without feedback. This survey aims to identify and analyze the key challenges associated with online learning with full feedback, including concept drift, catastrophic forgetting, skewed learning, and network adaptation, while the other existing reviews mainly focus on a single challenge or two without considering other scenarios. This article also discusses the application and ethical implications of online learning. The results of this survey provide valuable insights for researchers and instructional designers seeking to create effective online learning experiences that incorporate full feedback while addressing the associated challenges. In the end, some conclusions, remarks, and future directions for the research community are provided based on the findings of this review.
Notes
Mina Farmanbar, Shingo Kagami, Ahmed Nabil Belbachir, and Chunmingl Rong have contributed equally to this work.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Machine learning is a subfield of artificial intelligence that focuses on developing algorithms that can learn patterns from data and make predictions or decisions based on that learning. Machine learning algorithms can be broadly divided into two categories: offline learning and online learning. Offline learning, also known as batch learning, involves training a machine learning model on a fixed-size dataset and then using the trained model to make predictions on new, unseen data. This type of learning requires both the data and a learning algorithm to be available before the training process begins [1]. During the training process, the model parameters are adjusted to fit the patterns in the training data, with the goal of making accurate predictions. The model parameters are determined during the training phase and are static in the sense that they are not updated after training. Once the model is trained, it is deployed for use in the application, and the model’s parameters remain fixed during inference or prediction.
Online learning, contrary to offline backpropagation, involves updating the machine learning model continuously as new data becomes available. This makes it particularly useful in applications where data are generated in real-time by mobile phones, Internet of Things devices, and different network sensors [2‐5]. By updating the model with the most recent data trends, online learning can help to improve the accuracy and reliability of the model over time. Each of these categories of machine learning can also be further divided into three subcategories based on the type of learning, including supervised learning, semi-supervised learning, and unsupervised learning. Based on the various types of feedback, online learning tasks can be divided into three categories: learning without feedback, learning with limited feedback, and learning with full feedback. In learning without feedback, the algorithms receive little or no feedback about their performance on a task. This type of learning is similar to unsupervised learning in machine learning, where the model must identify patterns in the data without explicit feedback. In learning with limited feedback, the algorithms receive some feedback about their performance, but the feedback may not be complete or detailed. This type of learning is similar to semi-supervised learning in machine learning, where the model has access to some labeled examples but must still learn from a large amount of unlabeled data. In learning with full feedback, the algorithms receive complete and detailed feedback about their performance on a task. This type of learning is similar to supervised learning in machine learning, where the model is trained on a labeled dataset and receives feedback in the form of correct labels.
Advertisement
Online learning has the advantage of being able to adapt to changing data and continuously improve the model’s performance. However, it also requires more resources and can be more complex to implement than offline learning. There are also challenges associated with online learning. Concept drift, catastrophic forgetting, skewed learning, and network adaptation are four challenges that can arise in online machine learning systems. Each of these challenges can affect the performance of the model and lead to inaccurate or unreliable results. This study aims to conduct a comprehensive review of the literature on online learning methodologies and the associated challenges, especially for online learning, with full feedback. In this study, each of the aforementioned challenges and some of the approaches that have been proposed to address them will be discussed in detail. The main contributions of this study are as follows:
1.
An overview of offline learning and online learning approaches strength, weaknesses, and their ability to handle challenges.
2.
Literature review of the investigated four major challenges, namely concept drift, catastrophic forgetting, skewed learning, and network adaptation associated with online learning with full feedback.
3.
A conclusion of various aspects of online learning, analyzing the limitations of existing methods.
4.
Identification of the open research questions and potential directions for future research in this area.
Figure 1 summarizes the major contributions of this study. In the introduction section, online and offline learning are compared first, then problems related to online learning are identified, and later, different challenges are described. In the literature review, different methods are described and analyzed for four challenges of online learning. The next chapter discuss application of online deep learning in different domain. Later, ethical implication of the online deep learning is discussed. In conclusion, all four challenges are summarized. In future directions, open research questions and potential directions are identified. Best viewed in color.
Fig. 1
Overview of the sections included in this Survey
×
1.1 Offline learning analysis
Offline learning, also referred to as batch learning, involves training a machine learning model on a static dataset that is provided all at once. After deploying a trained model in the application scenario, the model predicts data according to the patterns or trends in the dataset. In applications where the underlying data distribution changes over time, such as in streaming data or non-stationary environments, the deployed model may become outdated and need to be retrained on new data to maintain accuracy and performance. This is especially important in safety-critical applications such as self-driving vehicles, unmanned aerial vehicles, and robots, where the consequences of model failure can be severe [6]. In such schemes, machine learning practitioners manually retrain the model on new data and redeploy the model. However, this approach can be time-consuming and may not be feasible in applications with large amounts of streaming data. For this purpose, some practitioners schedule the training and deploying automatically at any time stamp. In this way, the training and deployment problem is solved, but three drawbacks still need to be considered: redeploying cost, the model is yet not trained on the most recent data after the last schedule of training, and data from the last schedule to the next schedule need to be stored which is difficult in real-world dynamic applications, due to the huge amount of data to be stored [7]. Sometimes models are used for prediction at the edge, where the machines are limited to computing and memory resources, and therefore, it is important to have models that are lightweight and can be executed efficiently on the edge devices. Moreover, in streaming data applications, where data are constantly coming in, it is important to have a model that can adapt to the most recent trends at run time. This requires a model that is capable of incremental learning, which means it can learn from new data points without retraining the entire model from scratch.
1.2 Online learning analysis
Offline learning models are prone to change and deteriorate the performance in a non-stationary streaming environment [8]. Such kind of stationary models is prone to scale for real-world dynamic problems. Online machine learning is the technique that updates model parameters at run time to adapt to the most recent trends. Due to the adaptive learning nature of the online learning technique, no more batch retraining is required, nor redeploying, and neither needs to store data in the memory. In online machine learning, models fed with sequential data and parameters are updated in real-time (one data point at a time) [9]. On the contrary, backpropagation on one data point raises challenges like concept drift, catastrophic forgetting, skewed learning, and network adaptation. Concept drift occurs when the underlying distribution of the data changes over time, which can cause the model to become outdated and less accurate. Catastrophic forgetting occurs when the model forgets previously learned information as it learns new information, which can be problematic for tasks that require long-term memory. Skewed learning occurs when the data are imbalanced, which can lead to the model making biased predictions. Network adaptation refers to the ability of the architecture to adapt to new data while retaining previously learned knowledge. Online learning is indeed one of the key techniques that make machine learning practical for real-time analysis of big data.
In many online learning scenarios, obtaining large amounts of labeled data can be costly and time-consuming, which makes techniques that reduce the dependence on labeled data highly desirable. Some recent methods of online learning that can reduce the amount of labeled data required for training models are zero-shot learning, one-shot learning, and transfer learning. Zero-shot learning refers to the ability of a model to recognize and classify objects that it has never seen before without requiring any examples of those objects during training. Instead, the model is trained on a set of attributes or semantic descriptions that describe the properties of different object classes [10]. One-shot learning, on the other hand, is a type of machine learning in which a model is trained to recognize new objects or classes from just one or a few examples rather than requiring a large dataset for training [11]. Transfer learning is a technique that involves using knowledge gained from one task to improve performance on another related task [12]. Overall, online learning is a powerful technique for building models that can adapt to changing data distributions in real time. While backpropagation on a single data point can raise several challenges, online learning can help mitigate these challenges by allowing the model to learn from new data incrementally.
Advertisement
1.3 Problem setting/problem formulation
In online learning, data arrive sequentially at the learning algorithm, and the timestamp for each data instance of data arriving at the model is called “round.” Models predict the data sample, e.g., classifying data into a predefined class. After the prediction, the model receives the actual label as ground truth for the sample, and then, the model measures the loss value, which is the difference between the predicted and actual class labels. In the end, the model updates its parameter according to the loss suffered to improve the model prediction for the next upcoming samples.
Fig. 2
To summarize the difference between (a) Batch learning and (b) Online learning models in a production environment. Batch learning requires repeating all steps, while online learning, on the contrary, does not need any intervention
×
Figure 2 mimics the working cycle of batch and online learning in the production environment. Memory is required to gather data for training purposes in batch learning, which may not be possible in high-dimensional data streaming applications, while data gathering is not required in an online fashion. In batch learning, training is necessary before deployment for production in a use case, while in online learning, training is performed continuously with each prediction. In online learning, a loop imitates continuous prediction in the application. The prediction time between two instances of batch learning associated with online learning is less because there exist two extra steps: receiving true outcomes and updating the model in the loop of online learning. To perform predictions about the most recent trends, batch learning has to perform all steps from the beginning, while online learning is specially designed to be updated with recent trends, which makes online learning suitable for streaming data.
Consider an online classification task where the primary objective is to minimize the cumulative loss suffered by the model. The cumulative loss could be minimized with the help of a learning function f based on a sequence of samples \(X=\{x_1,x_2, x_3,...,x_T\}\) that arrives sequentially, where T represents time stamp. After the arrival of each sample, function f will classify the sample as \(p_t\), which represents the predicted class label by the model. Ground truth \(Y=\{y_1,y_2, y_3,...,y_T\}\) is the output label for each sample. Loss is the difference between the predicted and actual output for a specific sample, which can be calculated with the help of loss function \(L(p_t,y_t)\), whereas cumulative predictive error can be calculated with equation 1.
Algorithms 1 & 2 present the baseline for batch machine learning and online learning [9] algorithms.
Algorithm 1
Batch Learning
×
Algorithm 2
Online Learning
Table 1
Research questions
RQ
Research questions
RQ1
How to detect and adapt concept drift?
RQ2
How to avoid catastrophic forgetting?
RQ3
How to avoid skewed/imbalanced learning?
RQ4
How to adapt network architecture?
×
The paper is organized as follows. Section 2 reviews key challenges in applying deep learning to streaming data, focusing on concept drift, catastrophic forgetting, skewed learning, and network adaptation. Section 3 delves into concept drift, covering methods for detecting drift and strategies for adaptation. Section 4 addresses catastrophic forgetting, exploring techniques such as hedging methods, selective training, and progressive neural networks to mitigate its impact. Section 5 examines skewed learning, analyzing both data-level and algorithm-level approaches to handling imbalanced streaming data. Section 6 discusses network adaptation, including methods for dynamically adjusting network depth and width to better handle streaming data demands. The Application Sect. 7 illustrates practical implementations of these techniques in smart city, finance, and healthcare systems. Following this, the Ethical Implication section 8 examines the ethical considerations surrounding the use of online deep learning in streaming data, including issues of bias, fairness, transparency, and accountability. The Conclusion Sect. 9 summarizes the main insights from the survey, and the Future Directions Sect. 10 proposes avenues for continued research, highlighting open challenges and potential advancements in this field.
2 Challenges
Online learning faces certain challenges due to network parameter adaptation for each data point, including concept drift, catastrophic forgetting (also known as catastrophic inference), skewed or imbalanced learning (also known as underfitting or overfitting), and network adaptation. Table 2 shows surveys related to these problems, which conclude that none of the surveys target all four questions. Table 1 presents the research questions, and Sects. 3, 4, 5, 6 elaborate on different proposed techniques to address each question mentioned in Table 1. This survey aims to address all four questions at the same time.
Concept drift is a common problem in online learning where the underlying data distribution changes over time. Changes refer to the phenomenon where the statistical properties of the data used to train a machine learning model change over time, leading to a decrease in the model’s performance. The relationship between input and output in an online data stream can change over time due to various factors such as changes in user behavior, changes in the environment, or changes in the underlying system generating the data, and a static model with a fixed relationship may result in poor predictions [19, 20]. Therefore, a dynamic model is required to update the relationship between input and output [15]. To update the model for changing data, it is necessary to detect concept drift in the streaming data [21].
Recently, various methods have been proposed for detecting concept drift in machine learning models using the error rate of classifier [22], sliding window and accuracy [23‐25], fuzzy windowing [26], incremental least square density difference [27] and local drift detection (LDD) measurement [28]. After detecting concept drift, it is important to adapt the machine learning model to the changing data distribution in order to maintain or improve performance. Recent methods for concept drift adaptation are Bilevel Online Deep Learning (BODL) [22], update ensemble by adding or removing networks [23], update model by retraining [25], Fuzzy windowing concept drift adaptation (FW-DA) [26], case-based editing technique [29], and self-adjusting memory [24]. Concept drift can occur in different forms, depending on the nature and rate of change in the data distribution. Figure 3 illustrates the four different types of concept drift such as sudden, gradual, incremental, and reoccurring. Section 3 elaborates on the literature related to concept drift.
Fig. 3
Illustration of four Concept Drift: (a) Sudden (b) Gradual (c) Incremental (d) Reoccurring drift. [15]
×
2.2 Catastrophic forgetting
In a dynamic environment, human cognitive reactions may change in response to the same stimulus due to neural variability in the sensory cortex, which fluctuates over time. Humans learn continuously in such environments, but this can lead to catastrophic forgetting, where previously learned tasks are disrupted by the brain’s neuro system. Neural variability helps compensate for both accuracy and plasticity in humans [30]. Online learning techniques face the same phenomenon of catastrophic forgetting when working with dynamic data [16]. This occurs when the network modifies information related to a previous task due to continuous training on a new task [30‐34]. Figure 4 depicts catastrophic forgetting, where the network forgets class A due to the recursive occurrence of class B. Section 3 elaborates on the literature related to catastrophic forgetting.
Fig. 4
Catastrophic Forgetting: (a) Input data (b) Mimic online learning model in Fig. 2b (c) Output labels. Due to access feeding of label B, the network forgets label A identity
×
2.3 Skewed learning
Streaming data refer to an unbounded sequence of real-time data points with high velocity, volume, and skewed distribution [35]. In a normal distribution, the data points on both sides of the graph are equal, whereas in a skewed distribution, the data points are not equally distributed. In supervised learning, data points are labeled with classes, and if the difference between the data points of classes is enormous, then the data are imbalanced or skewed. Conventional online learning techniques aim to minimize the error rate and maximize accuracy, but raw accuracy can be misleading if the data are skewed or imbalanced [14]. Class-based accuracy needs to be improved to address this issue. Additionally, skewed distribution can result in catastrophic forgetting for the minority class. Section 5 elaborates on the literature related to skewed learning.
2.4 Network adaptation
In an online setting, the optimal network capacity is unknown due to concept drift in streaming data. Online learning starts with a small network and expands when concept drift occurs to achieve scalability and efficiency in training [36]. However, there are challenges involved in network expansion, such as when and how to expand the network and whether to expand the network’s width or depth [37]. Network expansion increases the training cost for the new task, and if the existing network is sufficient to handle the new task, then the network may not need to expand. Additionally, network contraction is necessary to prune redundant neurons and layers from the network, reducing the prediction and training costs of the new task. Section 6 elaborates on the literature related to network adaptation.
Fig. 5
Online Adaptive Structure [23]. Output depends on each layer’s outcomes
×
Figure 5 depicts the online adaptive structure of the deep neural network (DNN), where output depends on each layer output rather than just the last layer. The conventional DNN is composed of several hidden layers, where each layer is connected to the previous layer, and the output is generated from the last layer [23]. However, in the online learning process, the depth of the hidden layers is adjustable to adapt to the model capacity, and these layers are called adaptive depth units. Each adaptive depth unit works as a base classifier to avoid relying solely on the output of the final layer. The final output is a weighted combination of the base classifiers to prevent catastrophic forgetting and improve the convergence speed in case of concept drift.
3 Concept drift
Given time period [0, t], set of samples \(S_{0,t} = \{(X_0,y_0), (X_1,y_1),..., (X_d,y_d)\}\), where X is the feature vector and y is the label, concept drift at timestamp \(t+1\) occurs, if \((X_t,y_t) \ne (X_{t+1},y_{t+1})\) [18, 20, 24, 29]. Depending on the extent of drift, network parameters may need to be changed for small drift, and network architecture may need to be expanded for significant drift. Network expansion could involve adding neurons to the hidden layer or introducing a new hidden layer to the network [36]. Progressive neural networks expand the network by adding the same number of new layers as the existing network for a new task in streaming data [38, 39]. Feature evolution learning is another solution to manage concept drift by incrementally and decrementally adjusting features as the feature dimension changes dynamically, allowing the model to adapt to new patterns in the data [40]. To quickly overcome the performance degradation due to concept drift, three steps to perform are: detection of drift (whether drift occurs or not), quantifying drift (how much drift occurs), and drift adaption (how to react in drift) [15].
3.1 Drift detection
The sliding window method is commonly used to examine a small subset of instances for detecting drift. Drift detection can be categorized into two types: error rate-based and data distribution-based. Error rate-based detection can be conducted after the prediction phase, as it is dependent on the model’s performance. Data distribution-based detection, on the other hand, is not reliant on performance and can be performed at any stage, such as before or after classification.
3.1.1 Error rate-based
Error rate-based drift detection algorithms detect drift by monitoring the algorithm’s performance. These algorithms track the online error rate and trigger a drift alarm if it exceeds a specified level. Gama et al. [41] introduced the drift detection method (DDM), which set a benchmark for error rate-based drift detection algorithms. DDM consists of four steps: data retrieval, data modeling, test statistics calculation, and hypothesis testing. DDM focuses on the error rate of the classifier within a specified time window. If the error rate exceeds the warning level, DDM builds a new learner in parallel with the old learner’s prediction. If the error rate reaches the drift level, the old learner is replaced with the new learner for future prediction.
Ross et al. [42] proposed the exponentially weighted moving average (EWMA) chart for concept drift detection (ECDD) as an improvement over DDM. The method uses the same three steps as DDM but modifies the fourth step by using an EWMA chart to track changes in the error rate. The EWMA chart is a statistical tool that monitors the performance of a machine learning algorithm over time and detects changes in the data distribution that may indicate concept drift. The chart gives greater weight to more recent observations by tracking the mean of a process over time. However, a limitation of this procedure is the length of the window slot for dynamic mean, which may vary for different data and applications and can be challenging to determine a standard window slot.
Guo et al. [23] proposed the use of an ensemble of neural networks, each trained on a different subset of the data, for detecting and adapting to concept drift. To detect concept drift, the accuracy of each neural network within the ensemble is monitored on the data within the sliding window and compared to a threshold. If the accuracy falls below the threshold, it is assumed that concept drift has occurred. Similarly, Han et al. [22] utilized an ensemble of classifiers to detect and adapt to concept drift. Concept drift is detected by monitoring the probability distribution of the data using the error rate of the base classifiers. Losing et al. [24] utilized a sliding window approach to capture the most recent data, and the KNN classifier is trained on the current window to classify incoming data. To detect concept drift, the classification accuracy of the classifier is monitored, and if it falls below a threshold, it is assumed that concept drift has occurred. The memory is then adjusted to adapt to the new data distribution.
3.1.2 Data distribution-based
Data distribution-based drift detection algorithms use distance or similarity functions to detect drift by comparing historical and new data [43‐45]. These algorithms can be computationally expensive due to the need to measure distances for all instances, making them slower than error rate-based drift detection algorithms [29]. Additionally, determining appropriate window sizes for both historical and new data can be challenging. To address these issues, researchers have proposed various approaches for data distribution-based drift detection. Liu et al. [26] developed the Fuzzy Windowing Drift Detection (FW-DD) method, which uses a fuzzy time window instead of a traditional time window to focus on gradual drift detection. FW-DD compares statistical measures of the current time window with those of the previous time window to detect gradual concept drift, using fuzzy logic to allow for a gradual transition between states of no drift, gradual drift, and sudden drift.
Qahtan et al. [46] proposed the PCA-based change detection (PCA-CD) method, which uses PCA to reduce the dimensionality of multi-dimensional streaming data and identify the principal components that capture the most significant variation in the data. PCA-CD constructs a subspace model from the principal components to represent the underlying data distribution and compares the subspace model of the current data with that of historical data using a statistical test. This method is computationally efficient due to its use of an efficient density estimator, and it minimizes the need for setting user-defined thresholds with the help of the Page–Hinkley test. Gu et al. [47] developed the equal density estimation (EDE) method, which applies kernel density estimation to estimate the local data distribution within a fixed-size sliding window. EDE compares the density estimate of the current sliding window with that of the previous window and uses a threshold to determine whether the difference between the two is significant. Liu et al. [28] partitioned the input space into a set of regions using a clustering algorithm and estimated the underlying data density for each region using a kernel density estimator. They proposed an adaptive bandwidth selection method that improved the accuracy of density estimation and allowed the method to handle data streams with varying density levels. Least Squares Density Difference-based Change Detection Test (LSDD-CDT) [27] is an incremental version of [28] that uses a Gaussian mixture model for density estimation and a change detection test to detect significant drift.
3.2 Drift adaptation
Once concept drift is detected, adaptation methods can be employed to update the model to reflect the new data distribution. These methods can include retraining the model with new data, updating the model parameters, or using ensemble methods that combine multiple models trained on different data subsets. Elwell et al. [48] proposed Learn++.NSE detects concept drift by comparing the current and recent performance of base classifiers. Learn++.NSE adapts concept drift by building a new classifier unit for each batch of input data and combining them as an ensemble using dynamically weighted majority voting. The voting weights are updated based on the time-adjusted accuracy of each classifier.
Zhou et al. [39] addressed concept drift by adding a constant number of features to the network for underfitting and merging features to avoid overfitting and redundancy. However, this method is sub-optimal because of the complete retraining required and the constant number of features added without measuring the capacity of the drift. Shao et al. [45] proposed a method to detect concept drift by measuring the similarity between the incoming data stream and the learned prototype. Once the drift is detected, the learned prototypes are updated according to the new data for drift adaptation. Liu et al. [49] detect concept drift by measuring the conflict between the active learner and input data. If drift is detected, a new learner is initialized and trained on the conflicted input data. Losing et al. [24] proposed Self-Adjusting Memory KNN to deal with heterogeneous drift. KNN is used as a classifier, and the SAM concept is used to transfer the current concept from short-term memory (STM) to long-term memory (LTM).
Lu et al. [29] proposed a two-step Case-based Reasoning method to solve the drift problem. In the first step, drift is detected along with the competence region, indicating where the drift is more severe. Noise-efficient Fast Context Switching is used to identify noise and novel concepts, and then, the noise is removed. The second step is the preservation of novel concepts for drift adaptation. After preservation, Stepwise Redundancy Removal (SRR) uses KNN to remove redundant concepts, and then, the competence model is updated. Liu et al. [26] proposed the Fuzzy Windowing Drift Adaptation (FW-DA) algorithm, which detects concept drift using a certain warning threshold and membership function. When concept drift is detected, a new learner is created and trained, replacing the old learner.
Xu et al. [25] proposed the Dynamic Extreme Learning Machine (DELM) to detect drift using the same technique as [41]. After drift detection, the adaptation procedure is enhanced by using an Extreme Learning Machine (ELM) [50]. When concept drift is detected, more hidden layer neurons are added to the ELM, which serves as a base classifier. When concept drift reaches the provided upper limit or accuracy to the provided lower limit, the current classifier is deleted, and the new classifier starts training on new data. Ashfahani et al. [51] proposed a Drift Detection Scenario (DDS) to detect concept drift. When concept drift is detected, the depth of the network increases by adding a new hidden layer to the network. At the same time, the complexity reduction scenario removes a hidden layer if it is highly correlated to another hidden layer, in this way equilibrium in hidden layers is maintained. Concept drift was detected in the input space via evaluation of accuracy by Hoeffding’s bound method. Hoeffding’s bound method defines theoretical bound, i.e., many data points required to signal drift based on accuracy.
Guo et al. [23] proposed an ensembled-based technique called selective ensemble-based online adaptive (SEOA) neural network. When concept drift is detected, the next step is adaptation, in which model generalization and adaptability are improved by integrating different natures of base classifiers dynamically and selectively. Hen et al. [22] proposed the Bilevel Online Deep Learning Framework (BODL). When concept drift is detected, BODL updates the model parameters for base classifiers using a proposed bilevel optimization scheme. In bilevel optimization, the cross-entropy loss is used as an objective function for memory and model weights. Table 3 presents a list of articles on concept drift, highlighting the methodologies employed, along with their advantages and disadvantages. Ren et al. [52] handle concept drift by dynamically selecting network components based on the current data distribution. This dynamic network selection is conditioned on the discrete variable that models the distribution shifts, allowing the model to adapt to new patterns as they emerge.
Table 3
Concept drift methods and proposed positive and negative remarks
\(-\) Concept drift detected only with incorrect prediction
4 Catastrophic forgetting
Catastrophic forgetting is a common problem in online machine learning where a model forgets previously learned information when it is trained on new data. In an attempt to address this problem, researchers have proposed various techniques, such as the hedging method, selective training, and progressive neural networks. Ans et al. [32] proposed a model called SRM (self-refreshing memory) that comprises a memory module storing essential information from past tasks and a network module learning the current task. The memory module is updated regularly by a self-refreshing mechanism, which selects the most pertinent information from the network module to avoid catastrophic forgetting. As a result, the memory is optimized to retain important knowledge and forget irrelevant or outdated information.
Goodfellow et al. [30] investigated the selection of appropriate learning algorithms and activation functions for different tasks and relationships between tasks to mitigate the catastrophic forgetting effect. They examined the relationship between tasks and found that dropout is the most effective training algorithm for modern feed-forward neural networks. However, the choice of activation function varies depending on the task and the relationship between tasks. Maxout is the only activation function that consistently performs well across all tasks, but it may not always be the optimal choice. Dropout tends to increase the optimal size of the network, but this effect is not always consistent. Kirkpatrick et al. [55] overcome catastrophic forgetting through Elastic Weight Consolidation (EWC) in neural networks by slowing the learning rate of the model. Plasticity of weights is selectively decreased toward previous weights according to the importance of the weights in the previously learned task.
Nguyen et al. [34] proposed a method for measuring catastrophic forgetting using the actual error rate and task sequence hardness. They investigated the relationship between catastrophic forgetting and task properties and found a strong correlation with total complexity and a weak correlation with sequential heterogeneity toward task sequence hardness. Ren et al. [52] uses a Bayesian framework with a discrete distribution-modeling variable to capture abrupt shifts in data. This approach helps in retaining knowledge from previous distributions while adapting to new ones, thereby mitigating catastrophic forgetting. The challenge of catastrophic forgetting is also addressed through instance incremental learning, which handles data attributes changing over time, ensuring that the model retains previously learned information while adapting to new data [40]. Park et al. [56] explore this challenge by introducing a speculative backpropagation method that leverages activation history to mitigate forgetting. Their approach allows neural networks to retain knowledge from previous tasks while adapting to new ones, thus enhancing the model’s ability to learn continuously without significant degradation in performance.
4.1 Hedging method
Littlestone et al. [57] proposed the Weighted Majority Algorithm, which assigns weights to each algorithm in a pool and combines their outputs to make the final prediction. Since each algorithm in the pool performs prediction in a different way, the Weighted Majority Algorithm allocates weights according to their performance to account for their heterogeneity. The weight assigned to each algorithm is updated based on its prior performance. Consequently, algorithms with good prior performance contribute more to the final decision than those with poor performance. The Weighted Majority Algorithm is a crucial step in maintaining prior knowledge and overcoming catastrophic forgetting.
Freund et al. [58] presented the Hedge Algorithm as a generalization of the Weighted Majority Algorithm proposed by [57] for online allocation problems. The Hedge Algorithm maintains a weight vector with time t for all algorithms, where weights are nonnegative and sum up to 1. The initial weight vector can be arbitrary or assigned high weights to those algorithms expected to perform best. If prior knowledge of strategy performance is missing, equal weights can be assigned to each strategy. A key improvement over [57] is the application of upper and lower bounds on weights and loss. Using a weight vector, strategies with better performance are given preference for the final output to overcome catastrophic forgetting. The hedging method utilizes each layer of the deep neural network as a classifier [23]. In the hedging method, the weights of the classifiers are updated according to their performance on the current task.
Han et al. [22] proposed a method to avoid catastrophic forgetting by using base classifiers as an ensemble, with the weights of these classifiers updated using exponential gradient descent in an online manner. The method involves solving a bilevel optimization problem, where the inner problem determines the optimal weights of the base classifiers for the current task, while the outer problem updates the weights of the base classifiers based on their performance on the current task. This approach allows for both retaining important knowledge and adapting to new tasks. It has also been shown to outperform other methods, such as fine-tuning and elastic weight consolidation.
Sahoo et al. [37] proposed a method for training deep neural networks in an online setting with stream data. Their network architecture includes an output classifier connected to each hidden layer, similar to the hedging method. The final output is a combination of the outputs from all layers weighted by their respective performance. This approach gives preference to the output of better-performing layers in the past to overcome catastrophic forgetting. Similarly, in [51], Ashfahani et al. proposed a method to overcome catastrophic forgetting in which each layer is directly connected to the output layer, and its contribution to the final output is determined by a dynamically assigned weight. These weights decide how much preference should be given to new or old knowledge.
4.2 Selective training
In order to overcome the problem of catastrophic forgetting in lifelong learning, Yoon et al. [36] propose a novel incremental learning algorithm called Dynamically Expandable Networks (DEN). DEN is designed to prevent semantic drift by selecting and retaining relevant knowledge from previous tasks while efficiently allocating resources to learn new information. Initially, the network is trained on the first task. When a new task is introduced, the network is duplicated, and the new task is trained on the duplicate network. During this process, the activations of each neuron in the original network are recorded. Based on the difference between a neuron’s activations on the new task and its activations on the original task, a relevance score is calculated for each neuron. The top-ranked neurons, based on their relevance score, are then selected and used in the expanded network for the new task. The expanded network is trained on both the old and new tasks. This process is repeated for each new task, allowing the expanded network to selectively retain knowledge from previous tasks while also efficiently allocating resources to learn new information. By selecting neurons based on their relevance to the new task, DEN effectively prevents semantic drift and retains important knowledge from previous tasks. Additionally, the dynamically expandable nature of the network allows it to efficiently allocate resources to new tasks as they arrive, improving its overall performance in lifelong learning scenarios.
Iman et al. [53] propose a two-step training approach to prevent catastrophic forgetting and over-biasing of the model. In the first step, the network is trained with a high learning rate and then fine-tuned to adjust the weights of the network. In the second step, to prevent catastrophic forgetting, the pre-trained layers are kept frozen, and the network capacity is expanded by adding new neurons and layers to handle the new task’s complexity. In this second step, selective training is performed only on the newly added neurons and layers, allowing the network to learn new information without overwriting previously learned knowledge. This approach improves the model’s ability to handle complex lifelong learning scenarios while maintaining a balanced representation of old and new knowledge.
Mousser et al. [59] designed the Incremental Deep Tree (IDT) framework to enable CNNs to learn new classes incrementally without forgetting previously learned information. The framework organizes the learning process in a tree-like structure, where each node represents a different class or task. This hierarchical approach helps in isolating the learning of new classes from the existing ones, thereby reducing interference and preventing catastrophic forgetting. Instead of retraining the entire network from scratch, the IDT framework updates only the relevant parts of the network. This selective updating mechanism ensures that the model retains its performance on previously learned tasks while incorporating new information.
Fig. 6
Progressive Network [38]. Blue arrows denoted weights for the first label, red for the second, and green for the third. The network may add another column if a new class label appears in the data stream. During training, weights of specific class labels are updated and remain frozen
×
4.3 Progressive neural network
Rusu et al. [38] proposed the progressive neural network (PNN) to overcome catastrophic inference in lifelong learning. PNN starts with a single-column network for one task, which consists of multiple layers of neurons similar to conventional deep neural networks, and adds a new column for each subsequent task or label. During the learning process of the second task, only the weights in the second column are updated, while the weights in the first column are kept frozen to maintain past knowledge and avoid catastrophic forgetting. Figure 6 illustrates a PNN with three columns, where each block in a column represents a hidden layer, and the third column is added for the final task. During the training process of the third task, only the green connections will be updated, and the remaining connections will be kept frozen to avoid catastrophic forgetting. The alpha box serves as a lateral connection, also known as an adapter, to ensure that previously learned features are reused, modified, or ignored, depending on their relevance to the current task. As the number of columns increases with the number of tasks, the network will become too complex. Same as [39] constant number of layers are added by [38] without measuring the difficulty of the task, in this way network complexity will increase exponentially, which needs to be considered for future direction.
In their recent paper, Ergun et al. [60] proposed a novel approach to address the issue of network complexity in progressive neural networks (PNNs) [38, 39]. Specifically, they introduced a new variant of PNN called the recursive progressive neural network (R-PNN), which incorporates sparse group LASSO regularization [61] to achieve better generalization performance by pruning unnecessary neurons from the network. The authors employed both \(l_1\) and \(l_2\) regularization techniques to promote sparsity in the network connections. By applying sparse group LASSO regularization, they were able to eliminate redundant neurons by setting all outgoing connections of a specific neuron to zero. Table 4 presents a list of articles on catastrophic forgetting, highlighting the methodologies employed, along with their advantages and disadvantages.
Table 4
Catastrophic forgetting methods and proposed positive and negative remarks
Investigation of learning algorithm and activation function for catastrophic forgetting
+ Recommend dropout algorithm with maxout activation function
+ Investigation recommends that the impact of choosing the best learning algorithm is more as compared to the activation function on catastrophic forgetting
+ The Hierarchical structure of the network allows selection training easily
+ Framework is compared with three other methods on thre e benchmark datasets
\(-\)The hierarchical structure of the IDT framework can become complex and computationally expensive as the number of classes increases. This can make it challenging to scale the model for very large datasets with numerous classes
Speculative Backpropagation (SB) and Activation History
+ 4.4% improvement in knowledge preservation compared to state-of-the-art techniques
+ 31% training time reduced, making the continual learning more efficient
\(-\)Scalability to very large datasets or complex tasks remains to be fully tested
5 Skewed learning
In the past few decades, skewed learning in data streams has been addressed using two main approaches: data-level and algorithm-level techniques. Data-level methods involve modifying the data prior to feeding it into the learning algorithm, such as through oversampling or undersampling. These approaches are algorithm-independent and can be used with any learning algorithm. Algorithm-level techniques, on the other hand, modify the learning algorithm’s training process to improve the effectiveness of classifiers on imbalanced data streams. These techniques are often more specialized for specific learning models [13].
5.1 Data-level approach
Ditzler et al. [21] proposed two algorithms to address concept drift and imbalanced data in dynamic environments. The first algorithm, Learn++ for Concept Drift with SMOTE (Learn++.CDS), combines Learn++ Non-Stationary Environment (Learn++.NSE) with SMOTE. Learn++.NSE accommodates various types of concept drift, such as slow, rapid, gradual, abrupt, and cyclical drift, while SMOTE balances the minority class ratio with the majority class. The second algorithm, Learn++ Nonstationary and Imbalanced Environment (Learn++.NIE), builds on the first algorithm by making two important updates. Firstly, it avoids using raw classification accuracy, which can be misleading as it improves overall accuracy rather than class-wise accuracy. Instead, error distribution is balanced to improve minority recall while preserving majority performance. The classifier weights are updated with class-dependent errors to avoid catastrophic inference for the minority class. Secondly, Bagging Variation is used to generate sub-ensembles of classifiers. These sub-ensemble classifiers are trained on the minority class data and an equal portion of random majority class data each time, thereby avoiding oversampling of the minority class or generating synthetic data. In this approach, a batch of samples arrives at each time stamp.
Aminian et al. [63] proposed Chebyshev’s inequality approach to find rare and frequent class instances in the data stream by using the mean and variance of the distribution. A low value indicates rarity and a high value indicates frequent samples. Later, the same information is used to perform oversampling and undersampling to balance class distribution in the data stream. The proposed approach is effective in both high and low levels of majority or minority cases in the data stream. The approach is dependent on the mean and variance of the data stream, which me not be effective in evolving the data stream.
Czarnowski et al. [64] proposed a hybrid framework using oversampling and instance selection that consists of three components: classification, summarization, and learning in Fig. 7. The classification model consists of an ensemble having classifiers for each target class. A predicted label is the result of a weighted majority vote from the classification ensemble. After classification, incorrect instances are gathered in the form of data chunks in the summarization component. Incorrect prediction is the detection of concept drift. Adequate instances that can improve the learning process are selected for learning components; others are removed from the chunk. The chuck summarizing process ensures balance among the target class instances is maintained. The summarization component belongs to data-level approach classifiers learning.
Fig. 7
Post Data-Level Approach [64]. The classification component consists of model ensembles. The summarization component gathers incorrect instances and ensures balance among class instances. The learning component selects adequate instances for model updates
×
5.2 Algorithm-level approach
Czarnowski et al. [54] proposed an algorithm-based solution for the class imbalance problem. Whenever a new data chunk arrives, a new classifier is introduced to replace the worst component in the ensemble. The worst and best component in an ensemble is associated with weights assigned to each single class classifier.
Li et al. [65] proposed a hybrid approach on the algorithm level for imbalanced data stream misclassification. In the hybrid approach, the Online Sequential Extreme Learning Machine (OSELM) algorithm is combined with cost-sensitive learning strategy. In the initialization phase, chucks are created from incoming instances, and penalty weights are calculated for both minority and majority class samples. In both cases, the penalty weight is calculated with one divided by the number of minority or majority samples. In this way, the penalty weight for the minority will be higher than that for the majority to improve misclassification due to an imbalanced data stream.
Chen et al. [35] proposed cost-sensitive sparse online learning via truncate gradient (CSGT) to address concept drift and skewed learning problems on high-dimensional stream data. CSGT proposes a trade-off between low misclassification via convex loss function and sparse linear classifier via truncated gradient technique. In CSGT, sparsity and loss are taken into account in fitness functions. The influence of both creates a trade-off. CSGT is scalable and quick to respond to due to low space and computation complexity. Table 5 presents a list of articles on skewed learning, highlighting the methodologies employed, along with their advantages and disadvantages.
Table 5
Skewed learning methods and proposed positive and negative remarks
A hybrid approach using oversampling and instance selection
+ Chunk summarization process ensures to maintain balance in class instances
\(-\) Only incorrect instances are gathered in the summarization component because it is assumed that incorrect predictions happen due to concept drift
\(-\) But incorrect prediction may result due to the inability of the model
+ No artificial oversampling and instance removal without investigation
\(-\) Concept drift is ignored at all
\(-\) Classifiers update delayed due to summarization component
6 Network adaptation
In batch learning, the network capacity problem in neural networks is solved with the help of validation. Due to data absence, batch learning is unrealistic in an online setting, which raises the problem of deciding network capacity at the beginning. In a static network, the architecture and capability of the network are fixed after training, which may limit the inference and efficiency [66‐69]. On the other hand, the dynamic network can adapt its architecture, including structure and parameters, to attain favorable advantages according to applications. Due to their adaptive nature, these networks trade-off in different performance metrics for different target hardware and dynamic environments, according to applications. Apart from flexibility in nature, these networks are compatible with adapting recently advanced architectures [70, 71], optimizations techniques [72, 73], and data prepossessing methods [73, 74]. Just like a human brain, dynamic network process info dynamically [75, 76], analyzing which input part is more favorable for prediction [77]. Changes in the network architecture could be based on input samples, spatial information, and temporal information. This article targets input-wise dynamic networks. Input-wise dynamic networks are dependent on the incoming samples of data streams. Networks adjust architecture and parameters according to sample nature to reduce redundancy, increase efficiency via architecture, and improve inference with minimum increase in computational cost via parameter adjustment.
Adjusting architecture means changing the depth and width of the network according to sample input. Network architecture may change by performing dynamic routing according to the input sample. As compared to the static model, the dynamic model achieves efficiency by using a shallow network for easy samples and a deep network for hard samples [78‐80]. Bohnstingl et al. [81] introduce an online spatiotemporal learning framework that allows deep recurrent and spiking neural networks to adapt continuously. This framework contrasts with traditional backpropagation methods, which require extensive offline computations. The ability to learn in real time is further supported by the findings of Dongjin et al. [82], who explore deep joint learning for radar signal modulation recognition, demonstrating the potential for adaptive learning in signal processing applications. These advancements indicate a shift toward more flexible and responsive neural network architectures that can thrive in dynamic environments.
6.1 Dynamic depth
Sahoo et al. [37] identify the appropriate depth of the neural network with the help of classifier performance at the depth. Discount, learning rate, and smoothing parameters are provided as input parameters to the network. Three parameters need to be learned: initially, uniformly distributed classifier weights are provided, which contribute to the global output; classification weights, which contribute to local output; and network weights. These learning parameters of the model are updated using online gradient descent by hedge propagation instead of conventional backpropagation. In the hidden layer, ReLU is used as an activation function, and Softmax is used at the output layer. The performance of shallow layers is good as compared to higher layers due to the fast convergence of shallow networks with fewer data. The performance of higher layers is improved with the passage of time for more data. In [37], Sahoo et al. proposed that shallow networks converge fast as compared to complex networks, while complex networks gradually improve performance according to the amount of data.
The depth of the dynamic network increases exponentially due to hard samples, which makes it computationally expensive for easy sample prediction latency. Early exit and skipping layers are two possible solutions to address this issue. The complexity of input samples varies for dynamic and real-world application problems due to uncertainty, using the same depth of the network for easy samples may be naive in case of prediction latency. If there exists more difference in the complexity of the input and prediction latency is important to be decreased, then early exit is a better solution. Early exit allows an easy sample to be predicted with shallow layers according to the complexity level of the sample while skipping the remaining deep layers [83‐85].
6.1.1 Early exit
An initial approach to an early exit for the easy sample is cascading CNN, which consists of multiple models. Big/Little-net [86] cascade two network models with different depths. Early exit is concluded with the help of the score difference between class labels if the score difference exceeds a certain threshold. However, this solution only works for binary classification. [70, 85, 87‐90] cascade multiple CNNs to solve multi-classification problems. A decision function is used to evaluate whether obtaining features from the previous model is enough to be fed to a linear classifier for prediction or may be forwarded to the immediate model. In these procedures, models work independently, i.e., each model works from scratch without reusing the features learned from the previous model. Later, a backbone network is proposed which decides multi-early exit based on threshold [84, 91], the learned function [92, 93], and utilized feature learned from the previous model to intermediate model.
6.1.2 Skipping layers
In the above procedures, after a certain number of layers, the remaining deep layers are skipped for an early exit. A more flexible network ResNet [70] is proposed to skip only the intermediate layer and continue execution of the remaining deep layers. Halting score is a scalar value first proposed by [66] and is used to decide whether the learned feature of the previous layer will be fed to the intermediate layer or not. Later this procedure is improved by [94] for the vision task via residual blocks. [95] reduces the number of halting score evaluations after each layer to a certain extent by making a weight-sharing block that consists of multiple layers and halting score evaluated after each block instead of each layer. Ashfahani et al. [51] adapt network depth by adding or pruning hidden layers with the help of the Drift Detection Scenario (DDS) method. A new hidden layer is added for drift signaled which is based on a high-bias situation. Layer pruning is achieved via the analysis of mutual information across the hidden layer. If two layers consist of the same information, then one of them should be pruned to reduce network depth. In DDS, thresholds for network depth adaptation are self-learned. Zhou et al. [39] proposed an incremental feature learning algorithm to determine the optimal capacity of the network for a complex problem. Feature learning algorithm consists of two strategies, adding new features (hidden neurons or layers) and merging features (hidden layer).
6.2 Dynamic width
Network width can be controlled by skipping neurons, branches, or channels. In a fully connected network, according to sample nature, different neurons are responsible for different feature representations. Initially, neuron activations are controlled by auxiliary branches [96‐98], and low-rank approximation [99]. A soft mixture of experts consists of multiple network branches built in parallel to each other, and outputs are fused depending on weight policy [100‐102]. Later hard gates are developed for branch selection to skip some branches and improve inference by decreasing prediction latency [77, 103, 104].
Multi-stage architectures can be constructed using the width channel dimension, allowing for early predictions to be made with considerable confidence [105]. One method for achieving this is through the use of a Channel Gating Network (CGNet) [106], which activates a portion of the convolutional filters and selects subsequent filters based on specific criteria. Another approach is to dynamically decide which channels to activate at each stage, as done in Runtime Neural Pruning (RNP) [107], which performs layer-wise pruning using a Markov decision process. A gate model could also be used to decide the width of a stage for ResNet [108], though this requires training and optimization. Several other solutions [109‐111] prune both depth and width dynamically, allowing for more flexible networks. These methods can improve network efficiency and accuracy by removing unnecessary connections while retaining important ones.
Yoon et al. [36] addressed the scalability and efficiency of the networks in an online setting. If selective retraining fails to produce satisfied performance, then the network is expanded in a top-down manner, and unnecessary neurons are removed with the help of group sparsity regularization. DEN calculates the amount of drift from the number of neurons related to previous and new tasks. If the difference exceeds the limit, then neurons for the new task are duplicated. DEN is also compared with batch training; both have the same performance, though DEN uses less capacity of the network than batch. After fine-tuning, DEN outperforms batch training.
Ashfahani et al. [51] proposed a self-organized network which is also known as autonomous deep learning. Network width is adapted with the help of the network significance (NS) method, which adds and prunes hidden neurons for adaptation. Network adaptation is decided with the help of bias and variance. Bias is calculated as the difference between the actual and out and average predicted output. A high biased value of the network is considered underfitting, which can be overcome by adding new hidden neurons to the network. Variance is defined as variability in prediction; a high variance value of the network is considered as overfitting on the testing data but may underfit on unforeseen data. Overfitting can be overcome with the help of hidden neuron pruning. In the NS method, thresholds for bias and variance are not user-defined, but self-learned values.
Routing nodes are responsible for selecting different paths in dynamic routes. CapsuleNets [69, 112] perform dynamic routing between capsules to draw relations among objects or parts of the objects. Capsules are a group of neurons, and capsule parameters need to be trained, fine-tuned, and optimized. Another approach is to dynamically adjust the parameters of the network with little increase in computation cost as compared to dynamic architecture while keeping architecture static. However, this paper only targets dynamic architecture. Ren et al. [52] employ a dynamic masking strategy that supports inter-distribution transfer. By overlapping a set of sparse networks, the model can adapt to different distributions without needing to retrain from scratch. This strategy ensures that the network can flexibly adjust to varying data regimes. Table 6 presents a list of articles on concept drift, highlighting the methodologies employed, along with their advantages and disadvantages.
Table 6
Network adaptation methods and proposed positive and negative remarks
+ real-time adaptation in spiking neural networks (SNNs) and recurrent neural networks (RNNs)
+ Achieves gradient equivalence to backpropagation through time (BPTT) for shallow networks, enabling efficient online training with comparable performance to traditional offline methods
\(-\) While OSTL is effective for shallow networks, its application to deeper networks may involve increased computational complexity and resource requirements
1) Bike sharing. 2) Appliance energy prediction. 3) California housing. 4) Friedman Artificial Domain. 5) Beijing PM2.5 Data. 6) US used car sales data. 7) 3D road network
Point of Interest (POI) recommendation 1) New York 2) Tokyo [115]
7 Applications
7.1 Smart city
In the context of smart cities, online learning techniques are increasingly being applied to optimize energy harvesting and information decoding for Internet of Things (IoT) devices. These devices are capable of simultaneous wireless information and power reception, which is crucial for maintaining efficient and sustainable urban environments. Chun et al. [116] and Luo et al. [117] developed frameworks that have been developed to jointly optimize energy harvesting and information decoding for IoT devices, involving the design of a generalized power-splitting receiver where each antenna has an independent power splitter, thereby enhancing network performance. The primary objective of Lee et al. [118] and Al et al. [119] is to maximize the harvested energy for each IoT device while meeting data rate requirements. To achieve this, a double-deep deterministic policy gradient-based online learning algorithm has been proposed. Tang et al. [120] and Li et al. [121] algorithms allow each IoT device to determine receive beamforming and power-splitting ratio vectors in real time, and it can be implemented in a distributed manner using only local channel state information. This eliminates the need for cooperation and information exchange among base stations and IoT devices, making the system more efficient. Lee et al. [118] propose extensive simulations that have validated the effectiveness of the proposed algorithm, demonstrating significant improvements in energy harvesting and information decoding efficiency. These approaches highlight the potential of online learning to enhance the functionality and sustainability of smart cities by optimizing the performance of IoT devices in real time. Wang et al. [114] developed a novel deep interactive reinforcement learning framework to enhance the accuracy and relevance of POI recommendations within smart cities. The methodology involves modeling dynamic interactions between users and geospatial contexts through a dynamic knowledge graph stream, capturing human–human, geo–human, and geo–geo interactions. The recommendation process is treated as a series of actions taken by an agent in response to environmental changes, including user visits and POI updates. Jiang et al. [122] introduce the EduHawkes framework leveraging the Neural Hawkes Process to model online study behaviors. EduHawkes employs a hierarchical encode-decode architecture to simultaneously optimize two tasks: study behavior prediction (event-level) and study quality prediction (course-level).
7.2 Finance
Online learning algorithms are particularly effective in predicting stock prices as they can continuously update their models with new market data. This allows them to adapt to changing market conditions and improve the accuracy of their predictions over time. For instance, a model might initially predict stock prices based on historical data, but as new data come in, it can adjust its predictions to reflect recent trends and events. Padhi et al. [123] present a two-stage framework for stock market prediction. First, it uses the mean–variance approach for portfolio construction to minimize investment risk. Then, it employs an online machine learning technique combining perceptron and passive-aggressive algorithms to predict future stock price movements. The algorithm balances between being passive (not changing the model much) and aggressive (updating the model significantly) based on the prediction error. This makes it robust for online learning scenarios where quick adjustments are needed.
In algorithmic trading, online learning models can optimize trading strategies by learning from real-time data [124]. These models can make buy and sell decisions based on the latest market information, helping traders to maximize their profits. For example, an online learning model might identify a pattern in stock price movements and adjust its trading strategy accordingly, leading to more profitable trades. Tsantekidis et al. [125] approach involves diversity-driven knowledge distillation, where multiple teacher agents are trained on different subsets of real-time streaming data to learn diverse trading policies. These policies are then distilled into a student agent, enhancing its ability to adapt to noisy financial environments and improving overall trading performance. This method leverages online and incremental learning to continuously update and refine trading strategies, demonstrating substantial improvements in both stability and effectiveness in dynamic market conditions.
Online portfolio selection is a dynamic and evolving field that leverages advanced machine learning techniques to optimize investment strategies in real time. In portfolio management, online learning can assist in dynamically adjusting investment portfolios to optimize returns and minimize risks. By continuously learning from new market data, these models can make more informed decisions about which assets to buy or sell. The foundational work by Cesa-Bianchi and Lugosi [126] in “Prediction, Learning, and Games” provides a comprehensive framework for understanding the theoretical underpinnings of sequential decision-making and prediction, which are crucial for developing robust online portfolio selection algorithms. Building on this foundation, Li and Hoi survey [127] in “Online Portfolio Selection: A Survey” offers a detailed overview of various state-of-the-art approaches, categorizing them into benchmarks, “Follow-the-Winner,” “Follow-the-Loser,” “Pattern-Matching,” and “Meta-Learning Algorithms.” Their subsequent book [128], “Online Portfolio Selection: Principles and Algorithms,” further elaborates on these principles and introduces innovative strategies that utilize machine learning for financial investment. The development of practical tools, such as the OLPS toolbox by Li et al. [129], has significantly advanced the field by providing an open-source platform for implementing and benchmarking various online portfolio selection strategies. Additionally, the work on transaction cost optimization by Li et al. [130] addresses the practical challenges of incorporating transaction costs into online portfolio selection, proposing a novel framework that enhances the performance of existing strategies under realistic trading conditions5. Together, these contributions highlight the interdisciplinary nature of online portfolio selection, integrating concepts from finance, machine learning, and optimization to create sophisticated and effective investment strategies.
7.3 Healthcare systems
Online learning has significant applications in healthcare systems, particularly in real-time health monitoring for several devices. Le Sun and Yueyuan Wang [131] introduce an energy-efficient online time series classification algorithm called OTCD. This algorithm is designed to handle challenges such as concept drift and catastrophic forgetting, making it highly suitable for continuous monitoring of health data like electrocardiograms (ECG) and photoplethysmograms (PPG). By efficiently processing and classifying time series data on edge devices, OTCD enables timely and accurate health assessments, which are crucial for early detection and intervention in various medical conditions.
Sana Ayromlou et al. [132] introduces a novel data-free class incremental learning framework called Continual Class-Specific Impression (CCSI). This framework addresses the challenge of catastrophic forgetting in deep learning models, which is crucial for continuously updating healthcare systems with new disease types. CCSI ensures privacy and complies with storage regulations by synthesizing data from previously learned classes and combining it with new class data. This approach has demonstrated significant improvements in classification accuracy on various medical datasets, making it a valuable tool for real-time health monitoring and diagnosis.
Sun et al. [133] introduce an algorithm called Prevent Concept Drift in Online Continual Learning (PCDOL). This algorithm is designed to handle challenges such as concept drift and catastrophic forgetting, which are crucial for maintaining accurate and up-to-date health monitoring systems. PCDOL is energy-efficient, requiring minimal computational power and memory, making it ideal for use in nanorobots that collect and analyze health data like electrocardiograms (ECG) and electroencephalograms (EEG). The experimental results demonstrate that PCDOL outperforms several state-of-the-art methods in handling these challenges, ensuring reliable and efficient health monitoring.
Fatemeh Amrollahi et al. [134] introduce a privacy-preserving continual learning algorithm named Weight Uncertainty Propagation and Episodic Representation Replay (WUPERR). This algorithm addresses the challenge of catastrophic forgetting and maintains high predictive performance across different healthcare institutions. Validated using data from over 104,000 patients across four distinct healthcare systems for early sepsis prediction, WUPERR demonstrated superior performance compared to baseline transfer learning approaches. This approach ensures privacy and enhances the generalizability of predictive models, making it a valuable tool for real-time health monitoring and diagnosis.
Mengya Xu et al. [135] introduce a privacy-preserving synthetic continual semantic segmentation framework designed to enhance the precision of robotic-assisted surgeries. This framework addresses the challenge of catastrophic forgetting in deep neural networks by blending open-source old instrument foregrounds with synthesized backgrounds and new instrument foregrounds with extensively augmented real backgrounds. This approach ensures that real patient data are not revealed, maintaining privacy. The framework also incorporates overlapping class-aware temperature normalization (CAT) and multi-scale shifted-feature distillation (SD) to maintain long- and short-range spatial relationships among semantic objects. The effectiveness of this framework was demonstrated on the EndoVis 2017 and 2018 instrument segmentation datasets, making it a valuable tool for real-time health monitoring and diagnosis.
8 Ethical implication
Online deep learning operates in environments where data are continuously generated and processed. This dynamic nature presents unique challenges in ensuring that the models developed are not only effective but also ethical and fair. The integration of ethical considerations into the design and deployment of online learning systems is crucial, especially as these systems increasingly influence decision-making processes across various sectors.
The ethical implications of online deep learning systems primarily revolve around the potential for bias in the data streams they utilize. Bias can manifest in various forms, including historical bias [136], societal bias [137], representation bias [138, 139], and measurement bias [140, 141], which can lead to unfair treatment of certain groups or individuals when decisions are made based on model outputs. For instance, if the streaming data reflect historical inequalities or societal biases, the models trained on such data may perpetuate or even exacerbate these biases. This concern is echoed in the literature, where researchers emphasize the importance of understanding the sources of bias and implementing strategies to mitigate them.
One approach to addressing bias in online learning systems is through the implementation of fairness-aware algorithms [142]. These algorithms are designed to identify and correct biases in the training data or the model outputs. For example, techniques such as re-weighting the training samples or modifying the decision thresholds can help ensure that the model’s predictions do not disproportionately disadvantage any particular group. Moreover, continuous monitoring of model performance across different demographic groups is essential to identify and rectify any emerging biases in real time. This aligns with the findings of Lian et al., who discuss the importance of adapting online learning algorithms to account for the dynamic nature of streaming data, which can change over time and may introduce new biases [143].
Furthermore, the ethical deployment of online learning systems necessitates transparency and accountability. Stakeholders, including developers, users, and affected individuals, should be informed about how models are trained, the data sources used, and the potential limitations of the models. This transparency can foster trust and enable stakeholders to understand the implications of decisions made by these systems. Additionally, establishing accountability mechanisms, such as regular audits and assessments of model performance, can help ensure that ethical standards are upheld throughout the lifecycle of the online learning system.
Another critical aspect of addressing ethical implications in online deep learning is the need for interdisciplinary collaboration. Engaging ethicists, social scientists, and domain experts in the development process can provide valuable insights into the societal impacts of these technologies. This collaborative approach can help identify ethical dilemmas early in the design phase and facilitate the development of solutions that are not only technically sound but also socially responsible. The integration of diverse perspectives can lead to more robust and equitable online learning systems that better serve the needs of all stakeholders.
Moreover, it is essential to consider the regulatory landscape surrounding online learning systems. As these technologies become more prevalent, there is a growing demand for policies and regulations that govern their use. Policymakers must work closely with researchers and practitioners to develop guidelines that promote ethical practices in online learning. These regulations should address issues such as data privacy, informed consent, and the right to explanation, ensuring that individuals are aware of how their data are used and the implications of automated decisions made by these systems.
9 Conclusion
In this paper, we have explored various challenges and solutions associated with online learning, particularly in dynamic environments where data characteristics can shift over time. As online learning applications expand into fields with complex and evolving data streams, handling issues like concept drift, catastrophic forgetting, skewed learning, and network adaptation has become increasingly essential. Each of these challenges introduces unique considerations for model design, training efficiency, and long-term performance. However, most of the existing research on these four problems has been conducted on synthetic or artificial datasets, with only a few studies using real-world datasets of a specific nature. To further validate the effectiveness of these algorithms, more experiments on dynamic and diverse real-world datasets are necessary in the future. Table 7 divides datasets from the literature review into artificial and real-world datasets. Real-world datasets are more dynamic as compared to synthetic or artificial datasets. Future work needs to be more real-world dataset-oriented than artificial datasets. In the following sections, we summarize the key methods and approaches in each area, highlighting their strengths, limitations, and opportunities for further development.
9.1 Concept drift
The detection of concept drift based on data distribution typically involves the use of a window slot, similar to error rate-based detection. Statistical tests or distance measures are used to compare the current window slot to historical ones and detect drift. However, the effectiveness of these methods is highly dependent on the choice of threshold values, which can vary depending on the application. For high-dimensional data, dimensionality reduction techniques such as PCA may be employed.
Adapting to concept drift often involves building new classifiers, which can lead to catastrophic forgetting if prior knowledge is lost. To address this issue, some methods focus on adding new features to the existing network, such as adding neurons or layers and later merging similar features to control network growth. It is important to carefully choose when and how to add features to manage concept drift capacity and reduce the time complexity associated with online learning.
9.2 Catastrophic forgetting
The hedging method is a promising approach to mitigate the issue of catastrophic forgetting in machine learning algorithms by addressing noise and uncertainty in the data. However, its effectiveness in handling streaming data with concept drift is limited due to the predefined structure of the network. This method utilizes an ensemble of multiple classifiers, each trained with different hyperparameters and initialization conditions, which allows for a wider exploration of policies. While this approach can improve performance, it also has some drawbacks. For example, using multiple classifiers can increase the computation and memory cost by maintaining multiple models and their weights. Additionally, optimizing the combination of multiple models can be challenging, and selecting appropriate weights for each classifier can be time-consuming. Nevertheless, the hedging method remains a promising approach for handling concept drift in streaming data and can be further improved with future research.
Selective training prevents catastrophic forgetting by selectively updating only a subset of the weights in the network, allowing the model to retain important knowledge from previous tasks. Selective training saves computational resources as compared to hedging and is easy to implement because it does not require significant changes to the model architecture. However, selective training may not be able to adapt to new tasks completely, and it is difficult to determine which weights need to be updated. Selective training may lose some information by selectively retaining some of the weights from the previous task. Selective training reduces the computation cost by avoiding complete retraining. However, it is difficult to determine which neurons or layers to select on which basis for selective training. Other methods store information related to the previous task in another memory component that is costly.
The progressive neural network (PNN) is an effective technique for avoiding catastrophic forgetting by retaining previously learned knowledge from past tasks. PNN utilizes incremental learning, which reduces computational costs by avoiding the need to train separate networks for each new task. However, as the number of tasks and layers increases, PNN can become computationally expensive due to the introduction of new tasks. Furthermore, PNN has limited generalization capabilities for entirely new tasks that are different from those previously learned.
9.3 Skewed learning
To avoid skewed learning, various methods can be used at both the data level and algorithm level. The SMOTE technique is commonly used at the data level, but it may affect the performance due to the introduction of artificial data. Another method involves updating weights through class-dependent error to avoid creating artificial data. Ensemble classifiers may also be trained on minority classes with an equal portion of the majority class, but this method can result in some portion of the majority class data being useless. To improve this, it is essential to use only important data from the majority class. At the data level, the class-wise accuracy approach can be used instead of raw accuracy to avoid skewed learning. This approach is easy to implement and algorithm-independent. Oversampling of the minority class and undersampling of the majority class need to be carefully controlled, as excessive oversampling or undersampling can introduce bias and negatively impact model performance. At the algorithm level, the cost-sensitive approach is less computationally expensive than ensemble algorithms. However, the performance of the ensemble algorithm is generally better than that of the cost-sensitive approach due to the use of multiple models.
9.4 Network adaptation
Determining network capacity based on classifier performance can be challenging when dealing with imbalanced class distributions. Another approach is to dynamically add or remove neurons based on the bias and variance in performance and add or remove layers for tasks with high bias and variance. Early exit can be a more practical solution for network adaptation, as it is easier to determine when to exit early based on input performance rather than skipping layers for specific tasks. Network width can be increased by adding neurons for complex tasks and reduced through pruning to improve generalization.
10 Future directions
Although there have been numerous investigations in the literature, there remain several unresolved problems and difficulties that require further exploration and community efforts in future research. Below, we outline some significant and emerging research directions for scholars who have an interest in online learning.
First of all, concept drift detection and adaptation have been extensively studied, both in single- and multi-dimensional data. Detection is the first step, and adaptation is the last step in solving concept drift. Adaptation mainly depends on the amount and type of drift that occurs, which is still an open challenge. There is still a lack of approaches for measuring concept drift and then adapting drift according to the level of drift in the data.
Secondly, an important direction in online learning research is the exploration of large-scale streaming images for real-time big data analytics. While online learning has significant efficiency and scalability advantages over batch learning for static images, it becomes a challenging task when handling the extremely high volume and high velocity of streaming images.
Third, despite considerable research in large-scale batch machine learning, there is a need for further investigation into parallel online learning and distributed online learning using diverse computational resources, including high-performance computing machines, cloud computing infrastructures, and potentially low-cost IoT computing environments.
Last but not least, the impact of these four problems on each other and the evaluation metrics. In proposing a deep learning model in online passion, all four methods need to be considered, and their impact on each other needs to be specified. Preventing catastrophic forgetting may increase network complexity, while reducing network complexity may result in catastrophic forgetting. In the same way, increasing learning accuracy may increase model complexity, and increasing learning scalability may affect computational efficiency.
Finally, online learning is often applied in domains where data privacy and security are critical, such as healthcare and finance. Therefore, developing privacy-preserving and secure online learning algorithms is becoming increasingly important. Future research should focus on developing techniques that can provide strong privacy guarantees while maintaining the performance of the online learning algorithms.
Acknowledgements
This work is supported by the project EMB3DCAM “Next Generation 3D Machine Vision with Embedded Visual Computing” and co-funded under the grant number 325748 of the Research Council of Norway.
Declarations
Conflict of interest
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.