Skip to main content
Top
Published in:

Open Access 06-08-2024 | Original Article

Multiscale-integrated deep learning approaches for short-term load forecasting

Authors: Yang Yang, Yuchao Gao, Zijin Wang, Xi’an Li, Hu Zhou, Jinran Wu

Published in: International Journal of Machine Learning and Cybernetics | Issue 12/2024

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The article introduces a novel multi-scale deep learning approach for short-term load forecasting (STLF), which is crucial for effective power system management. Traditional methods like Fourier Transform (FT) suffer from limitations such as manual setting of decomposition modes and inefficiencies in forecasting. The proposed approach, called M-RAR-BiGRU, automates the decomposition process and integrates signal decomposition with deep learning techniques, avoiding duplicate modeling and enhancing feature extraction. The model uses a multi-scale deep neural network (MscaleDNN) to decompose load series into low- and high-frequency components, and a robust-autoregression-bi-directional gate recurrent unit (RAR-BiGRU) to model linear and nonlinear components. Additionally, the model employs the adaptive rescaled lncosh (ARlncosh) loss function to handle outliers, significantly improving the robustness of the forecasting model. The authors validate the proposed approach through experiments on load series from Portugal and Australia, demonstrating its superior performance compared to traditional and other state-of-the-art methods. The article highlights the importance of multi-scale feature extraction and robust loss functions in enhancing the accuracy and reliability of short-term load forecasting.
Notes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Short-term load forecasting (STLF) is critical for effective power system management, aiming to predict electricity demand over periods ranging from an hour to a week [1, 2]. This prediction is essential due to the direct influence of human activities on power consumption and the operational challenges posed by the variability of renewable energy sources [3]. Traditional signal decomposition methods like the Fourier transform (FT) [4], integral to STLF, require manual setting of decomposition modes and often suffer from duplicative modeling, leading to inefficiencies and inaccuracies in forecasts. These limitations hinder the ability of energy providers to respond dynamically to changes in demand and to effectively integrate renewable energy. To address these challenges, this paper introduces a novel STLF algorithm employing multi-scale perspective decomposition. This approach automates the decomposition process, reduces redundancy, and aims to enhance the accuracy and reliability of load forecasts, thereby improving energy management practices and accommodating the rapid shifts in energy consumption patterns.

1.1 Literature review

STLF aims to generate future load values by learning the patterns of historical data. From the physical perspective, load series is the accumulation of multiple exogenous factor series [5]. FT can decompose load series into periodic signals with different frequencies and amplitudes [6]. Generally, the load series can be divided into trends, periods, and noise [7]. Recently, many studies have introduced signal decomposition methods for feature extraction. Signal decomposition methods transform the complex load series into multiple relatively stable sub-sequences. Ghelardoni et al. [8] used empirical mode decomposition (EMD) to disaggregate load series into multiple components. Zhang et al. [9] used EMD to capture the trends of time series. Ding et al. [10] adopts ensemble empirical mode decomposition (EEMD) to reduce the influences of hidden noise in load data. Zhang and Hong [11] introduced complete ensemble empirical mode decomposition adaptive noise (CEEMDAN) as data preprocessing to improve forecasting accuracy. Zhang et al. [12] utilized fast ensemble empirical mode decomposition (FEEMD) for data filtering in probabilistic load forecasting. Zhang and Hong [13] proposed a hybrid model that combined variational mode decomposition (VMD) with support vector regression (SVR). He et al. [14] employed VMD to decompose the load series into a discrete number of modes. The aforementioned studies proved that signal decomposition methods extract features from complex load series effectively.
However, the above decomposition technique has two main shortcomings in application: (1) the number of decomposition modes requires multiple experiments to be determined; (2) models have to duplicate decomposition and modeling, with the update of load series. To address these issues, researchers utilize the neural network (NN) to propose novel decomposition methods. Wu et al. [15] designed Autoformer, which is a novel decomposition architecture with an auto-correlation mechanism. Autoformer uses the inner decomposition block to empower the deep forecasting model with immanent progressive decomposition capacity. Oreshkin et al. [16] developed neural basis expansion analysis (N-BEATS), through the idea of gradient integration.
STLF methods are mainly divided into statistical methods and machine learning (ML) methods. Statistical methods include autoregression (AR) [17] and autoregressive integral moving average (ARIMA) [18]. However, statistical methods are based on linear assumptions, which are limited in nonlinear feature capture. In contrast, ML methods, such as the recurrent neural network (RNN), long short-term memory (LSTM), and the gate recurrent unit (GRU), perform competitively on nonlinear feature extraction. However, LSTM and GRU cannot fully alleviate the vanishing and exploding gradient of RNN. The back propagation neural network (BPNN) trains NNs with reverse gradient propagation. However, it is unable to learn the long- and short-term dependencies from time series. Yang et al. [19] proposed a method combined with BPNN, extreme gradient boosting (XGBoost), and LSTM. Sadaei et al. [20] proposed a combined method that is based on the fuzzy time series (FTS) and convolutional neural network (CNN) for STLF. Zhang et al. [21] developed a new hybrid model based on improved EMD, ARIMA, and wavelet neural network (WNN). Therefore, integrated methods are verified to improve the performance of STLF methods efficiently.
Outliers inevitably exist in load series. Due to the unknown distribution of outliers, prediction models are difficult to determine the appropriate loss function to fit load data. Although the \(L_{2}\) loss has been widely used as the objective function, it is sensitive to outliers and may cause overfitting [22]. Actual load series generally contain complex noise, which is composed of multiple distributions. Traditional loss functions with one single distribution may generate unreliable predictions. Common robust loss function, such as the Huber loss function, has a better performance on the load data that contains random noise. The Huber loss function fits the mixed distribution of Laplace and Gaussian distribution, which combines the advantages of \(L_{1}\) and \(L_{2}\) loss [23]: (1) The Huber loss function is differentiable at the origin. (2) Huber loss is more insensitive to outliers than \(L_{2}\) loss function. The lncosh loss is more robust to outliers than the Huber loss, which can approach \(L_{1}\), \(L_{2}\), and Huber loss through adjusting hyper-parameters [24]. However, due to the unknown prior knowledge of outliers, it is difficult to determine the loss function. Thus, the hyper-parameter setting of lncosh loss is the main difficulty in model training. Yang et al. [25] proposed an adaptive rescaled lncosh (ARlncosh) loss function to handle time series modeling with outliers and random noise. The ARlncosh loss uses a ‘working’ likelihood approach to determine the hyper-parameter of the lncosh loss [26]. This paper utilizes the ARlncosh loss to handle outliers of load series, considering its competitive performance in practice.

1.2 Motivation

Considering the limitations of traditional decomposition methods in STLF, this paper proposes a multi-scale perspective prediction model based on deep learning (DL). The proposed model adopts an end-to-end architecture for training, which avoids duplicate decomposition and modeling in application. MscaleDNN uses NNs to approximate the FT process, which avoids setting the number of decomposition modes manually. Furthermore, this paper proposed the integrated model to extract linear and nonlinear features from load data. Considering the outlier of load data, the proposed model introduced the ARlncosh loss to train the NN based on the idea of robust regression. The ‘working’ likelihood function optimizes the hyper-parameter of the lncosh loss, to approximate the data distribution of load data.

1.3 Contributions

This paper develops a multi-scale deep neural network-robust-autoregression-bi-directional gate recurrent unit (M-RAR-BiGRU) for STLF, which extracts long-term stable periodic features from a multi-scale perspective. Firstly, MscaleDNN decomposes load series into different frequencies. Secondly, the robust-autoregression-bi-directional gate recurrent unit (RAR-BiGRU) models the linear and nonlinear components of load data. During the model training, the ARlncosh loss function is used as the objective function to handle the outliers of load data. Meanwhile, the ARlncosh loss introduces the ‘working’ likelihood function to optimize the hyper-parameter for fitting the distribution of load data. The contributions of this paper are summarized as follows:
(a)
To address multi-scale data integration in STLF, we introduce M-RAR-BiGRU, a deep learning approach that approximates the FT, enhancing the model’s ability to analyze periodic features across various time scales.
 
(b)
Tackling the challenge of extracting both linear and nonlinear dynamics in load data, our MscaleDNN utilizes robust-autoregression and bi-directional gate recurrent units (RAR-BiGRU), along with an attention mechanism, to accurately capture and leverage temporal correlations.
 
(c)
Addressing the impact of outliers on forecast accuracy, we employ the ARlncosh loss, using a ’working’ likelihood function to adaptively approximate the distribution of load data, increasing the model’s robustness against anomalies.
 

1.4 Structure of this paper

The remaining part of this paper is organized as follows. Section 2 introduces the technologies referred to in this paper. Section 3 details the implementation of M-RAR-BiGRU. Sections 4 and 5 analyze the experiment results of Portugal and Australia data sets and verify the performance of the proposed algorithm. Section 6 concludes this paper.

2 Technology background

Signal decomposition methods can improve the performance and interpretability of STLF models. However, traditional decomposition methods, such as seasonal-trend decomposition (STL) [7] and VMD, have limitations of duplicate decomposition and modeling, which may waste computation resources. Additionally, it is difficult to determine the optimal value of decomposition modes, which may influence feature extraction. Therefore, this paper employs MscaleDNN to decompose load series, based on the idea of FT. MscaleDNN avoids manually setting the number of decomposition modes. M-RAR-BiGRU integrates signal decomposition and DL techniques into an end-to-end framework. Compared with traditional decomposition approaches, M-RAR-BiGRU avoids duplicate decomposition and modeling. The proposed method decomposes the load series into low- and high-frequency components as parts of predictions. Furthermore, RAR-BiGRU models the linear and nonlinear features of load data. M-RAR-BiGRU improves the interpretability of prediction results and the feature extraction ability of the forecasting model.

2.1 Multi-scale deep neural network

This paper borrows the idea of radial scaling in frequency domain [27] to structure a deep neural network (MscaleDNN), which decomposes the frequency-domain signals of time series. For a given time series \(f(x) = \{x_{1}, x_{2}, x_{3}, \ldots , x_{t} \}\), its FT is defined as \({\hat{f}}(k)\). The decomposition results of \({\hat{f}}(k)\) are as follows:
$$\begin{aligned} \begin{aligned}&{\hat{f}}(k) \triangleq \sum \limits _{i=1}^{M} \hat{f_{i}}(k), \end{aligned} \end{aligned}$$
(1)
where M is the number of decomposition modes. The corresponding decomposition in physical space is as follows:
$$\begin{aligned} \begin{aligned}&f(x) = \sum \limits _{i=1}^{M}{f_{i}}(x), \end{aligned} \end{aligned}$$
(2)
where \(f_{i}(x) = F^{-1}[\hat{f_{i}}(k)](x) = f(x)*x\check{A_{i}}(x)\). The inverse FT of \(xA_{i}(k)\) is a frequency selection kernel, which can be calculated with the Bessel function [28].
However, the NN has different learning effects on contents with different frequencies according to the frequency principle (F-Principle) [29]. Generally, the NN can learn the low-frequency content of data efficiently but is inefficient in high-frequency content. Therefore, MscaleDNN introduces a simple down-scaling to transform the high-frequency region of Eq. (1) into a low-frequency region. The scaled version of \(\hat{f_{i}}(k)\) is defined as:
$$\begin{aligned} \begin{aligned}&\hat{f_{i}}^{(scale)}(k) = \hat{f_{i}}(\alpha _{i} k), \alpha _{i}>1, \end{aligned} \end{aligned}$$
(3)
and, correspondingly in the physical space:
$$\begin{aligned} \begin{aligned}&f_{i}^{(scale)}(x) = f_{i}(\frac{1}{\alpha _{i}}x), \end{aligned} \end{aligned}$$
(4)
or
$$\begin{aligned} \begin{aligned}&f_{i}(x) = f_{i}^{(scale)}(\alpha _{i}x), \end{aligned} \end{aligned}$$
(5)
where \(\alpha _{i}\) is scale factor. The low-frequency region of \(f_{i}(x)\) can be obtained by a large enough \(\alpha _{i}\). MscaleDNN constructs a deep neural network (DNN) \(f_{\theta ^{n_{i}}}(x)\) to learn decomposed components \(f_{i}^{scale}(x)\) at a common frequency scale:
$$\begin{aligned} \begin{aligned}&f_{i}^{(scale)}(x) \sim f_{\theta ^{n_{i}}}(x), \end{aligned} \end{aligned}$$
(6)
where \(\theta ^{n_{i}}\) is the parameters of DNN. The approximation of \(f_{i}(x)\) is:
$$\begin{aligned} \begin{aligned}&f_{i}(x) \sim f_{\theta ^{n_{i}}}(\alpha _{i}x). \end{aligned} \end{aligned}$$
(7)
Therefore, the fitting process for the decomposition of f(x) can be written as:
$$\begin{aligned} \begin{aligned}&f(x) \sim \sum \limits _{i=1}^{M}f_{\theta ^{n_{i}}}(\alpha _{i}x), \end{aligned} \end{aligned}$$
(8)
where [27] recommends \(\alpha _{i} = i\) or \(\alpha _{i} = 2^{i-1}\). The proposed algorithm decomposes time series into low- and high-frequency components. The scale factors of low- and high-frequency components are \(\{\alpha _{1}, \alpha _{2}, \ldots , \alpha _{d} \}\) and \(\{\alpha _{d+1}, \alpha _{d+2}, \ldots , \alpha _{D} \}\), respectively. Vector \(\{\alpha _{1}, \alpha _{2}, \cdots , \alpha _{d+1}, \alpha _{d+2}, \cdot , \ldots , \alpha _{D} \}\) is generally in ascending order.
Considering the activation function of MscaleDNN is limited in input data, this paper introduces the soften Fourier mapping (SFM) activation function [30] in low- and high-frequency parts of DNN, as follows:
$$\begin{aligned} \begin{aligned}&\sigma (z) = s \times \left[ \begin{array}{cc} \cos (z) \\ \sin (z) \end{array}\right] , \end{aligned} \end{aligned}$$
(9)
where relaxation parameter \(s \in (0, 1]\) is used to control the output range of the activation function. This paper sets \(s = 0.5\) through the experiment method. The first hidden layer of MscaleDNN simulates Fourier expansion, while the remaining hidden layers approximate Fourier coefficients. The training process of MscaleDNN can be viewed as FT, through the SFM activation function. Compared with input data, Fourier coefficients have relatively less oscillation. Therefore, SFM can effectively accelerate the training of MscaleDNN.

2.2 Adaptive scale lncosh loss function

The traditional lncosh loss function is defined as follows:
$$\begin{aligned} \begin{aligned}&l_{1} = \frac{1}{\lambda }\log (\cosh (\lambda r)), \end{aligned} \end{aligned}$$
(10)
where \(\cosh (\lambda r) = (e^{\lambda r} + e^{-\lambda r})/2\) and the hyper-parameter \(\lambda \in (0, +\infty )\) controls the properties of lncosh loss function. Residual \(r = y - {\hat{y}}\) is the difference between actual load data y and prediction results \({\hat{y}}\). lncosh loss function approximates \(L_{1}\), \(L_{2}\), and Huber loss function, by adjusting hyper-parameter \(\lambda\).
Remark 1
As shown in Fig. 1, \(\lambda\) controls the properties of lncosh loss. lncosh loss approximates \(L_{1}\) loss, with \(\lambda \rightarrow \infty\). lncosh loss approximates \(L_{2}\) loss, with \(\lambda \rightarrow 0\). lncosh loss approximates Huber loss, with a proper \(\lambda\). Therefore, lncosh loss can approximate \(L_{1}\), \(L_{2}\), and Huber loss by adjusting \(\lambda\).
Fig. 1
lncosh loss function with different \(\lambda\) (\(\lambda\) is the hyper-parameter that controls the properties of lncosh loss function)
Full size image
The constant that is independent of the core parameter generally can be ignored, due to its limited influence on optimization. Thus, this paper omits the constant \(\lambda\) in the denominator of Eq. (10). This paper proposes a scale lncosh loss function \(l_{2}\) to approximate the noise distribution:
$$\begin{aligned} \begin{aligned}&l_{2} = \log (\cosh (\lambda r)). \end{aligned} \end{aligned}$$
(11)
Then, the scale lncosh loss function constructs the following probability density function [26]:
$$\begin{aligned} \begin{aligned}&f(r;\lambda ) = \frac{\lambda }{\pi } \cdot \frac{1}{\cosh (\lambda r)}, \end{aligned} \end{aligned}$$
(12)
where constant \((\lambda /\pi )\) is calculated by the integral of the probability density function.
The likelihood function mainly describes the random generation of data. However, the ‘working’ likelihood function is only used for parameter estimation [26]. The scale lncosh loss function is derived from the logarithmic likelihood function. The corresponding ‘working’ likelihood function is as follows:
$$\begin{aligned} \begin{aligned}&f(r_{1},r_{2}, \ldots , r_{T};\lambda ) = \prod \limits _{i=1}^{T}f(r_{i};\lambda ) = \prod \limits _{i=1}^{T} \frac{\lambda }{\pi } \cdot \frac{1}{\cosh (\lambda r_{i})}. \end{aligned} \end{aligned}$$
(13)
where \(i \in [1, T]\) is the ith moment of the load series.
This paper proposes the scale lncosh loss function and introduces the corresponding ‘working’ likelihood function to estimate \(\lambda\). Assuming \(\zeta = 1/\lambda\), \(\zeta\) is a scale parameter for error essentially. Thus, the error can be written as \(\zeta \varepsilon _{i}\), where \(\varepsilon _{i}\) without any distribution of parameters. The extended original objective function is as follows:
$$\begin{aligned} \begin{aligned}&L = \sum \limits _{i=1}^{T}\log (\cosh (\frac{r_{i}}{\zeta })) + T\log (\pi \zeta ), \end{aligned} \end{aligned}$$
(14)
where residual \(r_{i} = y_{i} - \hat{y_{i}}\). Due to \(r_{i}\) is not exist in the second term of Eq. (14), this ‘working’ likelihood function is equivalent to the minimization Eq. (11). Therefore, for the given \(\lambda\), the optimization of \(r_{i}\) has the same solution as Eq. (11). Eq. (14) can also calculate \(\lambda\) (or \(\zeta = 1/\lambda\)). This scale lncosh loss function approximates the unknown distribution of load data effectively.
This ‘working’ likelihood function method provides data-driven tune parameters, hyper-parameters, and variance parameters tuning [26, 31]. \(\zeta\) can be automatically selected as \(\zeta ^{*}\), by setting the derivative of \(\zeta\) in Eq. (14) to 0.
$$\begin{aligned} \begin{aligned}&\zeta ^{*} = T^{-1} \sum \limits _{i=1}^{T} r_{i} \tanh (\frac{r_{i}}{\zeta }), \end{aligned} \end{aligned}$$
(15)
where \(\tanh (r_{i}/\zeta ) = (e^{r_{i}/\zeta } - e^{-r_{i}/\zeta }) (e^{r_{i}/\zeta } + e^{-r_{i}/\zeta })\). The minimization of \(\zeta\) or Eq. (14) optimizes \(\zeta ^{*}\).

3 Multiscale-integrated deep learning algorithm

3.1 Multi-scale deep neural network

STL assumes that the load series is composed of trend terms, period terms, and noise terms. The trend and period terms are the modelable parts of the load series. The feature extraction of MscaleDNN is an approximate FT process. In the frequency domain, the modelable parts of the load series consist of periodic signals with different frequencies and amplitudes. Thus, MscaleDNN extracts the low- and high-frequency components of the load series, reference STL. The specific structure of MscaleDNN is shown in Fig. 2.
To solve the limitation of traditional decomposition methods in practice, MscaleDNN employs time features (such as year, month, day, and hour) as input features and outputs low- and high-frequency decomposition results. MscaleDNN approximate FT with low-frequency scale factor \(\{\alpha _{1}, \alpha _{2}, \cdots , \alpha _{d}\}\) and high-frequency scale factor \(\{\alpha _{d+1}, \alpha _{d+2}, \cdots , \alpha _{D} \}\). Meanwhile, MscaleDNN constructs two different DNNs to output decomposition results with different frequencies.
Fig. 2
The multi-scale deep neural network for low- and high-frequency extraction, where \(\alpha _{i} (i = 1, 2, 3, \ldots , d)\) and \(W^{1}_{i} (i = 1, 2, 3, \ldots , P)\) are the scale factors and weights of the low-frequency components, respectively. \(\alpha _{i} (i = d+1, d+2, d+3, \ldots , D)\) and \(W^{1}_{i} (i = 1, 2, 3, \ldots , Q)\) are the scale factors and weights of the high-frequency components, respectively
Full size image

3.2 Robust-autoregression-bi-directional gate recurrent unit

RAR-BiGRU is a robust prediction model, which is trained by ARlncosh loss. AR and bi-directional gate recurrent unit (BiGRU) model linear and nonlinear features of the load series, respectively. Moreover, RAR-BiGRU utilizes ATTN to further extract the temporal feature of load data. Considering the outliers of the load series, ARlncosh loss approximates the distribution of outliers through hyperparameters adjusting. This paper introduces AR into BiGRU to extract the linear features while considering the long-term dependence of load series. Meanwhile, ARlncosh loss as an adaptive loss function can enhance the robustness of AR-BiGRU. RAR-BiGRU effectively improves the stability and robustness of the BiGRU.
AR provides the linear feature of load data. AR generates the prediction \(x_{t}\) at t moment with the linear combination of historical data \(\{x_{t-p},\ldots , x_{t-2}, x_{t-1}\}\):
$$\begin{aligned} \begin{aligned}&x_{t} = \phi _{0} + \phi _{1}x_{t-1} + \phi _{2}x_{t-2} + \cdots + \phi _{p}x_{t-p}. \end{aligned} \end{aligned}$$
(16)
Furthermore, BiGRU uses GRU units to extract nonlinear features from load data. The update process of GRU units mainly depends on the following reset gate \(r_{t}\) and update gate \(z_{t}\):
$$\begin{aligned} \begin{aligned}&r_{t} = \sigma (x_{t}W_{xr}+H_{t-1}W_{hr}+b_{r}), \end{aligned} \end{aligned}$$
(17)
and
$$\begin{aligned} \begin{aligned}&z_{t} = \sigma (x_{t}W_{xz}+H_{t-1}W_{hz}+b_{z}), \end{aligned} \end{aligned}$$
(18)
where \(\sigma\) is the sigmoid function. \(W_{xr}\), \(W_{hr}\), \(W_{xz}\), \(W_{hz}\) represent the weights of the NN. \(b_{r}\) and \(b_{z}\) represent the biases of the NN. \(H_{t-1}\) represents the hidden state of the previous moment.
BiGRU uses two hidden layers to extract previous and future information. The final output layer is concatenated by the outputs of GRUs at each moment. The training process of BiGRU is as follows:
$$\begin{aligned}{} & {} \begin{aligned}&\mathop {h_{t}}\limits ^{\rightarrow } = \text{ GRU }(x_{t},\mathop {h_{t-1}}\limits ^{\rightarrow }), \end{aligned} \end{aligned}$$
(19)
$$\begin{aligned}{} & {} \begin{aligned}&\mathop {h_{t}}\limits ^{\leftarrow } = \text{ GRU }(x_{t},\mathop {h_{t-1}}\limits ^{\leftarrow }), \end{aligned} \end{aligned}$$
(20)
and
$$\begin{aligned} \begin{aligned}&h_{t} = f(W_{\mathop {h_{t}}\limits ^{\rightarrow }} \mathop {h_{t}}\limits ^{\rightarrow }+W_{\mathop {h_{t}}\limits ^{\leftarrow }} \mathop {h_{t}}\limits ^{\leftarrow }+b_{t}), \end{aligned} \end{aligned}$$
(21)
where \(\mathop {h_{t}}\limits ^{\rightarrow }\) and \(\mathop {h_{t}}\limits ^{\leftarrow }\) represent forward and backward features of hidden layers, respectively. Nonlinear prediction is generated by the linear mapping of \(h_{t}\).
BiGRU extracts features from different directions while being unable to consider the correlations between different moments. Historical load data may have correlations between specific moments. ATTN further extracts temporal features from load data. ATTN assigns weight according to the importance of input data, avoiding interference from useless information for modeling. ATTN encodes historical load data as follows:
$$\begin{aligned}&\text{ Attention }(Q, K, V) = \text{ softmax }\left( \frac{QK^{T}}{\sqrt{d_k}}\right) V, \end{aligned}$$
(22)
where \(d_{k}\) is the scaling factor. The output of ATTN is obtained by linear mapping. The specific process of ATTN is detailed in [32].

3.3 Objective function

According to the ARlncosh loss function, the objective function of the proposed algorithm is as follows:
$$\begin{aligned} \begin{aligned}&F(r_{i}) = \sum \limits _{i=1}^{T}\log \left( \cosh \left( \frac{r_{i}}{\zeta }\right) \right) , \end{aligned} \end{aligned}$$
(23)
where \(r_{i} = y_{i} - \hat{y_{i}}\), and \(\zeta\) is calculated by the residual estimation.
Accelerating the convergence rate of \(\zeta\) in the ARlncosh loss function, this paper introduces the robust penalized extreme learning machine (RPELM) [33] to generate the initial prediction sequence and residual sequence \(r_{init}\). RPELM used the robust penalty framework and M-estimation theory for robust optimization of extreme learning machine (ELM). According to \(r_{init}\), the set of initial unbiased robust estimates \(\zeta _{init}\) is as follows:
$$\begin{aligned} \begin{aligned}&{\hat{\zeta }}_{init} = \text{ argmin }\left\{ \sum \limits _{i=1}^{T}\log (\cosh (\frac{r_{i}}{\zeta })) + T\log (\pi \zeta )\right\} , \end{aligned} \end{aligned}$$
(24)
The optimized ARlncosh objective function can be constructed with \({\hat{\zeta }}_{init}\).

3.4 Aggregate prediction results

As shown in Fig. 3, this paper combines MscaleDNN and RAR-BiGRU to propose a novel STLF framework. MscaleDNN uses NNs to approximate the adaptive FT of load series. For example, the predicted value \({\hat{x}}_{t}\) at t moment is generated by last p moment \(\{t-p, t-p+1, \ldots , t-1\}\), then \({\hat{x}}_{t}\) is generated as follows:
$$\begin{aligned} \begin{aligned}&{\hat{x}}_{t} = \alpha \cdot {\hat{x}}_{t, low} + \beta \cdot {\hat{x}}_{t, high} + \kappa \cdot {\hat{x}}_{t, RAR-BiGRU}, \end{aligned} \end{aligned}$$
(25)
where \({\hat{x}}_{t, low}\) and \({\hat{x}}_{t, high}\) represent the low- and high-frequency components of MscaleDNN at t moment. \({\hat{x}}_{t, RAR-BiGRU}\) represents the prediction result of the RAR-BiGRU at t moment. \(\alpha\), \(\beta\), and, \(\kappa\) are adjustment coefficients to control the proportion of different prediction modules.
Fig. 3
The forecasting framework of M-RAR-BiGRU. \(t_{i}\), \({\hat{x}}_{i, low}\), and \({\hat{x}}_{i, high}\) is the time feature, low-frequency, and high-frequency respectively, where \(i \in [1, \hbox {N}]\). N is the length of the historical load series. \({\hat{x}}_{t}\) is the final prediction result of M-RAR-BiGRU, where \(t \in [1, \hbox {L}]\). L is the length of the forecasting step. The illustration of MscaleDNN is detailed in Fig. 2. The prediction of RAR-BiGRU is calculated by weighting different features. \(\alpha\), \(\beta\), and, \(\kappa\) are adjustment coefficients for different prediction modules
Full size image

3.5 Training process

The proposed algorithm extracts features of load data from a multi-scale perspective. The overall training process is end-to-end. Specifically, the input of the proposed algorithm has two stages. Stage 1: MscaleDNN decomposes the current load series with the year, month, day, and hour information corresponding to the p moment \(\{t-p, t-p+1, \ldots , t-1\}\). Stage 2: RAR-BiGRU forecasts the load data at t moment by concatenating the low- and high-frequency components with the corresponding load data. The overall training process is as follows:
Step 1: MscaleDNN generates the low- and high-frequency components of t moment, with the historical time information.
Step 2: RAR-BiGRU concatenated the low- and high-frequency components with load data at the corresponding moment as input data, to output the prediction at t moment.
Step 3: Aggregate the prediction results of MscaleDNN and RAR-BiGRU according to Eq. (25), and update the network parameters according to the ARlncosh loss function in Eq. (23).

3.6 Computational complexity

In this section, we analyze the computational complexity of the M-RAR-BiGRU model. The FT has a complexity of \(O(N \cdot \log N)\), where \(N\) is the input length. The RAR-BiGRU includes GRU units with a complexity of \(O(T \cdot D^2)\) per unit, doubled for bi-directional processing, where \(T\) is the number of time steps and \(D\) is the hidden layer dimension. The linear regression component has a complexity of \(O(N \cdot D)\), where \(N\) is the sample size and \(D\) is the feature size. The attention mechanism has a complexity of \(O(N^2 \cdot D)\). The ARlncosh loss function used for loss calculation has a complexity of \(O(N)\).
Combining these components, the overall computational complexity of the model at each learning cycle is \(O(N \cdot \log N + T \cdot D^2 + N \cdot D + N^2 \cdot D + N)\). The lower-order terms are typically disregarded, resulting in a dominant term of \(O(N^2 \cdot D)\). Therefore, the overall complexity simplifies to \(O(N^2 \cdot D)\). To provide a clearer comparison, we summarize the time complexities of various models. Table 1 highlights the significant computational demands of the attention mechanism within these models.
Table 1
Comparison of time complexity
Model
Time Complexity
Model
Time Complexity
ELM
\(O(N \cdot D)\)
GRU
\(O(T \cdot D^2)\)
SVR
\(O(N^3)\)
BiGRU
\(O(T \cdot D^2)\)
LSTM-MSNet
\(O(T \cdot D^2)\)
Transformer
\(O(N^2 \cdot D)\)
LSTNet
\(O(T \cdot D^2)\)
Informer
\(O(N \log N \cdot D)\)
Reformer
\(O(N \log N \cdot D)\)
M-BiGRU
\(O(T \cdot D^2)\)
M-RAR-BiGRU
\(O(N^2 \cdot D)\)
  

4 Example 1: Substation load series from Portugal

4.1 Experimental data

This experiment aims to verify the prediction performance of M-RAR-BiGRU in the load series with noise. The load series with high sample entropy may contain plentiful information and complex noise. Thus, this paper experiments on 6 load series from Portugal substations with high sample entropy and outlier levels. The load series are named after their substation: MT7, MT31, MT34, MT161, and MT259. Artur Trindade published the load series of substations from 2011 to 2014 in the UCI machine learning repository. Considering the generalization performance of MscaleDNN, this paper uses the load series with a relatively long period, which spans from 0:00 on January 1, 2013, to 23:00 on December 31, 2014, with 17,520 sample points. The time granularity of each substation load series is one hour. Complex load series improve the difficulty of feature extraction. Meanwhile, the outlier of load data verifies the robustness of the ARlncosh loss. This experiment adopts \(70\%\) of sample points for model training, \(10\%\) of sample points for model validation, and \(20\%\) of sample points for model test.

4.2 Evaluation metric

To evaluate the prediction performance of the proposed algorithm and its contrast model, this experiment introduces the following mean absolute error (MAE) and root mean square error (RMSE) as error indicators:
$$\begin{aligned} \text{ MAE } = \frac{1}{T}\sum \limits _{i=1}^{T}\Vert x_{t}-{\hat{x}}_{t}\Vert , \end{aligned}$$
(26)
and
$$\begin{aligned} \text{ RMSE } = \sqrt{\frac{1}{T}\sum \limits _{i=1}^{T}(x_{t}-{\hat{x}}_{t})^{2}}, \end{aligned}$$
(27)
where T is the number of sample points. \(x_{t}\) and \({\hat{x}}_{t}\) represent the load data and prediction result at tth moment, respectively. MAE is a universal error measure. RMSE is sensitive to the prediction with high error.

4.3 Experimental setting

To verify the prediction performance of the proposed model, this paper uses ELM, SVR, GRU, BiGRU, long short-term memory multiseasonal net (LSTM-MSNet) [34], the long- and short-term time-series network (LSTNet) [35], Transformer [32], Informer [36], and Reformer [37] as contrast models. ELM and SVR are ML comparative models, while GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer are DL comparative models. ELM adopts a tanh function as the activation function. SVR sets the kernel function as the radial basis function (RBF). The parameters of LSTM-MSNet and LSTNet follow [34] and are further adjusted by many experiments. For the unbiasedness of experiments, each experiment is repeated 15 times to calculate the mean. All experiments in this paper are completed in the environment of Python 3.8, graphics card GTX1650Ti, and 4 G graphics memory.

4.4 Outlier test

This experiment analyzes the outlier of the load series. Figure 4 shows the quantile–quantile (Q–Q) plots and boxplots for the outlier test. In Fig. 4, Q–Q plots show the data distribution of the residual sequence has a greater probability of generating maximum and minimum values than the normal distribution, which indicates that the load series contains outliers with a high probability. Further, many outliers exceed the upper and lower limits of the load series in each boxplot. Therefore, the load series has a certain amount of outliers.
Fig. 4
Outlier test on Portuguese data sets
Full size image

4.5 Experimental results analysis

To verify the performance improvement of the proposed model, this experiment compares the M-BiGRU with contrast models. Table 2 shows the prediction results of the comparative experiment. Compared with ML models, SVR has slightly lower MAE and RMSE than ELM. Due to the support vectors, SVR performs robustness in modeling. However, it is difficult for ELM to set the node of the hidden layer. In DL models, GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer have similar performance. The experimental results indicate that DL models improve prediction performance, by extracting temporal features adaptively. The ATTN has a powerful capability for feature extraction. Compared to contrast models, M-BiGRU achieved better MAE and RMSE indicators on the MT31, MT34, MT161, and MT259. On MT7, the error metrics of M-BiGRU are slightly higher than SVR. This comparative experiment verifies the superiority of M-BiGRU in STLF compared to existing classical models.
Table 2
Comparison model experiment results on substation load data sets from Portugal
Model
MT7
MT31
MT32
MT34
MT161
MT259
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
ELM
3.29
5.93
24.36
31.92
9.91
13.09
6.30
8.74
375.33
500.47
24.57
32.89
SVR
2.46
3.97
25.29
33.91
8.54
11.78
5.02
6.87
278.30
410.93
26.56
37.24
GRU
3.14
5.88
24.73
34.09
8.88
11.86
5.33
7.02
274.95
379.90
24.31
32.36
BiGRU
3.49
5.35
24.70
34.18
8.84
11.82
5.29
7.08
290.74
398.66
24.25
32.31
LSTM-MSNet
2.94
5.42
23.98
33.26
8.36
11.58
5.18
7.32
336.34
467.94
24.08
32.38
LSTNet
2.74
4.21
22.25
30.03
8.36
11.57
5.49
7.52
273.15
409.25
27.87
37.71
Transformer
2.90
4.43
22.29
29.94
8.72
11.77
5.56
7.75
285.73
403.55
24.55
32.85
Informer
2.72
4.23
22.62
30.19
8.55
11.55
5.41
7.48
300.67
415.40
24.68
33.59
Reformer
2.80
4.60
24.96
32.96
8.81
12.00
5.29
7.33
300.89
416.11
24.55
33.39
M-BiGRU
2.63
4.10
20.41
27.50
9.66
13.11
4.41
5.95
258.68
364.78
21.58
29.25
M-RAR-BiGRU
2.49
3.91
20.14
27.23
9.42
12.56
4.26
5.61
256.32
363.42
21.24
28.70
Bold values highlight the lowest MAE or RMSE for each dataset, indicating the best performing model
This section designs the comparative experiment with \(L_{2}\) and adaptive scaled Huber (ARHuber) loss function, to study the robustness of the ARlncosh loss. Table 3 shows the experimental results of M-BiGRU adopting different loss functions. Due to the approximate MAE and RMSE, this section further conducts the statistical test on the prediction result in Table 4. Table 4 uses M-BiGRU to compare the ARlncosh loss with \(L_{2}\) and ARHuber loss. 1 indicates that the model used ARlncosh loss obtains a better prediction performance than the comparison model, while 0 indicates the comparison model has a better performance. On MT7, MT32, and MT161 data sets, the MAE and RMSE of the model trained with the ARlncosh loss are lower than that trained with \(L_{2}\) loss. Thus, the ARlncosh loss is more robust to outliers than \(L_{2}\) loss. Additionally, this section further compares the difference between ARlncosh and ARHuber loss, through Wilkerson signed rank test as shown in Table 4. On the MT161 and MT259 of Table 4, the ARlncosh loss achieves a lower MAE and RMSE than the ARHuber loss. These statistical test results show that ARlncosh loss has a better modeling ability than ARHuber loss. To further analyze the competitive performance of ARlncosh loss on MT161 and MT259 data sets, this experiment draws the histogram of residuals fitted by ARIMA. Figure 5 shows the distribution of the residual fitted by ARIMA, normal distribution, ARHuber loss, and ARlncosh loss. Furthermore, Table 5 uses Wasserstein distance to accurately calculate the distance between different distributions. In Fig. 5, ARlncosh distribution has a relatively concentrated sample distribution, which is similar to that of residuals. This is consistent with the result in Table 4. The experiment results on MT161 and MT259 data sets indicate that the ARlncosh distribution is closer to the real noise distribution. Thus, the ARlncosh loss improves the robustness of the proposed model. Furthermore, Fig. 6 shows the convergence curve of the hyper-parameter \(\zeta\) of the ARlncosh loss. After 3 iterations, \(\zeta\) tends to converge. This experiment verified the ‘working’ likelihood function accelerates the convergence of \(\zeta\) effectively.
Table 3
Comparative experimental results of M-BiGRU using \(L_{2}\), ARHuber, and ARlncosh loss function on Portugal data sets
Model
MT7
MT31
MT32
MT34
MT161
MT259
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
\(L_{2}\)
2.75
4.27
20.72
27.77
10.10
13.48
4.72
6.27
268.08
373.82
21.88
29.39
ARHuber
2.60
4.06
20.43
27.40
9.78
13.21
4.40
5.95
262.25
369.62
21.81
29.37
ARlncosh
2.63
4.10
20.41
27.50
9.66
13.11
4.41
5.95
258.68
364.78
21.58
29.25
Bold values highlight the lowest MAE or RMSE for each dataset, indicating the best performing model
Table 4
Statistical test results of M-BiGRU using \(L_{2}\), ARHuber, and ARlncosh loss function on Portugal data sets
Loss functions
Error indicators
ARlncosh
MT7
MT31
MT32
MT34
MT161
MT259
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
\(L_{2}\)
MAE
1
1
1
1
1
1
RMSE
1
1
1
0
1
1
ARHuber
MAE
0
0
0
0
1
1
RMSE
0
0
0
0
1
1
Bold values indicate that the model used ARlncosh loss obtains a better prediction performance than the comparison model
Table 5
Wasserstein distance between normal distribution, ARHuber distribution, ARlncosh distribution, and residual distribution on Portugal data sets
Data sets
MT7
MT31
MT32
MT34
MT161
MT259
Normal-residual distribution
87
50
52
85
91
78
ARHuber-residual distribution
58
38
38
48
65
38
ARlncosh-residual distribution
63
38
38
32
51
23
Bold values highlight the lowest Wasserstein distance for each dataset, indicating the best matching distribution to the residual distribution
Fig. 5
The residual distribution, normal distribution, ARHuber distribution, and ARlncosh distribution on MT161 and MT259 data sets
Full size image
Fig. 6
Convergence of \(\zeta\) on Portuguese load data sets
Full size image
To verify the effectiveness of MscaleDNN in the proposed model, this paper compares M-BiGRU with BiGRU. As shown in Table 2, MAE and RMSE indexes of M-BiGRU are lower than that of BiGRU on all load series except MT32. Table 6 shows the statistical test between BiGRU and M-BiGRU. 1 means M-BiGRU has a better forecasting performance than BiGRU, in terms of error indicators. 0 means BiGRU has a higher forecasting accuracy than M-BiGRU. As shown in Table 6, there is a significant difference between M-BiGRU and BiGRU for each power load series. Combined with the results of Table 2, it can be concluded that the model trained with MscaleDNN has the optimal prediction results on 6 sets of load series.
Table 6
Statistical test results for the effectiveness of MscaleDNN on Portugal data sets
Model
Error indicators
BiGRU
MT7
MT31
MT32
MT34
MT161
MT259
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
M–BiGRU
MAE
1
1
1
1
1
1
RMSE
1
1
1
0
1
1
Bold values highight M-BiGRU has a better forecasting performance than BiGRU
This experiment conducted spectrum analysis on the prediction results of MscaleDNN, to further analyze its improvement for the proposed model. Firstly, FT transforms the load series into a frequency domain signal. Then, periodic features are calculated by the frequencies with significant amplitudes. Figure 7a–c are the curves of the prediction result, low- and high-frequency prediction results corresponding to the MT31. The spectrum analysis is visualized in Fig. 7d–f. As shown in Fig. 7d, the frequencies of maximum amplitude are 730, 1460, 2190, 2920, and 4380. Thus, the periods of the MT31 are 1 day, 12 h, 8 h, 6 h, and 4 h. Similarly, the significant periods of low-frequency components are 1 day and 12 h. The spectrum analysis shows that significant periods of the high-frequency components are 1 day, 12 h, and 8 h. The low- and high-frequency components reflect multiple main periods of the load series. The prediction results explain the sense of low-frequency and high-frequency components in the physical and verify the effectiveness of MscaleDNN for feature extraction in the frequency domain. The proposed framework improves the accuracy and interpretability of prediction results.
Fig. 7
Spectrum analysis on MT31 data set
Full size image
Finally, this experiment compares M-BiGRU with M-RAR-BiGRU. As shown in Table 2, the MAE and RMSE of M-RAR-BiGRU on the 6 load series are stably better than those of M-BiGRU. The experiment results demonstrate that BiGRU can extract the nonlinear feature from the load series effectively. Meanwhile, the linear features extracted by AR benefit the prediction model to model the load series more accurately. Moreover, ARlncosh loss can fit outliers with different distributions adaptively. Therefore, the combination of MscaleDNN and RAR-BiGRU can extract multi-scale features from the load series, which is an optimal model for STLF.

5 Example 2: Power load series from Australia

5.1 Experimental data

Example 2, as an additional experiment, uses the load series from five major states of Australia to evaluate the performance of M-RAR-BiGRU. These data sets are named after their state abbreviations: New South Wales (NSW), Queensland (QLD), South Australia (SA), Tasmania (TAS), and Victoria (VIC). The period of load series spans from 0:00 on January 1, 2019, to 23:00 on December 31, 2020, with the time granularity of an hour. Example 2 uses a similar ratio of Example 1 to divide the data set, where 70\(\%\) of sample points are used for model training, 10\(\%\) for model validation, and 20\(\%\) for model test.

5.2 Evaluation metric

Example 2 uses the same error index as Example 1 with Eqs. (26) and (27).

5.3 Experimental setting

This experiment uses six comparative models (ELM, SVR, GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer) to verify the prediction performance of M-BiGRU. ELM and SVR as ML comparative models. GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer are DL comparative models. ELM sets the activation function as a tanh function. SVR sets the kernel function as RBF. The parameter setting of LSTM-MSNet and LSTNet is followed [34] and adjusted by the empirical method. Each group of experiments is repeated 15 times to take the mean, for unbiased contrast experiments. All experiments are completed in Python 3.8, graphics card GTX1650Ti, and 4 G graphics memory environment.

5.4 Outlier test

This section analyzes the outlier of 5 load series from Australia. As in Example 1, Fig. 8 shows the Q-Q plots and boxplots for the outlier test. The Q-Q plot shows that the data distribution of the load series has a higher probability of generating relative maximum and minimum than the normal distribution. Further, the boxplot shows the outliers exceed the upper and lower limits of the load series. In conclusion, there are certain outliers in these load series.
Fig. 8
Outlier test on Australian data sets
Full size image

5.5 Experimental results analysis

Example 2 conducts the comparative experiment of M-BiGRU. Table 7 shows the prediction results of the comparative experiment. SVR as an ML model performs slightly better than ELM. DL models (such as GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer) have approximately excellent performance, due to the nonlinear feature extraction. In TAS data set, Reformer and M-BiGRU have the approximate performance. In other load series, M-BiGRU performs the optimal MAE and RMSE values. This experimental result demonstrates the superiority of M-BiGRU in STLF.
Table 7
Comparison model experiment results on load data sets from Australia
Model
NSW
QLD
SA
TAS
VIC
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
ELM
280
371
219
304
119
156
46
62
257
334
SVR
230
310
169
230
101
134
42
57
218
287
GRU
231
315
174
237
98
133
42
57
191
252
BiGRU
223
299
166
226
97
131
40
53
192
254
LSTM-MSNet
206
274
164
224
96
127
41
55
175
234
LSTNet
212
283
191
258
86
118
42
57
196
259
Transformer
183
238
158
190
95
128
43
55
165
192
Informer
198
262
154
196
93
115
44
57
193
243
Reformer
178
214
165
212
90
121
40
53
183
240
M-BiGRU
170
227
134
188
84
116
40
53
153
206
M-RAR-BiGRU
168
224
121
173
82
115
37
49
152
205
Bold values highlight the lowest MAE or RMSE for each dataset, indicating the best performing model
This experiment conducted a comparative experiment on M-BiGRU with the \(L_{2}\), ARHuber, and ARlncosh loss in Table 8, to verify the effectiveness of the ARlncosh loss. Due to the similar experiment results, statistical tests are conducted on the experimental results of M-BiGRU with \(L_{2}\), ARHuber, and ARlncosh loss, as shown in Table 9. The ARlncosh loss has a significant difference between \(L_{2}\) loss, on SA and VIC data sets. Meanwhile, M-BiGRU with the ARlncosh loss obtains lower MAE and RMSE than other loss functions on SA and VIC data sets in Table 8. Compared with the ARHuber loss, the model trained by ARlncosh loss performs a lower MAE on VIC and SA data sets. Thus, ARlncosh loss performs better than \(L_{2}\) and ARHuber loss, on these two data sets. Considering the conclusion of Example 1, this section further analyzes the performance of ARlncosh loss on random noise. Figure 9 shows data distributions corresponding to different loss functions on SA and VIC data sets. Table 10 calculates the Wasserstein distance between different distributions. As shown in Fig. 9, the ARlncosh loss presents a concentrated data distribution and is similar to the residual. In this experiment, M-BiGRU using ARlncosh loss achieves the optimal simulation than other contrast models. The above experiment results are consistent with the results in Table 10, which shows that the ARlncosh loss optimized by hyper-parameter \(\zeta\) generates the data distribution closer to the real noise. Moreover, Fig. 10 shows the convergence of \(\zeta\) in ARlncosh loss. On all load series, \(\zeta\) tends to converge after 3 iterations, which is similar to the conclusion in Example 1. This section verified that the ARlncosh loss has a better performance than contrast models.
Table 8
Comparative experimental results of M-BiGRU using \(L_{2}\), ARHuber, and ARlncosh loss function on Australian data sets
Model
NSW
QLD
SA
TAS
VIC
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
\(L_{2}\)
170
227
128
181
86
119
38
51
158
213
ARHuber
170
227
126
179
85
119
37
49
157
210
ARlncosh
170
227
125
178
84
116
37
49
153
206
Bold values highlight the lowest MAE or RMSE for each dataset, indicating the best performing model
Table 9
Statistical test results of M-BiGRU using \(L_{2}\), ARHuber, and ARlncosh loss function on Australian data sets
Loss functions
Error indicators
ARlncosh
NSW
QLD
SA
TAS
VIC
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
\(L_{2}\)
MAE
0
1
1
1
1
RMSE
0
1
1
1
1
ARHuber
MAE
0
0
0
0
1
RMSE
0
0
1
0
1
Bold values indicate that the model used ARlncosh loss obtains a better prediction performance than the comparison model
Table 10
Wasserstein distance between normal distribution, ARHuber distribution, ARlncosh distribution, and residual distribution on Australian data sets
Data sets
NSW
QLD
SA
TAS
VIC
Normal-residual distribution
18
36
50
40
39
ARHuber-residual distribution
18
8
16
8
14
ARlncosh-residual distribution
18
8
9
7
12
Bold values highlight the lowest Wasserstein distance for each dataset, indicating the best matching distribution to the residual distribution
Fig. 9
The residual distribution, normal distribution, ARHuber distribution, and ARlncosh distribution on SA and VIC data sets
Full size image
Fig. 10
Convergence of \(\zeta\) on Australian load data sets
Full size image
This section designs a similar experiment to Example 1, to verify the performance of MscaleDNN. Table 7 shows that M-BiGRU achieved better MAE and RMSE than BiGRU on NSW, QLD, SA, and VIC data sets. On the TAS data set, M-BiGRU and BiGRU have similar performance. This experiment further carried out the statistical test on the experiment results of BiGRU and M-BiGRU, as shown in Table 11. 1 means MscaleDNN improves the forecasting performance of BiGRU. In contrast, 0 means MscaleDNN causes BiGRU with the prediction accuracy decrease. On NSW, QLD, SA, and VIC data sets, there is a significant difference between M-BiGRU and BiGRU. Therefore, MscaleDNN can effectively improve the performance of the proposed model in Australian data sets.
Table 11
Statistical test results for the effectiveness of MscaleDNN on Australian data sets
Model
Error indicators
BiGRU
NSW
QLD
SA
TAS
VIC
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
M–BiGRU
MAE
1
1
1
0
1
RMSE
1
1
1
0
1
Bold values highight M-BiGRU has a better forecasting performance than BiGRU
This experiment conducts spectrum analysis on the NSW data set, to further analyze the improvement of MscaleDNN for feature extraction. Figure 11 shows the frequencies of maximum amplitude are 4, 104, 209, 731, 1462, and 2193. Thus, the periodicity of NSW is 6 months, 7 days, 3.5 days, 1 day, 12 h, and 8 h. Similarly, the significant periods of low-frequency components are 1 day and 12 h. The significant periods of high-frequency components are 1 day, 12 h, 8 h, 6 h, and 4 h. The spectrum analysis indicates that the low- and high-frequency components can extract multiple main periods of the load series. The above analysis explains the interpretability of low- and high-frequency components in a physical sense.
Fig. 11
Spectrum analysis on NSW data set
Full size image
This experiment compares M-BiGRU with M-RAR-BiGRU, similar to Example 1. In Table 7, M-RAR-BiGRU has a better forecasting performance than M-BiGRU. The MAE of M-RAR-BiGRU is slightly lower than M-BiGRU in all load series. The experiment results show that ATTN further improves the feature extraction of BiGRU. M-RAR-BiGRU improves the prediction accuracy, through multi-scale modeling. This experimental result further verifies that the combination of RAR-BiGRU and MscaleDNN has a better prediction performance.

6 Conclusion

This paper establishes an integrated DL framework based on a multi-scale perspective (M-RAR-BiGRU) for STLF. M-RAR-BiGRU extracts features of load data from a multi-scale perspective. This study borrows the idea of FT to transfer the load series into multiple components with different frequencies through NNs. MscaleDNN uses NNs to approximate the adaptive FT of load series, avoiding the limitations of traditional signal decomposition methods in practical applications. Additionally, M-RAR-BiGRU introduces the ARlncosh loss function for outlier handling. The ‘working’ likelihood function optimizes the ARlncosh loss to approximate the distribution of load data. This paper conducts contrast experiments on Portuguese and Australian data sets, to analyze the performance of M-RAR-BiGRU in STLF. The experiment results verified that M-RAR-BiGRU decouples stable periodic features from the load series and generates accurate forecasting results. In the load series with outliers, the ARlncosh loss function improves the robustness of the proposed model.
The power load data, as the time series, has complex features. The proposed model calculates the features of load series from multiple scales, through a weighted method. However, the weighted coefficients are manual settings. Therefore, future work will study the optimization algorithm to adaptively determine these coefficients based on feature information. Furthermore, outliers inevitably exist in load data. Thus, future work will study the loss function for the outlier handling.

Acknowledgements

This work is supported in part by the National Natural Science Foundation of China under Grant 61873130 and Grant 61833011, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20191377, in part by the 1311 Talent Project of Nanjing University of Posts and Telecommunications, and “Chunhui Program” Collaborative Scientific Research Project (202202004).

Declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Our product recommendations

ATZelectronics worldwide

ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information. 

Order your 30-days-trial for free and without any commitment.

ATZelektronik

Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik. 

Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.

Literature
1.
go back to reference Singh P, Dwivedi P, Kant V (2019) A hybrid method based on neural network and improved environmental adaptation method using controlled Gaussian mutation with real parameter for short-term load forecasting. Energy 174:460–477CrossRef Singh P, Dwivedi P, Kant V (2019) A hybrid method based on neural network and improved environmental adaptation method using controlled Gaussian mutation with real parameter for short-term load forecasting. Energy 174:460–477CrossRef
2.
go back to reference Eren Y, Küçükdemiral İ (2024) A comprehensive review on deep learning approaches for short-term load forecasting. Renew Sustain Energy Rev 189:114031CrossRef Eren Y, Küçükdemiral İ (2024) A comprehensive review on deep learning approaches for short-term load forecasting. Renew Sustain Energy Rev 189:114031CrossRef
3.
go back to reference Wang Y, Chen J, Chen X, Zeng X, Kong Y, Sun S et al (2021) Short-term load forecasting for industrial customers based on TCN-LightGBM. IEEE Trans Power Syst 36(3):1984–1997CrossRef Wang Y, Chen J, Chen X, Zeng X, Kong Y, Sun S et al (2021) Short-term load forecasting for industrial customers based on TCN-LightGBM. IEEE Trans Power Syst 36(3):1984–1997CrossRef
4.
go back to reference Lv L, Wu Z, Zhang J, Zhang L, Tan Z, Tian Z (2021) A VMD and LSTM based hybrid model of load forecasting for power grid security. IEEE Trans Ind Inform 18(9):6474–6482CrossRef Lv L, Wu Z, Zhang J, Zhang L, Tan Z, Tian Z (2021) A VMD and LSTM based hybrid model of load forecasting for power grid security. IEEE Trans Ind Inform 18(9):6474–6482CrossRef
5.
go back to reference Mohan N, Soman KP, Sachin KS (2018) A data-driven strategy for short-term electric load forecasting using dynamic mode decomposition model. Appl Energy 232:229–244CrossRef Mohan N, Soman KP, Sachin KS (2018) A data-driven strategy for short-term electric load forecasting using dynamic mode decomposition model. Appl Energy 232:229–244CrossRef
6.
go back to reference Yang Y, Wang Z, Zhao S, Wu J (2023) An integrated federated learning algorithm for short-term load forecasting. Electr Power Syst Res 214:108830CrossRef Yang Y, Wang Z, Zhao S, Wu J (2023) An integrated federated learning algorithm for short-term load forecasting. Electr Power Syst Res 214:108830CrossRef
7.
go back to reference Cleveland RB, Cleveland WS (1990) STL: a seasonal-trend decomposition procedure based on Loess. J Off Stat 6(1):3–73 Cleveland RB, Cleveland WS (1990) STL: a seasonal-trend decomposition procedure based on Loess. J Off Stat 6(1):3–73
8.
go back to reference Ghelardoni L, Ghio A, Anguita D (2013) Energy load forecasting using empirical mode decomposition and support vector regression. IEEE Trans Smart Grid 4(1):549–556CrossRef Ghelardoni L, Ghio A, Anguita D (2013) Energy load forecasting using empirical mode decomposition and support vector regression. IEEE Trans Smart Grid 4(1):549–556CrossRef
9.
go back to reference Zhang Z, Ding S, Sun Y (2020) A support vector regression model hybridized with chaotic krill herd algorithm and empirical mode decomposition for regression task. Neurocomputing 410:185–201CrossRef Zhang Z, Ding S, Sun Y (2020) A support vector regression model hybridized with chaotic krill herd algorithm and empirical mode decomposition for regression task. Neurocomputing 410:185–201CrossRef
10.
go back to reference Ding S, Zhang Z, Guo L, Sun Y (2022) An optimized twin support vector regression algorithm enhanced by ensemble empirical mode decomposition and gated recurrent unit. Inf Sci 598:101–125CrossRef Ding S, Zhang Z, Guo L, Sun Y (2022) An optimized twin support vector regression algorithm enhanced by ensemble empirical mode decomposition and gated recurrent unit. Inf Sci 598:101–125CrossRef
11.
go back to reference Zhang Z, Hong WC (2019) Electric load forecasting by complete ensemble empirical mode decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn 98(2):1107–1136CrossRef Zhang Z, Hong WC (2019) Electric load forecasting by complete ensemble empirical mode decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm. Nonlinear Dyn 98(2):1107–1136CrossRef
12.
go back to reference Zhang Z, Dong Y, Hong WC (2023) Long short-term memory-based twin support vector regression for probabilistic load forecasting. IEEE Trans Neural Netw Learn Syst 1–15 Zhang Z, Dong Y, Hong WC (2023) Long short-term memory-based twin support vector regression for probabilistic load forecasting. IEEE Trans Neural Netw Learn Syst 1–15
13.
go back to reference Zhang Z, Hong WC (2021) Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl-Based Syst 228:107297CrossRef Zhang Z, Hong WC (2021) Application of variational mode decomposition and chaotic grey wolf optimizer with support vector regression for forecasting electric loads. Knowl-Based Syst 228:107297CrossRef
14.
go back to reference He F, Zhou J, Feng Z, Liu G, Yang Y (2019) A hybrid short-term load forecasting model based on variational mode decomposition and long short-term memory networks considering relevant factors with Bayesian optimization algorithm. Appl Energy 237:103–116CrossRef He F, Zhou J, Feng Z, Liu G, Yang Y (2019) A hybrid short-term load forecasting model based on variational mode decomposition and long short-term memory networks considering relevant factors with Bayesian optimization algorithm. Appl Energy 237:103–116CrossRef
15.
go back to reference Wu H, Xu J, Wang J, Long M (2021) Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 34:22419–22430 Wu H, Xu J, Wang J, Long M (2021) Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 34:22419–22430
16.
go back to reference Oreshkin BN, Carpov D, Chapados N, Bengio Y (2020) N-BEATS: neural basis expansion analysis for interpretable time series forecasting. In: International conference on learning representations Oreshkin BN, Carpov D, Chapados N, Bengio Y (2020) N-BEATS: neural basis expansion analysis for interpretable time series forecasting. In: International conference on learning representations
17.
go back to reference Bond-Taylor S, Leach A, Long Y, Willcocks CG (2022) Deep generative modelling: a comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE Trans Pattern Anal Mach Intell 44(11):7327–7347CrossRef Bond-Taylor S, Leach A, Long Y, Willcocks CG (2022) Deep generative modelling: a comparative review of vaes, gans, normalizing flows, energy-based and autoregressive models. IEEE Trans Pattern Anal Mach Intell 44(11):7327–7347CrossRef
18.
go back to reference Yunus K, Thiringer T, Chen P (2015) ARIMA-based frequency-decomposed modeling of wind speed time series. IEEE Trans Power Syst 31(4):2546–2556CrossRef Yunus K, Thiringer T, Chen P (2015) ARIMA-based frequency-decomposed modeling of wind speed time series. IEEE Trans Power Syst 31(4):2546–2556CrossRef
19.
go back to reference Yang W, Shi J, Li S, Song Z, Zhang Z, Chen Z (2022) A combined deep learning load forecasting model of single household resident user considering multi-time scale electricity consumption behavior. Appl Energy 307:118197CrossRef Yang W, Shi J, Li S, Song Z, Zhang Z, Chen Z (2022) A combined deep learning load forecasting model of single household resident user considering multi-time scale electricity consumption behavior. Appl Energy 307:118197CrossRef
20.
go back to reference Sadaei HJ, e Silva PCdL, Guimaraes FG, Lee MH (2019) Short-term load forecasting by using a combined method of convolutional neural networks and fuzzy time series. Energy. 175:365–377CrossRef Sadaei HJ, e Silva PCdL, Guimaraes FG, Lee MH (2019) Short-term load forecasting by using a combined method of convolutional neural networks and fuzzy time series. Energy. 175:365–377CrossRef
21.
go back to reference Zhang J, Wei YM, Li D, Tan Z, Zhou J (2018) Short term electricity load forecasting using a hybrid model. Energy 158:774–781CrossRef Zhang J, Wei YM, Li D, Tan Z, Zhou J (2018) Short term electricity load forecasting using a hybrid model. Energy 158:774–781CrossRef
22.
go back to reference Li Z, Li Y, Liu Y, Wang P, Lu R, Gooi HB (2021) Deep learning based densely connected network for load forecasting. IEEE Trans Power Syst 36(4):2829–2840CrossRef Li Z, Li Y, Liu Y, Wang P, Lu R, Gooi HB (2021) Deep learning based densely connected network for load forecasting. IEEE Trans Power Syst 36(4):2829–2840CrossRef
23.
go back to reference Esmaeili A, Marvasti F (2019) A novel approach to quantized matrix completion using huber loss measure. IEEE Signal Process Lett 26(2):337–341CrossRef Esmaeili A, Marvasti F (2019) A novel approach to quantized matrix completion using huber loss measure. IEEE Signal Process Lett 26(2):337–341CrossRef
24.
go back to reference Karal O (2017) Maximum likelihood optimal and robust Support Vector Regression with lncosh loss function. Neural Netw 94:1–12CrossRef Karal O (2017) Maximum likelihood optimal and robust Support Vector Regression with lncosh loss function. Neural Netw 94:1–12CrossRef
25.
go back to reference Yang Y, Zhou H, Wu J, Ding Z, Tian YC, Yue D et al (2023) Robust adaptive rescaled Lncosh neural network regression toward time-series forecasting. IEEE Trans Syst Man Cybern Syst 53(9):5658–5669CrossRef Yang Y, Zhou H, Wu J, Ding Z, Tian YC, Yue D et al (2023) Robust adaptive rescaled Lncosh neural network regression toward time-series forecasting. IEEE Trans Syst Man Cybern Syst 53(9):5658–5669CrossRef
26.
go back to reference Fu L, Wang YG, Cai F (2020) A working likelihood approach for robust regression. Stat Methods Med Res 29(12):3641–3652MathSciNetCrossRef Fu L, Wang YG, Cai F (2020) A working likelihood approach for robust regression. Stat Methods Med Res 29(12):3641–3652MathSciNetCrossRef
27.
go back to reference Liu Z, Cai W, Xu ZQJ (2020) Multi-scale deep neural network (MscaleDNN) for solving Poisson–Boltzmann equation in complex domains. Commun Comput Phys 28(5):1970–2001MathSciNetCrossRef Liu Z, Cai W, Xu ZQJ (2020) Multi-scale deep neural network (MscaleDNN) for solving Poisson–Boltzmann equation in complex domains. Commun Comput Phys 28(5):1970–2001MathSciNetCrossRef
28.
go back to reference Cai W, Li X, Liu L (2020) A phase shift deep neural network for high frequency approximation and wave problems. SIAM J Sci Comput 42(5):A3285–A3312MathSciNetCrossRef Cai W, Li X, Liu L (2020) A phase shift deep neural network for high frequency approximation and wave problems. SIAM J Sci Comput 42(5):A3285–A3312MathSciNetCrossRef
29.
go back to reference Xu ZQJ, Zhang Y, Luo T, Xiao Y, Ma Z (2020) Frequency principle: Fourier analysis sheds light on deep neural networks. Commun Comput Phys 28(5):1746–1767MathSciNetCrossRef Xu ZQJ, Zhang Y, Luo T, Xiao Y, Ma Z (2020) Frequency principle: Fourier analysis sheds light on deep neural networks. Commun Comput Phys 28(5):1746–1767MathSciNetCrossRef
30.
go back to reference Li XA, Xu ZQJ, Zhang L (2023) Subspace decomposition based DNN algorithm for elliptic type multi-scale PDEs. J Comput Phys 488:112242MathSciNetCrossRef Li XA, Xu ZQJ, Zhang L (2023) Subspace decomposition based DNN algorithm for elliptic type multi-scale PDEs. J Comput Phys 488:112242MathSciNetCrossRef
31.
go back to reference Wang YG, Zhao Y (2007) A modified pseudolikelihood approach for analysis of longitudinal data. Biometrics 63(3):681–689MathSciNetCrossRef Wang YG, Zhao Y (2007) A modified pseudolikelihood approach for analysis of longitudinal data. Biometrics 63(3):681–689MathSciNetCrossRef
32.
go back to reference Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:6000–6010 Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:6000–6010
33.
go back to reference Yang Y, Zhou H, Gao Y, Wu J, Wang YG, Fu L (2022) Robust penalized extreme learning machine regression with applications in wind speed forecasting. Neural Comput Appl 34(1):391–407CrossRef Yang Y, Zhou H, Gao Y, Wu J, Wang YG, Fu L (2022) Robust penalized extreme learning machine regression with applications in wind speed forecasting. Neural Comput Appl 34(1):391–407CrossRef
34.
go back to reference Bandara K, Bergmeir C, Hewamalage H (2021) LSTM-MSNet: leveraging forecasts on sets of related time series with multiple seasonal patterns. IEEE Trans Neural Netw Learn Syst 32(4):1586–1599CrossRef Bandara K, Bergmeir C, Hewamalage H (2021) LSTM-MSNet: leveraging forecasts on sets of related time series with multiple seasonal patterns. IEEE Trans Neural Netw Learn Syst 32(4):1586–1599CrossRef
35.
go back to reference Lai G, Chang WC, Yang Y, Liu H (2017) Modeling long-and short-term temporal patterns with deep neural networks. In: The 41st international ACM SIGIR conference on research & development in information retrieval Lai G, Chang WC, Yang Y, Liu H (2017) Modeling long-and short-term temporal patterns with deep neural networks. In: The 41st international ACM SIGIR conference on research & development in information retrieval
36.
go back to reference Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H et al (2020) Informer: beyond efficient transformer for long sequence time-series forecasting. In: AAAI conference on artificial intelligence Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H et al (2020) Informer: beyond efficient transformer for long sequence time-series forecasting. In: AAAI conference on artificial intelligence
37.
go back to reference Kitaev N, Kaiser Ł, Levskaya A (2019) Reformer: the efficient transformer. In: International conference on learning representations Kitaev N, Kaiser Ł, Levskaya A (2019) Reformer: the efficient transformer. In: International conference on learning representations
Metadata
Title
Multiscale-integrated deep learning approaches for short-term load forecasting
Authors
Yang Yang
Yuchao Gao
Zijin Wang
Xi’an Li
Hu Zhou
Jinran Wu
Publication date
06-08-2024
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 12/2024
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-024-02302-4