The article introduces a novel multi-scale deep learning approach for short-term load forecasting (STLF), which is crucial for effective power system management. Traditional methods like Fourier Transform (FT) suffer from limitations such as manual setting of decomposition modes and inefficiencies in forecasting. The proposed approach, called M-RAR-BiGRU, automates the decomposition process and integrates signal decomposition with deep learning techniques, avoiding duplicate modeling and enhancing feature extraction. The model uses a multi-scale deep neural network (MscaleDNN) to decompose load series into low- and high-frequency components, and a robust-autoregression-bi-directional gate recurrent unit (RAR-BiGRU) to model linear and nonlinear components. Additionally, the model employs the adaptive rescaled lncosh (ARlncosh) loss function to handle outliers, significantly improving the robustness of the forecasting model. The authors validate the proposed approach through experiments on load series from Portugal and Australia, demonstrating its superior performance compared to traditional and other state-of-the-art methods. The article highlights the importance of multi-scale feature extraction and robust loss functions in enhancing the accuracy and reliability of short-term load forecasting.
AI Generated
This summary of the content was generated with the help of AI.
Abstract
Accurate short-term load forecasting (STLF) is crucial for the power system. Traditional methods generally used signal decomposition techniques for feature extraction. However, these methods are limited in extrapolation performance, and the parameter of decomposition modes needs to be preset. To end this, this paper develops a novel STLF algorithm based on multi-scale perspective decomposition. The proposed algorithm adopts the multi-scale deep neural network (MscaleDNN) to decompose load series into low- and high-frequency components. Considering outliers of load series, this paper introduces the adaptive rescaled lncosh (ARlncosh) loss to fit the distribution of load data and improve the robustness. Furthermore, the attention mechanism (ATTN) extracts the correlations between different moments. In two power load data sets from Portugal and Australia, the proposed model generates competitive forecasting results.
Notes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Short-term load forecasting (STLF) is critical for effective power system management, aiming to predict electricity demand over periods ranging from an hour to a week [1, 2]. This prediction is essential due to the direct influence of human activities on power consumption and the operational challenges posed by the variability of renewable energy sources [3]. Traditional signal decomposition methods like the Fourier transform (FT) [4], integral to STLF, require manual setting of decomposition modes and often suffer from duplicative modeling, leading to inefficiencies and inaccuracies in forecasts. These limitations hinder the ability of energy providers to respond dynamically to changes in demand and to effectively integrate renewable energy. To address these challenges, this paper introduces a novel STLF algorithm employing multi-scale perspective decomposition. This approach automates the decomposition process, reduces redundancy, and aims to enhance the accuracy and reliability of load forecasts, thereby improving energy management practices and accommodating the rapid shifts in energy consumption patterns.
1.1 Literature review
STLF aims to generate future load values by learning the patterns of historical data. From the physical perspective, load series is the accumulation of multiple exogenous factor series [5]. FT can decompose load series into periodic signals with different frequencies and amplitudes [6]. Generally, the load series can be divided into trends, periods, and noise [7]. Recently, many studies have introduced signal decomposition methods for feature extraction. Signal decomposition methods transform the complex load series into multiple relatively stable sub-sequences. Ghelardoni et al. [8] used empirical mode decomposition (EMD) to disaggregate load series into multiple components. Zhang et al. [9] used EMD to capture the trends of time series. Ding et al. [10] adopts ensemble empirical mode decomposition (EEMD) to reduce the influences of hidden noise in load data. Zhang and Hong [11] introduced complete ensemble empirical mode decomposition adaptive noise (CEEMDAN) as data preprocessing to improve forecasting accuracy. Zhang et al. [12] utilized fast ensemble empirical mode decomposition (FEEMD) for data filtering in probabilistic load forecasting. Zhang and Hong [13] proposed a hybrid model that combined variational mode decomposition (VMD) with support vector regression (SVR). He et al. [14] employed VMD to decompose the load series into a discrete number of modes. The aforementioned studies proved that signal decomposition methods extract features from complex load series effectively.
Advertisement
However, the above decomposition technique has two main shortcomings in application: (1) the number of decomposition modes requires multiple experiments to be determined; (2) models have to duplicate decomposition and modeling, with the update of load series. To address these issues, researchers utilize the neural network (NN) to propose novel decomposition methods. Wu et al. [15] designed Autoformer, which is a novel decomposition architecture with an auto-correlation mechanism. Autoformer uses the inner decomposition block to empower the deep forecasting model with immanent progressive decomposition capacity. Oreshkin et al. [16] developed neural basis expansion analysis (N-BEATS), through the idea of gradient integration.
STLF methods are mainly divided into statistical methods and machine learning (ML) methods. Statistical methods include autoregression (AR) [17] and autoregressive integral moving average (ARIMA) [18]. However, statistical methods are based on linear assumptions, which are limited in nonlinear feature capture. In contrast, ML methods, such as the recurrent neural network (RNN), long short-term memory (LSTM), and the gate recurrent unit (GRU), perform competitively on nonlinear feature extraction. However, LSTM and GRU cannot fully alleviate the vanishing and exploding gradient of RNN. The back propagation neural network (BPNN) trains NNs with reverse gradient propagation. However, it is unable to learn the long- and short-term dependencies from time series. Yang et al. [19] proposed a method combined with BPNN, extreme gradient boosting (XGBoost), and LSTM. Sadaei et al. [20] proposed a combined method that is based on the fuzzy time series (FTS) and convolutional neural network (CNN) for STLF. Zhang et al. [21] developed a new hybrid model based on improved EMD, ARIMA, and wavelet neural network (WNN). Therefore, integrated methods are verified to improve the performance of STLF methods efficiently.
Outliers inevitably exist in load series. Due to the unknown distribution of outliers, prediction models are difficult to determine the appropriate loss function to fit load data. Although the \(L_{2}\) loss has been widely used as the objective function, it is sensitive to outliers and may cause overfitting [22]. Actual load series generally contain complex noise, which is composed of multiple distributions. Traditional loss functions with one single distribution may generate unreliable predictions. Common robust loss function, such as the Huber loss function, has a better performance on the load data that contains random noise. The Huber loss function fits the mixed distribution of Laplace and Gaussian distribution, which combines the advantages of \(L_{1}\) and \(L_{2}\) loss [23]: (1) The Huber loss function is differentiable at the origin. (2) Huber loss is more insensitive to outliers than \(L_{2}\) loss function. The lncosh loss is more robust to outliers than the Huber loss, which can approach \(L_{1}\), \(L_{2}\), and Huber loss through adjusting hyper-parameters [24]. However, due to the unknown prior knowledge of outliers, it is difficult to determine the loss function. Thus, the hyper-parameter setting of lncosh loss is the main difficulty in model training. Yang et al. [25] proposed an adaptive rescaled lncosh (ARlncosh) loss function to handle time series modeling with outliers and random noise. The ARlncosh loss uses a ‘working’ likelihood approach to determine the hyper-parameter of the lncosh loss [26]. This paper utilizes the ARlncosh loss to handle outliers of load series, considering its competitive performance in practice.
1.2 Motivation
Considering the limitations of traditional decomposition methods in STLF, this paper proposes a multi-scale perspective prediction model based on deep learning (DL). The proposed model adopts an end-to-end architecture for training, which avoids duplicate decomposition and modeling in application. MscaleDNN uses NNs to approximate the FT process, which avoids setting the number of decomposition modes manually. Furthermore, this paper proposed the integrated model to extract linear and nonlinear features from load data. Considering the outlier of load data, the proposed model introduced the ARlncosh loss to train the NN based on the idea of robust regression. The ‘working’ likelihood function optimizes the hyper-parameter of the lncosh loss, to approximate the data distribution of load data.
Advertisement
1.3 Contributions
This paper develops a multi-scale deep neural network-robust-autoregression-bi-directional gate recurrent unit (M-RAR-BiGRU) for STLF, which extracts long-term stable periodic features from a multi-scale perspective. Firstly, MscaleDNN decomposes load series into different frequencies. Secondly, the robust-autoregression-bi-directional gate recurrent unit (RAR-BiGRU) models the linear and nonlinear components of load data. During the model training, the ARlncosh loss function is used as the objective function to handle the outliers of load data. Meanwhile, the ARlncosh loss introduces the ‘working’ likelihood function to optimize the hyper-parameter for fitting the distribution of load data. The contributions of this paper are summarized as follows:
(a)
To address multi-scale data integration in STLF, we introduce M-RAR-BiGRU, a deep learning approach that approximates the FT, enhancing the model’s ability to analyze periodic features across various time scales.
(b)
Tackling the challenge of extracting both linear and nonlinear dynamics in load data, our MscaleDNN utilizes robust-autoregression and bi-directional gate recurrent units (RAR-BiGRU), along with an attention mechanism, to accurately capture and leverage temporal correlations.
(c)
Addressing the impact of outliers on forecast accuracy, we employ the ARlncosh loss, using a ’working’ likelihood function to adaptively approximate the distribution of load data, increasing the model’s robustness against anomalies.
1.4 Structure of this paper
The remaining part of this paper is organized as follows. Section 2 introduces the technologies referred to in this paper. Section 3 details the implementation of M-RAR-BiGRU. Sections 4 and 5 analyze the experiment results of Portugal and Australia data sets and verify the performance of the proposed algorithm. Section 6 concludes this paper.
2 Technology background
Signal decomposition methods can improve the performance and interpretability of STLF models. However, traditional decomposition methods, such as seasonal-trend decomposition (STL) [7] and VMD, have limitations of duplicate decomposition and modeling, which may waste computation resources. Additionally, it is difficult to determine the optimal value of decomposition modes, which may influence feature extraction. Therefore, this paper employs MscaleDNN to decompose load series, based on the idea of FT. MscaleDNN avoids manually setting the number of decomposition modes. M-RAR-BiGRU integrates signal decomposition and DL techniques into an end-to-end framework. Compared with traditional decomposition approaches, M-RAR-BiGRU avoids duplicate decomposition and modeling. The proposed method decomposes the load series into low- and high-frequency components as parts of predictions. Furthermore, RAR-BiGRU models the linear and nonlinear features of load data. M-RAR-BiGRU improves the interpretability of prediction results and the feature extraction ability of the forecasting model.
2.1 Multi-scale deep neural network
This paper borrows the idea of radial scaling in frequency domain [27] to structure a deep neural network (MscaleDNN), which decomposes the frequency-domain signals of time series. For a given time series \(f(x) = \{x_{1}, x_{2}, x_{3}, \ldots , x_{t} \}\), its FT is defined as \({\hat{f}}(k)\). The decomposition results of \({\hat{f}}(k)\) are as follows:
where \(f_{i}(x) = F^{-1}[\hat{f_{i}}(k)](x) = f(x)*x\check{A_{i}}(x)\). The inverse FT of \(xA_{i}(k)\) is a frequency selection kernel, which can be calculated with the Bessel function [28].
However, the NN has different learning effects on contents with different frequencies according to the frequency principle (F-Principle) [29]. Generally, the NN can learn the low-frequency content of data efficiently but is inefficient in high-frequency content. Therefore, MscaleDNN introduces a simple down-scaling to transform the high-frequency region of Eq. (1) into a low-frequency region. The scaled version of \(\hat{f_{i}}(k)\) is defined as:
where \(\alpha _{i}\) is scale factor. The low-frequency region of \(f_{i}(x)\) can be obtained by a large enough \(\alpha _{i}\). MscaleDNN constructs a deep neural network (DNN) \(f_{\theta ^{n_{i}}}(x)\) to learn decomposed components \(f_{i}^{scale}(x)\) at a common frequency scale:
where [27] recommends \(\alpha _{i} = i\) or \(\alpha _{i} = 2^{i-1}\). The proposed algorithm decomposes time series into low- and high-frequency components. The scale factors of low- and high-frequency components are \(\{\alpha _{1}, \alpha _{2}, \ldots , \alpha _{d} \}\) and \(\{\alpha _{d+1}, \alpha _{d+2}, \ldots , \alpha _{D} \}\), respectively. Vector \(\{\alpha _{1}, \alpha _{2}, \cdots , \alpha _{d+1}, \alpha _{d+2}, \cdot , \ldots , \alpha _{D} \}\) is generally in ascending order.
Considering the activation function of MscaleDNN is limited in input data, this paper introduces the soften Fourier mapping (SFM) activation function [30] in low- and high-frequency parts of DNN, as follows:
where relaxation parameter \(s \in (0, 1]\) is used to control the output range of the activation function. This paper sets \(s = 0.5\) through the experiment method. The first hidden layer of MscaleDNN simulates Fourier expansion, while the remaining hidden layers approximate Fourier coefficients. The training process of MscaleDNN can be viewed as FT, through the SFM activation function. Compared with input data, Fourier coefficients have relatively less oscillation. Therefore, SFM can effectively accelerate the training of MscaleDNN.
2.2 Adaptive scale lncosh loss function
The traditional lncosh loss function is defined as follows:
where \(\cosh (\lambda r) = (e^{\lambda r} + e^{-\lambda r})/2\) and the hyper-parameter \(\lambda \in (0, +\infty )\) controls the properties of lncosh loss function. Residual \(r = y - {\hat{y}}\) is the difference between actual load data y and prediction results \({\hat{y}}\). lncosh loss function approximates \(L_{1}\), \(L_{2}\), and Huber loss function, by adjusting hyper-parameter \(\lambda\).
Remark 1
As shown in Fig. 1, \(\lambda\) controls the properties of lncosh loss. lncosh loss approximates \(L_{1}\) loss, with \(\lambda \rightarrow \infty\). lncosh loss approximates \(L_{2}\) loss, with \(\lambda \rightarrow 0\). lncosh loss approximates Huber loss, with a proper \(\lambda\). Therefore, lncosh loss can approximate \(L_{1}\), \(L_{2}\), and Huber loss by adjusting \(\lambda\).
Fig. 1
lncosh loss function with different \(\lambda\) (\(\lambda\) is the hyper-parameter that controls the properties of lncosh loss function)
The constant that is independent of the core parameter generally can be ignored, due to its limited influence on optimization. Thus, this paper omits the constant \(\lambda\) in the denominator of Eq. (10). This paper proposes a scale lncosh loss function \(l_{2}\) to approximate the noise distribution:
where constant \((\lambda /\pi )\) is calculated by the integral of the probability density function.
The likelihood function mainly describes the random generation of data. However, the ‘working’ likelihood function is only used for parameter estimation [26]. The scale lncosh loss function is derived from the logarithmic likelihood function. The corresponding ‘working’ likelihood function is as follows:
where \(i \in [1, T]\) is the ith moment of the load series.
This paper proposes the scale lncosh loss function and introduces the corresponding ‘working’ likelihood function to estimate \(\lambda\). Assuming \(\zeta = 1/\lambda\), \(\zeta\) is a scale parameter for error essentially. Thus, the error can be written as \(\zeta \varepsilon _{i}\), where \(\varepsilon _{i}\) without any distribution of parameters. The extended original objective function is as follows:
where residual \(r_{i} = y_{i} - \hat{y_{i}}\). Due to \(r_{i}\) is not exist in the second term of Eq. (14), this ‘working’ likelihood function is equivalent to the minimization Eq. (11). Therefore, for the given \(\lambda\), the optimization of \(r_{i}\) has the same solution as Eq. (11). Eq. (14) can also calculate \(\lambda\) (or \(\zeta = 1/\lambda\)). This scale lncosh loss function approximates the unknown distribution of load data effectively.
This ‘working’ likelihood function method provides data-driven tune parameters, hyper-parameters, and variance parameters tuning [26, 31]. \(\zeta\) can be automatically selected as \(\zeta ^{*}\), by setting the derivative of \(\zeta\) in Eq. (14) to 0.
where \(\tanh (r_{i}/\zeta ) = (e^{r_{i}/\zeta } - e^{-r_{i}/\zeta }) (e^{r_{i}/\zeta } + e^{-r_{i}/\zeta })\). The minimization of \(\zeta\) or Eq. (14) optimizes \(\zeta ^{*}\).
3 Multiscale-integrated deep learning algorithm
3.1 Multi-scale deep neural network
STL assumes that the load series is composed of trend terms, period terms, and noise terms. The trend and period terms are the modelable parts of the load series. The feature extraction of MscaleDNN is an approximate FT process. In the frequency domain, the modelable parts of the load series consist of periodic signals with different frequencies and amplitudes. Thus, MscaleDNN extracts the low- and high-frequency components of the load series, reference STL. The specific structure of MscaleDNN is shown in Fig. 2.
To solve the limitation of traditional decomposition methods in practice, MscaleDNN employs time features (such as year, month, day, and hour) as input features and outputs low- and high-frequency decomposition results. MscaleDNN approximate FT with low-frequency scale factor \(\{\alpha _{1}, \alpha _{2}, \cdots , \alpha _{d}\}\) and high-frequency scale factor \(\{\alpha _{d+1}, \alpha _{d+2}, \cdots , \alpha _{D} \}\). Meanwhile, MscaleDNN constructs two different DNNs to output decomposition results with different frequencies.
Fig. 2
The multi-scale deep neural network for low- and high-frequency extraction, where \(\alpha _{i} (i = 1, 2, 3, \ldots , d)\) and \(W^{1}_{i} (i = 1, 2, 3, \ldots , P)\) are the scale factors and weights of the low-frequency components, respectively. \(\alpha _{i} (i = d+1, d+2, d+3, \ldots , D)\) and \(W^{1}_{i} (i = 1, 2, 3, \ldots , Q)\) are the scale factors and weights of the high-frequency components, respectively
3.2 Robust-autoregression-bi-directional gate recurrent unit
RAR-BiGRU is a robust prediction model, which is trained by ARlncosh loss. AR and bi-directional gate recurrent unit (BiGRU) model linear and nonlinear features of the load series, respectively. Moreover, RAR-BiGRU utilizes ATTN to further extract the temporal feature of load data. Considering the outliers of the load series, ARlncosh loss approximates the distribution of outliers through hyperparameters adjusting. This paper introduces AR into BiGRU to extract the linear features while considering the long-term dependence of load series. Meanwhile, ARlncosh loss as an adaptive loss function can enhance the robustness of AR-BiGRU. RAR-BiGRU effectively improves the stability and robustness of the BiGRU.
AR provides the linear feature of load data. AR generates the prediction \(x_{t}\) at t moment with the linear combination of historical data \(\{x_{t-p},\ldots , x_{t-2}, x_{t-1}\}\):
Furthermore, BiGRU uses GRU units to extract nonlinear features from load data. The update process of GRU units mainly depends on the following reset gate \(r_{t}\) and update gate \(z_{t}\):
where \(\sigma\) is the sigmoid function. \(W_{xr}\), \(W_{hr}\), \(W_{xz}\), \(W_{hz}\) represent the weights of the NN. \(b_{r}\) and \(b_{z}\) represent the biases of the NN. \(H_{t-1}\) represents the hidden state of the previous moment.
BiGRU uses two hidden layers to extract previous and future information. The final output layer is concatenated by the outputs of GRUs at each moment. The training process of BiGRU is as follows:
where \(\mathop {h_{t}}\limits ^{\rightarrow }\) and \(\mathop {h_{t}}\limits ^{\leftarrow }\) represent forward and backward features of hidden layers, respectively. Nonlinear prediction is generated by the linear mapping of \(h_{t}\).
BiGRU extracts features from different directions while being unable to consider the correlations between different moments. Historical load data may have correlations between specific moments. ATTN further extracts temporal features from load data. ATTN assigns weight according to the importance of input data, avoiding interference from useless information for modeling. ATTN encodes historical load data as follows:
where \(r_{i} = y_{i} - \hat{y_{i}}\), and \(\zeta\) is calculated by the residual estimation.
Accelerating the convergence rate of \(\zeta\) in the ARlncosh loss function, this paper introduces the robust penalized extreme learning machine (RPELM) [33] to generate the initial prediction sequence and residual sequence \(r_{init}\). RPELM used the robust penalty framework and M-estimation theory for robust optimization of extreme learning machine (ELM). According to \(r_{init}\), the set of initial unbiased robust estimates \(\zeta _{init}\) is as follows:
The optimized ARlncosh objective function can be constructed with \({\hat{\zeta }}_{init}\).
3.4 Aggregate prediction results
As shown in Fig. 3, this paper combines MscaleDNN and RAR-BiGRU to propose a novel STLF framework. MscaleDNN uses NNs to approximate the adaptive FT of load series. For example, the predicted value \({\hat{x}}_{t}\) at t moment is generated by last p moment \(\{t-p, t-p+1, \ldots , t-1\}\), then \({\hat{x}}_{t}\) is generated as follows:
where \({\hat{x}}_{t, low}\) and \({\hat{x}}_{t, high}\) represent the low- and high-frequency components of MscaleDNN at t moment. \({\hat{x}}_{t, RAR-BiGRU}\) represents the prediction result of the RAR-BiGRU at t moment. \(\alpha\), \(\beta\), and, \(\kappa\) are adjustment coefficients to control the proportion of different prediction modules.
Fig. 3
The forecasting framework of M-RAR-BiGRU. \(t_{i}\), \({\hat{x}}_{i, low}\), and \({\hat{x}}_{i, high}\) is the time feature, low-frequency, and high-frequency respectively, where \(i \in [1, \hbox {N}]\). N is the length of the historical load series. \({\hat{x}}_{t}\) is the final prediction result of M-RAR-BiGRU, where \(t \in [1, \hbox {L}]\). L is the length of the forecasting step. The illustration of MscaleDNN is detailed in Fig. 2. The prediction of RAR-BiGRU is calculated by weighting different features. \(\alpha\), \(\beta\), and, \(\kappa\) are adjustment coefficients for different prediction modules
The proposed algorithm extracts features of load data from a multi-scale perspective. The overall training process is end-to-end. Specifically, the input of the proposed algorithm has two stages. Stage 1: MscaleDNN decomposes the current load series with the year, month, day, and hour information corresponding to the p moment \(\{t-p, t-p+1, \ldots , t-1\}\). Stage 2: RAR-BiGRU forecasts the load data at t moment by concatenating the low- and high-frequency components with the corresponding load data. The overall training process is as follows:
Step 1: MscaleDNN generates the low- and high-frequency components of t moment, with the historical time information.
Step 2: RAR-BiGRU concatenated the low- and high-frequency components with load data at the corresponding moment as input data, to output the prediction at t moment.
Step 3: Aggregate the prediction results of MscaleDNN and RAR-BiGRU according to Eq. (25), and update the network parameters according to the ARlncosh loss function in Eq. (23).
3.6 Computational complexity
In this section, we analyze the computational complexity of the M-RAR-BiGRU model. The FT has a complexity of \(O(N \cdot \log N)\), where \(N\) is the input length. The RAR-BiGRU includes GRU units with a complexity of \(O(T \cdot D^2)\) per unit, doubled for bi-directional processing, where \(T\) is the number of time steps and \(D\) is the hidden layer dimension. The linear regression component has a complexity of \(O(N \cdot D)\), where \(N\) is the sample size and \(D\) is the feature size. The attention mechanism has a complexity of \(O(N^2 \cdot D)\). The ARlncosh loss function used for loss calculation has a complexity of \(O(N)\).
Combining these components, the overall computational complexity of the model at each learning cycle is \(O(N \cdot \log N + T \cdot D^2 + N \cdot D + N^2 \cdot D + N)\). The lower-order terms are typically disregarded, resulting in a dominant term of \(O(N^2 \cdot D)\). Therefore, the overall complexity simplifies to \(O(N^2 \cdot D)\). To provide a clearer comparison, we summarize the time complexities of various models. Table 1 highlights the significant computational demands of the attention mechanism within these models.
Table 1
Comparison of time complexity
Model
Time Complexity
Model
Time Complexity
ELM
\(O(N \cdot D)\)
GRU
\(O(T \cdot D^2)\)
SVR
\(O(N^3)\)
BiGRU
\(O(T \cdot D^2)\)
LSTM-MSNet
\(O(T \cdot D^2)\)
Transformer
\(O(N^2 \cdot D)\)
LSTNet
\(O(T \cdot D^2)\)
Informer
\(O(N \log N \cdot D)\)
Reformer
\(O(N \log N \cdot D)\)
M-BiGRU
\(O(T \cdot D^2)\)
M-RAR-BiGRU
\(O(N^2 \cdot D)\)
4 Example 1: Substation load series from Portugal
4.1 Experimental data
This experiment aims to verify the prediction performance of M-RAR-BiGRU in the load series with noise. The load series with high sample entropy may contain plentiful information and complex noise. Thus, this paper experiments on 6 load series from Portugal substations with high sample entropy and outlier levels. The load series are named after their substation: MT7, MT31, MT34, MT161, and MT259. Artur Trindade published the load series of substations from 2011 to 2014 in the UCI machine learning repository. Considering the generalization performance of MscaleDNN, this paper uses the load series with a relatively long period, which spans from 0:00 on January 1, 2013, to 23:00 on December 31, 2014, with 17,520 sample points. The time granularity of each substation load series is one hour. Complex load series improve the difficulty of feature extraction. Meanwhile, the outlier of load data verifies the robustness of the ARlncosh loss. This experiment adopts \(70\%\) of sample points for model training, \(10\%\) of sample points for model validation, and \(20\%\) of sample points for model test.
4.2 Evaluation metric
To evaluate the prediction performance of the proposed algorithm and its contrast model, this experiment introduces the following mean absolute error (MAE) and root mean square error (RMSE) as error indicators:
where T is the number of sample points. \(x_{t}\) and \({\hat{x}}_{t}\) represent the load data and prediction result at tth moment, respectively. MAE is a universal error measure. RMSE is sensitive to the prediction with high error.
4.3 Experimental setting
To verify the prediction performance of the proposed model, this paper uses ELM, SVR, GRU, BiGRU, long short-term memory multiseasonal net (LSTM-MSNet) [34], the long- and short-term time-series network (LSTNet) [35], Transformer [32], Informer [36], and Reformer [37] as contrast models. ELM and SVR are ML comparative models, while GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer are DL comparative models. ELM adopts a tanh function as the activation function. SVR sets the kernel function as the radial basis function (RBF). The parameters of LSTM-MSNet and LSTNet follow [34] and are further adjusted by many experiments. For the unbiasedness of experiments, each experiment is repeated 15 times to calculate the mean. All experiments in this paper are completed in the environment of Python 3.8, graphics card GTX1650Ti, and 4 G graphics memory.
4.4 Outlier test
This experiment analyzes the outlier of the load series. Figure 4 shows the quantile–quantile (Q–Q) plots and boxplots for the outlier test. In Fig. 4, Q–Q plots show the data distribution of the residual sequence has a greater probability of generating maximum and minimum values than the normal distribution, which indicates that the load series contains outliers with a high probability. Further, many outliers exceed the upper and lower limits of the load series in each boxplot. Therefore, the load series has a certain amount of outliers.
To verify the performance improvement of the proposed model, this experiment compares the M-BiGRU with contrast models. Table 2 shows the prediction results of the comparative experiment. Compared with ML models, SVR has slightly lower MAE and RMSE than ELM. Due to the support vectors, SVR performs robustness in modeling. However, it is difficult for ELM to set the node of the hidden layer. In DL models, GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer have similar performance. The experimental results indicate that DL models improve prediction performance, by extracting temporal features adaptively. The ATTN has a powerful capability for feature extraction. Compared to contrast models, M-BiGRU achieved better MAE and RMSE indicators on the MT31, MT34, MT161, and MT259. On MT7, the error metrics of M-BiGRU are slightly higher than SVR. This comparative experiment verifies the superiority of M-BiGRU in STLF compared to existing classical models.
Table 2
Comparison model experiment results on substation load data sets from Portugal
Model
MT7
MT31
MT32
MT34
MT161
MT259
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
ELM
3.29
5.93
24.36
31.92
9.91
13.09
6.30
8.74
375.33
500.47
24.57
32.89
SVR
2.46
3.97
25.29
33.91
8.54
11.78
5.02
6.87
278.30
410.93
26.56
37.24
GRU
3.14
5.88
24.73
34.09
8.88
11.86
5.33
7.02
274.95
379.90
24.31
32.36
BiGRU
3.49
5.35
24.70
34.18
8.84
11.82
5.29
7.08
290.74
398.66
24.25
32.31
LSTM-MSNet
2.94
5.42
23.98
33.26
8.36
11.58
5.18
7.32
336.34
467.94
24.08
32.38
LSTNet
2.74
4.21
22.25
30.03
8.36
11.57
5.49
7.52
273.15
409.25
27.87
37.71
Transformer
2.90
4.43
22.29
29.94
8.72
11.77
5.56
7.75
285.73
403.55
24.55
32.85
Informer
2.72
4.23
22.62
30.19
8.55
11.55
5.41
7.48
300.67
415.40
24.68
33.59
Reformer
2.80
4.60
24.96
32.96
8.81
12.00
5.29
7.33
300.89
416.11
24.55
33.39
M-BiGRU
2.63
4.10
20.41
27.50
9.66
13.11
4.41
5.95
258.68
364.78
21.58
29.25
M-RAR-BiGRU
2.49
3.91
20.14
27.23
9.42
12.56
4.26
5.61
256.32
363.42
21.24
28.70
Bold values highlight the lowest MAE or RMSE for each dataset, indicating the best performing model
This section designs the comparative experiment with \(L_{2}\) and adaptive scaled Huber (ARHuber) loss function, to study the robustness of the ARlncosh loss. Table 3 shows the experimental results of M-BiGRU adopting different loss functions. Due to the approximate MAE and RMSE, this section further conducts the statistical test on the prediction result in Table 4. Table 4 uses M-BiGRU to compare the ARlncosh loss with \(L_{2}\) and ARHuber loss. 1 indicates that the model used ARlncosh loss obtains a better prediction performance than the comparison model, while 0 indicates the comparison model has a better performance. On MT7, MT32, and MT161 data sets, the MAE and RMSE of the model trained with the ARlncosh loss are lower than that trained with \(L_{2}\) loss. Thus, the ARlncosh loss is more robust to outliers than \(L_{2}\) loss. Additionally, this section further compares the difference between ARlncosh and ARHuber loss, through Wilkerson signed rank test as shown in Table 4. On the MT161 and MT259 of Table 4, the ARlncosh loss achieves a lower MAE and RMSE than the ARHuber loss. These statistical test results show that ARlncosh loss has a better modeling ability than ARHuber loss. To further analyze the competitive performance of ARlncosh loss on MT161 and MT259 data sets, this experiment draws the histogram of residuals fitted by ARIMA. Figure 5 shows the distribution of the residual fitted by ARIMA, normal distribution, ARHuber loss, and ARlncosh loss. Furthermore, Table 5 uses Wasserstein distance to accurately calculate the distance between different distributions. In Fig. 5, ARlncosh distribution has a relatively concentrated sample distribution, which is similar to that of residuals. This is consistent with the result in Table 4. The experiment results on MT161 and MT259 data sets indicate that the ARlncosh distribution is closer to the real noise distribution. Thus, the ARlncosh loss improves the robustness of the proposed model. Furthermore, Fig. 6 shows the convergence curve of the hyper-parameter \(\zeta\) of the ARlncosh loss. After 3 iterations, \(\zeta\) tends to converge. This experiment verified the ‘working’ likelihood function accelerates the convergence of \(\zeta\) effectively.
Table 3
Comparative experimental results of M-BiGRU using \(L_{2}\), ARHuber, and ARlncosh loss function on Portugal data sets
Model
MT7
MT31
MT32
MT34
MT161
MT259
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
\(L_{2}\)
2.75
4.27
20.72
27.77
10.10
13.48
4.72
6.27
268.08
373.82
21.88
29.39
ARHuber
2.60
4.06
20.43
27.40
9.78
13.21
4.40
5.95
262.25
369.62
21.81
29.37
ARlncosh
2.63
4.10
20.41
27.50
9.66
13.11
4.41
5.95
258.68
364.78
21.58
29.25
Bold values highlight the lowest MAE or RMSE for each dataset, indicating the best performing model
Table 4
Statistical test results of M-BiGRU using \(L_{2}\), ARHuber, and ARlncosh loss function on Portugal data sets
Loss functions
Error indicators
ARlncosh
MT7
MT31
MT32
MT34
MT161
MT259
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
\(L_{2}\)
MAE
1
–
1
–
1
–
1
–
1
–
1
–
RMSE
–
1
–
1
–
1
–
0
–
1
–
1
ARHuber
MAE
0
–
0
–
0
–
0
–
1
–
1
–
RMSE
–
0
–
0
–
0
–
0
–
1
–
1
Bold values indicate that the model used ARlncosh loss obtains a better prediction performance than the comparison model
Table 5
Wasserstein distance between normal distribution, ARHuber distribution, ARlncosh distribution, and residual distribution on Portugal data sets
Data sets
MT7
MT31
MT32
MT34
MT161
MT259
Normal-residual distribution
87
50
52
85
91
78
ARHuber-residual distribution
58
38
38
48
65
38
ARlncosh-residual distribution
63
38
38
32
51
23
Bold values highlight the lowest Wasserstein distance for each dataset, indicating the best matching distribution to the residual distribution
Fig. 5
The residual distribution, normal distribution, ARHuber distribution, and ARlncosh distribution on MT161 and MT259 data sets
To verify the effectiveness of MscaleDNN in the proposed model, this paper compares M-BiGRU with BiGRU. As shown in Table 2, MAE and RMSE indexes of M-BiGRU are lower than that of BiGRU on all load series except MT32. Table 6 shows the statistical test between BiGRU and M-BiGRU. 1 means M-BiGRU has a better forecasting performance than BiGRU, in terms of error indicators. 0 means BiGRU has a higher forecasting accuracy than M-BiGRU. As shown in Table 6, there is a significant difference between M-BiGRU and BiGRU for each power load series. Combined with the results of Table 2, it can be concluded that the model trained with MscaleDNN has the optimal prediction results on 6 sets of load series.
Table 6
Statistical test results for the effectiveness of MscaleDNN on Portugal data sets
Model
Error indicators
BiGRU
MT7
MT31
MT32
MT34
MT161
MT259
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
M–BiGRU
MAE
1
–
1
–
1
–
1
–
1
–
1
–
RMSE
–
1
–
1
–
1
–
0
–
1
–
1
Bold values highight M-BiGRU has a better forecasting performance than BiGRU
This experiment conducted spectrum analysis on the prediction results of MscaleDNN, to further analyze its improvement for the proposed model. Firstly, FT transforms the load series into a frequency domain signal. Then, periodic features are calculated by the frequencies with significant amplitudes. Figure 7a–c are the curves of the prediction result, low- and high-frequency prediction results corresponding to the MT31. The spectrum analysis is visualized in Fig. 7d–f. As shown in Fig. 7d, the frequencies of maximum amplitude are 730, 1460, 2190, 2920, and 4380. Thus, the periods of the MT31 are 1 day, 12 h, 8 h, 6 h, and 4 h. Similarly, the significant periods of low-frequency components are 1 day and 12 h. The spectrum analysis shows that significant periods of the high-frequency components are 1 day, 12 h, and 8 h. The low- and high-frequency components reflect multiple main periods of the load series. The prediction results explain the sense of low-frequency and high-frequency components in the physical and verify the effectiveness of MscaleDNN for feature extraction in the frequency domain. The proposed framework improves the accuracy and interpretability of prediction results.
Finally, this experiment compares M-BiGRU with M-RAR-BiGRU. As shown in Table 2, the MAE and RMSE of M-RAR-BiGRU on the 6 load series are stably better than those of M-BiGRU. The experiment results demonstrate that BiGRU can extract the nonlinear feature from the load series effectively. Meanwhile, the linear features extracted by AR benefit the prediction model to model the load series more accurately. Moreover, ARlncosh loss can fit outliers with different distributions adaptively. Therefore, the combination of MscaleDNN and RAR-BiGRU can extract multi-scale features from the load series, which is an optimal model for STLF.
5 Example 2: Power load series from Australia
5.1 Experimental data
Example 2, as an additional experiment, uses the load series from five major states of Australia to evaluate the performance of M-RAR-BiGRU. These data sets are named after their state abbreviations: New South Wales (NSW), Queensland (QLD), South Australia (SA), Tasmania (TAS), and Victoria (VIC). The period of load series spans from 0:00 on January 1, 2019, to 23:00 on December 31, 2020, with the time granularity of an hour. Example 2 uses a similar ratio of Example 1 to divide the data set, where 70\(\%\) of sample points are used for model training, 10\(\%\) for model validation, and 20\(\%\) for model test.
5.2 Evaluation metric
Example 2 uses the same error index as Example 1 with Eqs. (26) and (27).
5.3 Experimental setting
This experiment uses six comparative models (ELM, SVR, GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer) to verify the prediction performance of M-BiGRU. ELM and SVR as ML comparative models. GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer are DL comparative models. ELM sets the activation function as a tanh function. SVR sets the kernel function as RBF. The parameter setting of LSTM-MSNet and LSTNet is followed [34] and adjusted by the empirical method. Each group of experiments is repeated 15 times to take the mean, for unbiased contrast experiments. All experiments are completed in Python 3.8, graphics card GTX1650Ti, and 4 G graphics memory environment.
5.4 Outlier test
This section analyzes the outlier of 5 load series from Australia. As in Example 1, Fig. 8 shows the Q-Q plots and boxplots for the outlier test. The Q-Q plot shows that the data distribution of the load series has a higher probability of generating relative maximum and minimum than the normal distribution. Further, the boxplot shows the outliers exceed the upper and lower limits of the load series. In conclusion, there are certain outliers in these load series.
Example 2 conducts the comparative experiment of M-BiGRU. Table 7 shows the prediction results of the comparative experiment. SVR as an ML model performs slightly better than ELM. DL models (such as GRU, BiGRU, LSTM-MSNet, LSTNet, Transformer, Informer, and Reformer) have approximately excellent performance, due to the nonlinear feature extraction. In TAS data set, Reformer and M-BiGRU have the approximate performance. In other load series, M-BiGRU performs the optimal MAE and RMSE values. This experimental result demonstrates the superiority of M-BiGRU in STLF.
Table 7
Comparison model experiment results on load data sets from Australia
Model
NSW
QLD
SA
TAS
VIC
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
ELM
280
371
219
304
119
156
46
62
257
334
SVR
230
310
169
230
101
134
42
57
218
287
GRU
231
315
174
237
98
133
42
57
191
252
BiGRU
223
299
166
226
97
131
40
53
192
254
LSTM-MSNet
206
274
164
224
96
127
41
55
175
234
LSTNet
212
283
191
258
86
118
42
57
196
259
Transformer
183
238
158
190
95
128
43
55
165
192
Informer
198
262
154
196
93
115
44
57
193
243
Reformer
178
214
165
212
90
121
40
53
183
240
M-BiGRU
170
227
134
188
84
116
40
53
153
206
M-RAR-BiGRU
168
224
121
173
82
115
37
49
152
205
Bold values highlight the lowest MAE or RMSE for each dataset, indicating the best performing model
This experiment conducted a comparative experiment on M-BiGRU with the \(L_{2}\), ARHuber, and ARlncosh loss in Table 8, to verify the effectiveness of the ARlncosh loss. Due to the similar experiment results, statistical tests are conducted on the experimental results of M-BiGRU with \(L_{2}\), ARHuber, and ARlncosh loss, as shown in Table 9. The ARlncosh loss has a significant difference between \(L_{2}\) loss, on SA and VIC data sets. Meanwhile, M-BiGRU with the ARlncosh loss obtains lower MAE and RMSE than other loss functions on SA and VIC data sets in Table 8. Compared with the ARHuber loss, the model trained by ARlncosh loss performs a lower MAE on VIC and SA data sets. Thus, ARlncosh loss performs better than \(L_{2}\) and ARHuber loss, on these two data sets. Considering the conclusion of Example 1, this section further analyzes the performance of ARlncosh loss on random noise. Figure 9 shows data distributions corresponding to different loss functions on SA and VIC data sets. Table 10 calculates the Wasserstein distance between different distributions. As shown in Fig. 9, the ARlncosh loss presents a concentrated data distribution and is similar to the residual. In this experiment, M-BiGRU using ARlncosh loss achieves the optimal simulation than other contrast models. The above experiment results are consistent with the results in Table 10, which shows that the ARlncosh loss optimized by hyper-parameter \(\zeta\) generates the data distribution closer to the real noise. Moreover, Fig. 10 shows the convergence of \(\zeta\) in ARlncosh loss. On all load series, \(\zeta\) tends to converge after 3 iterations, which is similar to the conclusion in Example 1. This section verified that the ARlncosh loss has a better performance than contrast models.
Table 8
Comparative experimental results of M-BiGRU using \(L_{2}\), ARHuber, and ARlncosh loss function on Australian data sets
Model
NSW
QLD
SA
TAS
VIC
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
\(L_{2}\)
170
227
128
181
86
119
38
51
158
213
ARHuber
170
227
126
179
85
119
37
49
157
210
ARlncosh
170
227
125
178
84
116
37
49
153
206
Bold values highlight the lowest MAE or RMSE for each dataset, indicating the best performing model
Table 9
Statistical test results of M-BiGRU using \(L_{2}\), ARHuber, and ARlncosh loss function on Australian data sets
Loss functions
Error indicators
ARlncosh
NSW
QLD
SA
TAS
VIC
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
\(L_{2}\)
MAE
0
–
1
–
1
–
1
–
1
–
RMSE
–
0
–
1
–
1
–
1
–
1
ARHuber
MAE
0
–
0
–
0
–
0
–
1
–
RMSE
–
0
–
0
–
1
–
0
–
1
Bold values indicate that the model used ARlncosh loss obtains a better prediction performance than the comparison model
Table 10
Wasserstein distance between normal distribution, ARHuber distribution, ARlncosh distribution, and residual distribution on Australian data sets
Data sets
NSW
QLD
SA
TAS
VIC
Normal-residual distribution
18
36
50
40
39
ARHuber-residual distribution
18
8
16
8
14
ARlncosh-residual distribution
18
8
9
7
12
Bold values highlight the lowest Wasserstein distance for each dataset, indicating the best matching distribution to the residual distribution
Fig. 9
The residual distribution, normal distribution, ARHuber distribution, and ARlncosh distribution on SA and VIC data sets
This section designs a similar experiment to Example 1, to verify the performance of MscaleDNN. Table 7 shows that M-BiGRU achieved better MAE and RMSE than BiGRU on NSW, QLD, SA, and VIC data sets. On the TAS data set, M-BiGRU and BiGRU have similar performance. This experiment further carried out the statistical test on the experiment results of BiGRU and M-BiGRU, as shown in Table 11. 1 means MscaleDNN improves the forecasting performance of BiGRU. In contrast, 0 means MscaleDNN causes BiGRU with the prediction accuracy decrease. On NSW, QLD, SA, and VIC data sets, there is a significant difference between M-BiGRU and BiGRU. Therefore, MscaleDNN can effectively improve the performance of the proposed model in Australian data sets.
Table 11
Statistical test results for the effectiveness of MscaleDNN on Australian data sets
Model
Error indicators
BiGRU
NSW
QLD
SA
TAS
VIC
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
MAE
RMSE
M–BiGRU
MAE
1
–
1
–
1
–
0
–
1
–
RMSE
–
1
–
1
–
1
–
0
–
1
Bold values highight M-BiGRU has a better forecasting performance than BiGRU
This experiment conducts spectrum analysis on the NSW data set, to further analyze the improvement of MscaleDNN for feature extraction. Figure 11 shows the frequencies of maximum amplitude are 4, 104, 209, 731, 1462, and 2193. Thus, the periodicity of NSW is 6 months, 7 days, 3.5 days, 1 day, 12 h, and 8 h. Similarly, the significant periods of low-frequency components are 1 day and 12 h. The significant periods of high-frequency components are 1 day, 12 h, 8 h, 6 h, and 4 h. The spectrum analysis indicates that the low- and high-frequency components can extract multiple main periods of the load series. The above analysis explains the interpretability of low- and high-frequency components in a physical sense.
This experiment compares M-BiGRU with M-RAR-BiGRU, similar to Example 1. In Table 7, M-RAR-BiGRU has a better forecasting performance than M-BiGRU. The MAE of M-RAR-BiGRU is slightly lower than M-BiGRU in all load series. The experiment results show that ATTN further improves the feature extraction of BiGRU. M-RAR-BiGRU improves the prediction accuracy, through multi-scale modeling. This experimental result further verifies that the combination of RAR-BiGRU and MscaleDNN has a better prediction performance.
6 Conclusion
This paper establishes an integrated DL framework based on a multi-scale perspective (M-RAR-BiGRU) for STLF. M-RAR-BiGRU extracts features of load data from a multi-scale perspective. This study borrows the idea of FT to transfer the load series into multiple components with different frequencies through NNs. MscaleDNN uses NNs to approximate the adaptive FT of load series, avoiding the limitations of traditional signal decomposition methods in practical applications. Additionally, M-RAR-BiGRU introduces the ARlncosh loss function for outlier handling. The ‘working’ likelihood function optimizes the ARlncosh loss to approximate the distribution of load data. This paper conducts contrast experiments on Portuguese and Australian data sets, to analyze the performance of M-RAR-BiGRU in STLF. The experiment results verified that M-RAR-BiGRU decouples stable periodic features from the load series and generates accurate forecasting results. In the load series with outliers, the ARlncosh loss function improves the robustness of the proposed model.
The power load data, as the time series, has complex features. The proposed model calculates the features of load series from multiple scales, through a weighted method. However, the weighted coefficients are manual settings. Therefore, future work will study the optimization algorithm to adaptively determine these coefficients based on feature information. Furthermore, outliers inevitably exist in load data. Thus, future work will study the loss function for the outlier handling.
Acknowledgements
This work is supported in part by the National Natural Science Foundation of China under Grant 61873130 and Grant 61833011, in part by the Natural Science Foundation of Jiangsu Province under Grant BK20191377, in part by the 1311 Talent Project of Nanjing University of Posts and Telecommunications, and “Chunhui Program” Collaborative Scientific Research Project (202202004).
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information.
Order your 30-days-trial for free and without any commitment.
Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik.
Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.