Abstract

The power transformer is an example of the key equipment of power grid, and its potential faults limit the system availability and the enterprise security. However, fault prediction for power transformers has its limitations in low data quality, binary classification effect, and small sample learning. We propose a method for fault prediction for power transformers based on dissolved gas chromatography data: after data preprocessing of defective raw data, fault classification is performed based on the predictive regression results. Here, Mish-SN Temporal Convolutional Network (MSTCN) is introduced to improve the accuracy during the regression step. Several experiments are conducted using data set from China State Grid. The discussion of the results of experiments is provided.

1. Introduction

As key equipment, the power transformer is directly related to the system availability and enterprise security of power grid. Dissolved gas analysis (DGA) is one of the most reliable means for condition estimation and fault diagnosis of oil-immersed transformers and is recommended for condition evaluation by standards from the International Electrotechnical Commission (IEC) and the National Energy Administration. Through the gas chromatography online monitoring technology, business analyses such as transformer fault detection can be done in quasi-real time, which improve the safety and stability of power grid [1].

However, it faces challenges in predicting power transformer faults due to inherent limitations in practice. First, the low quality of raw data makes direct data usage infeasible because transmission links may be interrupted and data packets may be lost [2]. Due to the equipment or communication network problems, in-complete, missing, and outlier records exist in gas chromatography data. Availability is usually an essential requirement [3], and such defective data makes fault prediction more difficult. Second, traditional methods like widely used binary classification are not accurate enough, because such threshold-based fault detection technology ignores the data below the threshold and lacks historical trends employment. The oscillatory values around threshold may imply potential fault but cannot be found only by those methods. Third, the model is hard to be learned on small samples. The faults of power transformers appear casually, and related data must be a small proportion, and traditional models trained have to perform poorly due to the fact that too few features can be learned.

In this work, a fault prediction method is proposed for power transformers, which converts the classification problem for power transformers into a regression problem. Our contributions can be summarized as follows: (1) Missing imputation and outlier detection during the data preprocessing step guarantee completeness and continuity for gas chromatography data, which improve data quality obviously. (2) MSTCN proposed during regression step can learn features from data below fault threshold, which avoids overfitting through small sample learning. (3) On real-world data, our work shows convincing benefits and has been adopted in a practical business project.

The rest of this work is organized as follows: Section 2 discusses related work. Section 3 presents research background including motivation and methodology, as well as the transformer fault diagnosis method, AKA the three-ratio rule. Section 4 elaborates transformer fault prediction method based on MSTCN model. Section 5 evaluates the effects in extensive experiments. Section 6 summarizes the conclusion.

Power transformer fault prediction is significant nowadays, but its discovery still faces challenges in efficiency and accuracy. Many works have adopted deep learning techniques in specific domains [4, 5]. We categorize related work into two technical perspectives: one is traditional algorithms through machine learning methods, and the other is deep learning methods, including recurrent neural networks (RNNs) and Temporal Convolutional Networks (TCNs).

2.1. Machine Learning Method

Machine learning methods can learn the fault occurrence pattern and then predict the possible faults. The literature [6] compared and analyzed MLP (Multilayer Perceptron), RBF (radial basis function), fuzzy logic, and support vector machine (SVM) for fault prediction of power transformers. However, their parameters are mainly selected empirically, which limits the efficiency of modeling. The F1-score of these methods is not more than 90% as evaluated by our data set.

Machine learning methods combined with DGA for transformer fault prediction have achieved many results. Dukarm [7] shows how fuzzy logic and neural networks are used to automate standard DGA methods. Furthermore, Wang et al. [8] conducted a combined artificial neural network and expert system tool (ANNEPS) developed for transformer fault diagnosis using dissolved gas-in-oil analysis. Huang et al. [9] introduced an evolutionary programming (EP) based fuzzy logic technique to identify the incipient faults of the power transformers. Yang et al. [10] employed bootstrap and genetic programming to improve the interpretation accuracy for DGA of power transformers. Hellmann [11] applied fuzzy logic (FL) that allows intermediate values to be defined between conventional evaluations like true/false, yes/no, high/low, and so forth. Souahlia et al. [12] applied the support vector machine (SVM) based decision for power transformers fault diagnosis.

However, these works have common problems, including the too small amount of data, few types of data, and only simple classification rules. For example, the fault categories are shallow, including overheating, discharging, and overheating with discharging. Advanced transformer fault prediction is required on Big Data fully using DGA data.

2.2. Deep Learning Method

In recent years, deep learning networks combined with DGA have further improved the accuracy of transformer fault prediction. Recurrent neural networks (RNNs) have been widely adopted in research areas concerned with sequential data, such as text, audio, and video [13]. Among RNNs methods, in particular Long Short-Term Memory (LSTM) [14] and Gated Recurrent Units (GRU) [15] are excellent in fully exploiting the time-varying features of time series data. Although the gradient problem of RNN has been solved to some extent in LSTM and GRU, it will still be tricky for longer sequences [13].

Bai et al. [16] proposed the Temporal Convolutional Networks (TCNs) model, a deep learning model for sequence modeling tasks. TCN combines convolutional neural network (CNN) and recurrent neural network ideas for processing time series type data. Almqvist [17] compared the performance of RNN and TCN for time series forecasting. Instead of using a cell state to preserve information from previous outputs as in LSTMs, TCNs use connection between previous hidden layers configured with two hyperparameters: dilation factor and filter size. Zhang et al. [18] proposed a multiscale temporal convolutional network for fault prediction. They extracted multiscale time-frequency information with the discrete wavelet transform, and each piece of scale data is handled by different TCN, respectively. Zhang et al. [19] presented an attention mechanism enhanced Temporal Convolutional Network for fault prediction. They utilized an attention mechanism to make the TCN-based fault prediction model focus on more essential input variables to enhance the fault prediction performance. Zai et al. [20] put forward a predictive method for dissolved gas content in transformer oil based on Temporal Convolutional Network (TCN) and graph convolutional network (GCN). They designed a GCN to analyze the correlations among all gases and then established a topological graph for their correlations.

However, these models did not solve the problem caused by the rectified linear unit (ReLU) and weight normalization layers. The rectified linear unit (ReLU) based activation function applied in TCN is underutilization of negative values leading to vanishing gradient. Meanwhile, the weight normalization applied in TCN is sensitive to initial values leading to overfitting. Meanwhile, these models used binary classification for fault prediction, which might not thoroughly learn the information below the threshold and lacks historical trends employment.

Inspired by the works in [21, 22], we propose a Mish-SN Temporal Convolutional Network (MSTCN) for dissolved gas regression to predict transformers’ fault. We apply the Mish activation function and switchable normalization to MSTCN to solve the problems caused by ReLU and weight normalization. Meanwhile, the dissolved gas regression can explore the numerical fluctuations before the threshold value and learn historical fault feature patterns.

3. Preliminary

3.1. Motivation

Our work is originated from a practical project of China State Grid.

This work utilizes the dissolved gas chromatography data set provided by China State Grid as the data set for experimentation. The data set comes from the gas chromatography online monitoring equipment of the power grid, which is based on an integrated, high-speed two-way communication network [23]. The data set covers roughly 600 transformers. With the explosive growth in Internet of Things (IoT) devices, applications have also substantially expanded in recent years [24]. Some of the data is a relatively long time series, containing more than 60 months of monitoring data, while others are short, only three or four months of monitoring data. In addition, each data item in the dissolved gas chromatography data is a multidimensional vector rather than a single number in some stock market and house price analysis data sets. The main fields in the dissolved gas chromatography data set of transformer oil are shown in Table 1. The data are all collected and measured automatically through the gas chromatography online monitoring technology.

Definition 1. Status code. In this work, a status code is used to identify each sample’s possible fault categorical value. Status code is used as the classification label for the later transformer fault classification. The possible status code is summarized in Table 2.
The appearance of dissolved gas in the transformer oil indicates transformer faults. The gas formation comes from three conditions: overheating, discharge, and moisture. The amount of gas inside the transformer oil can be measured frequently by technical means to keep track of the operating health of the transformer. If any of the gases has a tendency to exceed a notice value, the gas production rate should be observed. However, if all the gases are lower than the notice value, the transformer is considered to be working properly. Based on the recommendations of the data provider, the notice values of our data set in this work are shown in Table 3.

3.2. The Three-Ratio Rule

We apply the three-ratio rule to converse dissolved gas regression to the status code in our proposed method. The three-ratio rule is proposed by the National Energy Administration of China [25]. By studying the trend of the dissolved gas amount in transformer oil, the status of the transformer can be determined based on the gas chromatography combined with the three-ratio rule. The conversion of three-ratio rule is shown in Tables 2 and 4.

Table 4 shows ratio code of two gases. For example, if the ratio of to is 0.2, the ratio code for is 1. Similarly, the other ratio codes for and can be calculated. Table 2 shows the three-ratio codes and their corresponding faults. For example, if the ratio codes of , , and are 1, 1, and 2, that is, the three-ratio code 112, the corresponding type of fault is low energy discharge, with the status code 6.

4. Power Transformer Fault Prediction Method

4.1. Overview

The fault prediction method based on dissolved gas regression proposed in this work is shown in Figure 1. Our method is divided into three steps. The first step is data preprocessing. Inspired by the work of Ding et al. [26], we convert data from different sources in the dissolved gas chromatography data set into a uniform format and resolve problems such as missing values and outliers in the data as much as possible. The second step is to predict gas amounts using a deep learning model. We apply MSTCN to dissolved regression gas regression to obtain future gas amounts. The third step is fault classification. The predicted transformer status code is calculated based on regression results of the second step and three-ratio rule mentioned above.

On the basis of transformer fault prediction studies, fault prediction methods usually use machine learning models or statistical tools to predict transformer fault. Instead of directly using deep learning models to predict transformer fault, we add a gas regression step between data preprocessing and fault classification. The usual fault prediction uses binary classification as predicting labels to do classification prediction, and the fault classification is judged based on the threshold value, ignoring the fluctuation of the value before the threshold value. The final prediction model might not learn the prethreshold value fluctuation information.

4.2. Data Preprocessing

In the domain of gas chromatography online monitoring technology of power transformer, there are problems such as network instability and server performance bottlenecks in processing extensive data. We mainly address the problem of missing data and outliers that exist in the dissolved gas chromatography data set.

Definition 2. Missing data. The missing data types include negative number, not a number (NAN), and null. Let be a feature matrix consisting of data points and features of dissolved gases. The -th data point is denoted as . The -th feature value of is denoted as . is defined as missing data if .
The definition of outlier points combined with the characteristics of the data set and 3 -rule is shown as follows.

Definition 3. Outlier. Let be defined as a set of the -nearest neighbors of . Each of is recorded at a specific time point and consists of observations that could be denoted as , each dimension of -dimensional vectors at a certain data point could be denoted as , the expected value of could be denoted as , the Euclidean distance of two data points can be denoted as , and the highest distance threshold between a true data point and its expected data point could be denoted as . The outlier could be denoted asWith the definitions above, missing data and outlier problems are explicitly defined to be handled. Properly imputed data and corrected outliers could lower the regression errors and further promote fault prediction effectiveness:Missing Data Imputation. For the missing data mentioned earlier, considering the data characteristics of gas amount, in this work, we took a modification of the EM algorithm proposed by Junger [27]. The algorithm comprises the following steps: (i) replace the missing values by estimates; (ii) estimate parameters and ; (iii) estimate the level for each of the univariate pieces of data; (iv) reestimate the missing values using updated estimates of the parameters and the level of the data. These steps are iterated until some convergence criterion is reached.Let be the data point of features matrix . After iteration, the revised maximum likelihood estimates .Outlier Correction. is a set of the -nearest neighbors of . Each of is recorded at a specific time point and consists of real-valued observations that could be denoted as , each dimension of -dimensional vectors at a certain data point could be denoted as , the expected value of could be denoted as , and the highest distance threshold between a true data point and its expected data point could be denoted as . In the outlier equation (1), the expected value and the highest distance threshold are defined in the two following equations:The expected value is also the corrected value of the outlier .

4.3. Dissolved Gas Regression

In order to fully explore the numerical fluctuations before the threshold value and learn historical fault feature patterns, we proposed a regression model called Mish-SN Temporal Convolutional Networks.

On the other hand, if the predicted value of dissolved gas is obtained from the prediction model, the conversion from the predicted value of dissolved gas to the predicted value of the status code can be achieved with very few calculations.

Figure 2 shows a complete MSTCN map formed by stacking h residual blocks. The input is denoted as . In this work, we use a common technique in RNNs modeling called time step to improve predictive accuracy. The time step length could be denoted as . The number of features is denoted as . Let be defined as a new data point. For any , its gas regression label is denoted as . Therefore, the regression result of is denoted as . The convoluted result of the -th residual block layer is denoted as . However, to solve the real-world problem, we are only interested in the last case . represents regression points of input . The output of MSTCN regression result is shown as follows:Residual Block. In order to solve the vanishing gradient problem, in a deep convolution network, a well-known technique called residual blocks is applied in MSTCN shown in Figure 3. Residual blocks have been proven to be an effective method for training deep networks, which enables the network to transmit information in a cross-layer manner.In Figure 3, the upper branch of the residual block presents dilated causal convolution with the input . The lower branch is the skip connections added to solve the vanishing gradient problem. In this work, we replace weight normalization with switchable normalization. Through the switchable normalization self-learning method, let the MSTCN decide which normalizer to use to obtain the best prediction effect. The MSTCN also introduces the Mish activation function to replace the ReLU for solving the dead ReLU problem in order to make the activation function smooth and derivable at 0 points and to improve the generalization of the model. Let be the activation layer. The output could be expressed asDilated Casual Convolution. Figure 4 presents the structure of the dilated causal convolution stack from a residual block with filter size and dilation factor . In Figure 4, the other layers and skip connection are omitted. The input of dilated casual convolution is denoted as . The output of dilated casual convolution is denoted as . Inspired by the idea of dilated convolution [28] and casual convolution [29], we set a constraint according to concept of casual convolution that any only depends on and not on future . Meanwhile, to enlarge receptive field without deepening the structure, we apply concept of dilated convolution to the residual block.A constraint according to concept of casual convolution that any only depends on and not on future is shown in Figure 4. To enlarge receptive field without deepening the structure, the MSTCN introduces dilated convolution.

4.4. Fault Classification

The guidelines [25] stipulate that, in the oil chromatographic analysis, if the content of each gas has a tendency to increase or exceeds a notice value, the gas production rate should be observed, and the gas production rate should be observed based on the three-ratio rule; it could be preliminarily judged that there is an overheating fault or a discharging fault, according to the three-ratio rule of gas chromatography in Table 4.

Let be defined as the status code. Let be defined as a set of regression results. Let ; . Let be denoted as the feature numbers of . Let be denoted as the regression value of dissolved gas. The fault classification algorithm is defined in Algorithm 1.

Input: Regression result
Output: Status code
(1)for t + 1 to t + L
(2)  compute three ratios: , ,
(3)  look up Table 4 to convert the gas ratio to ratio code
(4)  combine the three-ratio code get combination and look up Table 2 to get the status code
(5)end for
(6)return

5. Evaluation

5.1. Setting

The experiments in this work are running on a server with CentOS 7 operating system installed with Intel Core i7-6700 CPU, 16 GB RAM, and 1 TB storage. The experiments are written in Python 3.9.6, implementing JupyterLab 3.1.11, TensorFlow 2.5.0, and Matplotlib 3.3.4.

The data set was collected from oil-immersed power transformers in different substations in China, with 200,000 records covering the period from 2012 to 2017. The data set contains 7 fault-related gases , the time of collection, other gases , substation and transformer information, and so forth. The distribution of 7 faulty gases is shown in Figure 5. The horizontal axis is the date. The vertical axis is the amount of gas. The blue curve indicates no fault on the corresponding date. The orange curve indicates a fault status on the corresponding date because the amount has reached its notice value.

We selected a subset composed of 100 transformers of about 170,000 records from the data set, divided into 80 training sets, 10 validation sets, and 10 test sets. The time range of the subset is from November 2012 to September 2017. The reason for the selection is that it has high data integrity and few missing values. In every transformer sequence of this data set, each record has attributes of collection date, 7 different dissolved gas values, and the status code label according to the three-ratio rule as shown in Table 1.

5.2. Experiment

In order to accurately evaluate the performance of the proposed transformer fault prediction model based on the MSTCN, we carried out dissolved gas regression and transformer fault classification experiments on the dissolved gas chromatography data set. First, we verify the average accuracy of dissolved gas regression based on the improved MSTCN. Second, we verify our proposed method by comparing it with other fault classification models based on binary classification and analyze the effectiveness of the models.

Experiment 1. Dissolved Gas Regression. The experiment applies MSTCN, TCN, LSTM, and GRU, respectively, on the test set to verify the effectiveness of the MSTCN model. The final parameters of MSTCN are defined in configuration: number of epochs is 100, batch size is 32, time step is 12, and learning rate is 0.001. The final parameters of residual block in MSTCN are shown in Table 5. For the TCN, LSTM, and GRU methods, they have roughly the same parameters as MSTCN, considering the rigour of the experiment. This work applies the root mean square error (RMSE) as the loss function shown in equation (6) and Adam as the optimization algorithm. represents the total m records, represents the actual gas amount of record , and represents the predicted gas amount.
represents the total m records, represents the actual gas amount of record , represents the average value of actual gas amount, and represents the predicted gas amount. The minimum of RMSE, MAE, and MAPE is 0, and the closer the metric is to 0, the better the predictive effect is. The maximum of is 1; the closer to 1 the better.
In order to measure the predictive performance of the models, RMSE, MAE, MAPE, and are used as the models’ metrics. The calculation formulas of those metrics are shown in equation (6).
Figure 6 shows the actual gas amount curves and the regression curves predicted by different models, including MSTCN, TCN, LSTM, and GRU. It can be seen from Figure 6 that the fit curve of MSTCN is more accurate than the curves of the other models. Although, in the predictions from () , LSTM performed better, overall, the MSTCN error is smaller than those in other models.
We have calculated above metrics of MSTCN, TCN, LSTM, and GRU. The results are listed in Table 6.
Table 6 shows the comparison of the MSTCN model and other deep learning models (TCN, LSTM, and GRU) as regards gas regression effect. MSE, MAE, and MAPE are used to measure the error between the true value and the predicted value of the data; is also used to measure the difference between the true value and the predicted value of the data and to standardize this difference to . For the predictions of , MSTCN has achieved a relatively good evaluation index. Although the prediction of TCN has more minor prediction errors (RMSE), MSTCN is overall significantly better than other models.

Experiment 2. Fault Classification. In order to further verify the superiority of the transformer fault prediction method proposed in this work, this experiment uses the regression value of the previous experiment as input. It converts the predicted gas amount to the status code according to the three-ratio rule. The control group uses TCN, LSTM, and GRU models and uses actual gas amount as input to directly classify the fault of the transformer.
In order to measure the accuracy of fault classification under different models, according to the confusion matrix, this experiment denotes faulty status as positive (P) and normal status as negative (N). Therefore, the correct fault classification could be denoted as true positive (TP) and true negative (TN), and the incorrect prediction could be denoted as false positive (FP) and false negative (FN) [30]. This experiment introduces three metrics to measure the model’s accuracy on the test set. The precision, recall, and F1-score are expressed in equations (7)–(9). The F1-score is a harmonic mean of precision and recall, whose value is also between 0 and 1, as well as precision and recall. The more the three metrics are close to 1, the better predictive effect the model has.Comparing the accuracy of fault classification under different models, the evaluation metrics are shown in Table 7.
Table 7 shows the comparison of transformer fault classification results between the MSTCN model and other deep learning models (TCN, LSTM, and GRU). The first column indicates the transformers participating in the experiment. Each transformer is an independent and complete experiment. The second column presents the evaluation metrics. The following columns are a comparison of the four model evaluation metrics. Overall, the prediction results of each model are satisfactory. This is caused by the faulty gas three-ratio algorithm and the gas attention value. For example, although the model has a deviation between the predicted gas value and the true value, it is still in the same ratio range, or the failure attention value is not reached at all, so the final predicted failure state will not change easily, resulting in an excellent overall prediction effect. For different transformers, the difference in fault prediction effect is more significant than the difference between models. The effect of model prediction is more affected by the data set than the model difference. For different models, the difference in failure prediction effects is relatively small. Overall, MSTCN is slightly higher than other models.
With the proposed transformer fault prediction method in this work in Figure 2, it can reduce or eliminate the impact of low accuracy of classification caused by threshold-based binary classification. It can use the data information before the threshold and enhance the usage of historical fault data because the proposed method is based on the dissolved gas regression value. At the same time, this classification step does not introduce additional errors because it uses the same judgment criteria as the existing fault diagnosis methods.

6. Conclusions

In this work, we propose a power transformer fault prediction method based on dissolved gas regression, which cleverly converts the transformer fault prediction problem into a regression problem for dissolved gas amount. First, through data preprocessing, we overcome the difficulties in directly using raw data. Second, by dissolved gas regression, we achieve more efficient learning of the data below threshold than binary classification and avoid small sample learning caused by a large amount of preventive maintenance. Compared with the traditional binary-based classification fault prediction model, the fault prediction method based on gas amount prediction has better results with F1-score more than 0.9741. This novel method provides new insights for power transformer fault prediction.

In summary, the fault prediction method based on dissolved gas regression using MSTCN has excellent potential. In future work, we will continue to research this concept and shorten the training time with more advanced deep learning techniques. In addition to the fault prediction method proposed, we plan to tune the procedure to simplify the method.

Data Availability

The oil chromatography data used to support the findings of this study were supplied by China State Grid under license and so cannot be made freely available. Requests for access to these data should be made to the corresponding author for an application of joint research.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the science and technology project of State Grid Corporation of China: “Research on Data Governance and Knowledge Mining Technology of Power IOT Based on Artificial Intelligence” (Grant No. 5700-202058184A-0-0-00).