Machine-learning-based predictive maintenance models, i.e. models that predict breakdowns of machines based on condition information, have a high potential to minimize maintenance costs in industrial applications by determining the best possible time to perform maintenance. Modern machines have sensors that can collect all relevant data of the operating condition and for legacy machines which are still widely used in the industry, retrofit sensors are readily, easily and inexpensively available. With the help of this data it is possible to train such a predictive maintenance model. The main problem is that most data is obtained from normal operating conditions, whereas only limited data are from failures. This leads to highly unbalanced data sets, which makes it very difficult, if not impossible, to train a predictive maintenance model that can detect faults reliably and timely. Another issue is the lack of available real data due to privacy concerns. To address these problems, a suitable data generation strategy is needed. In this work, a literature review is conducted to identify a solution approach for a suitable data augmentation strategy that can be applied to our specific use case of hydrogen combustion engines in the automotive field. This literature review shows that, among the different state-of-the-art proposals, the most promising for the generation of reliable synthetic data are the ones based on generative models. The analysis of the different metrics used in the state of the art allows to identify the most suitable ones to evaluate the quality of generated signals. Finally, an open problem in research in this area is identified and it is the need to validate the plausibility of the data generated. The generation of results in this area will contribute decisively to the development of predictive maintenance models.
Notes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
In industrial environments it is critical that the machines in the production lines work continuously. An unplanned stop in production even for a short time can lead to significant losses. A study from Thomas and Weiss (2020) collected data from several U.S. manufacturers to determine the costs that could be avoided through a good maintenance strategy. These costs arise not only from direct maintenance costs, but also from the production losses. The reputation of a company can suffer greatly as well if delivery deadlines cannot be met due to a production stoppage which can furthermore lead to the loss of follow-up orders. The total cost that could be avoided was $119.1 billion. $18.1 billion of this can be attributed to failures and downtimes, $0.8 billion to defects and the remaining $100.2 billion is caused by contracts and deliveries that could not be fulfilled.
Companies perform maintenance to prevent or, once they occur, repair defects that would negatively impact production. According to Thomas and Weiss (2020) and Wen et al. (2022), different existing maintenance approaches can be categorized into the following three classes. Reactive maintenance, also known as corrective or failure-driven maintenance, is typically performed in response to equipment malfunctions or breakdowns. This approach is also employed when machinery fails to meet expected quality or production targets. Preventive maintenance is conducted on a regular basis, according to predefined intervals that are easily monitored. These intervals may be based on a fixed amount of time, the number of produced parts, machine cycles, or other parameters. The maintenance schedule is typically developed by experts with experience and an understanding of the historical breakdowns or failures of the machinery in question. Predictive maintenance entails measuring the reliability and condition of a given piece of machinery, a workcell, an assembly line, etc., or a manufacturing process itself. These measurements are frequently obtained through the use of sensors that capture data, which can then be combined with historical data in order to assess the current condition and inform maintenance decisions.
Advertisement
In his study, Thomas (2018) conducted a comparison between these three types of maintenance. The comparison demonstrates that reactive maintenance can be a cost-effective approach when the initial cost of equipment is low, it is easily replaceable, has high availability, has minimal impact on collateral failures, or has high redundancy. In contrast, preventive maintenance is cost-effective for equipment in process chains where the different parts rely on each other. However, there is a potential risk of over-maintenance, which can lead to excessive production downtimes. Predictive maintenance mitigates the risk of over-maintenance and downtimes by identifying the optimal time for maintenance, making it a more cost-effective alternative to preventive maintenance. However, it requires a higher upfront investment due to the hardware and software needed to capture and monitor the necessary data, as well as the training of personnel on monitoring techniques and data analysis.
Predictive maintenance can be divided into rule-based approaches which need expert knowledge and data-driven approaches. The data-driven approaches can be based on statistical or machine learning models. The trend in research of data-driven methods for predictive maintenance is towards machine learning. A literature research from Wen et al. (2022) shows that 60% of the publications about data-driven predictive maintenance approaches in the years 2015–2020 use some sort of machine learning model.
Murphy (2012) identifies three principal categories of machine learning: supervised, unsupervised and reinforcement learning. Supervised learning uses a set of labeled data samples to train a model that can predict a value for unseen samples according to the labels. In the case that the labels represent a categorical variable the task is called classification and if the labels are continuous values, the task is known as regression. Unsupervised learning trains a model only using inputs, as there are no labels. The task is to find useful patterns in the data. Reinforcement learning works in the way that an agent learns an optimal strategy, called policy through interactions with its environment for which he receives positive or negative feedback signals. Reinforcement learning is not considered in this study, since supervised and unspervised approaches are more established in predictive maintenance.
In predictive maintenance tasks, unsupervised learning can be used to detect anomalies. In this case a model can be trained, e.g. an autoencoder, using only data of the normal behavior. When the model is used for inference, it can detect data samples that differ from the training data as anomalies. In contrast, supervised learning models can be utilized not only for the detection of anomalous conditions, but also for the identification of the specific fault that occurred, provided that the model is trained with labeled data that reflects the various conditions (Chandola et al. 2009).
Advertisement
These supervised models require large amounts of training data that needs to be captured by different sensors directly at the machines. One of the biggest problems with predictive maintenance models is that there is often not enough data available for quality training. This applies in particular to data on faulty behavior or defects resulting in imbalanced datasets. The reason is that faulty behavior and defects occur rather rarely during the lifetime of machines. To counteract this imbalanced data problem a good strategy to create additional data samples of fault cases is needed.
A sound strategy for the creation of synthetic data offers the possibility to create high quality training and evaluation datasets. These datasets would be well balanced and have sufficient data for all the possible error cases. With such datasets it is possible to train machine-learning-based predictive maintenance models that can not only detect faults in real world applications before they occur but also identify the specific fault.
In order to be able to develop such a strategy, a literature review is carried out in this article. The focus of this review is to develop a predictive maintenance model for an hydrogen combustion engine, as part of the WaVe research project, which is described in the next section. The main contributions of this paper are:
1.
A survey of the current state of the art of data augmentation methods for predictive maintenance tasks.
2.
A summary of the research gap found on the selected publications according to the WaVe use case and a summary of suitable approaches.
The remaining sections in this article are organized as follows. Section 2 provides a brief overview of the WaVe research project and the problem statement. Section 3 describes the methodology used for the literature review. In Sect. 4 the selected publications of the literature review are discussed according to several research questions. Section 5 analyzes the research gap found in the selected publications and discusses suitable data augmentation approaches to solve the imbalanced data problem in the WaVe project. Finally, Sect. 6 presents the conclusions of this work.
2 The WaVe research project
The joint research project WaVe is promoted by the German Federal Ministry for Economic Affairs and Climate Action. WaVe stands for Wasserstoff-Verbrennungsmotor which translates to hydrogen combustion engine. The goal of this research project is to develop a hydrogen-based drive system for commercial vehicles in the medium-duty range. The project partners are pooling their technological expertise and developing innovative individual solutions for a hydrogen-based drive system in multiple technological subprojects. The individual solutions are tested, harmonized and combined to form a functioning overall drive system (Commercial Vehicle Cluster-Nutzfahrzeug GmbH 2021). This will then be installed and tested in two different demonstrators—a Mercedes Benz Unimog U400l and a crawler vehicle.
To support the overall development of the hydrogen-based drive system and the two demonstrators, digital twins are under development by comlet Verteilte Systeme GmbH. The development of these digital twins is interconnected to the hardware development. A digital twin is an accurate representation of a physical system. Its application in the design, testing, and manufacturing phases of the physical system allows for the reduction of time and costs, as well as improvements in user safety (Grieves and Vickers 2017). In their work, Fuller et al. (2020) differentiate between three definitions based on their data flow characteristics. A digital model exhibits no inherent automatic data flow, while a digital shadow is characterized by an automatic exchange of data from the physical asset to the digital object. A digital twin, in contrast, represents a fully integrated data flow in both directions, encompassing both the physical and digital domains. Consequently, a change made to the physical object is reflected in the digital object, and vice versa. The data of a digital twin can be grouped into static and dynamic data. Static data is created during development and does not change significantly during the lifecycle. This includes manuals, technical specifications, product information, CAD models, circuit diagrams, simulations etc. Dynamic data, on the other hand, is collected by sensors in the physical asset in real time. Dynamic data can include not only data from the finished product, but also engine data recorded in various configurations on an engine test bench, data from field trials, and data captured during tests of the hydrogen tank system.
All these data provide a solid basis for the development of a predictive maintenance system for a hydrogen-based drive system and complete vehicles. The development of such a predictive maintenance system is a planned goal of comlet Verteilte Systeme GmbH. Digital twins offer several advantages for the creation of a predictive maintenance systems. The digital twin serves as a data provider for the predictive maintenance system. Thus, feature extraction, as well as feature engineering can directly access the data of a digital twin, or they could also be directly integrated into the digital twin. This makes it very easy to create training and test data sets from the existing data. Another advantage is the access to simulations which provide additional data.
This leads to the biggest problem for the intended predictive maintenance system. Even if there is plenty of data available, most of it contains only the normal condition of the hydrogen-based drive system and the vehicles. There will be little data from failure cases, which leads to strongly imbalanced training data. Generating fault data using deliberate defects is limited due to their destructive nature. Only defects that do not cause damage to the hydrogen drive, or vehicles, can be used. In addition, it is very time consuming and costly to manually cause such defects in the physical systems to generate a sufficient amount of data. Simulations alone also cannot be used to generate a sufficient amount of defect data because they are usually computationally intensive. As defects that can result in personal injury deserve the most accurate consideration, numerous safety precautions for propulsion systems in general exist. These are the most dangerous and difficult defects to simulate.
To solve the lack of available data, this work discusses a strategy based on data augmentation. This strategy starts with simulations to generate a limited amount of data related to different failure cases. Then this data, as well as the real data recorded during the engine test bench and field trials, will be used to generate more data with specific characteristics using data augmentation methods for machine learning.
3 Data augmentation in predictive maintenance
Not all data augmentation methods are suitable for the WaVe use case. Since the data are predominantly, if not exclusively, time series, image-based methods are unsuitable. It is possible to use these methods partially for sensor data, but this would change the underlying structure of the time series. Sampling-based methods are of limited suitability because they are usually based on the weighted average of the available data and do not consider the underlying distribution, thus limiting the diversity of the generated data. The generative models appear most promising for the WaVe use case because they are trained to learn the underlying distribution of the available data and do not affect the structure of time series.
A state-of-the-art review is conducted to find suitable generative algorithms for the creation of high quality data for predictive maintenance models. This review follows the guidelines of PRISMA (Page et al. 2021a), which consists of four main steps: identification, screening, inclusion and discussion of papers. The whole process of the review from identification of the data sources until the inclusion of the papers is shown in Fig. 1. The review focuses on generative algorithms for data augmentation and answers the following research questions:
RQ1: Which data augmentation techniques have proven to be suitable for predictive maintenance?
RQ2: What role do generative algorithms play in predictive maintenance?
RQ3: In which application domains are those algorithms typically used?
RQ4: Which type of data and datasets are most commonly used for predictive maintenance?
RQ5: Which validation methods and metrics are used to evaluate the quality of generated data?
The literature review was conducted in January 2024 and took publications into account starting from 2018. This means that only publications between 2018 and 2023 are included to reflect the current state-of-the-art and trend in data augmentation for predictive maintenance tasks. This can lead to the exclusion of techniques developed before that time frame that are still relevant today. The following scientific databases were used in the research:
IEEExplore
ACM Digital Library
Scopus
×
In order to avoid obtaining a very high or very low number of results the search query was refined in an iterative process until a manageable number of publications was found. Table 1 shows the final search query used and the number of results from the different sources.
Table 1
Search query used for the literature review and the number of resulting publications
Search query
Source
Num papers
“data augmentation” AND (“predictive maintenance” OR “anomaly detection”)
IEEExplore
333
ACM Digital Library
384
Scopus
272
The search query resulted in a total number of 989 papers found in the databases. Before an in depth screening of these papers was conducted duplicate papers were removed. In total, 95 duplicate papers were found and removed. The screening phase was split into multiple steps to discard papers that didn’t fulfill the following eligibility criteria:
Only publications in English.
Must be published after 2018.
Must be at least a short paper. Posters, extended abstracts, workshops, etc. are excluded.
Must be related to the area of interest.
Must have a focus on industrial applications or time series data.
Must provide detailed information about the datasets used or must use public (benchmark) datasets.
Must describe the used methodology detailed enough so that it can be reproduced.
Must provide reproducible or comparable results, or must be a proof of concept.
The first five criteria were checked by reading the titles and abstracts of the papers. All papers found were written in English, so no papers were excluded due to this criterion. Even when the period 2018–2023 was used as filter for the search queries, three papers from 2017 and before were still in the search results and therefore discarded. The resulting papers included three extended abstracts, two posters and one workshop which were excluded. Using the titles and abstracts 679 papers were excluded because they were not related to the area of interest or had focus on other types of data. Since generative algorithms and Generative Adversarial Networks (GAN)s in particular are often used in image generation, many of the resulting papers excluded had a strong focus in image-generation-based data augmentation for non industrial applications.
The remaining 206 papers were screened in depth for eligibility according to the remaining criteria. From these papers 128 were excluded because they were not related to the area of interest. This was not evident from the abstracts of these publications alone. Another 13 papers used confidential data from companies and didn’t provide enough information about the used datasets to reproduce the methods in any way and were therefore excluded. 11 papers were excluded due to not describing the used methodology in detail. This makes it impossible to reproduce their solutions and results. The last 16 papers that were discarded didn’t provide reproducible or comparable results. This leaves 38 papers which are included in this state-of-the-art review.
4 Paper discussion
In this section the remaining 38 papers that are included in the review are grouped and discussed by multiple different criteria to answer the above stated research questions.
4.1 RQ1: which data augmentation techniques are used for predictive maintenance?
In Table 2 the selected papers are grouped by their general data augmentation category: image-based, sampling-based and generative and then by their specific method. Figure 2 shows the number of publications for each of these groups. The total number of the grouped papers is larger than the remaining 38 papers, since some publications use data augmentation methods from multiple groups. With a focus on industrial applications and time series data, it can be seen that in recent years most of the publications found use generative algorithms for data augmentation.
×
Only four publications use image-based data augmentation techniques. Image-based data augmentation usually uses matrix transformations such as rotation, translation, scaling, etc. as well as changes in color or brightness to create new images out of the already existing ones. Pasqualotto et al. (2021) use stray flux analysis images of induction motors to compare the performance of five image-based data augmentation methods: random cropping, change in brightness, time translation, frequency translation and adding Gaussian noise. Mahenge et al. (2021) use cropping, blurring and matrix transformations to augment images for their proposed road crack detection. Li et al. (2021b) propose a new data augmentation technique for defect detection in images called CutPaste. There are two variants of this method. One that cuts out a relatively large rectangular part of an image and pastes it at a random location in another image. The other pastes a long-thin rectangle with a random color. In their publication, Molitor et al. (2022) compare multiple image-based data augmentation methods with generative algorithms, more specifically different GAN architectures, and combinations of both.
Table 2
Relevant papers grouped by data augmentation method
Group
Augmentation method
Publication
Generative
GAN
Molitor et al. (2022), Fathy et al. (2021), Lu et al. (2021a, b, c), Cannizzaro et al. (2022), Lin et al. (2020), Huang et al. (2021), Li et al. (2022), Bui et al. (2021), Jiang et al. (2021), Zhang et al. (2021), Kim et al. (2023)
WGAN
Molitor et al. (2022), Fathy et al. (2021), Lu et al. (2021b), Xu et al. (2019), Li et al. (2021a)
CGAN
Molitor et al. (2022), Fathy et al. (2021), Ranasinghe et al. (2019), Quintana et al. (2020), Faltings et al. (2022), Zhu et al. (2022), Zheng et al. (2020), Shao et al. (2019), Behera and Misra (2021), Yan et al. (2022)
Sampling-based approaches are used to balance datasets through undersampling of majority classes or oversampling of minority classes. Usually only oversampling methods are used since undersampling often leads to a loss of valuable information. These methods typically use statistical properties like the standard deviation or mean values of features in the existing data to create new data samples. Two of the most well known and successful methods are Synthetic Minority Over-Sampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) sampling approach. SMOTE developed by Chawla et al. (2002) selects a random neighbor from the K-Nearest Neighbors (KNN) for a random sample of the minority class and then generates a synthetic sample by selecting a random point in the feature space between these two samples. ADASYN was first introduced in the work of He et al. (2008) and is a variant of SMOTE that generates more samples in regions of the feature space where the density of minority samples is low and fewer samples in regions where the density is high. These two oversampling methods are used in many of the remaining publications as comparison benchmarks for their proposal. In these cases they are not listed in the sampling-based category which leads to eight publications left that use some form of sampling-based data augmentation techniques.
In their study, Martins et al. (2023) propose a combination of SMOTE and additive Gaussian noise for data augmentation, differentiating between two approaches. In the first approach, only the minority class is augmented by creating a subset of additional samples through the addition of Gaussian noise and creating another subset using SMOTE. In the second approach, the majority class is also augmented with new samples created using additive Gaussian noise. Liu et al. (2023) present a new data augmentation technique that combines SMOTE with deep attention networks and encoder-decoder networks to generate additional abnormal time series samples. The encoder-decoder is employed to transform the raw time series into a separable feature space, thereby reducing inter-class overlap. The attention network is utilized to identify interpolation factors for SMOTE, ensuring that the generated samples are distant from the aggregation area of normal samples. Subsequently, the newly generated samples are transformed back into the original space and combined with an undersampled set of normal samples, thus forming a balanced dataset. Fathy et al. (2021) conduct a comparative analysis of SMOTE and multiple generative data augmentation methods using different classifiers based on a real world case study. In their works Mo et al. (2022), Ding et al. (2022) and Hong and Suh (2021) use time domain specific methods like time stretching, amplitude scaling, translations etc. and add Gaussian noise to augment time series data. Liu et al. (2022) create new time series data by adding and removing of a small random number of time ticks where the added time ticks are the average of the two adjacent time ticks. Sadoughi et al. (2019) propose a data augmentation approach that uses a randomized shrinkage factor to quantify the ratio of the length of the generated sample and the training sample. Then the training sample is interpolated and mapped into the generated sample.
One publication used simulations to solve the problem with lack of training data. Dong et al. (2022) implemented a mathematical simulation model for ball bearings which is used to create additional data of failure cases.
The largest group of data augmentation methods are the generative algorithms with 24 publications found. In generative data augmentation a machine learning model is trained to learn the underlying distribution of the data. This model can then be fed with random noise to generate new samples from this distribution. The group of generative algorithms mainly consists of different GAN architectures. GANs were first developed by Goodfellow et al. (2014). GANs consist of two sub models, a generator model that is trained to create new data samples and a discriminator model that tries to classify samples as either real or generated. The two models are trained adversarially until the discriminator model fails to classify the real and generated samples correctly. This means the probability that a sample belongs to the generated samples or the real samples is 50%. When this happens the generator model creates samples that are not distinguishable from real samples by the discriminator model. Since nearly every publication about GANs adds a new name to the proposed method there are more than 500 different architectures reported by The GAN ZOO repository1 which collected all the different names of GANs and was last updated in September 2018. Therefore GANs in this review are divided by their main architecture into the following superordinate groups:
The group of general GANs represents the architecture described by Goodfellow et al. (2014), where two models are trained adversarially. The discriminator and the generator model can use any neural network architecture like convolutional neural networks, recurrent neural networks, etc. The group of Wasserstein Generative Adversarial Networks (WGAN) differs from the general GANs in that the Wasserstein distance is used instead of the Jensen–Shannon divergence to improve convergence. Conditional Generative Adversarial Networks (CGAN) add a conditional variable as input and output to GANs, so that the generator learns a conditional distribution. This conditional variable is usually used to control the generating process by adding class labels to the input samples, but can also be used for other types of additional information. The difference of Bidirectional Generative Adversarial Networks (BiGAN) to the other GAN architectures is that they include an encoder network which learns the inverse of the generator.
In the group of general GANs are multiple publications. In their work, Cannizzaro et al. (2022) use a GAN to generate additional images for powder bed fusion, an additive manufacturing process. In a case study these images are not used to augment the training data but instead they are evaluated for quality and validity. Huang et al. (2021) use a GAN to generate additional time series samples of the minority class of rolling bearing data. The data generation is guided by variable association graphs of the majority class that are learned by an additional model. Lin et al. (2020) and Bui et al. (2021) use simple GANs to generate new time series data of fault cases in ball bearings and gearboxes. (Lu et al. 2021a, GAN-LSTM Predictor...) and (Lu et al. 2021c, A Deep Adversarial...) use a combination of GAN and Long Short-Term Memory (LSTM) to predict the Remaining Useful Life (RUL) of ball bearings. They do not use the generators of the GANs for data augmentation. Instead the generators are used to predict the degradation. The authors Zhang et al. (2021) propose a combination of LSTM layers and convolutional layers to generate multivariate time series data for noncyclic and cyclic RUL prediction. The time series data is preprocessed into a 2D matrix for noncyclic and into a 3D matrix for cyclic problems that it can be fed to the convolutional layers. Jiang et al. (2021) use a 1D Convolutional Neural Network (1D-CNN) to detect failures in rotating machinery parts and combine it with a GAN to create additional time series data samples. In their study, Kim et al. (2023) put forth a data augmentation method that initially transforms multivariate time series data into images through the use of Gramian Angular Field (GAF). Subsequently, they train a StyleGAN to learn the latent space of the time series data. New samples are then generated through interpolation between samples in the latent space. In their work, Li et al. (2022) propose Dual Multiple Generative Adversarial Networks (Dual-MGAN) for outlier detection. This approach combines Multiple Generative Adversarial Active Learning (MGAAL) and Multiple Generative Adversarial Over-Sampling (MGAOS). MGAAL is used to detect discrete anomalies. The unlabeled data is clustered into multiple classes and for each cluster a sub-GAN learns to construct a reference distribution. MGAOS detects partially labeled group anomalies and works similar to MGAAL. Instead of clustering unlabeled data only labeled anomalies are clustered. A sub-GAN for each anomaly cluster is then trained to generate additional similar samples of these minority classes. Dual-MGAN combines both parts and adds an additional detector neural network. MGAOS increases the size of the minority classes and MGAAL and the detector are alternately optimized to separate anomalies from normal samples.
The next main group of generative models are WGANs. Lu et al. (2021b) and Xu et al. (2019) use WGANs to create additional anomalous time series data for sensor anomaly detection in industrial robots and pipeline leakage detection in petrochemical systems. In case of the pipeline leakage detection, time series data is transformed into graphs instead of using the raw sensor signals (Xu et al. 2019). Li et al. (2021a) also use WGANs for data augmentation and propose a new distance metric called Time-Regularized Hausdorff Distance (TRH) to quantify the similarity between generated and real samples. This metric is used to filter only samples with high similarity before they are passed to the discriminator to improve the overall quality of generated samples.
The third group of generative models are BiGANs. Cui et al. (2021) propose a combination of BiGAN and Wasserstein distance for fault detection in gearboxes and ball bearings. An unsupervised pre-training followed by fine-tuning with a small sample of labeled time series data is used. The pre-trained model is also used to generate additional data. Smolyak et al. (2020) combine BiGAN with Infinite Gaussian Mixture Model (IGGM) for anomaly detection and data augmentation in GPS data.
The last group are CGANs. Faltings et al. (2022), Shao et al. (2019), Behera and Misra (2021), Zheng et al. (2020), and Quintana et al. (2020) use class label information to synthesize additional images of stamps on casted steel billets, to generate additional sensor data for induction motor, aircraft engine and bearing fault detection, as well as thermal comfort data of buildings. Ranasinghe et al. (2019) propose a CGAN that uses additional auxiliary information containing expert knowledge, physics of failure and maintenance records to control the data generation process of failure samples. In their proposal, Yan et al. (2022) introduce a combination of Wasserstein Conditional Generative Adversarial Networks (WCGAN) with Variational Autoencoder (VAE) to generate additional data samples of chiller faults, which are then employed to augment the real-world data utilized for the training of an automated fault diagnosis model. The data is generated by the WCGAN, and the VAE is used to identify high-quality synthetic samples that are subsequently utilized for training the aforementioned fault diagnosis model. In their work, Zhu et al. (2022) use a combination of WGAN and CGAN to generate additional data for a polymerization reaction process of high density polyethylene. They first calculate sparse regions in the data using outliers detected by KNN algorithm. Then a Wasserstein Generative Adversarial Networks with Gradient Penalty (WGAN-GP) is trained to generate new samples that fill these regions and after that a Cycle Structure CGAN (CS-CGAN) is used to generate and filter new data samples.
Adversarial Autoencoder (AAE) use the adversarial concept of GANs. Instead of directly generating new data samples, the generator of an AAE creates vectors in a latent space. The discriminator then predicts if this vector was generated by the autoencoder or is a random vector from the real distribution of the data. Wu et al. (2020) use an AAE to detect anomalies in ball bearing time series data. Lim et al. (2018) propose a method to augment unlabeled data for anomaly detection in tabular data using AAEs. Instead of creating additional samples of the minority class or anomalous samples, data is augmented by creating synthetic samples of infrequent nominal samples.
The last two papers conduct a comparison of different data augmentation methods. Fathy et al. (2021) compare GAN, WGAN, CGAN and WCGAN with SMOTE to test their capability of generating additional samples. In the work of Molitor et al. (2022) Conditional Deep Convolutional Generative Adversarial Networks (C-DCGAN), WGAN-GP and Progressively Growing Generative Adversarial Networks (PGGAN) are compared with image manipulation methods to create synthetical images for tool wear classification.
4.2 RQ2: what role do generative algorithms play in predictive maintenance?
In the 38 selected publications generative models are not only used for data augmentation by generating new data samples, but also directly to detect anomalies. Figure 3 shows that 20 papers (Molitor et al. 2022; Fathy et al. 2021; Cannizzaro et al. 2022; Lin et al. 2020; Bui et al. 2021; Jiang et al. 2021; Zhang et al. 2021; Lu et al. 2021b; Xu et al. 2019; Li et al. 2021a; Ranasinghe et al. 2019; Quintana et al. 2020; Faltings et al. 2022; Zhu et al. 2022; Zheng et al. 2020; Shao et al. 2019; Lim et al. 2018; Behera and Misra 2021; Yan et al. 2022; Kim et al. 2023), use generative models for data augmentation, four papers (Lu et al. 2021a, GAN-LSTM Predictor...; Lu et al. 2021c, A Deep Adversarial...; Wu et al. 2020; Liu et al. 2022) for anomaly detection and four papers (Huang et al. 2021; Li et al. 2022; Smolyak et al. 2020; Cui et al. 2021) for both combined. The remaining 10 publications did not use generative algorithms.
One of the most significant obstacles to the development of effective predictive maintenance models is the scarcity of available failure data. The scarcity of failure data can be attributed to a number of factors. One such factor is the infrequency with which failures or defects occur during operations. Another is the potential absence of suitable systems for data collection in legacy equipment. The resulting imbalance between normal data and failure data makes the latter even more valuable. Fault data can provide crucial insights into issues that are not common but can have significant consequences when they do arise. The identification of scarce or previously unknown faults can be achieved through the utilisation of anomaly detection techniques. Consequently, a generative model can be trained on nominal data, and the reconstruction error can be employed to ascertain whether a sample belongs to the nominal class or represents an anomaly.
×
4.3 RQ3: in which application domains are those algorithms typically used?
In Table 3 the publications are grouped by their application area and type of data. Since only publications that have a focus on predictive maintenance tasks or use time series data are included in this review, most of the papers are from the industrial area. Five publications in this area conduct predictive maintenance in vehicles, Trucks from Scania (Fathy et al. 2021; Ranasinghe et al. 2019) and planes (Huang et al. 2021; Zhang et al. 2021; Mo et al. 2022; Behera and Misra 2021; Liu et al. 2023). The other publications in the industrial area use predictive maintenance in manufacturing processes like casting steel billets (Faltings et al. 2022), polymerization reactions (Zhu et al. 2022), additive manufacturing (Cannizzaro et al. 2022), etc. and to detect failures or predict the RUL of components in machines. These components can be ball bearings (Lu et al. 2021c; Zheng et al. 2020; Sadoughi et al. 2019), gearboxes (Bui et al. 2021; Wu et al. 2020), induction motors (Pasqualotto et al. 2021; Shao et al. 2019), etc.
Other application domains found are medical and traffic. In the medical area generative algorithms are used to generate additional time series samples and to detect anomalies in ECG data (Liu et al. 2022). In the traffic domain, reference (Mahenge et al. 2021) creates additional images to train a road crack detection model and Smolyak et al. (2020) creates GPS trajectories to detect anomalous routes and behavior of drivers.
There are many more domains in which generative algorithms are used that are not covered by this review due to its focus on predictive maintenance tasks. Sabuhi et al. (2021) identified additional application areas in their literature review about general use cases and applications of GANs. To these areas belong surveillance, intrusion detection, image recognition, fraud detection, etc.
Table 3
Relevant papers grouped by application area and type of data
Application area
Type of data
Publication
Industrial
Time series
Lu et al. (2021a, b, c), Lin et al. (2020), Huang et al. (2021), Bui et al. (2021), Jiang et al. (2021), Zhang et al. (2021), Xu et al. (2019), Li et al. (2021a), Zhu et al. (2022), Zheng et al. (2020), Shao et al. (2019), Cui et al. (2021), Wu et al. (2020), Sadoughi et al. (2019), Hong and Suh (2021), Ding et al. (2022), Mo et al. (2022), Dong et al. (2022), Yan et al. (2022), Kim et al. (2023), Behera and Misra (2021), Liu et al. (2023), Martins et al. (2023)
Images
Pasqualotto et al. (2021), Li et al. (2021b), Molitor et al. (2022), Cannizzaro et al. (2022), Faltings et al. (2022)
Tabular features
Fathy et al. (2021), Li et al. (2022), Ranasinghe et al. (2019)
4.4 RQ4: which type of data and datasets are most commonly used for predictive maintenance?
Table 3 shows that for predictive maintenance tasks three types of data are used, times series data, images and tabular data. Most of the publications found use time series data. This is not surprising due to the fact that most predictive maintenance applications rely on some form of sensor data which are usually recorded as time series. The most used datasets of this type are ball bearing datasets, like CWRU bearing fault dataset2 (Lin et al. 2020; Jiang et al. 2021; Zheng et al. 2020; Cui et al. 2021; Hong and Suh 2021; Dong et al. 2022) and the PRONOSTIA FEMTO-ST bearing dataset3 used in the IEEE PHM 2012 Data Challenge (Lu et al. 2021c; Sadoughi et al. 2019; Ding et al. 2022). Another publicly available dataset that was used is the NASA C-MAPPS aircraft simulation data4 (Zhang et al. 2021; Mo et al. 2022; Behera and Misra 2021; Liu et al. 2023). The remaining papers that used time series data in the areas of induction motors (Shao et al. 2019), data captured in an oil refinery (Xu et al. 2019), data of a polymerization reaction (Zhu et al. 2022), or data from rotating machinery parts Martins et al. (2023) do not make their datasets public.
Six of the publications found use images as input data to train their predictive maintenance or data augmentation models. Li et al. (2021b) use the freely available MVTec anomaly detection dataset5 from Bergmann et al. (2021, 2019) which is a benchmarking dataset for industrial inspection. Another public image dataset is the RDD2020 dataset used in the IEEE Global Road Damage Detection Challenge 2020.6 This dataset is used in the paper from Mahenge et al. (2021) to detect cracks in roads. The other publications used non public datasets for tool wear classification (Molitor et al. 2022), anomaly detection in powder bed fusion additive manufacturing (Cannizzaro et al. 2022) and for data augmentation of images of stamps on casted steel billets (Faltings et al. 2022).
Four publications used tabular datasets. Li et al. (2022) used multiple datasets from the UCI Machine Learning Repository7 and Lim et al. (2018) from the ODDS Repository.8 Another public dataset is the Scania air pressure system9 dataset used by Fathy et al. (2021) and Ranasinghe et al. (2019).
4.5 RQ5: which validation methods and metrics are used to evaluate the quality of generated data?
When data augmentation is used to create additional data of minority classes to balance a dataset, the quality of the generated data samples should be evaluated. In the case of time series data and generative algorithms it should be assured qualitatively and quantitatively that the generated samples are plausible and of high quality. Plausibility in this context means that the generated samples can occur in real data and are not physically or by any other restrictions impossible to occur in reality.
Table 4
Relevant papers grouped by quality validation method
Quality metric
Publication
Visual comparison
Molitor et al. (2022), Cannizzaro et al. (2022), Lin et al. (2020), Li et al. (2022), Jiang et al. (2021), Zhang et al. (2021), Xu et al. (2019), Quintana et al. (2020), Faltings et al. (2022), Smolyak et al. (2020), Cui et al. (2021), Lim et al. (2018), Liu et al. (2022), Dong et al. (2022), Behera and Misra (2021)
Pasqualotto et al. (2021), Mahenge et al. (2021), Huang et al. (2021), Lu et al. (2021b), Sadoughi et al. (2019), Ding et al. (2022), Martins et al. (2023), Liu et al. (2023), Kim et al. (2023)
Table 4 provides an overview of the methods and metrics used to evaluate the quality of generated data samples. Most publications use either no validation at all (Pasqualotto et al. 2021; Mahenge et al. 2021; Huang et al. 2021; Lu et al. 2021b; Sadoughi et al. 2019; Ding et al. 2022; Martins et al. 2023; Liu et al. 2023; Kim et al. 2023) or only do a visual comparison between generated and real samples (Li et al. 2022; Zhang et al. 2021; Xu et al. 2019; Faltings et al. 2022; Smolyak et al. 2020; Lim et al. 2018; Dong et al. 2022). This indicates a clear lack of a good quality metric for the evaluation of synthetic time series data.
Three of the papers evaluated the quality of generated data by using metrics that quantify the similarity between images. Molitor et al. (2022) and Cannizzaro et al. (2022) use the Fréchet Inception Distance (FID) (Heusel et al. 2017) and Inception Score (IS) (Salimans et al. 2016) to quantify the quality of generated samples. These metrics can not directly be used to evaluate the quality of generated time series data samples. Therefore Hong and Suh (2021) first transform their time series data into MEL spectrogram images and use the Structural Similarity Index Measure (SSIM) to quantify the similarity of generated and real data.
Three publications (Fathy et al. 2021; Bui et al. 2021; Ranasinghe et al. 2019), use the Kolmogorov–Smirnov (K–S) test to evaluate the quality of generated data. The K–S test is a measure for the similarity between the distribution of generated samples and the distribution of real samples. Two other metrics for the similarity between distributions, the Maximum Mean Discrepancy (MMD) and the Kullback–Leibler divergence (K-LD) are also used (Shao et al. 2019; Cui et al. 2021). Seven of the included papers (Lin et al. 2020; Quintana et al. 2020; Zheng et al. 2020; Shao et al. 2019; Liu et al. 2022; Mo et al. 2022; Behera and Misra 2021) use distance measures such as Euclidean Distance (ED), Dynamic Time Warping (DTW) and Cosine Distance to test if the synthetic time series samples are similar to the real samples. The Pearson Correlation Coefficient (PCC) was adopted by four papers (Lin et al. 2020; Zheng et al. 2020; Shao et al. 2019; Cui et al. 2021) to measure the linear correlation between generated and real time series data. Three papers (Li et al. 2021b; Zheng et al. 2020; Smolyak et al. 2020) use t-distributed Stochastic Neighbor Embedding (t-SNE) visualization. T-SNE is a dimensionality reduction technique that transforms high dimensional data into a low dimensional space of two or three dimensions where similar samples are presented by nearby points and unsimilar samples by distant points.
In their study, Yan et al. (2022) employ a VAE to identify high-quality generated time series samples. The VAE is trained with a randomly selected set of samples generated by a WCGAN and tested with real-world anomaly samples. This process is iteratively repeated until the reconstruction error of the VAE for all real-world test samples falls below a specified threshold. At this point, the generated samples used to train the VAE are deemed to be of high quality.
A novel method called TRH distance was introduced by Li et al. (2021a). TRH distance extends the Hausdorff distance by adding a time-regularized penalty that represents the temporal order difference between two points from different time series samples.
5 Research findings on the WaVe use case
As previously described, a hydrogen-based drive system is being developed in the WaVe project and tested in field trials in two demonstrators from the medium duty vehicle sector. Additionally, a predictive maintenance system is being developed for this drive system. Sensor data from engine test benches and field tests are available for this purpose. Since it can be assumed that these data mainly consist of time series from normal operation and therefore little abnormal data will be available, a suitable data augmentation method is needed to generate additional data of possible failure cases. For these reasons, the literature search was conducted and the results were examined to determine if they were suitable for the WaVe use case.
This section first summarizes the limitations according to the WaVe use case of the remaining papers and then summarizes the most suitable approaches for a data augmentation model to generate new data of fault cases.
5.1 Limitations according to the WaVe use case
It was found that most of the approaches are not suitable due to various limitations. Since the recorded data from the field tests and from the test bench are time series data, a method is needed that can generate such data. Image-based data augmentation methods such as rotation, scaling, color or brightness changes are not applicable to time series sensor data. The addition of Gaussian noise could be used to augment time series data, but would not incorporate the time dependencies. GANs that generate images as training data could be applied to the WaVe use case, but require modifications due to the different characteristics of time series data. Generative models that recreate time series data for RUL prediction or anomaly detection using the recreation error are of interest. However, since they are not used to augment the training data, it is not clear whether they would have a positive impact if used in this way. Time domain data augmentation methods may have the problem that the additional samples have a low variety and are therefore not suitable for the WaVe use case. Methods that use and generate tabular data are partially useful for time series data. However, the time dependencies are not taken into account, which means that the points in time are independent. This behavior is not a reflection of the real-world use case. Simulations to generate additional data are not feasible in the WaVe use case because access to simulations of the different parts and processes of the hydrogen combustion drive system is limited and creating such simulations from scratch is far too complex a task.
5.2 Suitable approaches for the WaVe use case
This section highlights the most suitable approaches for the WaVe use case and discusses how they can be used to create and evaluate additional training data.
The WGAN-GP architecture is the most appropriate approach for the WaVe use case based on the results of the literature review. The publications that have achieved the best results in the generation of time series data have based their approaches on the WGAN-GP architecture. Several methods can be used to extend this architecture. The existing fault time series data from the engine test bench and field trials can initially be clustered using approaches similar to the ones used by Zhu et al. (2022) and Li et al. (2022). A WGAN-GP can then be trained for each cluster to generate additional time series data of the specific failure cases. Depending on the type of nominal data and the performance of the predictive maintenance model, additional data can also be generated from rarely occurring nominal time series signals as suggested by Lim et al. (2018) to reduce the number of false positives. To train the WGAN-GP, the two-step training approach described by Cui et al. (2021) can be used. This means that the model will first be pre-trained with all available data, i.e. normal data and data from failure cases in an unsupervised step. After this step the model is fine-tuned with labeled data from the fault cases.
Both training phases can use the TRH distance metric proposed by Li et al. (2021a), to filter generated time series data samples with low quality, so that the discriminator network is trained only with high quality data. This should generally improve the quality of the generated data. To measure the quality of the generated data, the TRH distance can be used. Other metrics that can be used to evaluate the similarity of time series are DTW and ED. These can also be used to assess the quality of the generated data. To evaluate the plausibility of the synthetic data, a suitable metric is still needed. Plausibility here means, as already described, that the generated data can actually occur in reality. In the selected publications of the literature research, this aspect was only rarely considered and, if at all, only visually tested on a few samples.
6 Conclusion
The WaVe research project aims to develop a hydrogen-based drive system based on an internal combustion engine. A digital twin and a predictive maintenance solution will be implemented for this drive system. It is expected that the data will be highly imbalanced, since mostly data from the normal operating conditions will be available and only a small amount of data from failure cases. This makes it difficult to train a predictive maintenance model that can detect specific faults reliably. Therefore, a suitable data augmentation strategy is needed to generate more data of the underrepresented failure cases.
In order to identify, or develop a suitable strategy, a literature review was conducted. This review highlights the current state of the art of data augmentation methods for predictive maintenance procedures and time series and answers the previously posed research questions. It has been shown that mainly generative algorithms, especially GANs are used for data augmentation in predictive maintenance. On the one hand, these are used to generate new images for defect detection in image-based applications, such as images of stamps on casted steel billets or images of stamped parts. On the other hand they are also used to generate time series data recorded by vibration sensors or accelerometers on machine components, for example. Sampling-based data augmentation methods are used rather rarely.
While the literature review indicated that GAN are the predominant approach for data augmentation of time series data, alternative generative techniques may also prove effective for data augmentation in predictive maintenance. VAE, initially developed by Kingma and Welling (2014), can also be employed to generate novel time series data samples. Another noteworthy approach is that of diffusion models, initially proposed by Sohl-Dickstein et al. (2015) and subsequently refined through the introduction of denoising diffusion probabilistic models by Ho et al. (2020). These models have demonstrated remarkable efficacy in the generation of high-quality synthetic images and, in comparison to GAN, exhibit a more stable training process. GAN, on the other hand, are less computationally complex, which leads to shorter training and inference times.
The approaches found in the literature are discussed about their limitations and suitability for the WaVe use case. Based on this discussion the most promising methods include a two-step training approach for generative models. A method to reduce the amount of false positives is the generation of rarely occurring data of the majority class instead of creating additional samples of minority classes. Also a novel distance metric, TRH distance to evaluate the similarity between time series samples is found appropriate.
The question of how to evaluate and ensure the quality and plausibility of generated data has shown that most publications only perform a visual inspection. In order to evaluate the quality of time series, the TRH distance can be used as well as other metrics such as ED or DTW are suitable. To evaluate the plausibility of the data, however, suitable metrics for a quantitative assessment are still missing. The minority of publications checked the plausibility visually, but the majority did not consider it at all.
The next steps now are the analysis of the hydrogen combustion engine data captured at the engine test bench and implementation as well as experimental evaluation of a suitable data augmentation strategy.
Acknowledgements
This work was supported by the German Federal Ministry for Economic Affairs and Climate Action (BMWK), under grant No. (19I21028R), research project WaVe. The authors alone are responsible for the content of the paper.
Declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.