Introduction
-
We present a detailed use case of an end-to-end IoT solution for predictive maintenance. Previous discussion of the topic in the literature have always been limited to one of the aspects: big data architecture, storage design, analytic pipeline or machine learning algorithms. In this paper we address all aspects together under one roof. In particular, in each of these aspects we outline important pitfalls that are common to practical applications. The solutions that we suggest here can be used as reference for similar intelligent maintenance IoT systems in diverse application fields.
-
We present a scalable IoT data analytics pipeline based on a Big Data architecture which integrates heterogeneous data streams across an entire fleet of laser cutting machines. Our system enables near real-time and/or discrete interval-based machine health monitoring.
-
We perform an in-depth evaluation of different physical storage design choices such as partitioning and Z-ordering to enable efficient, multi-dimensional query processing.
-
We discuss the main challenges and lessons learned from implementing the IoT data analytics pipeline within an industrial setting using state-of-the-art Big Data technology, such as Azure Databricks, Spark Structured Streaming and Delta Lakes.
-
Finally, we demonstrate the possibility of early fault detection in the optical system of laser cutting machines. In particular, we applied both statistical methods and convolutional neural networks (CNNs). We claim that although CNNs do not require a first step of feature engineering, standard statistical methods are preferred in this case, due to their higher explainability as well as robustness towards variable operating conditions.
Background
-
Machine data collection
-
Data persistence
-
Data processing and analysis
-
Data visualization and automated alerting
Related work
Speeding up query processing with multi-dimensional indexes
Machine learning for intelligent maintenance
Methods: system design
Monitoring system with stream processing
Checkpointing: monitoring the monitoring system
-
This can happen due to an unexpected runtime error, for example, caused by memory issues within the streaming task.
-
We can consider this a soft failure, since the task can be restarted and the data will be subsequently re-processed through Spark’s own checkpointing mechanism (assuming the task is restarted within the buffering window of the stream). For more details on checkpointing in Spark, see the Spark user guide9.
-
For this type of error, it is sufficient to monitor the stream in a separate task, by periodically investigating its status. Whenever the status is detected as “Stopped”, the streaming task should be restarted. The last successfully processed timestamp will be available through the “lastProgress” parameter of the stream10.
-
This is a more significant failure, since all intermediate states will be lost, which implies that we may no longer be able to rely on Spark’s checkpointing mechanism. Instead, in this case we need to internally store the last successfully processed timestamp for each data type in order to overcome gaps in the data. More precisely, upon restarting the cluster, a batch job will process, from the immutable JSON data lake, all the data points between the last saved timestamp and the time of the restart and store them in the appropriate Delta tables. This will ensure that all the data is also available in structured format, despite the cluster downtime.
Query processing and optimization
Input dimensions
-
One-dimensional point queries (1D-PQ) are of the form \(a = v\) where \(a\) refers to an attribute and \(v\) to a value.
-
One-dimensional range queries (1D-RQ) are of the form \(a > v\) (greater than) or \(a < v\) (less than).
-
Multi-dimensional point queries (nD-PQ, i.e. 2D-PQ) combine multiple one-dimensional point queries with the boolean operator AND. They are of the form \(a_1 = v_1\)AND \(a_2 = v_2\)AND ... AND \(a_n = v_n\). For instance, find all machines where a failure occurred on weekday = 5 AND location = Zurich.
-
Multi-dimensional range queries (nD-RQ, i.e. 2D-RQ) - similar to multi-dimensional point queries - are a combination of one-dimensional range queries, concatenated with the boolean operator AND. They are of the form \(a_1 > v_1\)AND \(a_2 > v_2\)AND ... AND \(a_n > v_n\). For instance, find all machines with a temperature > 100 AND speed < 20.
Data storage optimization
-
Partitioning is used to split files into more manageable sizes such that a smaller subset of the data needs to be accessed during query processing. Let us assume a dataset with measurements for various machines. These measurements contain the machine name, the three coordinates of the cutting head of the machine (x, y and z), the temperature values of a particular machine part, and a timestamp of when the measurement was collected (see top part of Fig. 3). The middle and bottom part of the Figure show two data partitioning strategies, namely, partitioning by machine and partition by timestamp.Partitioning by machine is a good strategy for queries that look for specific machine names. On the other hand, partitioning by timestamp is the better strategy for queries looking for timestamps. However, what if we have a query that looks for machines AND timestamps? In this case, one might need a partition strategy that combines the partitions for machines and timestamps. In general, given that our system can receive arbitrary queries with any possible combination of attributes and dimensionality, the complexity of choosing all combinations of partitions is O(n!), where n is the number of attributes in a table11. Storing such a high number of partitions is impossible in practice. Hence, one needs to choose the partitions based on the most commonly used queries. This could be achieved only by correctly combining system specific domain knowledge, and allowing for an iterative development process which can adapt the partitions to the most frequently deployed data analytics.
-
Z-ordering/Multi-dimensional clustering The basic idea of Z-ordering [11] is to cluster those attributes together that are potentially also queried together. In other words, Z-ordering maps data from a multi-dimensional space down to a one-dimensional space. Moreover, Z-ordering uses a combined partitioning strategy across multiple attributes. In our example (see Fig. 4), the values of some of the machines are collocated with the values of some of the temperature measurements. The data can now be traversed following the shape of the letter “Z”.Let us assume that we are interested in the query where Machine = M2 AND temperature < 19. In this case, we only need to traverse the top left Z. However, if we are interested in all machines with temperature < 19, we basically need to traverse three times as much data, i.e. three Z-shapes. This example shows that depending on the type and the dimensionality of the queries, a different amount of data needs to be traversed. Thus, the more data that needs to be traversed during query processing, the longer the query response time.
Results and discussion
Analysis of query performance in Big Data architecture
-
What is the effect of partitioning and Z-ordering on one-dimensional queries?
-
How are the conclusions affected by higher query dimensionalities?
-
How do our findings change when we deal with high cardinality attributes?
Software and hardware setup
Datasets
\(N_{rows}\) | \(10^{10}\) |
---|---|
\(N_{cols}\) | 16 |
Data size | 240 GB |
Experiment results
-
Non-partitioned: The table is physically stored without optimization and specifically without partitioning.
-
Z-ordering: The table is stored using a specific attribute for Z-ordering. The Z-ordered attribute is also accessed by the queries.
-
Partitioned: The table is partitioned by a specific attribute. This partitioned attribute is accessed by the queries.
WHERE
clause. Let us first focus on the query response times of point queries. In particular we analyze the effects of these two Z-ordering strategies 1D-PQ 1Z vs. 1D-PQ 4Z
. In the former case, one-dimensional point queries are executed using Z-ordering based on one attribute. In the latter case, one-dimensional point queries are executed using Z-ordering based on four attributes. We can observe that the query response times on 64 CPU-cores are about 8 times higher for 1D-PQ 4Z
compared to the configuration of 1D-PQ 1Z
. The reason is that Z-ordering aligns rows maximizing all Z-ordered attributes equally, such that there is a performance drop when applying Z-ordering to more attributes than those being used in the query. In summary, one-dimensional queries perform better on Z-ordering with one attribute rather than four attributes.-
Categorical attributes such as year, month, day or country have a low attribute cardinality. For these attributes, partitioning works well and should be used primarily. As a general rule of thumb, at a cardinality above \(10^4\) one should start to question, if partitioning is the right choice.
-
Continuous attributes such as temperaturevalues typically have high attribute cardinalities and are best suited for Z-ordering.
-
In general, partitioning is faster than Z-ordering and should be preferred, if the cardinality is below \(10^5\). Z-ordering is able to handle high attribute cardinalities where partitioning is not an option anymore.
-
Unlike partitioning, the performance of Z-ordering degrades with multiple attributes that are Z-ordered, but not part of a query.
-
Our performance experiments with real-world data showed similar results and could confirm our lessons learned.
Statistical and machine learning algorithms for intelligent maintenance
Use case description
Datasets
Method 1: Machine classification based on convolutional neural networks
Machine | TN | FN | TP | FP | Accuracy |
---|---|---|---|---|---|
1 | 10963 | 0 | 0 | 0 | 1 |
2 | 13133 | 0 | 0 | 0 | 1 |
3 | 8817 | 0 | 0 | 0 | 1 |
4 | 7502 | 0 | 0 | 0 | 1 |
5* | 0 | 2978 | 90876 | 0 | 0.968 |
6 | 7690 | 0 | 0 | 0 | 1 |
7 | 22899 | 0 | 0 | 21 | 0.999 |
8 | 28724 | 0 | 0 | 11 | 0.999 |
9 | 52533 | 0 | 0 | 0 | 1 |
10 | 3636 | 0 | 0 | 504 | 0.878 |
11* | 0 | 184 | 23556 | 0 | 0.992 |
12 | 14933 | 0 | 0 | 0 | 1 |
13 | 5491 | 0 | 0 | 31 | 0.994 |
14 | 9992 | 0 | 0 | 33 | 0.997 |
Method 2: Machine classification based on statistical signal analysis
Lessons learned and applying best practices
-
Periodic re-training and validation of the fault classification model on the entire fleet data.
-
Verification of the defined business objectives making use of the precision, recall and accuracy metrics of the algorithms.
-
Model deployment in a close to real-time data processing pipeline.