Skip to main content

2018 | OriginalPaper | Buchkapitel

2. Data Preprocessing Techniques

verfasst von : Jun Zhao, Wei Wang, Chunyang Sheng

Erschienen in: Data-Driven Prediction for Industrial Processes and Their Applications

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

It is hard for raw industrial data accumulated by commonly implemented supervisory control and data acquisition (SCADA) system on-site to be directly employed to construct a prediction model, given that such data are always mixed with high level noise, missing points, and outliers due to the possible real-time database malfunction, data transformation, or maintenance. Thereby, the data preprocessing techniques have to be implemented, which usually contain anomaly data detection, data imputation, and data de-noising techniques. As for the issue of outliers, in this chapter, we introduce the anomaly detection methods based on fuzzy C means (FCM), K-nearest-neighbor (KNN), and dynamic time warping (DTW) algorithms. To tackle the missing data points problem, a series of data imputation methods are also described. After introducing the generic regression filling and expectation maximum methods, we supplement a varied window similarity measure method, the segmented shape-representation-based method, and the non-equal-length granules correlation method for industrial data imputation. With respect to the high level noise embodied in raw data, we then give an introduction to the well-known empirical mode decomposition (EMD) method. To verify the effectiveness of these methods, this chapter also provides a number of industrial case studies.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Keogh, E. (2005). Recent advances in mining time series data. knowledge discovery in databases: Pkdd 2005. European Conference on Principles and Practice of Knowledge Discovery in Databases (p. 6), Porto, Portugal, October 3–7, 2005, Proceedings. DBLP. Keogh, E. (2005). Recent advances in mining time series data. knowledge discovery in databases: Pkdd 2005. European Conference on Principles and Practice of Knowledge Discovery in Databases (p. 6), Porto, Portugal, October 3–7, 2005, Proceedings. DBLP.
2.
Zurück zum Zitat Adamo, J. M. (2001). Data mining for association rules and sequential patterns. Berlin: Springer.CrossRef Adamo, J. M. (2001). Data mining for association rules and sequential patterns. Berlin: Springer.CrossRef
3.
Zurück zum Zitat Pyle, D. (1999). Data preparation for data mining (pp. 375–381). San Francisco: Morgan Kaufmann. Pyle, D. (1999). Data preparation for data mining (pp. 375–381). San Francisco: Morgan Kaufmann.
4.
Zurück zum Zitat Kotsiantis, S. B., Kanellopoulos, D., & Pintelas, P. E. (2006). Data preprocessing for supervised leaning. International Journal of Computer Science, 1(2), 111–117. Kotsiantis, S. B., Kanellopoulos, D., & Pintelas, P. E. (2006). Data preprocessing for supervised leaning. International Journal of Computer Science, 1(2), 111–117.
5.
Zurück zum Zitat Alpaydin, E. (2014). Introduction to machine learning. Cambridge: MIT press.MATH Alpaydin, E. (2014). Introduction to machine learning. Cambridge: MIT press.MATH
6.
Zurück zum Zitat Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37. Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI Magazine, 17(3), 37.
7.
Zurück zum Zitat Gama, J. (2010). Knowledge discovery from data streams. London: CRC Press.CrossRef Gama, J. (2010). Knowledge discovery from data streams. London: CRC Press.CrossRef
8.
Zurück zum Zitat Liu, H., & Motoda, H. (1998). Feature extraction, construction and selection: a data mining perspective. Boston: Kluwer Academic Publishers.CrossRef Liu, H., & Motoda, H. (1998). Feature extraction, construction and selection: a data mining perspective. Boston: Kluwer Academic Publishers.CrossRef
9.
Zurück zum Zitat Chen, M., & Chen, L. (2008). An information granulation based data mining approach for classifying imbalanced data. Information Sciences, 178, 3214–3227.CrossRef Chen, M., & Chen, L. (2008). An information granulation based data mining approach for classifying imbalanced data. Information Sciences, 178, 3214–3227.CrossRef
10.
Zurück zum Zitat Zhao, J., Liu, K., Wang, W., et al. (2014). Adaptive fuzzy clustering based anomaly data detection in energy system of steel industry. Information Sciences, 259, 335–345.CrossRef Zhao, J., Liu, K., Wang, W., et al. (2014). Adaptive fuzzy clustering based anomaly data detection in energy system of steel industry. Information Sciences, 259, 335–345.CrossRef
11.
Zurück zum Zitat Akouemo, H. N., & Povinelli, R. J. (2014). Time series outlier detection and imputation. PES General Meeting | Conference & Exposition (pp. 1–5), 2014 IEEE. IEEE. Akouemo, H. N., & Povinelli, R. J. (2014). Time series outlier detection and imputation. PES General Meeting | Conference & Exposition (pp. 1–5), 2014 IEEE. IEEE.
12.
Zurück zum Zitat Aydilek, I. B., & Arslan, A. (2013). A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Information Sciences, 233, 25–35.CrossRef Aydilek, I. B., & Arslan, A. (2013). A hybrid method for imputation of missing values using optimized fuzzy c-means with support vector regression and a genetic algorithm. Information Sciences, 233, 25–35.CrossRef
13.
Zurück zum Zitat Fu, T. C. (2011). A review on time series data mining. Engineering Applications of Artificial Intelligence, 24(1), 164–181.CrossRef Fu, T. C. (2011). A review on time series data mining. Engineering Applications of Artificial Intelligence, 24(1), 164–181.CrossRef
14.
Zurück zum Zitat Jaeger, H., & Haas, H. (2004). Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science, 304(5667), 78–80.CrossRef Jaeger, H., & Haas, H. (2004). Harnessing nonlinearity: predicting chaotic systems and saving energy in wireless communication. Science, 304(5667), 78–80.CrossRef
15.
Zurück zum Zitat Eftekhar, A., Toumazou, C., & Drakakis, E. M. (2013). Empirical mode decomposition: Real-time implementation and applications. Journal of Signal Processing Systems, 73(1), 43–58.CrossRef Eftekhar, A., Toumazou, C., & Drakakis, E. M. (2013). Empirical mode decomposition: Real-time implementation and applications. Journal of Signal Processing Systems, 73(1), 43–58.CrossRef
16.
Zurück zum Zitat Monard, M. C. (2002). A study of K-nearest neighbour as an imputation method. DBLP (pp. 251–260). Monard, M. C. (2002). A study of K-nearest neighbour as an imputation method. DBLP (pp. 251–260).
17.
Zurück zum Zitat Steinbach, M., Karypis, G., & Kumar, V. (2000, August 20–23). A comparison of document clustering techniques. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and DataMining (pp. 174–181). Boston, MA, USA. Steinbach, M., Karypis, G., & Kumar, V. (2000, August 20–23). A comparison of document clustering techniques. In: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and DataMining (pp. 174–181). Boston, MA, USA.
18.
Zurück zum Zitat Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press.CrossRef Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. New York: Plenum Press.CrossRef
19.
Zurück zum Zitat Pal, N. R., & Bezdek, J. C. (2002). On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy Systems, 3(3), 370–379.CrossRef Pal, N. R., & Bezdek, J. C. (2002). On cluster validity for the fuzzy c-means model. IEEE Transactions on Fuzzy Systems, 3(3), 370–379.CrossRef
20.
Zurück zum Zitat Chiang, J. H., & Hao, P. Y. (2003). A new kernel-based fuzzy clustering approach: support vector clustering with cell growing. IEEE Transactions on Fuzzy Systems, 11(4), 518–527.CrossRef Chiang, J. H., & Hao, P. Y. (2003). A new kernel-based fuzzy clustering approach: support vector clustering with cell growing. IEEE Transactions on Fuzzy Systems, 11(4), 518–527.CrossRef
21.
Zurück zum Zitat Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (pp. 87–88).CrossRef Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (pp. 87–88).CrossRef
22.
Zurück zum Zitat Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. London: Cambridge University press. Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models. London: Cambridge University press.
23.
Zurück zum Zitat Dempster, A. P., Laird, N. M., & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B: Methodological, 1977, 1–38. Dempster, A. P., Laird, N. M., & Rubin, D. B. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B: Methodological, 1977, 1–38.
24.
Zurück zum Zitat Rancourt, E., Särndal, C. E., & Lee, H. (1994). Estimation of the variance in the presence of nearest neighbor imputation. Proceedings of the Section on Survey Research Methods (pp. 888–893). Rancourt, E., Särndal, C. E., & Lee, H. (1994). Estimation of the variance in the presence of nearest neighbor imputation. Proceedings of the Section on Survey Research Methods (pp. 888–893).
25.
Zurück zum Zitat Buschman, T. J., & Miller, E. K. (2007). Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science, 315(5820), 1860–1862.CrossRef Buschman, T. J., & Miller, E. K. (2007). Top-down versus bottom-up control of attention in the prefrontal and posterior parietal cortices. Science, 315(5820), 1860–1862.CrossRef
26.
Zurück zum Zitat Navalpakkam, V., & Itti, L. (2006). An integrated model of top-down and bottom-up attention for optimal object detection. Computer Society Conference on IEEE, 2, 2049–2056. Navalpakkam, V., & Itti, L. (2006). An integrated model of top-down and bottom-up attention for optimal object detection. Computer Society Conference on IEEE, 2, 2049–2056.
27.
Zurück zum Zitat Lu, K. F., Lin, S. K., & Qiao, J. Z. (2008). FSMBO: fast time series similarity matching based on bit operation. Proceedings of the 9th International Conference for Young Computer Scientists. Lu, K. F., Lin, S. K., & Qiao, J. Z. (2008). FSMBO: fast time series similarity matching based on bit operation. Proceedings of the 9th International Conference for Young Computer Scientists.
28.
Zurück zum Zitat Lv, Z., Zhao, J., Liu, Y., et al. (2016). Data imputation for gas flow data in steel industry based on non-equal-length granules correlation coefficient. Information Sciences, 367, 311–323.CrossRef Lv, Z., Zhao, J., Liu, Y., et al. (2016). Data imputation for gas flow data in steel industry based on non-equal-length granules correlation coefficient. Information Sciences, 367, 311–323.CrossRef
29.
Zurück zum Zitat Rilling, G., Flandrin, P., & Goncalves, P. (2003). On empirical mode decomposition and its algorithms. IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing (vol. 3, pp. 8–11). IEEER, Grado, Italy. Rilling, G., Flandrin, P., & Goncalves, P. (2003). On empirical mode decomposition and its algorithms. IEEE-EURASIP Workshop on Nonlinear Signal and Image Processing (vol. 3, pp. 8–11). IEEER, Grado, Italy.
30.
Zurück zum Zitat Kountouriotis, P. A., Obradovic, D., Goh, S. L., & Mandic, D. P. (2005). Multi-step forecasting using echo state networks. In: Proceedings of International Conference on Computer as a Tool (pp. 1574–1577). Belgrade, IEEE. Kountouriotis, P. A., Obradovic, D., Goh, S. L., & Mandic, D. P. (2005). Multi-step forecasting using echo state networks. In: Proceedings of International Conference on Computer as a Tool (pp. 1574–1577). Belgrade, IEEE.
31.
Zurück zum Zitat Shi, Z. W., & Han, M. (2007). Ridge regression learning in ESN for chaotic time series prediction. Control and Decision, 22(3), 258–267.MathSciNetMATH Shi, Z. W., & Han, M. (2007). Ridge regression learning in ESN for chaotic time series prediction. Control and Decision, 22(3), 258–267.MathSciNetMATH
Metadaten
Titel
Data Preprocessing Techniques
verfasst von
Jun Zhao
Wei Wang
Chunyang Sheng
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-94051-9_2