Skip to main content

2018 | OriginalPaper | Buchkapitel

66. Improving Data Quality Through Deep Learning and Statistical Models

verfasst von : Wei Dai, Kenji Yoshigoe, William Parsley

Erschienen in: Information Technology - New Generations

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Traditional data quality control methods are based on users’ experience or previously established business rules, and this limits performance in addition to being a very time consuming process with lower than desirable accuracy. Utilizing deep learning, we can leverage computing resources and advanced techniques to overcome these challenges and provide greater value to users.
In this paper, we, the authors, first review relevant works and discuss machine learning techniques, tools, and statistical quality models. Second, we offer a creative data quality framework based on deep learning and statistical model algorithm for identifying data quality. Third, we use data involving salary levels from an open dataset published by the state of Arkansas to demonstrate how to identify outlier data and how to improve data quality via deep learning. Finally, we discuss future work.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103–110.CrossRef Strong, D. M., Lee, Y. W., & Wang, R. Y. (1997). Data quality in context. Communications of the ACM, 40(5), 103–110.CrossRef
2.
Zurück zum Zitat Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (2013). Machine learning: An artificial intelligence approach. Berlin: Springer Science & Business Media.MATH Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (2013). Machine learning: An artificial intelligence approach. Berlin: Springer Science & Business Media.MATH
3.
Zurück zum Zitat Alpaydin, E. (2014). Introduction to machine learning. Cambridge, MA/London: MIT Press.MATH Alpaydin, E. (2014). Introduction to machine learning. Cambridge, MA/London: MIT Press.MATH
4.
Zurück zum Zitat Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.MATH Murphy, K. P. (2012). Machine learning: A probabilistic perspective. Cambridge, MA: MIT Press.MATH
5.
Zurück zum Zitat Natarajan, B. K. (2014). Machine learning: A theoretical approach. San Mateo: Morgan Kaufmann. Natarajan, B. K. (2014). Machine learning: A theoretical approach. San Mateo: Morgan Kaufmann.
6.
Zurück zum Zitat Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.CrossRef Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85–117.CrossRef
7.
Zurück zum Zitat LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.CrossRef LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.CrossRef
8.
Zurück zum Zitat Deng, L., Hinton, G., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013 (pp. 8599–8603). IEEE. Deng, L., Hinton, G., & Kingsbury, B. (2013). New types of deep neural network learning for speech recognition and related applications: An overview. In IEEE international conference on acoustics, speech and signal processing (ICASSP), 2013 (pp. 8599–8603). IEEE.
9.
Zurück zum Zitat Hawkins, S., He, H., Williams, G., & Baxter, R. (2002). Outlier detection using replicator neural networks. In Data warehousing and knowledge discovery (pp. 170–180). Berlin Heidelberg: Springer.CrossRef Hawkins, S., He, H., Williams, G., & Baxter, R. (2002). Outlier detection using replicator neural networks. In Data warehousing and knowledge discovery (pp. 170–180). Berlin Heidelberg: Springer.CrossRef
10.
Zurück zum Zitat Aggarwal, C. C. (2015). Outlier analysis. In Data mining (pp. 237–263). Springer International Publishing. Aggarwal, C. C. (2015). Outlier analysis. In Data mining (pp. 237–263). Springer International Publishing.
11.
Zurück zum Zitat Montgomery, D. C. (2009). Statistical quality control (Vol. 7). New York: Wiley.MATH Montgomery, D. C. (2009). Statistical quality control (Vol. 7). New York: Wiley.MATH
12.
Zurück zum Zitat Leavenworth, R. S., & Grant, E. L. (2000). Statistical quality control. New York: Tata McGraw-Hill Education.MATH Leavenworth, R. S., & Grant, E. L. (2000). Statistical quality control. New York: Tata McGraw-Hill Education.MATH
13.
Zurück zum Zitat DeVor, R. E., Chang, T.-h., & Sutherland, J. W. (2007). Statistical quality design and control: Contemporary concepts and methods. Upper Saddle River: Prentice Hall. DeVor, R. E., Chang, T.-h., & Sutherland, J. W. (2007). Statistical quality design and control: Contemporary concepts and methods. Upper Saddle River: Prentice Hall.
14.
Zurück zum Zitat Bluman, A. G. (2009). Elementary statistics: A step by step approach. New York: McGraw-Hill Higher Education. Bluman, A. G. (2009). Elementary statistics: A step by step approach. New York: McGraw-Hill Higher Education.
15.
Zurück zum Zitat Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R., Kötter, T., Meinl, T., Ohl, P., Sieb, C., Thiel, K., & Wiswedel, B. (2008). KNIME: The Konstanz information miner. Berlin Heidelberg: Springer. Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R., Kötter, T., Meinl, T., Ohl, P., Sieb, C., Thiel, K., & Wiswedel, B. (2008). KNIME: The Konstanz information miner. Berlin Heidelberg: Springer.
16.
Zurück zum Zitat O’hagan, S., & Kell, D. B. (2015). Software review: the KNIME workflow environment and its applications in genetic programming and machine learning. Genetic Programming and Evolvable Machines, 16(3), 387–391.MathSciNetCrossRef O’hagan, S., & Kell, D. B. (2015). Software review: the KNIME workflow environment and its applications in genetic programming and machine learning. Genetic Programming and Evolvable Machines, 16(3), 387–391.MathSciNetCrossRef
17.
Zurück zum Zitat Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.CrossRef Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.CrossRef
18.
Zurück zum Zitat Mark, H., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.CrossRef Mark, H., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.CrossRef
19.
Zurück zum Zitat Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C.-W., & Tseng, V. S. (2014). SPMF: A java open-source pattern mining library. The Journal of Machine Learning Research, 15(1), 3389–3393.MATH Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C.-W., & Tseng, V. S. (2014). SPMF: A java open-source pattern mining library. The Journal of Machine Learning Research, 15(1), 3389–3393.MATH
20.
Zurück zum Zitat Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C. W., & Tseng, V. S. (2014). SPMF: A java open-source pattern mining library. The Journal of Machine Learning Research, 15(1), 3389–3393.MATH Fournier-Viger, P., Gomariz, A., Gueniche, T., Soltani, A., Wu, C. W., & Tseng, V. S. (2014). SPMF: A java open-source pattern mining library. The Journal of Machine Learning Research, 15(1), 3389–3393.MATH
Metadaten
Titel
Improving Data Quality Through Deep Learning and Statistical Models
verfasst von
Wei Dai
Kenji Yoshigoe
William Parsley
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-54978-1_66

Premium Partner