Skip to main content
Top

2018 | OriginalPaper | Chapter

Iteratively Modeling Based Cleansing Interactively Samples of Big Data

Authors : Xiangwu Ding, Shengnan Qin

Published in: Cloud Computing and Security

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Taking advantage of big data means analyzing it and building prediction model on it. However, the data obtained in reality often contains dirty data due to various factors. One method of using big data is to clean the whole data at first, and then train predictive model on cleaned data, but existing cleaning approaches often need lots of completely clean data as guide to fix errors, that is impractical to obtain many clean data. Another method is to train predictive model on raw data directly, which causes the model is not accurate. Therefore, we explore the iterative updating model process and propose an updating algorithm combining data cleaning and conjugate gradient. In this paper, we incrementally update initial model trained on raw data towards the optimum by cleaning samples instead of whole data at each iteration. And the updating direction is established according to gradient of data. After multiple iterations, we can obtain the optimal model that still works well without cleaning data when new data comes in. We also present cluster descent sampling algorithm to accelerate model convergence. Our evaluation on real datasets shows that the approach significantly improves model accuracy compared with training model directly on raw data.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Li, J.Z.: State-of-the-art of research on big data usability. Ruan Jian Xue Bao/J. Softw. 27(7), 1605–1625 (2016) Li, J.Z.: State-of-the-art of research on big data usability. Ruan Jian Xue Bao/J. Softw. 27(7), 1605–1625 (2016)
2.
go back to reference Fan, W.F.: Relative information completeness. In: Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT- SIGART Symposium on Principles of Database Systems, vol. 35(4), pp. 97–106. ACM, New York (2009) Fan, W.F.: Relative information completeness. In: Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT- SIGART Symposium on Principles of Database Systems, vol. 35(4), pp. 97–106. ACM, New York (2009)
3.
go back to reference Krishnan, S., et al.: Towards reliable interactive data cleaning: a user survey and recommendations. In: HILDA, pp. 1–4 (2016) Krishnan, S., et al.: Towards reliable interactive data cleaning: a user survey and recommendations. In: HILDA, pp. 1–4 (2016)
4.
go back to reference Simpson, E.H.: The interpretation of interaction in contingency tables. J. Roy. Stat. Soc. F. Ser. B (Methodol.), S. JSTOR 1951, 56–60 (1951) Simpson, E.H.: The interpretation of interaction in contingency tables. J. Roy. Stat. Soc. F. Ser. B (Methodol.), S. JSTOR 1951, 56–60 (1951)
5.
go back to reference Fan, W.F., Geerts, F.S.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. (TODS) 33(2), 6–18 (2008)CrossRef Fan, W.F., Geerts, F.S.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. (TODS) 33(2), 6–18 (2008)CrossRef
6.
go back to reference Chen, K.Z.: Optimization Calculation Method. Xi’an Electronic Science and Technology University Press, Xi’an (1985) Chen, K.Z.: Optimization Calculation Method. Xi’an Electronic Science and Technology University Press, Xi’an (1985)
7.
go back to reference Miao, D.J.: Approximate functional dependency mining algorithm in probability database. J. Comput. Res. Dev. 52(12), 2857–2865 (2015) Miao, D.J.: Approximate functional dependency mining algorithm in probability database. J. Comput. Res. Dev. 52(12), 2857–2865 (2015)
9.
go back to reference Fan, W.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2011)CrossRef Fan, W.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2011)CrossRef
11.
go back to reference Diallo, T.: Discovering (frequent) constant conditional functional dependencies. Int. J. Data Mining Model. Manage. 4(3), 205–223 (2012)CrossRef Diallo, T.: Discovering (frequent) constant conditional functional dependencies. Int. J. Data Mining Model. Manage. 4(3), 205–223 (2012)CrossRef
12.
go back to reference Grippo, L.: Convergence conditions, line search algorithms and trust region implementations for the polak-ribiere conjugate gradient method. Optim. Metiiods Softw. 20, 71–98 (2005)MathSciNetCrossRef Grippo, L.: Convergence conditions, line search algorithms and trust region implementations for the polak-ribiere conjugate gradient method. Optim. Metiiods Softw. 20, 71–98 (2005)MathSciNetCrossRef
13.
go back to reference Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: ICML, pp. 12–22 (2015) Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: ICML, pp. 12–22 (2015)
14.
go back to reference Yakout, M., Berti-Equille, L.: Don’t be SCAREd: Use SCalable automatic REpairing with maximal likelihood andbounded changes. In: Proceedings of the SIGMOD, pp. 553–564 (2013) Yakout, M., Berti-Equille, L.: Don’t be SCAREd: Use SCalable automatic REpairing with maximal likelihood andbounded changes. In: Proceedings of the SIGMOD, pp. 553–564 (2013)
Metadata
Title
Iteratively Modeling Based Cleansing Interactively Samples of Big Data
Authors
Xiangwu Ding
Shengnan Qin
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-030-00006-6_55

Premium Partner