Skip to main content

2018 | OriginalPaper | Buchkapitel

Iteratively Modeling Based Cleansing Interactively Samples of Big Data

verfasst von : Xiangwu Ding, Shengnan Qin

Erschienen in: Cloud Computing and Security

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Taking advantage of big data means analyzing it and building prediction model on it. However, the data obtained in reality often contains dirty data due to various factors. One method of using big data is to clean the whole data at first, and then train predictive model on cleaned data, but existing cleaning approaches often need lots of completely clean data as guide to fix errors, that is impractical to obtain many clean data. Another method is to train predictive model on raw data directly, which causes the model is not accurate. Therefore, we explore the iterative updating model process and propose an updating algorithm combining data cleaning and conjugate gradient. In this paper, we incrementally update initial model trained on raw data towards the optimum by cleaning samples instead of whole data at each iteration. And the updating direction is established according to gradient of data. After multiple iterations, we can obtain the optimal model that still works well without cleaning data when new data comes in. We also present cluster descent sampling algorithm to accelerate model convergence. Our evaluation on real datasets shows that the approach significantly improves model accuracy compared with training model directly on raw data.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Li, J.Z.: State-of-the-art of research on big data usability. Ruan Jian Xue Bao/J. Softw. 27(7), 1605–1625 (2016) Li, J.Z.: State-of-the-art of research on big data usability. Ruan Jian Xue Bao/J. Softw. 27(7), 1605–1625 (2016)
2.
Zurück zum Zitat Fan, W.F.: Relative information completeness. In: Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT- SIGART Symposium on Principles of Database Systems, vol. 35(4), pp. 97–106. ACM, New York (2009) Fan, W.F.: Relative information completeness. In: Proceedings of the Twenty-Eighth ACM SIGMOD-SIGACT- SIGART Symposium on Principles of Database Systems, vol. 35(4), pp. 97–106. ACM, New York (2009)
3.
Zurück zum Zitat Krishnan, S., et al.: Towards reliable interactive data cleaning: a user survey and recommendations. In: HILDA, pp. 1–4 (2016) Krishnan, S., et al.: Towards reliable interactive data cleaning: a user survey and recommendations. In: HILDA, pp. 1–4 (2016)
4.
Zurück zum Zitat Simpson, E.H.: The interpretation of interaction in contingency tables. J. Roy. Stat. Soc. F. Ser. B (Methodol.), S. JSTOR 1951, 56–60 (1951) Simpson, E.H.: The interpretation of interaction in contingency tables. J. Roy. Stat. Soc. F. Ser. B (Methodol.), S. JSTOR 1951, 56–60 (1951)
5.
Zurück zum Zitat Fan, W.F., Geerts, F.S.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. (TODS) 33(2), 6–18 (2008)CrossRef Fan, W.F., Geerts, F.S.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. (TODS) 33(2), 6–18 (2008)CrossRef
6.
Zurück zum Zitat Chen, K.Z.: Optimization Calculation Method. Xi’an Electronic Science and Technology University Press, Xi’an (1985) Chen, K.Z.: Optimization Calculation Method. Xi’an Electronic Science and Technology University Press, Xi’an (1985)
7.
Zurück zum Zitat Miao, D.J.: Approximate functional dependency mining algorithm in probability database. J. Comput. Res. Dev. 52(12), 2857–2865 (2015) Miao, D.J.: Approximate functional dependency mining algorithm in probability database. J. Comput. Res. Dev. 52(12), 2857–2865 (2015)
9.
Zurück zum Zitat Fan, W.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2011)CrossRef Fan, W.: Discovering conditional functional dependencies. IEEE Trans. Knowl. Data Eng. 23(5), 683–698 (2011)CrossRef
11.
Zurück zum Zitat Diallo, T.: Discovering (frequent) constant conditional functional dependencies. Int. J. Data Mining Model. Manage. 4(3), 205–223 (2012)CrossRef Diallo, T.: Discovering (frequent) constant conditional functional dependencies. Int. J. Data Mining Model. Manage. 4(3), 205–223 (2012)CrossRef
12.
Zurück zum Zitat Grippo, L.: Convergence conditions, line search algorithms and trust region implementations for the polak-ribiere conjugate gradient method. Optim. Metiiods Softw. 20, 71–98 (2005)MathSciNetCrossRef Grippo, L.: Convergence conditions, line search algorithms and trust region implementations for the polak-ribiere conjugate gradient method. Optim. Metiiods Softw. 20, 71–98 (2005)MathSciNetCrossRef
13.
Zurück zum Zitat Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: ICML, pp. 12–22 (2015) Zhao, P., Zhang, T.: Stochastic optimization with importance sampling for regularized loss minimization. In: ICML, pp. 12–22 (2015)
14.
Zurück zum Zitat Yakout, M., Berti-Equille, L.: Don’t be SCAREd: Use SCalable automatic REpairing with maximal likelihood andbounded changes. In: Proceedings of the SIGMOD, pp. 553–564 (2013) Yakout, M., Berti-Equille, L.: Don’t be SCAREd: Use SCalable automatic REpairing with maximal likelihood andbounded changes. In: Proceedings of the SIGMOD, pp. 553–564 (2013)
Metadaten
Titel
Iteratively Modeling Based Cleansing Interactively Samples of Big Data
verfasst von
Xiangwu Ding
Shengnan Qin
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-00006-6_55

Premium Partner