Skip to main content
Erschienen in: The VLDB Journal 6/2020

09.06.2020 | Regular Paper

Automatic weighted matching rectifying rule discovery for data repairing

Can we discover effective repairing rules automatically from dirty data?

verfasst von: Hiba Abu Ahmad, Hongzhi Wang

Erschienen in: The VLDB Journal | Ausgabe 6/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data repairing is a key problem in data cleaning which aims to uncover and rectify data errors. Traditional methods depend on data dependencies to check the existence of errors in data, but they fail to rectify the errors. To overcome this limitation, recent methods define repairing rules on which they depend to detect and fix errors. However, all existing data repairing rules are provided by experts which is an expensive task in time and effort. Besides, rule-based data repairing methods need an external verified data source or user verifications; otherwise, they are incomplete where they can repair only a small number of errors. In this paper, we define weighted matching rectifying rules (WMRRs) based on similarity matching to capture more errors. We propose a novel algorithm to discover WMRRs automatically from dirty data in-hand. We also develop an automatic algorithm for rules inconsistency resolution. Additionally, based on WMRRs, we propose an automatic data repairing algorithm (WMRR-DR) which uncovers a large number of errors and rectifies them dependably. We experimentally verify our method on both real-life and synthetic data. The experimental results prove that our method can discover effective WMRRs from dirty data in-hand and perform dependable and full-automatic repairing based on the discovered WMRRs, with higher accuracy than the existing dependable methods.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
http://www.cs.utexas.edu/users/ml/riddle/data.html
 
Literatur
3.
Zurück zum Zitat Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM, pp. 143–154 (2005) Bohannon, P., Fan, W., Flaster, M., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data. ACM, pp. 143–154 (2005)
4.
Zurück zum Zitat Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007. IEEE, pp. 746–755 (2007) Bohannon, P., Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for data cleaning. In: IEEE 23rd International Conference on Data Engineering, 2007. ICDE 2007. IEEE, pp. 746–755 (2007)
5.
Zurück zum Zitat Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)CrossRef Fan, W., Jia, X., Li, J., Ma, S.: Reasoning about record matching rules. Proc. VLDB Endow. 2(1), 407–418 (2009)CrossRef
6.
Zurück zum Zitat Wang, Y., Song, S., Chen, L., Yu, J.X., Cheng, H.: Discovering conditional matching rules. ACM Trans. Knowl. Discov. Data 11(4), 46 (2017)CrossRef Wang, Y., Song, S., Chen, L., Yu, J.X., Cheng, H.: Discovering conditional matching rules. ACM Trans. Knowl. Discov. Data 11(4), 46 (2017)CrossRef
7.
Zurück zum Zitat Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. Proc. VLDB Endow. 3(1–2), 173–184 (2010)CrossRef Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. Proc. VLDB Endow. 3(1–2), 173–184 (2010)CrossRef
8.
Zurück zum Zitat Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)CrossRef Fan, W., Li, J., Ma, S., Tang, N., Wenyuan, Y.: Towards certain fixes with editing rules and master data. VLDB J. 21(2), 213–238 (2012)CrossRef
9.
Zurück zum Zitat Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD Conference, pp. 457–468 (2014) Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: SIGMOD Conference, pp. 457–468 (2014)
10.
Zurück zum Zitat Interlandi, M., Tang, N.: Proof positive and negative in data cleaning. In: 2015 IEEE 31st International Conference on Data Engineering. IEEE, pp. 18–29 (2015) Interlandi, M., Tang, N.: Proof positive and negative in data cleaning. In: 2015 IEEE 31st International Conference on Data Engineering. IEEE, pp. 18–29 (2015)
11.
Zurück zum Zitat Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp. 315–326 (2007) Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, pp. 315–326 (2007)
12.
Zurück zum Zitat Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp. 159–170 (2008) Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems. ACM, pp. 159–170 (2008)
13.
Zurück zum Zitat Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, vol. 99. Citeseer, pp. 68–79 (1999) Arenas, M., Bertossi, L., Chomicki, J.: Consistent query answers in inconsistent databases. In: PODS, vol. 99. Citeseer, pp. 68–79 (1999)
14.
Zurück zum Zitat Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory. ACM, pp. 53–62 (2009) Kolahi, S., Lakshmanan, L.V.S.: On approximating optimum repairs for functional dependency violations. In: Proceedings of the 12th International Conference on Database Theory. ACM, pp. 53–62 (2009)
15.
Zurück zum Zitat Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow. 3(1–2), 197–207 (2010)CrossRef Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow. 3(1–2), 197–207 (2010)CrossRef
16.
Zurück zum Zitat Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: Sampling from repairs of conditional functional dependency violations. VLDB J. 23(1), 103–128 (2014)CrossRef Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: Sampling from repairs of conditional functional dependency violations. VLDB J. 23(1), 103–128 (2014)CrossRef
17.
Zurück zum Zitat Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6 (2008)CrossRef Fan, W., Geerts, F., Jia, X., Kementsietsidis, A.: Conditional functional dependencies for capturing data inconsistencies. ACM Trans. Database Syst. 33(2), 6 (2008)CrossRef
18.
Zurück zum Zitat Fan, W., Ma, S., Tang, N., Wenyuan, Y.: Interaction between record matching and data repairing. J. Data Inf. Qual. 4(4), 16 (2014) Fan, W., Ma, S., Tang, N., Wenyuan, Y.: Interaction between record matching and data repairing. J. Data Inf. Qual. 4(4), 16 (2014)
19.
Zurück zum Zitat Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, pp. 458–469 (2013) Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE). IEEE, pp. 458–469 (2013)
20.
Zurück zum Zitat Ahmad, H.A., Wang, H.: An effective weighted rule-based method for entity resolution. Distrib. Parallel Databases 36(3), 593–612 (2018)CrossRef Ahmad, H.A., Wang, H.: An effective weighted rule-based method for entity resolution. Distrib. Parallel Databases 36(3), 593–612 (2018)CrossRef
21.
Zurück zum Zitat Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2014)CrossRef Li, L., Li, J., Gao, H.: Rule-based method for entity resolution. IEEE Trans. Knowl. Data Eng. 27(1), 250–263 (2014)CrossRef
22.
Zurück zum Zitat He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp. 893–907 (2016) He, J., Veltri, E., Santoro, D., Li, G., Mecca, G., Papotti, P., Tang, N.: Interactive and deterministic data cleaning. In: Proceedings of the 2016 International Conference on Management of Data. ACM, pp. 893–907 (2016)
23.
Zurück zum Zitat Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: Katara: a data cleaning system powered by knowledge bases and crowd sourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp. 1247–1261 (2015) Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: Katara: a data cleaning system powered by knowledge bases and crowd sourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, pp. 1247–1261 (2015)
24.
Zurück zum Zitat Hao, S., Tang, N., Li, G., Li, J., Feng, J.: Distilling relations using knowledge bases. VLDB J. 27(4), 497–519 (2018)CrossRef Hao, S., Tang, N., Li, G., Li, J., Feng, J.: Distilling relations using knowledge bases. VLDB J. 27(4), 497–519 (2018)CrossRef
25.
Zurück zum Zitat Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In VLDB, vol. 1, pp. 381–390 (2001) Raman, V., Hellerstein, J.M.: Potter’s wheel: an interactive data cleaning system. In VLDB, vol. 1, pp. 381–390 (2001)
26.
Zurück zum Zitat Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR (2015) Heer, J., Hellerstein, J.M., Kandel, S.: Predictive interaction for data transformation. In: CIDR (2015)
27.
Zurück zum Zitat Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. Proc. VLDB Endow. 4(5), 279–289 (2011)CrossRef Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. Proc. VLDB Endow. 4(5), 279–289 (2011)CrossRef
28.
Zurück zum Zitat Rekatsinas, T., Chu, X., Ilyas, I.F., Holoclean, C.R.: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11), 1190–1201 (2017)CrossRef Rekatsinas, T., Chu, X., Ilyas, I.F., Holoclean, C.R.: Holistic data repairs with probabilistic inference. Proc. VLDB Endow. 10(11), 1190–1201 (2017)CrossRef
29.
Zurück zum Zitat Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, pp. 553–564 (2013) Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, pp. 553–564 (2013)
30.
Zurück zum Zitat Shin, J., Sen, W., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proc. VLDB Endow. 8(11), 1310–1321 (2015)CrossRef Shin, J., Sen, W., Wang, F., De Sa, C., Zhang, C., Ré, C.: Incremental knowledge base construction using deepdive. Proc. VLDB Endow. 8(11), 1310–1321 (2015)CrossRef
31.
Zurück zum Zitat Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. arXiv:1505.04406 (2015) Bach, S.H., Broecheler, M., Huang, B., Getoor, L.: Hinge-loss Markov random fields and probabilistic soft logic. arXiv:​1505.​04406 (2015)
32.
Zurück zum Zitat Niu, F., Ré, C., Doan, A.H., Shavlik, J.: Tuffy: scaling up statistical inference in Markov logic networks using an rdbms. Proc. VLDB Endow. 4(6), 373–384 (2011)CrossRef Niu, F., Ré, C., Doan, A.H., Shavlik, J.: Tuffy: scaling up statistical inference in Markov logic networks using an rdbms. Proc. VLDB Endow. 4(6), 373–384 (2011)CrossRef
33.
Zurück zum Zitat Singh, R., Meduri, V., Elmagarmid, A., Madden, S., Papotti, P., Quiané-Ruiz, J.-A., Solar-Lezama, A., Tang, N.: Generating concise entity matching rules. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM, pp. 1635–1638 (2017) Singh, R., Meduri, V., Elmagarmid, A., Madden, S., Papotti, P., Quiané-Ruiz, J.-A., Solar-Lezama, A., Tang, N.: Generating concise entity matching rules. In: Proceedings of the 2017 ACM International Conference on Management of Data. ACM, pp. 1635–1638 (2017)
34.
Zurück zum Zitat Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data quality and record linkage techniques. Springer, Berlin (2007)MATH Herzog, T.N., Scheuren, F.J., Winkler, W.E.: Data quality and record linkage techniques. Springer, Berlin (2007)MATH
35.
Zurück zum Zitat Hao, S., Tang, N., Li, G., He, J., Ta, N., Feng, J.: A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng. 29(4), 727–742 (2017)CrossRef Hao, S., Tang, N., Li, G., He, J., Ta, N., Feng, J.: A novel cost-based model for data repairing. IEEE Trans. Knowl. Data Eng. 29(4), 727–742 (2017)CrossRef
Metadaten
Titel
Automatic weighted matching rectifying rule discovery for data repairing
Can we discover effective repairing rules automatically from dirty data?
verfasst von
Hiba Abu Ahmad
Hongzhi Wang
Publikationsdatum
09.06.2020
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 6/2020
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-020-00617-6

Weitere Artikel der Ausgabe 6/2020

The VLDB Journal 6/2020 Zur Ausgabe