skip to main content
research-article

Incremental Discovery of Imprecise Functional Dependencies

Published:15 October 2020Publication History
Skip Abstract Section

Abstract

Functional dependencies (fds) are one of the metadata used to assess data quality and to perform data cleaning operations. However, to pursue robustness with respect to data errors, it has been necessary to devise imprecise versions of functional dependencies, yielding relaxed functional dependencies (rfds). Among them, there exists the class of rfds relaxing on the extent, i.e., those admitting the possibility that an fd holds on a subset of data. In the literature, several algorithms to automatically discover rfds from big data collections have been defined. They achieve good performances with respect to the inherent problem complexity. However, most of them are capable of discovering rfds only by batch processing the entire dataset. This is not suitable in the era of big data, where the size of a database instance can grow with high-velocity, and the insertion of new data can invalidate previously holding rfds. Thus, it is necessary to devise incremental discovery algorithms capable of updating the set of holding rfds upon data insertions, without processing the entire dataset. To this end, in this article we propose an incremental discovery algorithm for rfds relaxing on the extent. It manages the validation of candidate rfds and the generation of possibly new rfd candidates upon the insertion of the new tuples, while limiting the size of the overall search space. Experimental results show that the proposed algorithm achieves extremely good performances on real-world datasets.

References

  1. Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: A survey. VLDB J. 24, 4 (2015), 557--581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ziawasch Abedjan, Patrick Schulze, and Felix Naumann. 2014. DFD: Efficient functional dependency discovery. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM’14). 949--958.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. K. Bache and M. Lichman. 2017. UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA.Google ScholarGoogle Scholar
  4. Siegfried Bell. 1995. Discovery and maintenance of functional dependencies by independencies. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD’95). 27--32.Google ScholarGoogle Scholar
  5. Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. On the discovery of relaxed functional dependencies. In Proceedings of 20th International Database Engineering 8 Applications Symposium (IDEAS’16). 53--61.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed functional dependencies -- A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2016), 147--165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2017. Evolutionary mining of relaxed dependencies from big data collections. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics (WIMS’17). ACM, 5.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2017. Learning effective query management strategies from big data. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA’17). IEEE, 643--648.Google ScholarGoogle ScholarCross RefCross Ref
  9. David W. Cheung, Jiawei Han, Vincent T. Ng, and C. Y. Wong. 1996. Maintenance of discovered association rules in large databases: An incremental updating technique. In Proceedings of the Twelfth International Conference on Data Engineering. IEEE, 106--114.Google ScholarGoogle Scholar
  10. Pedro G. DeLima and Gary G. Yen. 2005. Multiple objective evolutionary algorithm for temporal linguistic rule extraction. ISA Trans. 44, 2 (2005), 315--327.Google ScholarGoogle ScholarCross RefCross Ref
  11. Seyed Mostafa Fakhrahmad, M. H. Sadreddini, and M. Zolghadri Jahromi. 2008. AD-miner: A new incremental method for discovery of minimal approximate dependencies using logical operations. Intell. Data Anal. 12, 6 (2008), 607--619.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Peter A. Flach and Iztok Savnik. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139--160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Chris Giannella and Edward Robertson. 2004. On approximation measures for functional dependencies. Inf. Syst. 29, 6 (2004), 483--507.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. John Grant and Jack Minker. 1985. Normalization and axiomatization for numerical dependencies. Inf. Contr. 65, 1 (1985), 1--17.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Peter J. Haas, Ihab F. Ilyas, Guy M. Lohman, and Volker Markl. 2009. Discovering and exploiting statistical properties for query optimization in relational databases: A survey. Stat. Anal. Data Min. 1, 4 (2009), 223--250.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1998. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of the 14th International Conference on Data Engineering (ICDE’98). 392--401.Google ScholarGoogle ScholarCross RefCross Ref
  17. Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2 (1999), 100--111.Google ScholarGoogle ScholarCross RefCross Ref
  18. Ihab F. Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: Automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD’04). 647--658.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Ronald S. King and James J. Legendre. 2003. Discovery of functional and approximate functional dependencies in relational databases. J. Appl. Math. Decis. Sci. 7, 1 (2003), 49--59.Google ScholarGoogle ScholarCross RefCross Ref
  20. Jyrki Kivinen and Heikki Mannila. 1995. Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149, 1 (1995), 129--149.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (2018), 759--772.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jixue Liu, Jiuyong Li, Chengfei Liu, and Yongfeng Chen. 2012. Discover dependencies from data—A review. IEEE Trans. Knowl. Data Eng. 24, 2 (2012), 251--264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Stéphane Lopes, Jean-Marc Petit, and Lotfi Lakhal. 2000. Efficient discovery of functional dependencies and armstrong relations. In Proceedings of the 7th International Conference on Extending Database Technology (EDBT’00). 350--364.Google ScholarGoogle ScholarCross RefCross Ref
  24. Heikki Mannila and Kari-Jouko Räihä. 1987. Dependency inference. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB’87). 155--158.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. B. Nath, D. K. Bhattacharyya, and A. Ghosh. 2013. Incremental association rule mining: A survey. Data Min. Knowl. Discov. 3, 3 (2013), 157--169.Google ScholarGoogle ScholarCross RefCross Ref
  26. Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Rec. 42, 4 (2014), 40--49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Noël Novelli and Rosine Cicchetti. 2001. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of 8th International Conference Database Theory (ICDT’01). 189--203.Google ScholarGoogle ScholarCross RefCross Ref
  28. Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: An experimental evaluation of seven algorithms. Proc. VLDB Endow. 8, 10 (2015), 1082--1093.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: An experimental evaluation of seven algorithms. Proc. VLDB Endow. 8, 10 (2015), 1082--1093.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD’16). ACM, 821--833.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Daniel Sánchez, José María Serrano, Ignacio Blanco, Maria Jose Martín-Bautista, and María-Amparo Vila. 2008. Using association rules to mine for strong approximate dependencies. Data Min. Knowl. Discov. 16, 3 (2008), 313--348.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hemant Saxena, Lukasz Golab, and Ihab F. Ilyas. 2019. Distributed discovery of functional dependencies. In Proceedings of the 35th International Conference on Data Engineering (ICDE’19). IEEE, 1590--1593.Google ScholarGoogle Scholar
  33. Hemant Saxena, Lukasz Golab, and Ihab F. Ilyas. 2019. Distributed implementations of dependency discovery algorithms. Proc. VLDB Endow. 12, 11 (2019), 1624--1636.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Philipp Schirmer, Thorsten Papenbrock, Sebastian Kruse, Dennis Hempfing, Torben Meyer, Daniel Neuschäfer-Rube, and Felix Naumann. 2019. DynFD: Functional dependency discovery in dynamic datasets. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’19). 253--264.Google ScholarGoogle Scholar
  35. Dan A. Simovici, Dana Cristofor, and Laurentiu Cristofor. 2002. Impurity measures in databases. Acta Inf. 38, 5 (2002), 307--324.Google ScholarGoogle ScholarCross RefCross Ref
  36. Shyue-Liang Wang, Ju-Wen Shen, and Tzung-Pei Hong. 2001. Incremental discovery of functional dependencies using partitions. In Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569), Vol. 3. IEEE, 1322--1326.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ziheng Wei and Sebastian Link. 2019. Discovery and ranking of functional dependencies. In Proceedings of the 35th International Conference on Data Engineering (ICDE’19). IEEE, 1526--1537.Google ScholarGoogle ScholarCross RefCross Ref
  38. Catharine Wyss, Chris Giannella, and Edward Robertson. 2001. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In Proceedings of 3rd International Conference on Data Warehousing and Knowledge Discovery (DaWaK’01). 101--110.Google ScholarGoogle ScholarCross RefCross Ref
  39. Hong Yao, Howard J. Hamilton, and Cory J. Butz. 2002. FD_Mine: Discovering functional dependencies in a database using equivalences. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM’02). 729--732.Google ScholarGoogle Scholar
  40. Lin Zhu, Xu Sun, Zijing Tan, Kejia Yang, Weidong Yang, Xiangdong Zhou, and Yingjie Tian. 2019. Incremental discovery of order dependencies on tuple insertions. In Proceedings of the 24th International Conference on Database Systems for Advanced Applications (DASFAA’19). Springer, 157--174.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Incremental Discovery of Imprecise Functional Dependencies

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Journal of Data and Information Quality
          Journal of Data and Information Quality  Volume 12, Issue 4
          Special Issue on Metadata Discovery for Assessing Data Quality
          December 2020
          118 pages
          ISSN:1936-1955
          EISSN:1936-1963
          DOI:10.1145/3430382
          Issue’s Table of Contents

          Copyright © 2020 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 15 October 2020
          • Online AM: 7 May 2020
          • Accepted: 1 April 2020
          • Revised: 1 February 2020
          • Received: 1 November 2019
          Published in jdiq Volume 12, Issue 4

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format