Abstract
Functional dependencies (fds) are one of the metadata used to assess data quality and to perform data cleaning operations. However, to pursue robustness with respect to data errors, it has been necessary to devise imprecise versions of functional dependencies, yielding relaxed functional dependencies (rfds). Among them, there exists the class of rfds relaxing on the extent, i.e., those admitting the possibility that an fd holds on a subset of data. In the literature, several algorithms to automatically discover rfds from big data collections have been defined. They achieve good performances with respect to the inherent problem complexity. However, most of them are capable of discovering rfds only by batch processing the entire dataset. This is not suitable in the era of big data, where the size of a database instance can grow with high-velocity, and the insertion of new data can invalidate previously holding rfds. Thus, it is necessary to devise incremental discovery algorithms capable of updating the set of holding rfds upon data insertions, without processing the entire dataset. To this end, in this article we propose an incremental discovery algorithm for rfds relaxing on the extent. It manages the validation of candidate rfds and the generation of possibly new rfd candidates upon the insertion of the new tuples, while limiting the size of the overall search space. Experimental results show that the proposed algorithm achieves extremely good performances on real-world datasets.
- Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: A survey. VLDB J. 24, 4 (2015), 557--581.Google ScholarDigital Library
- Ziawasch Abedjan, Patrick Schulze, and Felix Naumann. 2014. DFD: Efficient functional dependency discovery. In Proceedings of the 23rd ACM International Conference on Information and Knowledge Management (CIKM’14). 949--958.Google ScholarDigital Library
- K. Bache and M. Lichman. 2017. UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine, CA.Google Scholar
- Siegfried Bell. 1995. Discovery and maintenance of functional dependencies by independencies. In Proceedings of the 1st International Conference on Knowledge Discovery and Data Mining (KDD’95). 27--32.Google Scholar
- Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. On the discovery of relaxed functional dependencies. In Proceedings of 20th International Database Engineering 8 Applications Symposium (IDEAS’16). 53--61.Google ScholarDigital Library
- Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed functional dependencies -- A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2016), 147--165.Google ScholarDigital Library
- Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2017. Evolutionary mining of relaxed dependencies from big data collections. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics (WIMS’17). ACM, 5.Google ScholarDigital Library
- Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2017. Learning effective query management strategies from big data. In Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA’17). IEEE, 643--648.Google ScholarCross Ref
- David W. Cheung, Jiawei Han, Vincent T. Ng, and C. Y. Wong. 1996. Maintenance of discovered association rules in large databases: An incremental updating technique. In Proceedings of the Twelfth International Conference on Data Engineering. IEEE, 106--114.Google Scholar
- Pedro G. DeLima and Gary G. Yen. 2005. Multiple objective evolutionary algorithm for temporal linguistic rule extraction. ISA Trans. 44, 2 (2005), 315--327.Google ScholarCross Ref
- Seyed Mostafa Fakhrahmad, M. H. Sadreddini, and M. Zolghadri Jahromi. 2008. AD-miner: A new incremental method for discovery of minimal approximate dependencies using logical operations. Intell. Data Anal. 12, 6 (2008), 607--619.Google ScholarDigital Library
- Peter A. Flach and Iztok Savnik. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139--160.Google ScholarDigital Library
- Chris Giannella and Edward Robertson. 2004. On approximation measures for functional dependencies. Inf. Syst. 29, 6 (2004), 483--507.Google ScholarDigital Library
- John Grant and Jack Minker. 1985. Normalization and axiomatization for numerical dependencies. Inf. Contr. 65, 1 (1985), 1--17.Google ScholarDigital Library
- Peter J. Haas, Ihab F. Ilyas, Guy M. Lohman, and Volker Markl. 2009. Discovering and exploiting statistical properties for query optimization in relational databases: A survey. Stat. Anal. Data Min. 1, 4 (2009), 223--250.Google ScholarDigital Library
- Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1998. Efficient discovery of functional and approximate dependencies using partitions. In Proceedings of the 14th International Conference on Data Engineering (ICDE’98). 392--401.Google ScholarCross Ref
- Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2 (1999), 100--111.Google ScholarCross Ref
- Ihab F. Ilyas, Volker Markl, Peter Haas, Paul Brown, and Ashraf Aboulnaga. 2004. CORDS: Automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD’04). 647--658.Google ScholarDigital Library
- Ronald S. King and James J. Legendre. 2003. Discovery of functional and approximate functional dependencies in relational databases. J. Appl. Math. Decis. Sci. 7, 1 (2003), 49--59.Google ScholarCross Ref
- Jyrki Kivinen and Heikki Mannila. 1995. Approximate inference of functional dependencies from relations. Theor. Comput. Sci. 149, 1 (1995), 129--149.Google ScholarDigital Library
- Sebastian Kruse and Felix Naumann. 2018. Efficient discovery of approximate dependencies. Proc. VLDB Endow. 11, 7 (2018), 759--772.Google ScholarDigital Library
- Jixue Liu, Jiuyong Li, Chengfei Liu, and Yongfeng Chen. 2012. Discover dependencies from data—A review. IEEE Trans. Knowl. Data Eng. 24, 2 (2012), 251--264.Google ScholarDigital Library
- Stéphane Lopes, Jean-Marc Petit, and Lotfi Lakhal. 2000. Efficient discovery of functional dependencies and armstrong relations. In Proceedings of the 7th International Conference on Extending Database Technology (EDBT’00). 350--364.Google ScholarCross Ref
- Heikki Mannila and Kari-Jouko Räihä. 1987. Dependency inference. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB’87). 155--158.Google ScholarDigital Library
- B. Nath, D. K. Bhattacharyya, and A. Ghosh. 2013. Incremental association rule mining: A survey. Data Min. Knowl. Discov. 3, 3 (2013), 157--169.Google ScholarCross Ref
- Felix Naumann. 2014. Data profiling revisited. ACM SIGMOD Rec. 42, 4 (2014), 40--49.Google ScholarDigital Library
- Noël Novelli and Rosine Cicchetti. 2001. FUN: An efficient algorithm for mining functional and embedded dependencies. In Proceedings of 8th International Conference Database Theory (ICDT’01). 189--203.Google ScholarCross Ref
- Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: An experimental evaluation of seven algorithms. Proc. VLDB Endow. 8, 10 (2015), 1082--1093.Google ScholarDigital Library
- Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Functional dependency discovery: An experimental evaluation of seven algorithms. Proc. VLDB Endow. 8, 10 (2015), 1082--1093.Google ScholarDigital Library
- Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD’16). ACM, 821--833.Google ScholarDigital Library
- Daniel Sánchez, José María Serrano, Ignacio Blanco, Maria Jose Martín-Bautista, and María-Amparo Vila. 2008. Using association rules to mine for strong approximate dependencies. Data Min. Knowl. Discov. 16, 3 (2008), 313--348.Google ScholarDigital Library
- Hemant Saxena, Lukasz Golab, and Ihab F. Ilyas. 2019. Distributed discovery of functional dependencies. In Proceedings of the 35th International Conference on Data Engineering (ICDE’19). IEEE, 1590--1593.Google Scholar
- Hemant Saxena, Lukasz Golab, and Ihab F. Ilyas. 2019. Distributed implementations of dependency discovery algorithms. Proc. VLDB Endow. 12, 11 (2019), 1624--1636.Google ScholarDigital Library
- Philipp Schirmer, Thorsten Papenbrock, Sebastian Kruse, Dennis Hempfing, Torben Meyer, Daniel Neuschäfer-Rube, and Felix Naumann. 2019. DynFD: Functional dependency discovery in dynamic datasets. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’19). 253--264.Google Scholar
- Dan A. Simovici, Dana Cristofor, and Laurentiu Cristofor. 2002. Impurity measures in databases. Acta Inf. 38, 5 (2002), 307--324.Google ScholarCross Ref
- Shyue-Liang Wang, Ju-Wen Shen, and Tzung-Pei Hong. 2001. Incremental discovery of functional dependencies using partitions. In Proceedings Joint 9th IFSA World Congress and 20th NAFIPS International Conference (Cat. No. 01TH8569), Vol. 3. IEEE, 1322--1326.Google ScholarCross Ref
- Ziheng Wei and Sebastian Link. 2019. Discovery and ranking of functional dependencies. In Proceedings of the 35th International Conference on Data Engineering (ICDE’19). IEEE, 1526--1537.Google ScholarCross Ref
- Catharine Wyss, Chris Giannella, and Edward Robertson. 2001. FastFDs: A heuristic-driven, depth-first algorithm for mining functional dependencies from relation instances. In Proceedings of 3rd International Conference on Data Warehousing and Knowledge Discovery (DaWaK’01). 101--110.Google ScholarCross Ref
- Hong Yao, Howard J. Hamilton, and Cory J. Butz. 2002. FD_Mine: Discovering functional dependencies in a database using equivalences. In Proceedings of the 2nd IEEE International Conference on Data Mining (ICDM’02). 729--732.Google Scholar
- Lin Zhu, Xu Sun, Zijing Tan, Kejia Yang, Weidong Yang, Xiangdong Zhou, and Yingjie Tian. 2019. Incremental discovery of order dependencies on tuple insertions. In Proceedings of the 24th International Conference on Database Systems for Advanced Applications (DASFAA’19). Springer, 157--174.Google ScholarDigital Library
Index Terms
- Incremental Discovery of Imprecise Functional Dependencies
Recommendations
A Hybrid Approach to Functional Dependency Discovery
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataFunctional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually ...
Efficient Discovery of Functional Dependencies from Incremental Databases
iiWAS2021: The 23rd International Conference on Information Integration and Web IntelligenceWith the advent of Big Data there is an increasing necessity to incrementally mine information from data originating from sensors and other dynamic sources. Thus, it is necessary to devise algorithms capable of mining useful information upon possible ...
On the Discovery of Relaxed Functional Dependencies
IDEAS '16: Proceedings of the 20th International Database Engineering & Applications SymposiumFunctional dependencies (fds) express important relationships among data, which can be used for several goals, including schema normalization and data cleansing. However, to solve several issues in emerging application domains, such as the ...
Comments