Abstract
Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.
- A. Badia and D. Lemire. Functional dependencies with null markers. Comput. J., 58(5):1160--1168, 2015.Google ScholarCross Ref
- A. Badia and D. Lemire. On desirable semantics of functional dependencies over databases with incomplete information. arXiv preprint arXiv:1703.08198, 2017.Google Scholar
- L. E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. Google ScholarDigital Library
- G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. of the VLDB Endowment, 3(1):197--207, 2010. Google ScholarDigital Library
- T. Bleifuß, S. Bülow, J. Frohnhofen, J. Risch, G. Wiese, S. Kruse, T. Papenbrock, and F. Naumann. Approximate discovery of functional dependencies for large datasets. In Proc. of the International Conference on Information and Knowledge Management (CIKM), 2016. Google ScholarDigital Library
- P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In Proc. of the International Conference on Data Engineering (ICDE), pages 746--755, 2007.Google ScholarCross Ref
- L. Caruccio, V. Deufemia, and G. Polese. Relaxed functional dependencies: A survey of approaches. IEEE Transactions on Knowledge and Data Engineering, 28(1):147--165, 2016. Google ScholarDigital Library
- F. Chiang and R. J. Miller. A unified model for data and constraint repair. In Proc. of the International Conference on Data Engineering (ICDE), pages 446--457, 2011. Google ScholarDigital Library
- N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB Journal, 16(4):523--544, 2007. Google ScholarDigital Library
- S. De and S. Kambhampati. Defining and mining functional dependencies in probabilistic databases. arXiv preprint arXiv:1005.4714, 2010.Google Scholar
- R. D. De Veaux and D. J. Hand. How to lie with bad data. Statist. Sci., 20(3):231--238, 08 2005.Google ScholarCross Ref
- DemandGen. Assessing the impact of dirty data on sales & marketing performance, https://www.zoominfo.com/business/mktg/ebooks/dirtydataebook.pdf, 2017.Google Scholar
- B. Efron. Missing data, imputation, and the bootstrap. Journal of the American Statistical Association, 89(426):463--475, 1994.Google ScholarCross Ref
- W. Fan and F. Geerts. Uniform dependency language for improving data quality. IEEE Data Engineering Bulletin, 34(3):34--42, 2011.Google Scholar
- W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012. Google ScholarDigital Library
- W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008. Google ScholarDigital Library
- W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB Journal, 21(2):213--238, 2012. Google ScholarDigital Library
- Gartner. Dirty data is a business problem, not an it problem, http://www.gartner.com/newsroom/id/501733, 2007.Google Scholar
- A. Haug, F. Zachariassen, and D. van Liempd. The costs of poor data quality. Journal of Industrial Engineering and Management, 4(2):168--193, 2011.Google ScholarCross Ref
- Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In Proc. of the International Conference on Data Engineering (ICDE), pages 392--401, 1998. Google ScholarDigital Library
- J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theor. Comput. Sci., 149(1):129--149, 1995. Google ScholarDigital Library
- C. Koch and D. Olteanu. Conditioning probabilistic databases. Proceedings of the VLDB Endowment, 1(1):313--325, 2008. Google ScholarDigital Library
- H. Köhler, U. Leck, S. Link, and X. Zhou. Possible and certain keys for SQL. VLDB Journal, 25(4):571--596, 2016. Google ScholarDigital Library
- H. Köhler, S. Link, and X. Zhou. Discovering meaningful certain keys from incomplete and inconsistent relations. IEEE Data Engineering Bulletin, 39(2):21--37, 2016.Google Scholar
- M. Levene and G. Loizou. Axiomatisation of functional dependencies in incomplete relations. Theoretical Computer Science, 206(1--2):283--300, 1998. Google ScholarDigital Library
- M. Lichman. UCI machine learning repository http://archive.ics.uci.edu/ml, 2013.Google Scholar
- J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data - a review. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(2):251--264, 2012. Google ScholarDigital Library
- M. Mazuran, E. Quintarelli, L. Tanca, and S. Ugolini. Semi-automatic support for evolving functional dependencies. In Proc. of the International Conference on Extending Database Technology (EDBT), pages 293--304, 2016.Google Scholar
- N. Novelli and R. Cicchetti. Fun: An efficient algorithm for mining functional and embedded dependencies. In Proc. of the International Conference on Database Theory (ICDT), pages 189--203, 2001. Google ScholarDigital Library
- T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schnberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proc. of the VLDB Endowment, 8(10):1082--1093, 2015. Google ScholarDigital Library
- T. Papenbrock and F. Naumann. Data-driven schema normalization. In Proc. of the International Conference on Extending Database Technology (EDBT), 2017.Google Scholar
- S. Song, A. Zhang, L. Chen, and J. Wang. Enriching data imputation with extensive similarity neighbors. Proc. of the VLDB Endowment, 8(11):1286--1297, 2015. Google ScholarDigital Library
- S. Van Buuren. CRC/Chapman & Hall, 2012.Google Scholar
- D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB), 2009.Google Scholar
Recommendations
A Survey Study on XML Functional Dependencies
ISDPE '07: Proceedings of the The First International Symposium on Data, Privacy, and E-CommerceThere are two major kinds of XML functional dependency (FD) definitions. The first kind of XML FD includes Tree-tuple-based XML FD (tFD) and Path-based XML FD (pFD), and the second kind of XML FD includes Extended-path-based XML FD (epFD), Sub-graph-...
Functional dependencies for XML
WAIM'10: Proceedings of the 2010 international conference on Web-age information managementIn this paper, we present a new approach for defining functional dependencies for XML (XFDs) on XML Schema. While showing how to extend XML Schema, we analyze the expressive power of our XFDs. We focus on supporting complex value (e.g. list, set) in our ...
Missing Values: Proposition of a Typology and Characterization with an Association Rule-Based Model
DaWaK '09: Proceedings of the 11th International Conference on Data Warehousing and Knowledge DiscoveryHandling missing values when tackling real-world datasets is a great challenge arousing the interest of many scientific communities. Many works propose completion methods or implement new data mining techniques tolerating the presence of missing values. ...
Comments