skip to main content
research-article

Discovery of genuine functional dependencies from relational data with missing values

Published:01 April 2018Publication History
Skip Abstract Section

Abstract

Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.

References

  1. A. Badia and D. Lemire. Functional dependencies with null markers. Comput. J., 58(5):1160--1168, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  2. A. Badia and D. Lemire. On desirable semantics of functional dependencies over databases with incomplete information. arXiv preprint arXiv:1703.08198, 2017.Google ScholarGoogle Scholar
  3. L. E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. of the VLDB Endowment, 3(1):197--207, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Bleifuß, S. Bülow, J. Frohnhofen, J. Risch, G. Wiese, S. Kruse, T. Papenbrock, and F. Naumann. Approximate discovery of functional dependencies for large datasets. In Proc. of the International Conference on Information and Knowledge Management (CIKM), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In Proc. of the International Conference on Data Engineering (ICDE), pages 746--755, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  7. L. Caruccio, V. Deufemia, and G. Polese. Relaxed functional dependencies: A survey of approaches. IEEE Transactions on Knowledge and Data Engineering, 28(1):147--165, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Chiang and R. J. Miller. A unified model for data and constraint repair. In Proc. of the International Conference on Data Engineering (ICDE), pages 446--457, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB Journal, 16(4):523--544, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. De and S. Kambhampati. Defining and mining functional dependencies in probabilistic databases. arXiv preprint arXiv:1005.4714, 2010.Google ScholarGoogle Scholar
  11. R. D. De Veaux and D. J. Hand. How to lie with bad data. Statist. Sci., 20(3):231--238, 08 2005.Google ScholarGoogle ScholarCross RefCross Ref
  12. DemandGen. Assessing the impact of dirty data on sales & marketing performance, https://www.zoominfo.com/business/mktg/ebooks/dirtydataebook.pdf, 2017.Google ScholarGoogle Scholar
  13. B. Efron. Missing data, imputation, and the bootstrap. Journal of the American Statistical Association, 89(426):463--475, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  14. W. Fan and F. Geerts. Uniform dependency language for improving data quality. IEEE Data Engineering Bulletin, 34(3):34--42, 2011.Google ScholarGoogle Scholar
  15. W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB Journal, 21(2):213--238, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Gartner. Dirty data is a business problem, not an it problem, http://www.gartner.com/newsroom/id/501733, 2007.Google ScholarGoogle Scholar
  19. A. Haug, F. Zachariassen, and D. van Liempd. The costs of poor data quality. Journal of Industrial Engineering and Management, 4(2):168--193, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  20. Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In Proc. of the International Conference on Data Engineering (ICDE), pages 392--401, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theor. Comput. Sci., 149(1):129--149, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Koch and D. Olteanu. Conditioning probabilistic databases. Proceedings of the VLDB Endowment, 1(1):313--325, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H. Köhler, U. Leck, S. Link, and X. Zhou. Possible and certain keys for SQL. VLDB Journal, 25(4):571--596, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Köhler, S. Link, and X. Zhou. Discovering meaningful certain keys from incomplete and inconsistent relations. IEEE Data Engineering Bulletin, 39(2):21--37, 2016.Google ScholarGoogle Scholar
  25. M. Levene and G. Loizou. Axiomatisation of functional dependencies in incomplete relations. Theoretical Computer Science, 206(1--2):283--300, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Lichman. UCI machine learning repository http://archive.ics.uci.edu/ml, 2013.Google ScholarGoogle Scholar
  27. J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data - a review. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(2):251--264, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Mazuran, E. Quintarelli, L. Tanca, and S. Ugolini. Semi-automatic support for evolving functional dependencies. In Proc. of the International Conference on Extending Database Technology (EDBT), pages 293--304, 2016.Google ScholarGoogle Scholar
  29. N. Novelli and R. Cicchetti. Fun: An efficient algorithm for mining functional and embedded dependencies. In Proc. of the International Conference on Database Theory (ICDT), pages 189--203, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schnberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proc. of the VLDB Endowment, 8(10):1082--1093, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. T. Papenbrock and F. Naumann. Data-driven schema normalization. In Proc. of the International Conference on Extending Database Technology (EDBT), 2017.Google ScholarGoogle Scholar
  32. S. Song, A. Zhang, L. Chen, and J. Wang. Enriching data imputation with extensive similarity neighbors. Proc. of the VLDB Endowment, 8(11):1286--1297, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Van Buuren. CRC/Chapman & Hall, 2012.Google ScholarGoogle Scholar
  34. D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB), 2009.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 11, Issue 8
    April 2018
    94 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 April 2018
    Published in pvldb Volume 11, Issue 8

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader