research-article

Discovery of genuine functional dependencies from relational data with missing values

Authors:
Laure Berti-Équille

Aix-Marseille Univ., Marseille, France

Aix-Marseille Univ., Marseille, France
View Profile

,
Hazar Harmouch

University of Potsdam, Germany

University of Potsdam, Germany
View Profile

,
Felix Naumann

University of Potsdam, Germany

University of Potsdam, Germany
View Profile

,
Noël Novelli

Aix-Marseille Univ., Marseille, France

Aix-Marseille Univ., Marseille, France
View Profile

,
Saravanan Thirumuruganathan

HBKU, Doha, Qatar

HBKU, Doha, Qatar
View Profile

Proceedings of the VLDB Endowment Volume 11 Issue 8pp 880–892https://doi.org/10.14778/3204028.3204032

Published:01 April 2018Publication History

Proceedings of the VLDB Endowment

Abstract

Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.

References

A. Badia and D. Lemire. Functional dependencies with null markers. Comput. J., 58(5):1160--1168, 2015.Google ScholarCross Ref
A. Badia and D. Lemire. On desirable semantics of functional dependencies over databases with incomplete information. arXiv preprint arXiv:1703.08198, 2017.Google Scholar
L. E. Bertossi. Database Repairing and Consistent Query Answering. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2011. Google ScholarDigital Library
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. of the VLDB Endowment, 3(1):197--207, 2010. Google ScholarDigital Library
T. Bleifuß, S. Bülow, J. Frohnhofen, J. Risch, G. Wiese, S. Kruse, T. Papenbrock, and F. Naumann. Approximate discovery of functional dependencies for large datasets. In Proc. of the International Conference on Information and Knowledge Management (CIKM), 2016. Google ScholarDigital Library
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In Proc. of the International Conference on Data Engineering (ICDE), pages 746--755, 2007.Google ScholarCross Ref
L. Caruccio, V. Deufemia, and G. Polese. Relaxed functional dependencies: A survey of approaches. IEEE Transactions on Knowledge and Data Engineering, 28(1):147--165, 2016. Google ScholarDigital Library
F. Chiang and R. J. Miller. A unified model for data and constraint repair. In Proc. of the International Conference on Data Engineering (ICDE), pages 446--457, 2011. Google ScholarDigital Library
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB Journal, 16(4):523--544, 2007. Google ScholarDigital Library
S. De and S. Kambhampati. Defining and mining functional dependencies in probabilistic databases. arXiv preprint arXiv:1005.4714, 2010.Google Scholar
R. D. De Veaux and D. J. Hand. How to lie with bad data. Statist. Sci., 20(3):231--238, 08 2005.Google ScholarCross Ref
DemandGen. Assessing the impact of dirty data on sales & marketing performance, https://www.zoominfo.com/business/mktg/ebooks/dirtydataebook.pdf, 2017.Google Scholar
B. Efron. Missing data, imputation, and the bootstrap. Journal of the American Statistical Association, 89(426):463--475, 1994.Google ScholarCross Ref
W. Fan and F. Geerts. Uniform dependency language for improving data quality. IEEE Data Engineering Bulletin, 34(3):34--42, 2011.Google Scholar
W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2012. Google ScholarDigital Library
W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for capturing data inconsistencies. ACM Transactions on Database Systems (TODS), 33(2):6:1--6:48, 2008. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB Journal, 21(2):213--238, 2012. Google ScholarDigital Library
Gartner. Dirty data is a business problem, not an it problem, http://www.gartner.com/newsroom/id/501733, 2007.Google Scholar
A. Haug, F. Zachariassen, and D. van Liempd. The costs of poor data quality. Journal of Industrial Engineering and Management, 4(2):168--193, 2011.Google ScholarCross Ref
Y. Huhtala, J. Kärkkäinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In Proc. of the International Conference on Data Engineering (ICDE), pages 392--401, 1998. Google ScholarDigital Library
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theor. Comput. Sci., 149(1):129--149, 1995. Google ScholarDigital Library
C. Koch and D. Olteanu. Conditioning probabilistic databases. Proceedings of the VLDB Endowment, 1(1):313--325, 2008. Google ScholarDigital Library
H. Köhler, U. Leck, S. Link, and X. Zhou. Possible and certain keys for SQL. VLDB Journal, 25(4):571--596, 2016. Google ScholarDigital Library
H. Köhler, S. Link, and X. Zhou. Discovering meaningful certain keys from incomplete and inconsistent relations. IEEE Data Engineering Bulletin, 39(2):21--37, 2016.Google Scholar
M. Levene and G. Loizou. Axiomatisation of functional dependencies in incomplete relations. Theoretical Computer Science, 206(1--2):283--300, 1998. Google ScholarDigital Library
M. Lichman. UCI machine learning repository http://archive.ics.uci.edu/ml, 2013.Google Scholar
J. Liu, J. Li, C. Liu, and Y. Chen. Discover dependencies from data - a review. IEEE Transactions on Knowledge and Data Engineering (TKDE), 24(2):251--264, 2012. Google ScholarDigital Library
M. Mazuran, E. Quintarelli, L. Tanca, and S. Ugolini. Semi-automatic support for evolving functional dependencies. In Proc. of the International Conference on Extending Database Technology (EDBT), pages 293--304, 2016.Google Scholar
N. Novelli and R. Cicchetti. Fun: An efficient algorithm for mining functional and embedded dependencies. In Proc. of the International Conference on Database Theory (ICDT), pages 189--203, 2001. Google ScholarDigital Library
T. Papenbrock, J. Ehrlich, J. Marten, T. Neubert, J.-P. Rudolph, M. Schnberg, J. Zwiener, and F. Naumann. Functional dependency discovery: An experimental evaluation of seven algorithms. Proc. of the VLDB Endowment, 8(10):1082--1093, 2015. Google ScholarDigital Library
T. Papenbrock and F. Naumann. Data-driven schema normalization. In Proc. of the International Conference on Extending Database Technology (EDBT), 2017.Google Scholar
S. Song, A. Zhang, L. Chen, and J. Wang. Enriching data imputation with extensive similarity neighbors. Proc. of the VLDB Endowment, 8(11):1286--1297, 2015. Google ScholarDigital Library
S. Van Buuren. CRC/Chapman & Hall, 2012.Google Scholar
D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In Proc. of the ACM SIGMOD Workshop on the Web and Databases (WebDB), 2009.Google Scholar

Recommendations

A Survey Study on XML Functional Dependencies
ISDPE '07: Proceedings of the The First International Symposium on Data, Privacy, and E-Commerce

There are two major kinds of XML functional dependency (FD) definitions. The first kind of XML FD includes Tree-tuple-based XML FD (tFD) and Path-based XML FD (pFD), and the second kind of XML FD includes Extended-path-based XML FD (epFD), Sub-graph-...
Read More
Functional dependencies for XML
WAIM'10: Proceedings of the 2010 international conference on Web-age information management

In this paper, we present a new approach for defining functional dependencies for XML (XFDs) on XML Schema. While showing how to extend XML Schema, we analyze the expressive power of our XFDs. We focus on supporting complex value (e.g. list, set) in our ...
Read More
Missing Values: Proposition of a Typology and Characterization with an Association Rule-Based Model
DaWaK '09: Proceedings of the 11th International Conference on Data Warehousing and Knowledge Discovery

Handling missing values when tackling real-world datasets is a great challenge arousing the interest of many scientific communities. Many works propose completion methods or implement new data mining techniques tolerating the presence of missing values. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 11, Issue 8
April 2018
94 pages
ISSN:2150-8097
Editors:
Jian Pei
Simon Fraser University
,
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 April 2018
Published in pvldb Volume 11, Issue 8
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 15
  Total Citations
  View Citations
- 153
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Discovery of genuine functional dependencies from relational data with missing values

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

A Survey Study on XML Functional Dependencies

Functional dependencies for XML

Missing Values: Proposition of a Typology and Characterization with an Association Rule-Based Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Discovery of genuine functional dependencies from relational data with missing values

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

A Survey Study on XML Functional Dependencies

Functional dependencies for XML

Missing Values: Proposition of a Typology and Characterization with an Association Rule-Based Model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media