research-article

Discovering data quality rules

Authors:
Fei Chiang

University of Toronto

University of Toronto
View Profile

,
Renée J. Miller

University of Toronto

University of Toronto
View Profile

Proceedings of the VLDB Endowment Volume 1 Issue 1pp 1166–1177https://doi.org/10.14778/1453856.1453980

Published:01 August 2008Publication History

Proceedings of the VLDB Endowment

Abstract

Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code.

In this work, we propose a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records).

We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.

References

C. Aggarwal and P. Yu. Mining large itemsets for association rules. Data Eng Bull, 21(1):23--31, 1998.Google Scholar
R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD '93, pages 207--216, 1993. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB '94, pages 487--499, 1994. Google ScholarDigital Library
A. Asuncion and D. Newman. UCI ML repository, 2007.Google Scholar
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE '07, pages 746--755, 2007.Google ScholarCross Ref
L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB '07, pages 243--254, 2007. Google ScholarDigital Library
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD '97, pages 255--264, 1997. Google ScholarDigital Library
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE TKDE, 13(1):64--78, 2001. Google ScholarDigital Library
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB '07, pages 315--326, 2007. Google ScholarDigital Library
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or how to build a data quality browser. In SIGMOD '02, pages 240--251. Google ScholarDigital Library
W. Eckerson. Achieving Business Success through a Commitment to High Quality Data. Technical report, Data Warehousing Institute, 2002.Google Scholar
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB '01, pages 371--380, 2001. Google ScholarDigital Library
Y. Huhtala, J. Kinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In ICDE '98, pages 392--401, 1998. Google ScholarDigital Library
H. V. Jagadish, J. Madar, and R. T. Ng. Semantic compression and pattern extraction with fascicles. In VLDB '99, pages 186--198, 1999. Google ScholarDigital Library
H. V. Jagadish, R. T. Ng, B. C. Ooi, and A. K. H. Tung. Itcompress: An iterative semantic compression algorithm. In ICDE '04, page 646, 2004. Google ScholarDigital Library
J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1):129--149, 1995. Google ScholarDigital Library
L. V. S. Lakshmanan, R. T. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In SIGMOD '99, pages 157--168. Google ScholarDigital Library
S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and armstrong relations. In EDBT '00, pages 350--364, 2000. Google ScholarDigital Library
M. J. Maher. Constrained dependencies. Theoretical Computer Science, 173(1):113--149, 1997. Google ScholarDigital Library
J. Pei and J. Han. Constrained frequent pattern mining: a pattern-growth view. SIGKDD Explor. '02, 4(1):31--39. Google ScholarDigital Library
V. Raman and J. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB '01, pages 381--390, 2001. Google ScholarDigital Library
I. Savnik and P. A. Flach. Discovery of multivalued dependencies from relations. Intelligent Data Analysis, 4(3,4):195--211, 2000. Google ScholarDigital Library
P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis. Arktos: towards the modeling, design, control and execution of etl processes. Inf. Syst. '01, 26(8):537--561. Google ScholarDigital Library
C. Wyss, C. Giannella, and E. L. Robertson. Fastfds: A heuristic-driven, depth-first alg for mining fds from relations. In DaWaK '01, pages 101--110, 2001. Google ScholarDigital Library
M. Zaki. Generating non-redundant association rules. In KDD '00, pages 34--43, 2000. Google ScholarDigital Library

Index Terms

Discovering data quality rules
1. Information systems

Recommendations

Discovering dynamic integrity rules with a rules-based tool for data quality analyzing
CompSysTech '10: Proceedings of the 11th International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing on International Conference on Computer Systems and Technologies

Rules based approaches for data quality solutions often use business rules or integrity rules for data monitoring purpose. Integrity rules are constraints on data derived from business rules into a formal form in order to allow computerization. One of ...
Read More
Discovering association rules change from large databases
AICI'11: Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part I

Discovering association rules and association rules change (ARC) from existing large databases is an important problem. This paper presents an approach based on multi-hash chain structures to mine association rules change from large database with ...
Read More
Mining fuzzy association rules from low-quality data
Special Issue on Knowledge Extraction from Low Quality Data: Theoretical, Methodological and Practical Issues

Data mining is most commonly used in attempts to induce association rules from databases which can help decision-makers easily analyze the data and make good decisions regarding the domains concerned. Different studies have proposed methods for mining ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 1, Issue 1
August 2008
1216 pages
ISSN:2150-8097
Editors:
Peter Buneman,
Beng Chin Ooi,
Kenneth Ross,
Gerald Weber
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2008
Published in pvldb Volume 1, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 71
  Total Citations
  View Citations
- 1,774
  Total Downloads
- Downloads (Last 12 months)115
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Discovering data quality rules

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Discovering dynamic integrity rules with a rules-based tool for data quality analyzing

Discovering association rules change from large databases

Mining fuzzy association rules from low-quality data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Discovering data quality rules

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Discovering dynamic integrity rules with a rules-based tool for data quality analyzing

Discovering association rules change from large databases

Mining fuzzy association rules from low-quality data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media