skip to main content
research-article

Discovering data quality rules

Published:01 August 2008Publication History
Skip Abstract Section

Abstract

Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code.

In this work, we propose a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records).

We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.

References

  1. C. Aggarwal and P. Yu. Mining large itemsets for association rules. Data Eng Bull, 21(1):23--31, 1998.Google ScholarGoogle Scholar
  2. R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD '93, pages 207--216, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB '94, pages 487--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Asuncion and D. Newman. UCI ML repository, 2007.Google ScholarGoogle Scholar
  5. P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE '07, pages 746--755, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  6. L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB '07, pages 243--254, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD '97, pages 255--264, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE TKDE, 13(1):64--78, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB '07, pages 315--326, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or how to build a data quality browser. In SIGMOD '02, pages 240--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Eckerson. Achieving Business Success through a Commitment to High Quality Data. Technical report, Data Warehousing Institute, 2002.Google ScholarGoogle Scholar
  12. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB '01, pages 371--380, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Y. Huhtala, J. Kinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In ICDE '98, pages 392--401, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. V. Jagadish, J. Madar, and R. T. Ng. Semantic compression and pattern extraction with fascicles. In VLDB '99, pages 186--198, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. V. Jagadish, R. T. Ng, B. C. Ooi, and A. K. H. Tung. Itcompress: An iterative semantic compression algorithm. In ICDE '04, page 646, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1):129--149, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. V. S. Lakshmanan, R. T. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In SIGMOD '99, pages 157--168. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and armstrong relations. In EDBT '00, pages 350--364, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. J. Maher. Constrained dependencies. Theoretical Computer Science, 173(1):113--149, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Pei and J. Han. Constrained frequent pattern mining: a pattern-growth view. SIGKDD Explor. '02, 4(1):31--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. V. Raman and J. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB '01, pages 381--390, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. I. Savnik and P. A. Flach. Discovery of multivalued dependencies from relations. Intelligent Data Analysis, 4(3,4):195--211, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis. Arktos: towards the modeling, design, control and execution of etl processes. Inf. Syst. '01, 26(8):537--561. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Wyss, C. Giannella, and E. L. Robertson. Fastfds: A heuristic-driven, depth-first alg for mining fds from relations. In DaWaK '01, pages 101--110, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Zaki. Generating non-redundant association rules. In KDD '00, pages 34--43, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Discovering data quality rules

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader