Abstract
Dirty data is a serious problem for businesses leading to incorrect decision making, inefficient daily operations, and ultimately wasting both time and money. Dirty data often arises when domain constraints and business rules, meant to preserve data consistency and accuracy, are enforced incompletely or not at all in application code.
In this work, we propose a new data-driven tool that can be used within an organization's data quality management process to suggest possible rules, and to identify conformant and non-conformant records. Data quality rules are known to be contextual, so we focus on the discovery of context-dependent rules. Specifically, we search for conditional functional dependencies (CFDs), that is, functional dependencies that hold only over a portion of the data. The output of our tool is a set of functional dependencies together with the context in which they hold (for example, a rule that states for CS graduate courses, the course number and term functionally determines the room and instructor). Since the input to our tool will likely be a dirty database, we also search for CFDs that almost hold. We return these rules together with the non-conformant records (as these are potentially dirty records).
We present effective algorithms for discovering CFDs and dirty values in a data instance. Our discovery algorithm searches for minimal CFDs among the data values and prunes redundant candidates. No universal objective measures of data quality or data quality rules are known. Hence, to avoid returning an unnecessarily large number of CFDs and only those that are most interesting, we evaluate a set of interest metrics and present comparative results using real datasets. We also present an experimental study showing the scalability of our techniques.
- C. Aggarwal and P. Yu. Mining large itemsets for association rules. Data Eng Bull, 21(1):23--31, 1998.Google Scholar
- R. Agrawal, T. Imielinski, and A. N. Swami. Mining association rules between sets of items in large databases. In SIGMOD '93, pages 207--216, 1993. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB '94, pages 487--499, 1994. Google ScholarDigital Library
- A. Asuncion and D. Newman. UCI ML repository, 2007.Google Scholar
- P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE '07, pages 746--755, 2007.Google ScholarCross Ref
- L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB '07, pages 243--254, 2007. Google ScholarDigital Library
- S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In SIGMOD '97, pages 255--264, 1997. Google ScholarDigital Library
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE TKDE, 13(1):64--78, 2001. Google ScholarDigital Library
- G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In VLDB '07, pages 315--326, 2007. Google ScholarDigital Library
- T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or how to build a data quality browser. In SIGMOD '02, pages 240--251. Google ScholarDigital Library
- W. Eckerson. Achieving Business Success through a Commitment to High Quality Data. Technical report, Data Warehousing Institute, 2002.Google Scholar
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative data cleaning: Language, model, and algorithms. In VLDB '01, pages 371--380, 2001. Google ScholarDigital Library
- Y. Huhtala, J. Kinen, P. Porkka, and H. Toivonen. Efficient discovery of functional and approximate dependencies using partitions. In ICDE '98, pages 392--401, 1998. Google ScholarDigital Library
- H. V. Jagadish, J. Madar, and R. T. Ng. Semantic compression and pattern extraction with fascicles. In VLDB '99, pages 186--198, 1999. Google ScholarDigital Library
- H. V. Jagadish, R. T. Ng, B. C. Ooi, and A. K. H. Tung. Itcompress: An iterative semantic compression algorithm. In ICDE '04, page 646, 2004. Google ScholarDigital Library
- J. Kivinen and H. Mannila. Approximate inference of functional dependencies from relations. Theoretical Computer Science, 149(1):129--149, 1995. Google ScholarDigital Library
- L. V. S. Lakshmanan, R. T. Ng, J. Han, and A. Pang. Optimization of constrained frequent set queries with 2-variable constraints. In SIGMOD '99, pages 157--168. Google ScholarDigital Library
- S. Lopes, J.-M. Petit, and L. Lakhal. Efficient discovery of functional dependencies and armstrong relations. In EDBT '00, pages 350--364, 2000. Google ScholarDigital Library
- M. J. Maher. Constrained dependencies. Theoretical Computer Science, 173(1):113--149, 1997. Google ScholarDigital Library
- J. Pei and J. Han. Constrained frequent pattern mining: a pattern-growth view. SIGKDD Explor. '02, 4(1):31--39. Google ScholarDigital Library
- V. Raman and J. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB '01, pages 381--390, 2001. Google ScholarDigital Library
- I. Savnik and P. A. Flach. Discovery of multivalued dependencies from relations. Intelligent Data Analysis, 4(3,4):195--211, 2000. Google ScholarDigital Library
- P. Vassiliadis, Z. Vagena, S. Skiadopoulos, N. Karayannidis, and T. Sellis. Arktos: towards the modeling, design, control and execution of etl processes. Inf. Syst. '01, 26(8):537--561. Google ScholarDigital Library
- C. Wyss, C. Giannella, and E. L. Robertson. Fastfds: A heuristic-driven, depth-first alg for mining fds from relations. In DaWaK '01, pages 101--110, 2001. Google ScholarDigital Library
- M. Zaki. Generating non-redundant association rules. In KDD '00, pages 34--43, 2000. Google ScholarDigital Library
Index Terms
- Discovering data quality rules
Recommendations
Discovering dynamic integrity rules with a rules-based tool for data quality analyzing
CompSysTech '10: Proceedings of the 11th International Conference on Computer Systems and Technologies and Workshop for PhD Students in Computing on International Conference on Computer Systems and TechnologiesRules based approaches for data quality solutions often use business rules or integrity rules for data monitoring purpose. Integrity rules are constraints on data derived from business rules into a formal form in order to allow computerization. One of ...
Discovering association rules change from large databases
AICI'11: Proceedings of the Third international conference on Artificial intelligence and computational intelligence - Volume Part IDiscovering association rules and association rules change (ARC) from existing large databases is an important problem. This paper presents an approach based on multi-hash chain structures to mine association rules change from large database with ...
Mining fuzzy association rules from low-quality data
Special Issue on Knowledge Extraction from Low Quality Data: Theoretical, Methodological and Practical IssuesData mining is most commonly used in attempts to induce association rules from databases which can help decision-makers easily analyze the data and make good decisions regarding the domains concerned. Different studies have proposed methods for mining ...
Comments