Abstract
In this paper we present GDR, a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing user involvement. GDR consults the user on the updates that are most likely to be beneficial in improving data quality. GDR also uses machine learning methods to identify and apply the correct updates directly to the database without the actual involvement of the user on these specific updates. To rank potential updates for consultation by the user, we first group these repairs and quantify the utility of each group using the decision-theory concept of value of information (VOI). We then apply active learning to order updates within a group based on their ability to improve the learned model. User feedback is used to repair the database and to adaptively refine the training set for the model. We empirically evaluate GDR on a real-world dataset and show significant improvement in data quality using our user guided repairing process. We also, assess the trade-off between the user efforts and the resulting data quality.
- C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Addison-Wesley, 2006. Google ScholarDigital Library
- P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In ACM SIGMOD, pages 143--154, 2005. Google ScholarDigital Library
- P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarCross Ref
- L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB, pages 243--254, 2007. Google ScholarDigital Library
- L. Breiman. Random forests. Mach. Learn., 45:5--32, 2001. Google ScholarDigital Library
- J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. In Information and Computation, pages 90--121, 2005. Google ScholarCross Ref
- G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: consistency and accuracy. In VLDB, pages 315--326, 2007. Google ScholarDigital Library
- W. Fan. Dependencies revisited for improving data quality. In PODS, 2008. Google ScholarDigital Library
- W. Fan, F. Geerts, L. V. Lakshmanan, and M. Xiong. Discovering conditional functional dependencies. In ICDE, pages 1231--1234, 2009. Google ScholarDigital Library
- W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. In VLDB, pages 407--418, 2009. Google ScholarDigital Library
- W. Fan, J. Li, S. Ma, and W. Yu. Towards certain fixes with editing rules and master data. In VLDB, pages 173--184, 2010. Google ScholarDigital Library
- H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. In SIGMOD, page 590, 2001. Google ScholarDigital Library
- L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating nearoptimal tableaux for conditional functional dependencies. In VLDB, pages 376--390, 2008. Google ScholarDigital Library
- S. R. Jeffery, M. J. Franklin, and A. Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In ACM SIGMOD, pages 847--860, 2008. Google ScholarDigital Library
- A. Kapoor, E. Horvitz, and S. Basu. Selective supervision: Guiding supervised learning with decision-theoretic active learning. In IJCAI, pages 877--882, 2007. Google ScholarDigital Library
- A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In ICDE, pages 216--225, 2007.Google ScholarCross Ref
- V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarDigital Library
- S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Addison-Wesley, 2003. Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In ACM SIGKDD, pages 269--278, 2002. Google ScholarDigital Library
- S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In The Journal of Machine Learning Research, pages 45--66, 2002. Google ScholarDigital Library
- M. Yakout, A. K. Elmagarmid, and J. Neville. Ranking for data repairs. In In DBRank workshop of ICDE, pages 23--28, 2010.Google ScholarCross Ref
- M. Yakout, A. K. Elmagarmid, J. Neville, and M. Ouzzani. Gdr: a system for guided data repair. In SIGMOD system demo, pages 1223--1226, 2010. Google ScholarDigital Library
- B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In SIGKDD, pages 204--213, 2001. Google ScholarDigital Library
Index Terms
- Guided data repair
Recommendations
Repair diversification
In practice, data are often found to violate given integrity constraints, e.g., functional dependencies, and are hence inconsistent. To resolve such violations, data are to be restored to a consistent state, known as "repair", while the number of ...
Data structure repair using goal-directed reasoning
ICSE '05: Proceedings of the 27th international conference on Software engineeringData structure repair is a promising technique for enabling programs to execute successfully in the presence of otherwise fatal data structure corruption errors. Previous research in this field relied on the developer to write a specification to ...
Model and Program Repair via SAT Solving
Special Issue on MEMCODE 2015 and Regular Papers (Diamonds)We consider the subtractive model repair problem: given a finite Kripke structure M and a CTL formula η, determine if M contains a substructure M′ that satisfies η. Thus, M can be “repaired” to satisfy eta by deleting some transitions and states. We map ...
Comments