skip to main content
research-article

Guided data repair

Published:01 February 2011Publication History
Skip Abstract Section

Abstract

In this paper we present GDR, a Guided Data Repair framework that incorporates user feedback in the cleaning process to enhance and accelerate existing automatic repair techniques while minimizing user involvement. GDR consults the user on the updates that are most likely to be beneficial in improving data quality. GDR also uses machine learning methods to identify and apply the correct updates directly to the database without the actual involvement of the user on these specific updates. To rank potential updates for consultation by the user, we first group these repairs and quantify the utility of each group using the decision-theory concept of value of information (VOI). We then apply active learning to order updates within a group based on their ability to improve the learned model. User feedback is used to repair the database and to adaptively refine the training set for the model. We empirically evaluate GDR on a real-world dataset and show significant improvement in data quality using our user guided repairing process. We also, assess the trade-off between the user efforts and the resulting data quality.

References

  1. C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Addison-Wesley, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In ACM SIGMOD, pages 143--154, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  4. L. Bravo, W. Fan, and S. Ma. Extending dependencies with conditions. In VLDB, pages 243--254, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Breiman. Random forests. Mach. Learn., 45:5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. In Information and Computation, pages 90--121, 2005. Google ScholarGoogle ScholarCross RefCross Ref
  7. G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: consistency and accuracy. In VLDB, pages 315--326, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Fan. Dependencies revisited for improving data quality. In PODS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Fan, F. Geerts, L. V. Lakshmanan, and M. Xiong. Discovering conditional functional dependencies. In ICDE, pages 1231--1234, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. In VLDB, pages 407--418, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. W. Fan, J. Li, S. Ma, and W. Yu. Towards certain fixes with editing rules and master data. In VLDB, pages 173--184, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Galhardas, D. Florescu, D. Shasha, and E. Simon. Ajax: an extensible data cleaning tool. In SIGMOD, page 590, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating nearoptimal tableaux for conditional functional dependencies. In VLDB, pages 376--390, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. R. Jeffery, M. J. Franklin, and A. Y. Halevy. Pay-as-you-go user feedback for dataspace systems. In ACM SIGMOD, pages 847--860, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Kapoor, E. Horvitz, and S. Basu. Selective supervision: Guiding supervised learning with decision-theoretic active learning. In IJCAI, pages 877--882, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Lopatenko and L. Bravo. Efficient approximation algorithms for repairing inconsistent databases. In ICDE, pages 216--225, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  17. V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Addison-Wesley, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In ACM SIGKDD, pages 269--278, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In The Journal of Machine Learning Research, pages 45--66, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Yakout, A. K. Elmagarmid, and J. Neville. Ranking for data repairs. In In DBRank workshop of ICDE, pages 23--28, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  22. M. Yakout, A. K. Elmagarmid, J. Neville, and M. Ouzzani. Gdr: a system for guided data repair. In SIGMOD system demo, pages 1223--1226, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In SIGKDD, pages 204--213, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Guided data repair

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 4, Issue 5
          February 2011
          71 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 February 2011
          Published in pvldb Volume 4, Issue 5

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader