skip to main content
10.1145/1066157.1066175acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

A cost-based model and effective heuristic for repairing constraints by value modification

Published:14 June 2005Publication History

ABSTRACT

Data integrated from multiple sources may contain inconsistencies that violate integrity constraints. The constraint repair problem attempts to find "low cost" changes that, when applied, will cause the constraints to be satisfied. While in most previous work repair cost is stated in terms of tuple insertions and deletions, we follow recent work to define a database repair as a set of value modifications. In this context, we introduce a novel cost framework that allows for the application of techniques from record-linkage to the search for good repairs. We prove that finding minimal-cost repairs in this model is NP-complete in the size of the database, and introduce an approach to heuristic repair-construction based on equivalence classes of attribute values. Following this approach, we define two greedy algorithms. While these simple algorithms take time cubic in the size of the database, we develop optimizations inspired by algorithms for duplicate-record detection that greatly improve scalability. We evaluate our framework and algorithms on synthetic and real data, and show that our proposed optimizations greatly improve performance at little or no cost in repair quality.

References

  1. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Bertossi and J. Chomicki, Query Answering in Inconsistent Databases. In J. Chomicki, R. van der Meyden, and G. Saake, editors, Logics for Emerging Applications of Databases, Springer, 2003.]]Google ScholarGoogle Scholar
  3. L. Bravo and L. Bertossi. Logic programs for consistently querying data integration systems. In IJCAI, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Buneman, S. Khanna, and W.-C. Tan. On propagation of deletion and annotation through views. In PODS, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Cali, D. Lembo, and R. Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. In PODS, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Cali, D. Lembo, and R. Rosati. Query rewriting and answering under constraints in data integration systems. In IJCAI, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Chomicki and J. Marcinkowski. On the Computational Complexity of Minimal-Change Integrity Maintenance in Relational Databases. In L. Bertossi, A. Hunter, and T. Schaub, editors, Integrity Tolerance. Springer, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Chomicki and J. Marcinkowski. "Minimal-Change Integrity Maintenance Using Tuple Deletions". Information and Computation, in press.]]Google ScholarGoogle Scholar
  9. W. Cohen, P. Ravikumar, and S. Feinberg. A comparison of string-distance metrics for name-matching tasks. In IIWeb, 2003.]]Google ScholarGoogle Scholar
  10. M. G. Elfeky. V. S. Verykios, and A. K. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE, 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  11. E. Franconi, A. L. Palma, N. Leone, S. Perri, and F. Scarcello. Census data repair: a challenging application of disjunctive logic programming. In LPAR, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. "AJAX: An Extensible Data Cleaning Tool". In SIGMOD, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. "Declarative Data Cleaning: Language, Model and Algorithms". In VLDB, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Gertz and U. Lipeck. A diagnostic approach to repairing constraint violations in databases. In DX, 1995.]]Google ScholarGoogle Scholar
  15. G. Greco, S. Greco, and E. Zumpano. A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Eng, 15(6):1389--1408, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. A. Hernandez and S. Stolfo. "Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem". Data Mining and Knowledge Discovery, 2(1):9--37, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. International Standard ISO/IEC 9075-2:2003(E). Information technology: Database languages, SQL Part 2 (Foundation, 2nd edition), 2003.]]Google ScholarGoogle Scholar
  18. M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa florida. Journal of the American Statistical Association, 89:414--420, 1989.]]Google ScholarGoogle ScholarCross RefCross Ref
  19. Lavastorm. Making the Case for Automated Revenue Assurance Solutions. http://www.lavastormtech.com.]]Google ScholarGoogle Scholar
  20. Lucent Technologies. Revenue Assurance Products. http://www.lucent.com/solutions/revenue.html.]]Google ScholarGoogle Scholar
  21. A. Monge. Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin, 23(4), 2000.]]Google ScholarGoogle Scholar
  22. A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In DKMD, 1997.]]Google ScholarGoogle Scholar
  23. Network Resource Management: Inventory Takes Stage. Rhk research, July 2002. RHK Market Report.]]Google ScholarGoogle Scholar
  24. E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 2000.]]Google ScholarGoogle Scholar
  26. V. Raman and J. M. Hellerstein. "Potter's Wheel: An Interactive Data Cleaning System". In VLDB, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Transaction Processing Performance Council. TPC-H Benchmark. http://www.tpc.org.]]Google ScholarGoogle Scholar
  28. J. Wijsen. Condensed representation of database repairs for consistent query answering. In ICDT, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. W. Winkler. Advanced methods for record linkage. Technical report. Statistical Research Division, U.S. Bureau of the Census., 1994.]]Google ScholarGoogle Scholar
  1. A cost-based model and effective heuristic for repairing constraints by value modification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data
      June 2005
      990 pages
      ISBN:1595930604
      DOI:10.1145/1066157
      • Conference Chair:
      • Fatma Ozcan

      Copyright © 2005 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 14 June 2005

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader