ABSTRACT
Data integrated from multiple sources may contain inconsistencies that violate integrity constraints. The constraint repair problem attempts to find "low cost" changes that, when applied, will cause the constraints to be satisfied. While in most previous work repair cost is stated in terms of tuple insertions and deletions, we follow recent work to define a database repair as a set of value modifications. In this context, we introduce a novel cost framework that allows for the application of techniques from record-linkage to the search for good repairs. We prove that finding minimal-cost repairs in this model is NP-complete in the size of the database, and introduce an approach to heuristic repair-construction based on equivalence classes of attribute values. Following this approach, we define two greedy algorithms. While these simple algorithms take time cubic in the size of the database, we develop optimizations inspired by algorithms for duplicate-record detection that greatly improve scalability. We evaluate our framework and algorithms on synthetic and real data, and show that our proposed optimizations greatly improve performance at little or no cost in repair quality.
- M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, 1999.]] Google ScholarDigital Library
- L. Bertossi and J. Chomicki, Query Answering in Inconsistent Databases. In J. Chomicki, R. van der Meyden, and G. Saake, editors, Logics for Emerging Applications of Databases, Springer, 2003.]]Google Scholar
- L. Bravo and L. Bertossi. Logic programs for consistently querying data integration systems. In IJCAI, 2003.]] Google ScholarDigital Library
- P. Buneman, S. Khanna, and W.-C. Tan. On propagation of deletion and annotation through views. In PODS, 2002.]] Google ScholarDigital Library
- A. Cali, D. Lembo, and R. Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. In PODS, 2003.]] Google ScholarDigital Library
- A. Cali, D. Lembo, and R. Rosati. Query rewriting and answering under constraints in data integration systems. In IJCAI, 2003.]] Google ScholarDigital Library
- J. Chomicki and J. Marcinkowski. On the Computational Complexity of Minimal-Change Integrity Maintenance in Relational Databases. In L. Bertossi, A. Hunter, and T. Schaub, editors, Integrity Tolerance. Springer, 2004.]] Google ScholarDigital Library
- J. Chomicki and J. Marcinkowski. "Minimal-Change Integrity Maintenance Using Tuple Deletions". Information and Computation, in press.]]Google Scholar
- W. Cohen, P. Ravikumar, and S. Feinberg. A comparison of string-distance metrics for name-matching tasks. In IIWeb, 2003.]]Google Scholar
- M. G. Elfeky. V. S. Verykios, and A. K. Elmagarmid. TAILOR: A record linkage toolbox. In ICDE, 2002.]]Google ScholarCross Ref
- E. Franconi, A. L. Palma, N. Leone, S. Perri, and F. Scarcello. Census data repair: a challenging application of disjunctive logic programming. In LPAR, 2001.]] Google ScholarDigital Library
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. "AJAX: An Extensible Data Cleaning Tool". In SIGMOD, 2001.]] Google ScholarDigital Library
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. "Declarative Data Cleaning: Language, Model and Algorithms". In VLDB, 2001.]] Google ScholarDigital Library
- M. Gertz and U. Lipeck. A diagnostic approach to repairing constraint violations in databases. In DX, 1995.]]Google Scholar
- G. Greco, S. Greco, and E. Zumpano. A logical framework for querying and repairing inconsistent databases. IEEE Trans. Knowl. Data Eng, 15(6):1389--1408, 2003.]] Google ScholarDigital Library
- M. A. Hernandez and S. Stolfo. "Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem". Data Mining and Knowledge Discovery, 2(1):9--37, 1998.]] Google ScholarDigital Library
- International Standard ISO/IEC 9075-2:2003(E). Information technology: Database languages, SQL Part 2 (Foundation, 2nd edition), 2003.]]Google Scholar
- M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of tampa florida. Journal of the American Statistical Association, 89:414--420, 1989.]]Google ScholarCross Ref
- Lavastorm. Making the Case for Automated Revenue Assurance Solutions. http://www.lavastormtech.com.]]Google Scholar
- Lucent Technologies. Revenue Assurance Products. http://www.lucent.com/solutions/revenue.html.]]Google Scholar
- A. Monge. Matching algorithm within a duplicate detection system. IEEE Data Engineering Bulletin, 23(4), 2000.]]Google Scholar
- A. Monge and C. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In DKMD, 1997.]]Google Scholar
- Network Resource Management: Inventory Takes Stage. Rhk research, July 2002. RHK Market Report.]]Google Scholar
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. VLDB Journal, 2001.]] Google ScholarDigital Library
- E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4), 2000.]]Google Scholar
- V. Raman and J. M. Hellerstein. "Potter's Wheel: An Interactive Data Cleaning System". In VLDB, 2001.]] Google ScholarDigital Library
- Transaction Processing Performance Council. TPC-H Benchmark. http://www.tpc.org.]]Google Scholar
- J. Wijsen. Condensed representation of database repairs for consistent query answering. In ICDT, 2003.]] Google ScholarDigital Library
- W. Winkler. Advanced methods for record linkage. Technical report. Statistical Research Division, U.S. Bureau of the Census., 1994.]]Google Scholar
- A cost-based model and effective heuristic for repairing constraints by value modification
Recommendations
Symmetry Breaking Constraints for Value Symmetries in Constraint Satisfaction
Constraint satisfaction problems (CSPs) sometimes contain both variable symmetries and value symmetries , causing adverse effects on CSP solvers based on tree search. As a remedy, symmetry breaking constraints are commonly used. While variable ...
A three-phase matheuristic for capacitated multi-commodity fixed-cost network design with design-balance constraints
This paper proposes a three-phase matheuristic solution strategy for the capacitated multi-commodity fixed-cost network design problem with design-balance constraints. The proposed matheuristic combines exact and neighbourhood-based methods. Tabu search ...
Comments