Article

Robust and efficient fuzzy match for online data cleaning

Authors:
Surajit Chaudhuri

Microsoft Research

Microsoft Research
View Profile

,
Kris Ganjam

Microsoft Research

Microsoft Research
View Profile

,
Venkatesh Ganti

Microsoft Research

Microsoft Research
View Profile

,
Rajeev Motwani

Stanford University

Stanford University
View Profile

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of dataJune 2003Pages 313–324https://doi.org/10.1145/872757.872796

Published:09 June 2003Publication History

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

Pages 313–324

ABSTRACT

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.

References

R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of VLDB, Hong Kong, 2002.]]Google ScholarCross Ref
R. Baeza-Yates and G. Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.]]Google Scholar
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999.]] Google ScholarDigital Library
A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES '97), 1998.]] Google ScholarDigital Library
P. Ciaccia, M. Patella, P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. VLDB 1997.]] Google ScholarDigital Library
E. Cohen. Size estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 1997.]] Google ScholarDigital Library
W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD, Seattle, WA, June 1998.]] Google ScholarDigital Library
W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288--321, July 2000.]] Google ScholarDigital Library
E. Cohen and D. Lewis. Approximating matrix multiplication for pattern recognition tasks. In SODA: ACM-SIAM Symposium on Discrete Algorithms, 1997.]] Google ScholarDigital Library
W. Cohen and J. Richman. Learning to match and cluster entity names. In proceedings of SIGKDD, Edmonton, July 2002.]]Google Scholar
V. Gaede and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 30(2):170--231, 1998.]] Google ScholarDigital Library
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proceedings of VLDB, Roma, Italy, September 11--14 2001.]] Google ScholarDigital Library
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, San Jose, CA, May 1995.]] Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Symposium on Theory of Computing (STOC), 1998.]] Google ScholarDigital Library
P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.]]Google Scholar
R. Motwani and P. Raghavan. Randomized Algorithms Cambridge University Press, 1995.]] Google ScholarDigital Library
G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.]]Google Scholar
G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM'2000), LNCS 1848, 2000.]] Google ScholarDigital Library
G. Navarro. Searching in metric spaces by spatial approximation. The VLDB Journal, 11(1):28--46, 2002.]] Google ScholarDigital Library
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of ACM SIGKDD, Edmonton, Canada, 2002.]] Google ScholarDigital Library
B. Schneier. Applied Cryptography John Wiley, 1996.]]Google Scholar
T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.]]Google ScholarCross Ref
Trillium Software. http://www.trilliumsoft.com]]Google Scholar
W. Winkler. The state of record linkage and current research problems. http://www.census.gov/srd/papers/pdf/rr99-04.pdf]]Google Scholar

Index Terms

Robust and efficient fuzzy match for online data cleaning
1. Information systems
  1. Data management systems
    1. Database design and models
    2. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Pattern matching
  2. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Keyword query cleaning

Unlike traditional database queries, keyword queries do not adhere to predefined syntax and are often dirty with irrelevant words from natural languages. This makes accurate and efficient keyword query processing over databases a very challenging task.

...
Read More
Statistical Data Cleaning with Applications in R
Read More
A Comparative Study of Data Cleaning Tools

In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
June 2003
702 pages
ISBN:158113634X
DOI:10.1145/872757
Conference Chair:
Zachary Ives
University of Pennsylvania
,
General Chair:
Yannis Papakonstantinou
University of California, San Diego
,
Program Chair:
Alon Halevy
University of Washington
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '03 Paper Acceptance Rate53of342submissions,15%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 339
  Total Citations
  View Citations
- 3,590
  Total Downloads
- Downloads (Last 12 months)83
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Robust and efficient fuzzy match for online data cleaning

SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Keyword query cleaning

Statistical Data Cleaning with Applications in R

A Comparative Study of Data Cleaning Tools