skip to main content
10.1145/872757.872796acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Robust and efficient fuzzy match for online data cleaning

Published:09 June 2003Publication History

ABSTRACT

To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.

References

  1. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of VLDB, Hong Kong, 2002.]]Google ScholarGoogle ScholarCross RefCross Ref
  2. R. Baeza-Yates and G. Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.]]Google ScholarGoogle Scholar
  3. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES '97), 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. P. Ciaccia, M. Patella, P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. VLDB 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Cohen. Size estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD, Seattle, WA, June 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288--321, July 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Cohen and D. Lewis. Approximating matrix multiplication for pattern recognition tasks. In SODA: ACM-SIAM Symposium on Discrete Algorithms, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Cohen and J. Richman. Learning to match and cluster entity names. In proceedings of SIGKDD, Edmonton, July 2002.]]Google ScholarGoogle Scholar
  11. V. Gaede and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 30(2):170--231, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proceedings of VLDB, Roma, Italy, September 11--14 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, San Jose, CA, May 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Symposium on Theory of Computing (STOC), 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.]]Google ScholarGoogle Scholar
  16. R. Motwani and P. Raghavan. Randomized Algorithms Cambridge University Press, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.]]Google ScholarGoogle Scholar
  18. G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM'2000), LNCS 1848, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Navarro. Searching in metric spaces by spatial approximation. The VLDB Journal, 11(1):28--46, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of ACM SIGKDD, Edmonton, Canada, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. B. Schneier. Applied Cryptography John Wiley, 1996.]]Google ScholarGoogle Scholar
  22. T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.]]Google ScholarGoogle ScholarCross RefCross Ref
  23. Trillium Software. http://www.trilliumsoft.com]]Google ScholarGoogle Scholar
  24. W. Winkler. The state of record linkage and current research problems. http://www.census.gov/srd/papers/pdf/rr99-04.pdf]]Google ScholarGoogle Scholar

Index Terms

  1. Robust and efficient fuzzy match for online data cleaning

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data
            June 2003
            702 pages
            ISBN:158113634X
            DOI:10.1145/872757

            Copyright © 2003 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 June 2003

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            SIGMOD '03 Paper Acceptance Rate53of342submissions,15%Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader