ABSTRACT
To ensure high data quality, data warehouses must validate and cleanse incoming data tuples from external sources. In many situations, clean tuples must match acceptable tuples in reference tables. For example, product name and description fields in a sales record from a distributor must match the pre-recorded name and description fields in a product reference relation.A significant challenge in such a scenario is to implement an efficient and accurate fuzzy match operation that can effectively clean an incoming tuple if it fails to match exactly with any tuple in the reference relation. In this paper, we propose a new similarity function which overcomes limitations of commonly used similarity functions, and develop an efficient fuzzy match algorithm. We demonstrate the effectiveness of our techniques by evaluating them on real datasets.
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In Proceedings of VLDB, Hong Kong, 2002.]]Google ScholarCross Ref
- R. Baeza-Yates and G. Navarro. A practical index for text retrieval allowing errors. In R. Monge, editor, Proceedings of the XXIII Latin American Conference on Informatics (CLEI'97), Valparaiso, Chile, 1997.]]Google Scholar
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999.]] Google ScholarDigital Library
- A. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES '97), 1998.]] Google ScholarDigital Library
- P. Ciaccia, M. Patella, P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. VLDB 1997.]] Google ScholarDigital Library
- E. Cohen. Size estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 1997.]] Google ScholarDigital Library
- W. Cohen. Integration of heterogeneous databases without common domains using queries based on textual similarity. In Proceedings of ACM SIGMOD, Seattle, WA, June 1998.]] Google ScholarDigital Library
- W. Cohen. Data integration using similarity joins and a word-based information representation language. ACM Transactions on Information Systems, 18(3):288--321, July 2000.]] Google ScholarDigital Library
- E. Cohen and D. Lewis. Approximating matrix multiplication for pattern recognition tasks. In SODA: ACM-SIAM Symposium on Discrete Algorithms, 1997.]] Google ScholarDigital Library
- W. Cohen and J. Richman. Learning to match and cluster entity names. In proceedings of SIGKDD, Edmonton, July 2002.]]Google Scholar
- V. Gaede and O. Gunther. Multidimensional access methods. ACM Computing Surveys, 30(2):170--231, 1998.]] Google ScholarDigital Library
- L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In Proceedings of VLDB, Roma, Italy, September 11--14 2001.]] Google ScholarDigital Library
- M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD, San Jose, CA, May 1995.]] Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the 30th Symposium on Theory of Computing (STOC), 1998.]] Google ScholarDigital Library
- P. Jokinen and E. Ukkonen. Two algorithms for approximate string matching in static texts. In A. Tarlecki, editor, Mathematical Foundations of Computer Science, 1991.]]Google Scholar
- R. Motwani and P. Raghavan. Randomized Algorithms Cambridge University Press, 1995.]] Google ScholarDigital Library
- G. Navarro, R. Baeza-Yates, E. Sutinen, and J. Tarhio. Indexing methods for approximate string matching. IEEE Data Engineering Bulletin, 24(4):19--27, 2001.]]Google Scholar
- G. Navarro, E. Sutinen, J. Tanninen, and J. Tarhio. Indexing text with approximate q-grams. In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching (CPM'2000), LNCS 1848, 2000.]] Google ScholarDigital Library
- G. Navarro. Searching in metric spaces by spatial approximation. The VLDB Journal, 11(1):28--46, 2002.]] Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proceedings of ACM SIGKDD, Edmonton, Canada, 2002.]] Google ScholarDigital Library
- B. Schneier. Applied Cryptography John Wiley, 1996.]]Google Scholar
- T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology, 147:195--197, 1981.]]Google ScholarCross Ref
- Trillium Software. http://www.trilliumsoft.com]]Google Scholar
- W. Winkler. The state of record linkage and current research problems. http://www.census.gov/srd/papers/pdf/rr99-04.pdf]]Google Scholar
Index Terms
- Robust and efficient fuzzy match for online data cleaning
Recommendations
Keyword query cleaning
Unlike traditional database queries, keyword queries do not adhere to predefined syntax and are often dirty with irrelevant words from natural languages. This makes accurate and efficient keyword query processing over databases a very challenging task.
...
A Comparative Study of Data Cleaning Tools
In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent ...
Comments