Abstract
"Robert" and "Bob" refer to the same first name but are textually far apart. Traditional string similarity functions do not allow a flexible way to account for such synonyms, abbreviations and aliases. Recently, string transformations have been proposed as a mechanism to make matching robust to such variations. However, in many domains, identifying an appropriate set of transformations is challenging as the space of possible transformations is large. In this paper, we investigate the problem of leveraging examples of matching strings to learn string transformations. We formulate an optimization problem where we are required to learn a concise set of transformations that explain most of the differences. We propose a greedy approximation algorithm for this NP-hard problem. Our experiments over real-life data illustrate the benefits of our approach.
- A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In Proc. of the 24th Intl. Conf. on Data Engineering (ICDE), pages 40--49, Apr. 2008. Google ScholarDigital Library
- E. Brill and R. C. Moore. An improved error model for noisy channel spelling correction. In Proc. of the 38th Ann. Meeting of Assoc. for Comp. Linguistics (ACL), Oct. 2000. Google ScholarDigital Library
- S. Chaudhuri, B. C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Proc. of the 33rd Intl. Conf. on Very Large Data Bases (VLDB), pages 327--338, Sept. 2007. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and D. Xin. Mining document collections to facilitate accurate approximate entity match, 2009. Under submission.Google Scholar
- DBLP. http://www.informatik.uni-trier.de/~ley/db/index.html.Google Scholar
- X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proc. of the 2005 ACM SIGMOD Intl. Conf. on Management of Data, pages 85--96, June 2005. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowledge and Data Engg., 19(1):1--16, Jan. 2007. Google ScholarDigital Library
- M. N. Garofalakis, A. Gionis, R. Rastogi, et al. XTRACT: A system for extracting document type descriptors from XML documents. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165--176, May 2000. Google ScholarDigital Library
- P. D. Grunwald. The Minimum Description Length Principle. MIT Press, 2007. Google ScholarDigital Library
- D. Kempe, J. M. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 137--146, Aug. 2003. Google ScholarDigital Library
- N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In Proc. of the 2006 ACM SIGMOD Intl. Conf. on Management of Data, pages 802--803, June 2006. Google ScholarDigital Library
- M. Michelson and C. A. Knoblock. Mining heterogeneous transformations for record linkage. In Proc. of the 6th Intl. Workshop on Information Integration on the Web (IIWeb), pages 68--73, 2007.Google Scholar
- S. Minton, C. Nanjo, C. A. Knoblock, et al. A heterogeneous field matching method for record linkage. In Proc. of the 5th IEEE Intl. Conf. on Data Mining (ICDM), pages 314--321, Nov. 2005. Google ScholarDigital Library
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 269--278, July 2002. Google ScholarDigital Library
- P. Singla and P. Domingos. Collective object identification. In Proc. of the 19th Intl. Joint Conf. on Artificial Intelligence (IJCAI), pages 1636--1637, Aug. 2005. Google ScholarDigital Library
- P. D. Turney. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Lecture Notes in Computer Science, (2167):491--503, 2001. Google ScholarDigital Library
- United States Postal Service (USPS). http://www.usps.com.Google Scholar
Index Terms
- Learning string transformations from examples
Recommendations
Learning semantic string transformations from examples
We address the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc. Unlike syntactic transformations, which are ...
Incorporating string transformations in record matching
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataToday's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record ...
Spreadsheet table transformations from examples
PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and ImplementationEvery day, millions of computer end-users need to perform tasks over large, tabular data, yet lack the programming knowledge to do such tasks automatically. In this work, we present an automatic technique that takes from a user an example of how the ...
Comments