skip to main content
research-article

Learning string transformations from examples

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

"Robert" and "Bob" refer to the same first name but are textually far apart. Traditional string similarity functions do not allow a flexible way to account for such synonyms, abbreviations and aliases. Recently, string transformations have been proposed as a mechanism to make matching robust to such variations. However, in many domains, identifying an appropriate set of transformations is challenging as the space of possible transformations is large. In this paper, we investigate the problem of leveraging examples of matching strings to learn string transformations. We formulate an optimization problem where we are required to learn a concise set of transformations that explain most of the differences. We propose a greedy approximation algorithm for this NP-hard problem. Our experiments over real-life data illustrate the benefits of our approach.

References

  1. A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In Proc. of the 24th Intl. Conf. on Data Engineering (ICDE), pages 40--49, Apr. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Brill and R. C. Moore. An improved error model for noisy channel spelling correction. In Proc. of the 38th Ann. Meeting of Assoc. for Comp. Linguistics (ACL), Oct. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Chaudhuri, B. C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Proc. of the 33rd Intl. Conf. on Very Large Data Bases (VLDB), pages 327--338, Sept. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chaudhuri, V. Ganti, and D. Xin. Mining document collections to facilitate accurate approximate entity match, 2009. Under submission.Google ScholarGoogle Scholar
  5. DBLP. http://www.informatik.uni-trier.de/~ley/db/index.html.Google ScholarGoogle Scholar
  6. X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proc. of the 2005 ACM SIGMOD Intl. Conf. on Management of Data, pages 85--96, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowledge and Data Engg., 19(1):1--16, Jan. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. N. Garofalakis, A. Gionis, R. Rastogi, et al. XTRACT: A system for extracting document type descriptors from XML documents. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165--176, May 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. D. Grunwald. The Minimum Description Length Principle. MIT Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Kempe, J. M. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 137--146, Aug. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In Proc. of the 2006 ACM SIGMOD Intl. Conf. on Management of Data, pages 802--803, June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Michelson and C. A. Knoblock. Mining heterogeneous transformations for record linkage. In Proc. of the 6th Intl. Workshop on Information Integration on the Web (IIWeb), pages 68--73, 2007.Google ScholarGoogle Scholar
  13. S. Minton, C. Nanjo, C. A. Knoblock, et al. A heterogeneous field matching method for record linkage. In Proc. of the 5th IEEE Intl. Conf. on Data Mining (ICDM), pages 314--321, Nov. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 269--278, July 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Singla and P. Domingos. Collective object identification. In Proc. of the 19th Intl. Joint Conf. on Artificial Intelligence (IJCAI), pages 1636--1637, Aug. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. D. Turney. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Lecture Notes in Computer Science, (2167):491--503, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. United States Postal Service (USPS). http://www.usps.com.Google ScholarGoogle Scholar

Index Terms

  1. Learning string transformations from examples

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader