research-article

Learning string transformations from examples

Authors:
Arvind Arasu

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Surajit Chaudhuri

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

,
Raghav Kaushik

Microsoft Research, Redmond, WA

Microsoft Research, Redmond, WA
View Profile

Proceedings of the VLDB Endowment Volume 2 Issue 1pp 514–525https://doi.org/10.14778/1687627.1687686

Published:01 August 2009Publication History

Proceedings of the VLDB Endowment

Abstract

"Robert" and "Bob" refer to the same first name but are textually far apart. Traditional string similarity functions do not allow a flexible way to account for such synonyms, abbreviations and aliases. Recently, string transformations have been proposed as a mechanism to make matching robust to such variations. However, in many domains, identifying an appropriate set of transformations is challenging as the space of possible transformations is large. In this paper, we investigate the problem of leveraging examples of matching strings to learn string transformations. We formulate an optimization problem where we are required to learn a concise set of transformations that explain most of the differences. We propose a greedy approximation algorithm for this NP-hard problem. Our experiments over real-life data illustrate the benefits of our approach.

References

A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In Proc. of the 24th Intl. Conf. on Data Engineering (ICDE), pages 40--49, Apr. 2008. Google ScholarDigital Library
E. Brill and R. C. Moore. An improved error model for noisy channel spelling correction. In Proc. of the 38th Ann. Meeting of Assoc. for Comp. Linguistics (ACL), Oct. 2000. Google ScholarDigital Library
S. Chaudhuri, B. C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In Proc. of the 33rd Intl. Conf. on Very Large Data Bases (VLDB), pages 327--338, Sept. 2007. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and D. Xin. Mining document collections to facilitate accurate approximate entity match, 2009. Under submission.Google Scholar
DBLP. http://www.informatik.uni-trier.de/~ley/db/index.html.Google Scholar
X. Dong, A. Y. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In Proc. of the 2005 ACM SIGMOD Intl. Conf. on Management of Data, pages 85--96, June 2005. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. on Knowledge and Data Engg., 19(1):1--16, Jan. 2007. Google ScholarDigital Library
M. N. Garofalakis, A. Gionis, R. Rastogi, et al. XTRACT: A system for extracting document type descriptors from XML documents. In Proc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages 165--176, May 2000. Google ScholarDigital Library
P. D. Grunwald. The Minimum Description Length Principle. MIT Press, 2007. Google ScholarDigital Library
D. Kempe, J. M. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 137--146, Aug. 2003. Google ScholarDigital Library
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In Proc. of the 2006 ACM SIGMOD Intl. Conf. on Management of Data, pages 802--803, June 2006. Google ScholarDigital Library
M. Michelson and C. A. Knoblock. Mining heterogeneous transformations for record linkage. In Proc. of the 6th Intl. Workshop on Information Integration on the Web (IIWeb), pages 68--73, 2007.Google Scholar
S. Minton, C. Nanjo, C. A. Knoblock, et al. A heterogeneous field matching method for record linkage. In Proc. of the 5th IEEE Intl. Conf. on Data Mining (ICDM), pages 314--321, Nov. 2005. Google ScholarDigital Library
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In Proc. of the 8th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 269--278, July 2002. Google ScholarDigital Library
P. Singla and P. Domingos. Collective object identification. In Proc. of the 19th Intl. Joint Conf. on Artificial Intelligence (IJCAI), pages 1636--1637, Aug. 2005. Google ScholarDigital Library
P. D. Turney. Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Lecture Notes in Computer Science, (2167):491--503, 2001. Google ScholarDigital Library
United States Postal Service (USPS). http://www.usps.com.Google Scholar

Index Terms

Learning string transformations from examples
1. Computing methodologies
  1. Machine learning
2. Theory of computation
  1. Design and analysis of algorithms
    1. Data structures design and analysis
      1. Pattern matching

Recommendations

Learning semantic string transformations from examples

We address the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc. Unlike syntactic transformations, which are ...
Read More
Incorporating string transformations in record matching
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Today's record matching infrastructure does not allow a flexible way to account for synonyms such as "Robert" and "Bob" which refer to the same name, and more general forms of string transformations such as abbreviations. We expand the problem of record ...
Read More
Spreadsheet table transformations from examples
PLDI '11: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation

Every day, millions of computer end-users need to perform tasks over large, tabular data, yet lack the programming knowledge to do such tasks automatically. In this work, we present an automatic technique that takes from a user an example of how the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 2, Issue 1
August 2009
1293 pages
ISSN:2150-8097
Editors:
Serge Abiteboul,
Tova Milo,
Jignesh Patel,
Philippe Rigaux
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2009
Published in pvldb Volume 2, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 232
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning string transformations from examples

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Learning semantic string transformations from examples

Incorporating string transformations in record matching

Spreadsheet table transformations from examples

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Learning string transformations from examples

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Learning semantic string transformations from examples

Incorporating string transformations in record matching

Spreadsheet table transformations from examples

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media