skip to main content
research-article

Learning semantic string transformations from examples

Published:01 April 2012Publication History
Skip Abstract Section

Abstract

We address the problem of performing semantic transformations on strings, which may represent a variety of data types (or their combination) such as a column in a relational table, time, date, currency, etc. Unlike syntactic transformations, which are based on regular expressions and which interpret a string as a sequence of characters, semantic transformations additionally require exploiting the semantics of the data type represented by the string, which may be encoded as a database of relational tables. Manually performing such transformations on a large collection of strings is error prone and cumbersome, while programmatic solutions are beyond the skill-set of end-users. We present a programming by example technology that allows end-users to automate such repetitive tasks.

We describe an expressive transformation language for semantic manipulation that combines table lookup operations and syntactic manipulations. We then present a synthesis algorithm that can learn all transformations in the language that are consistent with the user-provided set of input-output examples. We have implemented this technology as an add-in for the Microsoft Excel Spreadsheet system and have evaluated it successfully over several benchmarks picked from various Excel help-forums.

References

  1. A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, pages 40--49, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Arasu, S. Chaudhuri, and R. Kaushik. Learning string transformations from examples. PVLDB, 2(1):514--525, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Das Sarma, A. Parameswaran, H. Garcia-Molina, and J. Widom. Synthesizing view definitions from data. In ICDT, pages 89--103, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. iMAP: Discovering complex mappings between database schemas. In SIGMOD, pages 383--394, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Gualtieri. Deputize end-user developers to deliver business agility and reduce costs. In Forrester Report for Application Development and Program Management Professionals, April 2009.Google ScholarGoogle Scholar
  7. S. Gulwani. Dimensions in program synthesis. In PPDP, pages 13--24, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, pages 317--330, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Gulwani, W. R. Harris, and R. Singh. Spreadsheet data manipulation using examples. In Communications of the ACM, 2012. To Appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. R. Harris and S. Gulwani. Spreadsheet table transformations from examples. In PLDI, pages 317--328, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Jha, S. Gulwani, S. A. Seshia, and A. Tiwari. Oracle-guided component-based program synthesis. In ICSE, pages 215--224, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, pages 802--803, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Lau. Why programming-by-demonstration systems fail: Lessons learned for usable ai. AI Magazine, 30(4):65--67, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  14. T. Lau, S. Wolfman, P. Domingos, and D. Weld. Programming by demonstration using version space algebra. Machine Learning, 53(1-2):111--156, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. C. Miller and B. A. Myers. Interactive simultaneous editing of multiple text regions. In USENIX Annual Technical Conference, General Track, pages 161--174, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Singh and S. Gulwani. Learning semantic string transformations from examples. Technical Report MSR-TR-2012-5, Microsoft Research, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Singh and S. Gulwani. Synthesizing number transformations from input-output examples. In CAV, 2012. To Appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query by output. In SIGMOD, pages 535--548, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. H. Warren and F. W. Tompa. Multi-column substring matching for database schema translation. In VLDB, pages 331--342, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 8
    April 2012
    96 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 April 2012
    Published in pvldb Volume 5, Issue 8

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader