skip to main content
10.1145/2213836.2213912acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
demonstration

Automatic web-scale information extraction

Published:20 May 2012Publication History

ABSTRACT

In this demonstration, we showcase the technologies that we are building at Yahoo! for Web-scale Information Extraction. Given any new Website, containing semi-structured information about a pre-specified set of schemas, we show how to populate objects in the corresponding schema by automatically extracting information from the Website.

References

  1. Tobias Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.Google ScholarGoogle Scholar
  2. Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. Open information extraction from the web. In International Joint Conference on Artificial Intelligence, pages 2670--2676, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Lorenzo Blanco, Nilesh N. Dalvi, and Ashwin Machanavajjhala. Highly efficient algorithms for structural clustering of large websites. In WWW, pages 437--446, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Nilesh Dalvi, Ravi Kumar, Bo Pang, Raghu Ramakrishnan, Andrew Tomkins, Philip Bohannon, Sathiya Keerthi, and Srujana Merugu. A web of concepts (keynote). In PODS. Providence, Rhode Island, USA, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Nilesh N. Dalvi, Ravi Kumar, and Mohamed A. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Pedro DeRose, Warren Shen, Fei Chen, AnHai Doan, and Raghu Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Hazem Elmeleegy, Jayant Madhavan, and Alon Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. Web-scale information extraction in Knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Rahul Gupta and Sunita Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wei Han, David Buttler, and Calton Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chun-Nan Hsu and Ming-Tzung Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521--538, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.Google ScholarGoogle Scholar
  14. Jussi Myllymaki and Jared Jackson. Robust web data extraction with XML path expressions. Technical Report RJ 10245, IBM, 2002.Google ScholarGoogle Scholar
  15. Arnaud Sahuguet and Fabien Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In VLDB, pages 738--741, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Pierre Senellart, Avin Mittal, Daniel Muschick, Rémi Gilleron, and Marc Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. In WIDM, pages 9--16, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic web-scale information extraction

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
      May 2012
      886 pages
      ISBN:9781450312479
      DOI:10.1145/2213836

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 May 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • demonstration

      Acceptance Rates

      SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader