demonstration

Automatic web-scale information extraction

Authors:
Philip Bohannon

Yahoo! Research, Santa Clara, USA

Yahoo! Research, Santa Clara, USA
View Profile

,
Nilesh Dalvi

Yahoo! Research, Santa Clara, USA

Yahoo! Research, Santa Clara, USA
View Profile

,
Yuval Filmus

University of Toronto, Toronto, Canada

University of Toronto, Toronto, Canada
View Profile

,
Nori Jacoby

Yahoo!, Tel Aviv, Israel

Yahoo!, Tel Aviv, Israel
View Profile

,
Sathiya Keerthi

Yahoo! Research, Santa Clara, USA

Yahoo! Research, Santa Clara, USA
View Profile

,
Alok Kirpal

Yahoo! Research, Santa Clara, USA

Yahoo! Research, Santa Clara, USA
View Profile

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of DataMay 2012Pages 609–612https://doi.org/10.1145/2213836.2213912

Published:20 May 2012Publication History

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

Pages 609–612

ABSTRACT

In this demonstration, we showcase the technologies that we are building at Yahoo! for Web-scale Information Extraction. Given any new Website, containing semi-structured information about a pre-specified set of schemas, we show how to populate objects in the corresponding schema by automatically extracting information from the Website.

References

Tobias Anton. Xpath-wrapper induction by generating tree traversal patterns. In LWA, pages 126--133, 2005.Google Scholar
Michele Banko, Michael J. Cafarella, Stephen Soderland, Matthew Broadhead, and Oren Etzioni. Open information extraction from the web. In International Joint Conference on Artificial Intelligence, pages 2670--2676, 2007. Google ScholarDigital Library
Lorenzo Blanco, Nilesh N. Dalvi, and Ashwin Machanavajjhala. Highly efficient algorithms for structural clustering of large websites. In WWW, pages 437--446, 2011. Google ScholarDigital Library
Michael J. Cafarella, Alon Halevy, Daisy Zhe Wang, Eugene Wu, and Yang Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008. Google ScholarDigital Library
Nilesh Dalvi, Ravi Kumar, Bo Pang, Raghu Ramakrishnan, Andrew Tomkins, Philip Bohannon, Sathiya Keerthi, and Srujana Merugu. A web of concepts (keynote). In PODS. Providence, Rhode Island, USA, June 2009. Google ScholarDigital Library
Nilesh N. Dalvi, Ravi Kumar, and Mohamed A. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarDigital Library
Pedro DeRose, Warren Shen, Fei Chen, AnHai Doan, and Raghu Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarDigital Library
Hazem Elmeleegy, Jayant Madhavan, and Alon Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009. Google ScholarDigital Library
Oren Etzioni, Michael Cafarella, Doug Downey, Stanley Kok, Ana-Maria Popescu, Tal Shaked, Stephen Soderland, Daniel S. Weld, and Alexander Yates. Web-scale information extraction in Knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarDigital Library
Rahul Gupta and Sunita Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, 2009. Google ScholarDigital Library
Wei Han, David Buttler, and Calton Pu. Wrapping web data into XML. SIGMOD Record, 30(3):33--38, 2001. Google ScholarDigital Library
Chun-Nan Hsu and Ming-Tzung Dung. Generating finite-state transducers for semi-structured data extraction from the web. Information Systems, 23(8):521--538, 1998. Google ScholarDigital Library
Nickolas Kushmerick, Daniel S. Weld, and Robert B. Doorenbos. Wrapper induction for information extraction. In IJCAI, pages 729--737, 1997.Google Scholar
Jussi Myllymaki and Jared Jackson. Robust web data extraction with XML path expressions. Technical Report RJ 10245, IBM, 2002.Google Scholar
Arnaud Sahuguet and Fabien Azavant. Building light-weight wrappers for legacy web data-sources using w4f. In VLDB, pages 738--741, 1999. Google ScholarDigital Library
Pierre Senellart, Avin Mittal, Daniel Muschick, Rémi Gilleron, and Marc Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. In WIDM, pages 9--16, 2008. Google ScholarDigital Library

Index Terms

Automatic web-scale information extraction
1. Information systems
  1. Information retrieval

Recommendations

Web-scale information extraction in knowitall: (preliminary results)
WWW '04: Proceedings of the 13th international conference on World Wide Web

Manually querying search engines in order to accumulate a large bodyof factual information is a tedious, error-prone process of piecemealsearch. Search engines retrieve and rank potentially relevantdocuments for human perusal, but do not extract facts, ...
Read More
Automatic information extraction from unstructured mammography reports using distributed semantics
Graphical abstract

Display Omitted
Highlights
- Proposed a hybrid information extraction system for unstructured mammography reports.
Abstract
To date, the methods developed for automated extraction of information from radiology reports are mainly rule-based or dictionary-based, and, therefore, require substantial manual effort to build these systems. Recent efforts to ...
Read More
A robust web personal name information extraction system

Highlights Features are extracted with various lightweight methods and from broad resources. The unsupervised features improve the robustness of a disambiguation system. Our AE system integrates various extraction approaches with high precision. Each ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
May 2012
886 pages
ISBN:9781450312479
DOI:10.1145/2213836
General Chairs:
K. Selçuk Candan
Arizona State University
,
Yi Chen
Arizona State University
,
Richard Snodgrass
University of Arizona
,
Program Chair:
Luis Gravano
Columbia University
,
Publications Chair:
Ariel Fuxman
Microsoft Research
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 May 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
domain-centric
information extraction
web-scale
Qualifiers
- demonstration
Conference

Acceptance Rates
SIGMOD '12 Paper Acceptance Rate48of289submissions,17%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 833
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic web-scale information extraction

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Web-scale information extraction in knowitall: (preliminary results)

Automatic information extraction from unstructured mammography reports using distributed semantics

A robust web personal name information extraction system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic web-scale information extraction

SIGMOD '12: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Web-scale information extraction in knowitall: (preliminary results)

Automatic information extraction from unstructured mammography reports using distributed semantics

A robust web personal name information extraction system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media