research-article

An analysis of structured data on the web

Authors:
Nilesh Dalvi

Yahoo! Research, Great America Parkway, Santa Clara, CA

Yahoo! Research, Great America Parkway, Santa Clara, CA
View Profile

,
Ashwin Machanavajjhala

Yahoo! Research, Great America Parkway, Santa Clara, CA

Yahoo! Research, Great America Parkway, Santa Clara, CA
View Profile

,
Bo Pang

Yahoo! Research, Great America Parkway, Santa Clara, CA

Yahoo! Research, Great America Parkway, Santa Clara, CA
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 7pp 680–691https://doi.org/10.14778/2180912.2180920

Published:01 March 2012Publication History

Proceedings of the VLDB Endowment

Abstract

In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of Web-scale extraction, and how structured information is distributed amongst top aggregator websites and tail sites for various interesting domains. We believe this is the first study of its kind, and gives us new insights for information extraction over the Web.

References

A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003. Google ScholarDigital Library
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, pages 2670--2676, 2007. Google ScholarDigital Library
L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Flint: Google-basing the web. In EDBT, pages 720--724, 2008. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008. Google ScholarDigital Library
J. Cho. Crawling the web: discovery and maintenance of large-scale web data. PhD thesis, Stanford University, 2002. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A web of concepts (keynote). In PODS, pages 1--12, June 2009. Google ScholarDigital Library
N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarDigital Library
P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarDigital Library
H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009. Google ScholarDigital Library
O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in Knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarDigital Library
E. Gilbert and K. Karahalios. Understanding deja reviewers. In CSCW, pages 225--228, 2010. Google ScholarDigital Library
S. Goel, A. Broder, E. Gabrilovich, and B. Pang. Anatomy of the long tail: ordinary people with extraordinary tastes. In WSDM, pages 201--210, 2010. Google ScholarDigital Library
Google sets: http://labs.google.com/sets.Google Scholar
P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. PVLDB, 3(1):578--587, 2010. Google ScholarDigital Library
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, pages 289--300, 2009. Google ScholarDigital Library
A. Machanavajjhala, A. S. Iyer, P. Bohannon, and S. Merugu. Collective extraction from heterogeneous web lists. In WSDM, pages 445--454, 2011. Google ScholarDigital Library
J. Madhavan, L. Afanasiev, L. Antova, and A. Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.Google Scholar
L. Sarmento, V. Jijkoun, M. de Rijke, and E. Oliveira. "more like these": growing entity classes from seeds. In CIKM, pages 959--962, 2007. Google ScholarDigital Library
P. Senellart, A. Mittal, D. Muschick, R. Gilleron, and M. Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. In WIDM, pages 9--16, 2008. Google ScholarDigital Library
R. C. Wang and W. Cohen. Iterative set expansion of named entities using the web. In ICDM, pages 1091--1096, 2008. Google ScholarDigital Library

Index Terms

An analysis of structured data on the web
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Towards web-scale structured web data extraction
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data mining

In this paper we present an ongoing PhD research on unsupervised and domain-independent structured data extraction from the Web. We propose a novel method to extract structured data records from template-generated Web pages. The method is based on ...
Read More
Structured Web Pages Management for Efficient Data Retrieval
WISE '00: Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 2 - Volume 2

The widespread use of World Wide Web in recent years has opened a way of universal access to vast amount of information sources. An obstacle that affects the access to Web data is the lack of information structure among and within Web pages. This raises ...
Read More
Incorporating site-level knowledge to extract structured data from web forums
WWW '09: Proceedings of the 18th international conference on World wide web

Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 7
March 2012
94 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 March 2012
Published in pvldb Volume 5, Issue 7
Author Tags
information connectivity
information spread
structured data on the web
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 425
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An analysis of structured data on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Towards web-scale structured web data extraction

Structured Web Pages Management for Efficient Data Retrieval

Incorporating site-level knowledge to extract structured data from web forums

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An analysis of structured data on the web

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Towards web-scale structured web data extraction

Structured Web Pages Management for Efficient Data Retrieval

Incorporating site-level knowledge to extract structured data from web forums

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media