Abstract
In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of Web-scale extraction, and how structured information is distributed amongst top aggregator websites and tail sites for various interesting domains. We believe this is the first study of its kind, and gives us new insights for information extraction over the Web.
- A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003. Google ScholarDigital Library
- M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, pages 2670--2676, 2007. Google ScholarDigital Library
- L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Flint: Google-basing the web. In EDBT, pages 720--724, 2008. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008. Google ScholarDigital Library
- J. Cho. Crawling the web: discovery and maintenance of large-scale web data. PhD thesis, Stanford University, 2002. Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarDigital Library
- N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A web of concepts (keynote). In PODS, pages 1--12, June 2009. Google ScholarDigital Library
- N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarDigital Library
- P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarDigital Library
- H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009. Google ScholarDigital Library
- O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in Knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarDigital Library
- E. Gilbert and K. Karahalios. Understanding deja reviewers. In CSCW, pages 225--228, 2010. Google ScholarDigital Library
- S. Goel, A. Broder, E. Gabrilovich, and B. Pang. Anatomy of the long tail: ordinary people with extraordinary tastes. In WSDM, pages 201--210, 2010. Google ScholarDigital Library
- Google sets: http://labs.google.com/sets.Google Scholar
- P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. PVLDB, 3(1):578--587, 2010. Google ScholarDigital Library
- R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, pages 289--300, 2009. Google ScholarDigital Library
- A. Machanavajjhala, A. S. Iyer, P. Bohannon, and S. Merugu. Collective extraction from heterogeneous web lists. In WSDM, pages 445--454, 2011. Google ScholarDigital Library
- J. Madhavan, L. Afanasiev, L. Antova, and A. Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.Google Scholar
- L. Sarmento, V. Jijkoun, M. de Rijke, and E. Oliveira. "more like these": growing entity classes from seeds. In CIKM, pages 959--962, 2007. Google ScholarDigital Library
- P. Senellart, A. Mittal, D. Muschick, R. Gilleron, and M. Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. In WIDM, pages 9--16, 2008. Google ScholarDigital Library
- R. C. Wang and W. Cohen. Iterative set expansion of named entities using the web. In ICDM, pages 1091--1096, 2008. Google ScholarDigital Library
Index Terms
- An analysis of structured data on the web
Recommendations
Towards web-scale structured web data extraction
WSDM '13: Proceedings of the sixth ACM international conference on Web search and data miningIn this paper we present an ongoing PhD research on unsupervised and domain-independent structured data extraction from the Web. We propose a novel method to extract structured data records from template-generated Web pages. The method is based on ...
Structured Web Pages Management for Efficient Data Retrieval
WISE '00: Proceedings of the First International Conference on Web Information Systems Engineering (WISE'00)-Volume 2 - Volume 2The widespread use of World Wide Web in recent years has opened a way of universal access to vast amount of information sources. An obstacle that affects the access to Web data is the lack of information structure among and within Web pages. This raises ...
Incorporating site-level knowledge to extract structured data from web forums
WWW '09: Proceedings of the 18th international conference on World wide webWeb forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In ...
Comments