skip to main content
research-article

An analysis of structured data on the web

Authors Info & Claims
Published:01 March 2012Publication History
Skip Abstract Section

Abstract

In this paper, we analyze the nature and distribution of structured data on the Web. Web-scale information extraction, or the problem of creating structured tables using extraction from the entire web, is gathering lots of research interest. We perform a study to understand and quantify the value of Web-scale extraction, and how structured information is distributed amongst top aggregator websites and tail sites for various interesting domains. We believe this is the first study of its kind, and gives us new insights for information extraction over the Web.

References

  1. A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD, pages 337--348, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, pages 2670--2676, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. L. Blanco, V. Crescenzi, P. Merialdo, and P. Papotti. Flint: Google-basing the web. In EDBT, pages 720--724, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. J. Cafarella, A. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. Webtables: exploring the power of tables on the web. VLDB, 1(1):538--549, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Cho. Crawling the web: discovery and maintenance of large-scale web data. PhD thesis, Stanford University, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, pages 109--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A web of concepts (keynote). In PODS, pages 1--12, June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. DeRose, W. Shen, F. Chen, A. Doan, and R. Ramakrishnan. Building structured web community portals: A top-down, compositional, and incremental approach. In VLDB, pages 399--410, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational tables from lists on the web. PVLDB, 2(1):1078--1089, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. O. Etzioni, M. Cafarella, D. Downey, S. Kok, A.-M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates. Web-scale information extraction in Knowitall: (preliminary results). In WWW, pages 100--110, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Gilbert and K. Karahalios. Understanding deja reviewers. In CSCW, pages 225--228, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Goel, A. Broder, E. Gabrilovich, and B. Pang. Anatomy of the long tail: ordinary people with extraordinary tastes. In WSDM, pages 201--210, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Google sets: http://labs.google.com/sets.Google ScholarGoogle Scholar
  15. P. Gulhane, R. Rastogi, S. H. Sengamedu, and A. Tengli. Exploiting content redundancy for web information extraction. PVLDB, 3(1):578--587, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. In VLDB, pages 289--300, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Machanavajjhala, A. S. Iyer, P. Bohannon, and S. Merugu. Collective extraction from heterogeneous web lists. In WSDM, pages 445--454, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Madhavan, L. Afanasiev, L. Antova, and A. Y. Halevy. Harnessing the deep web: Present and future. In CIDR, 2009.Google ScholarGoogle Scholar
  19. L. Sarmento, V. Jijkoun, M. de Rijke, and E. Oliveira. "more like these": growing entity classes from seeds. In CIKM, pages 959--962, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Senellart, A. Mittal, D. Muschick, R. Gilleron, and M. Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. In WIDM, pages 9--16, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. C. Wang and W. Cohen. Iterative set expansion of named entities using the web. In ICDM, pages 1091--1096, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An analysis of structured data on the web
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader