skip to main content
research-article

Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future

Published:25 February 2016Publication History
Skip Abstract Section

Abstract

In this paper, we present a meta-analysis of several Web content extraction algorithms, and make recommendations for the future of content extraction on the Web. First, we find that nearly all Web content extractors do not consider a very large, and growing, portion of modernWeb pages. Second, it is well understood that wrapper induction extractors tend to break as theWeb changes; ; heuristic/ feature engineering extractors were thought to be immune to a Web site's evolution, but we find that this is not the case: heuristic content extractor performance also tends to degrade over time due to the evolution of Web site forms and practices. We conclude with recommendations for future work that address these and other findings.

References

  1. Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In WWW, page 580, New York, New York, USA, May 2002. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Communications of the ACM, 54(2):72--79, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. J. Cafarella, A. Halevy, D. Z.Wang, E.Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, Aug. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chakravarthy and S. C. H. Hara. Automating change detection and notification of web pages. In 17th International Conference on Database and Expert Systems Applications, pages 465--469. IEEE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Chidlovskii, B. Roustant, and M. Brette. Documentum eci self-repairing wrappers: Performance analysis. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD '06, pages 708--717, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB, pages 109--118, Sept. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. V. Crescenzi and P. Merialdo. Wrapper inference for ambiguous web pages. Applied Artificial Intelligence, 22(1&2):21--52, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction. In SIGMOD, page 335, New York, New York, USA, June 2009. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan. Computational aspects of resilient data extraction from semistructured sources (extended abstract). In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, pages 136--144, New York, NY, USA, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Debnath, P. Mitra, and C. L. Giles. Automatic extraction of informative blocks from webpages. In SAC, page 1722, New York, New York, USA, Mar. 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, 2001.Google ScholarGoogle Scholar
  12. D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In Special interest tracks and posters of the 14th international conference on World Wide Web - WWW '05, page 830, New York, New York, USA, May 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Gottron. Combining content extraction heuristics. In ii-WAS, page 591, New York, New York, USA, Nov. 2008. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Gottron. Content Code Blurring: A New Approach to Content Extraction. In DEXA TIR Workshop, pages 29--33. IEEE, Sept. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow., 2(1):289--300, Aug. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. ACM SIGMOD Record, 30(3):33, Sept. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining - WSDM '10, page 441, New York, New York, USA, Feb. 2010. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Kushmerick. Learning to remove internet advertisements. In Proceedings of the third annual conference on Autonomous Agents, pages 175--181. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow., 3(1-2):1338--1347, Sept. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Liu, R. Grossman, and Y. Zhai. Mining data records inWeb pages. In SIGKDD, page 601, New York, New York, USA, Aug. 2003. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Mantratzis, M. Orgun, and S. Cassidy. Separating XHTML content from navigation clutter using DOM-structure block analysis. In HT, page 145, New York, New York, USA, Sept. 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Marco Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. CleanEval: a competition for cleaning webpages. In International Language Resources and Evaluation, 2008.Google ScholarGoogle Scholar
  23. L. Martin and T. Gottron. Readability and the Web. Future Internet, 4:238--252, 2012. Special Issue Selected Papers from ITA 11.Google ScholarGoogle ScholarCross RefCross Ref
  24. R. Palacios. Eatiht. http://rodricios.github.io/eatiht/, 2015.Google ScholarGoogle Scholar
  25. A. G. Parameswaran, N. N. Dalvi, H. Garcia-Molina, and R. Rastogi. Optimal schemes for robust web extraction. PVLDB, 4(11):980--991, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. Proceedings of the VLDB Endowment, 5(10):908--919, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. QuASM: a system for question answering using semi-structured data. In JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 46--55, New York, NY, USA, 2002. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Weninger, W. H. Hsu, and J. Han. CETR. In WWW, page 971, New York, New York, USA, Apr. 2010. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, page 76, New York, New York, USA, May 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGKDD Explorations Newsletter
      ACM SIGKDD Explorations Newsletter  Volume 17, Issue 2
      December 2015
      41 pages
      ISSN:1931-0145
      EISSN:1931-0153
      DOI:10.1145/2897350
      Issue’s Table of Contents

      Copyright © 2016 Authors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 25 February 2016

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader