research-article

Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future

Authors:
Tim Weninger

University of Notre Dame, Notre Dame, Indiana

University of Notre Dame, Notre Dame, Indiana
View Profile

,
Rodrigo Palacios

California State University, Fresno, California

California State University, Fresno, California
View Profile

,
Valter Crescenzi

Università Roma Tre Dipartimento di Ingegneria, Rome, Italy

Università Roma Tre Dipartimento di Ingegneria, Rome, Italy
View Profile

,
Thomas Gottron

University of KoblenzLandau, Germany

University of KoblenzLandau, Germany
View Profile

,
Paolo Merialdo

Università Roma Tre Dipartimento di Ingegneria, Rome, Italy

Università Roma Tre Dipartimento di Ingegneria, Rome, Italy
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 17 Issue 2December 2015pp 17–23https://doi.org/10.1145/2897350.2897353

Published:25 February 2016Publication History

ACM SIGKDD Explorations Newsletter

Abstract

In this paper, we present a meta-analysis of several Web content extraction algorithms, and make recommendations for the future of content extraction on the Web. First, we find that nearly all Web content extractors do not consider a very large, and growing, portion of modernWeb pages. Second, it is well understood that wrapper induction extractors tend to break as theWeb changes; ; heuristic/ feature engineering extractors were thought to be immune to a Web site's evolution, but we find that this is not the case: heuristic content extractor performance also tends to degrade over time due to the evolution of Web site forms and practices. We conclude with recommendations for future work that address these and other findings.

References

Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In WWW, page 580, New York, New York, USA, May 2002. ACM Press. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Communications of the ACM, 54(2):72--79, 2011. Google ScholarDigital Library
M. J. Cafarella, A. Halevy, D. Z.Wang, E.Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, Aug. 2008. Google ScholarDigital Library
S. Chakravarthy and S. C. H. Hara. Automating change detection and notification of web pages. In 17th International Conference on Database and Expert Systems Applications, pages 465--469. IEEE, 2006. Google ScholarDigital Library
B. Chidlovskii, B. Roustant, and M. Brette. Documentum eci self-repairing wrappers: Performance analysis. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD '06, pages 708--717, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB, pages 109--118, Sept. 2001. Google ScholarDigital Library
V. Crescenzi and P. Merialdo. Wrapper inference for ambiguous web pages. Applied Artificial Intelligence, 22(1&2):21--52, 2008. Google ScholarDigital Library
N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction. In SIGMOD, page 335, New York, New York, USA, June 2009. ACM Press. Google ScholarDigital Library
H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan. Computational aspects of resilient data extraction from semistructured sources (extended abstract). In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, pages 136--144, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
S. Debnath, P. Mitra, and C. L. Giles. Automatic extraction of informative blocks from webpages. In SAC, page 1722, New York, New York, USA, Mar. 2005. ACM Press. Google ScholarDigital Library
A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, 2001.Google Scholar
D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In Special interest tracks and posters of the 14th international conference on World Wide Web - WWW '05, page 830, New York, New York, USA, May 2005. ACM Press. Google ScholarDigital Library
T. Gottron. Combining content extraction heuristics. In ii-WAS, page 591, New York, New York, USA, Nov. 2008. ACM Press. Google ScholarDigital Library
T. Gottron. Content Code Blurring: A New Approach to Content Extraction. In DEXA TIR Workshop, pages 29--33. IEEE, Sept. 2008. Google ScholarDigital Library
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow., 2(1):289--300, Aug. 2009. Google ScholarDigital Library
W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. ACM SIGMOD Record, 30(3):33, Sept. 2001. Google ScholarDigital Library
C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining - WSDM '10, page 441, New York, New York, USA, Feb. 2010. ACM Press. Google ScholarDigital Library
N. Kushmerick. Learning to remove internet advertisements. In Proceedings of the third annual conference on Autonomous Agents, pages 175--181. ACM, 1999. Google ScholarDigital Library
G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow., 3(1-2):1338--1347, Sept. 2010. Google ScholarDigital Library
B. Liu, R. Grossman, and Y. Zhai. Mining data records inWeb pages. In SIGKDD, page 601, New York, New York, USA, Aug. 2003. ACM Press. Google ScholarDigital Library
C. Mantratzis, M. Orgun, and S. Cassidy. Separating XHTML content from navigation clutter using DOM-structure block analysis. In HT, page 145, New York, New York, USA, Sept. 2005. ACM Press. Google ScholarDigital Library
Marco Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. CleanEval: a competition for cleaning webpages. In International Language Resources and Evaluation, 2008.Google Scholar
L. Martin and T. Gottron. Readability and the Web. Future Internet, 4:238--252, 2012. Special Issue Selected Papers from ITA 11.Google ScholarCross Ref
R. Palacios. Eatiht. http://rodricios.github.io/eatiht/, 2015.Google Scholar
A. G. Parameswaran, N. N. Dalvi, H. Garcia-Molina, and R. Rastogi. Optimal schemes for robust web extraction. PVLDB, 4(11):980--991, 2011.Google ScholarDigital Library
R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. Proceedings of the VLDB Endowment, 5(10):908--919, 2012. Google ScholarDigital Library
D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. QuASM: a system for question answering using semi-structured data. In JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 46--55, New York, NY, USA, 2002. ACM Press. Google ScholarDigital Library
T. Weninger, W. H. Hsu, and J. Han. CETR. In WWW, page 971, New York, New York, USA, Apr. 2010. ACM Press. Google ScholarDigital Library
Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, page 76, New York, New York, USA, May 2005. ACM Press. Google ScholarDigital Library

Index Terms

Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future
1. Information systems
  1. Information retrieval

Index terms have been assigned to the content through auto-classification.

Recommendations

Web Content Extraction based on Webpage Layout Analysis
ITCS '10: Proceedings of the 2010 Second International Conference on Information Technology and Computer Science

for web content extraction task, researchers have proposed many different methods, such as wrapper-based method, DOM tree rule-based method, machine learning-based method and so on. To some extent, all these methods ignore the layout information of the ...
Read More
Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites
ICEBE '11: Proceedings of the 2011 IEEE 8th International Conference on e-Business Engineering

This paper proposes a novel Web content extraction approach based on heuristic rules and the XPath utility in XML. The main objective is to address the problem of Web visualization by generating tag clouds from Thai Web sites in order to provide an ...
Read More
Content extraction from news web pages using tag tree

As the web endures to develop, there is an enormous amount of information which is typically designed for its users, which makes it difficult to extract relevant data from numerous sources. In this paper, we propose an approach for extracting the main ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGKDD Explorations Newsletter Volume 17, Issue 2
December 2015
41 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/2897350
Editors:
Charu Aggarwal
IBM T.J. Watson
,
Haixun Wang
Google
,
Ankur M. Teredesai
University of Washington Tacoma
,
Hanghang Tong
Arizona State University
Issue’s Table of Contents
Copyright © 2016 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 February 2016
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 295
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

Web Content Extraction based on Webpage Layout Analysis

Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites

Content extraction from news web pages using tag tree