Abstract
In this paper, we present a meta-analysis of several Web content extraction algorithms, and make recommendations for the future of content extraction on the Web. First, we find that nearly all Web content extractors do not consider a very large, and growing, portion of modernWeb pages. Second, it is well understood that wrapper induction extractors tend to break as theWeb changes; ; heuristic/ feature engineering extractors were thought to be immune to a Web site's evolution, but we find that this is not the case: heuristic content extractor performance also tends to degrade over time due to the evolution of Web site forms and practices. We conclude with recommendations for future work that address these and other findings.
- Z. Bar-Yossef and S. Rajagopalan. Template detection via data mining and its applications. In WWW, page 580, New York, New York, USA, May 2002. ACM Press. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, and J. Madhavan. Structured data on the web. Communications of the ACM, 54(2):72--79, 2011. Google ScholarDigital Library
- M. J. Cafarella, A. Halevy, D. Z.Wang, E.Wu, and Y. Zhang. Webtables: Exploring the power of tables on the web. Proc. VLDB Endow., 1(1):538--549, Aug. 2008. Google ScholarDigital Library
- S. Chakravarthy and S. C. H. Hara. Automating change detection and notification of web pages. In 17th International Conference on Database and Expert Systems Applications, pages 465--469. IEEE, 2006. Google ScholarDigital Library
- B. Chidlovskii, B. Roustant, and M. Brette. Documentum eci self-repairing wrappers: Performance analysis. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, SIGMOD '06, pages 708--717, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB, pages 109--118, Sept. 2001. Google ScholarDigital Library
- V. Crescenzi and P. Merialdo. Wrapper inference for ambiguous web pages. Applied Artificial Intelligence, 22(1&2):21--52, 2008. Google ScholarDigital Library
- N. Dalvi, P. Bohannon, and F. Sha. Robust web extraction. In SIGMOD, page 335, New York, New York, USA, June 2009. ACM Press. Google ScholarDigital Library
- H. Davulcu, G. Yang, M. Kifer, and I. V. Ramakrishnan. Computational aspects of resilient data extraction from semistructured sources (extended abstract). In Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '00, pages 136--144, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- S. Debnath, P. Mitra, and C. L. Giles. Automatic extraction of informative blocks from webpages. In SAC, page 1722, New York, New York, USA, Mar. 2005. ACM Press. Google ScholarDigital Library
- A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: Content classification for digital libraries. In DELOS Workshop: Personalisation and Recommender Systems in Digital Libraries, 2001.Google Scholar
- D. Gibson, K. Punera, and A. Tomkins. The volume and evolution of web page templates. In Special interest tracks and posters of the 14th international conference on World Wide Web - WWW '05, page 830, New York, New York, USA, May 2005. ACM Press. Google ScholarDigital Library
- T. Gottron. Combining content extraction heuristics. In ii-WAS, page 591, New York, New York, USA, Nov. 2008. ACM Press. Google ScholarDigital Library
- T. Gottron. Content Code Blurring: A New Approach to Content Extraction. In DEXA TIR Workshop, pages 29--33. IEEE, Sept. 2008. Google ScholarDigital Library
- R. Gupta and S. Sarawagi. Answering table augmentation queries from unstructured lists on the web. Proc. VLDB Endow., 2(1):289--300, Aug. 2009. Google ScholarDigital Library
- W. Han, D. Buttler, and C. Pu. Wrapping web data into XML. ACM SIGMOD Record, 30(3):33, Sept. 2001. Google ScholarDigital Library
- C. Kohlschütter, P. Fankhauser, and W. Nejdl. Boilerplate detection using shallow text features. In Proceedings of the third ACM international conference on Web search and data mining - WSDM '10, page 441, New York, New York, USA, Feb. 2010. ACM Press. Google ScholarDigital Library
- N. Kushmerick. Learning to remove internet advertisements. In Proceedings of the third annual conference on Autonomous Agents, pages 175--181. ACM, 1999. Google ScholarDigital Library
- G. Limaye, S. Sarawagi, and S. Chakrabarti. Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow., 3(1-2):1338--1347, Sept. 2010. Google ScholarDigital Library
- B. Liu, R. Grossman, and Y. Zhai. Mining data records inWeb pages. In SIGKDD, page 601, New York, New York, USA, Aug. 2003. ACM Press. Google ScholarDigital Library
- C. Mantratzis, M. Orgun, and S. Cassidy. Separating XHTML content from navigation clutter using DOM-structure block analysis. In HT, page 145, New York, New York, USA, Sept. 2005. ACM Press. Google ScholarDigital Library
- Marco Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff. CleanEval: a competition for cleaning webpages. In International Language Resources and Evaluation, 2008.Google Scholar
- L. Martin and T. Gottron. Readability and the Web. Future Internet, 4:238--252, 2012. Special Issue Selected Papers from ITA 11.Google ScholarCross Ref
- R. Palacios. Eatiht. http://rodricios.github.io/eatiht/, 2015.Google Scholar
- A. G. Parameswaran, N. N. Dalvi, H. Garcia-Molina, and R. Rastogi. Optimal schemes for robust web extraction. PVLDB, 4(11):980--991, 2011.Google ScholarDigital Library
- R. Pimplikar and S. Sarawagi. Answering table queries on the web using column keywords. Proceedings of the VLDB Endowment, 5(10):908--919, 2012. Google ScholarDigital Library
- D. Pinto, M. Branstein, R. Coleman, W. B. Croft, M. King, W. Li, and X. Wei. QuASM: a system for question answering using semi-structured data. In JCDL '02: Proceedings of the 2nd ACM/IEEE-CS joint conference on Digital libraries, pages 46--55, New York, NY, USA, 2002. ACM Press. Google ScholarDigital Library
- T. Weninger, W. H. Hsu, and J. Han. CETR. In WWW, page 971, New York, New York, USA, Apr. 2010. ACM Press. Google ScholarDigital Library
- Y. Zhai and B. Liu. Web data extraction based on partial tree alignment. In WWW, page 76, New York, New York, USA, May 2005. ACM Press. Google ScholarDigital Library
Index Terms
- Web Content Extraction: a MetaAnalysis of its Past and Thoughts on its Future
Recommendations
Web Content Extraction based on Webpage Layout Analysis
ITCS '10: Proceedings of the 2010 Second International Conference on Information Technology and Computer Sciencefor web content extraction task, researchers have proposed many different methods, such as wrapper-based method, DOM tree rule-based method, machine learning-based method and so on. To some extent, all these methods ignore the layout information of the ...
Automatic Web Content Extraction for Generating Tag Clouds from Thai Web Sites
ICEBE '11: Proceedings of the 2011 IEEE 8th International Conference on e-Business EngineeringThis paper proposes a novel Web content extraction approach based on heuristic rules and the XPath utility in XML. The main objective is to address the problem of Web visualization by generating tag clouds from Thai Web sites in order to provide an ...
Content extraction from news web pages using tag tree
As the web endures to develop, there is an enormous amount of information which is typically designed for its users, which makes it difficult to extract relevant data from numerous sources. In this paper, we propose an approach for extracting the main ...
Comments