ABSTRACT
We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones.
- Abe, M. and Hori, M. Robust Pointing by XPath Language: Authoring Support and Empirical Evaluation. in Proceedings of 2003 Symposium on Applications and the Internet (SAINT 2003), 27--31 January 2003, IEEE Computer Society, Orlando, FL, USA, 2003, 156--165. Google ScholarDigital Library
- Kowalkiewicz, M., Orlowska, M., Kaczmarek, T. and Abramowicz, W. Towards more personalized Web: Extraction and integration of dynamic content from the Web. in Proceedings of the 8th Asia Pacific Web Conference APWeb 2006, Harbin, China, 2006. Google ScholarDigital Library
- Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S.d. and Teixeira, J.S. A brief survey of web data extraction tools. ACM SIGMOD Record, 31 (2). 84--93. Google ScholarDigital Library
Index Terms
- Robust web content extraction
Recommendations
DOM-based content extraction of HTML documents
WWW '03: Proceedings of the 12th international conference on World Wide WebWeb pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, ...
Automating Content Extraction of HTML Documents
Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell ...
Extracting Web Content by Exploiting Multi-Category Characteristics
Web Information Systems Engineering – WISE 2017AbstractExtracting web content aims at separating web content from web pages since web content is organized and presented by different HTML templates and is surrounded by various information. Knowing little about template structures and noise information ...
Comments