Article

Robust web content extraction

Authors:
Marek Kowalkiewicz

The Poznan University of Economics, Poznan, Poland

The Poznan University of Economics, Poznan, Poland
View Profile

,
Maria E. Orlowska

The University of Queensland, St. Lucia, Australia

The University of Queensland, St. Lucia, Australia
View Profile

,
Tomasz Kaczmarek

The Poznan University of Economics, Poznan, Poland

The Poznan University of Economics, Poznan, Poland
View Profile

,
Witold Abramowicz

The Poznan University of Economics, Poznan, Poland

The Poznan University of Economics, Poznan, Poland
View Profile

WWW '06: Proceedings of the 15th international conference on World Wide WebMay 2006Pages 887–888https://doi.org/10.1145/1135777.1135928

Published:23 May 2006Publication History

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 887–888

ABSTRACT

We present an empirical evaluation and comparison of two content extraction methods in HTML: absolute XPath expressions and relative XPath expressions. We argue that the relative XPath expressions, although not widely used, should be used in preference to absolute XPath expressions in extracting content from human-created Web documents. Evaluation of robustness covers four thousand queries executed on several hundred webpages. We show that in referencing parts of real world dynamic HTML documents, relative XPath expressions are on average significantly more robust than absolute XPath ones.

References

Abe, M. and Hori, M. Robust Pointing by XPath Language: Authoring Support and Empirical Evaluation. in Proceedings of 2003 Symposium on Applications and the Internet (SAINT 2003), 27--31 January 2003, IEEE Computer Society, Orlando, FL, USA, 2003, 156--165. Google ScholarDigital Library
Kowalkiewicz, M., Orlowska, M., Kaczmarek, T. and Abramowicz, W. Towards more personalized Web: Extraction and integration of dynamic content from the Web. in Proceedings of the 8th Asia Pacific Web Conference APWeb 2006, Harbin, China, 2006. Google ScholarDigital Library
Laender, A.H.F., Ribeiro-Neto, B.A., Silva, A.S.d. and Teixeira, J.S. A brief survey of web data extraction tools. ACM SIGMOD Record, 31 (2). 84--93. Google ScholarDigital Library

Index Terms

Robust web content extraction
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Multi / mixed media creation
2. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction paradigms
      1. Hypertext / hypermedia

Recommendations

DOM-based content extraction of HTML documents
WWW '03: Proceedings of the 12th international conference on World Wide Web

Web pages often contain clutter (such as pop-up ads, unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, ...
Read More
Automating Content Extraction of HTML Documents

Web pages often contain clutter (such as unnecessary images and extraneous links) around the body of an article that distracts a user from actual content. Extraction of "useful and relevant" content from web pages has many applications, including cell ...
Read More
Extracting Web Content by Exploiting Multi-Category Characteristics
Web Information Systems Engineering – WISE 2017
Abstract
Extracting web content aims at separating web content from web pages since web content is organized and presented by different HTML templates and is surrounded by various information. Knowing little about template structures and noise information ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
content extraction
evaluation
robustness
wrappers
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 476
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Robust web content extraction

WWW '06: Proceedings of the 15th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

DOM-based content extraction of HTML documents

Automating Content Extraction of HTML Documents

Extracting Web Content by Exploiting Multi-Category Characteristics