Article

Visually guided bottom-up table detection and segmentation in web documents

Authors:
Bernhard Krüpl

Vienna University of Technology

Vienna University of Technology
View Profile

,
Marcus Herzog

Vienna University of Technology

Vienna University of Technology
View Profile

WWW '06: Proceedings of the 15th international conference on World Wide WebMay 2006Pages 933–934https://doi.org/10.1145/1135777.1135951

Published:23 May 2006Publication History

WWW '06: Proceedings of the 15th international conference on World Wide Web

Pages 933–934

ABSTRACT

In the AllRight project, we are developing an algorithm for unsupervised table detection and segmentation that uses the visual rendition of a Web page rather than the HTML code. Our algorithm works bottom-up by grouping word bounding boxes into larger groups and uses a set of heuristics. It has already been implemented and a preliminary evaluation on about 6000 Web documents has been carried out.

References

B. Krüpl, M. Herzog, and W. Gatterbauer. Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents. In Proc. of the 14th Int. World Wide Web Conf., pages 1000--1001, 2005. Google ScholarDigital Library
J. Liang, I. Phillips, R. Haralick. An Optimization Methodology for Document Structure Extraction on Latin Character Documents. In IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 23, No. 7, pages 719--734, 2001. Google ScholarDigital Library
G. Nagy and S. Seth. Hierarchical representation of optically scanned documents. In Proc. of the 7th Int. Conf. on Pattern Recognition, pages 347--349, 1984.Google Scholar

Index Terms

Visually guided bottom-up table detection and segmentation in web documents
1. Applied computing
  1. Document management and text processing
    1. Document capture
2. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Table detection in heterogeneous documents
DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis Systems

Detecting tables in document images is important since not only do tables contain important information, but also most of the layout analysis methods fail in the presence of tables in the document image. Existing approaches for table detection mainly ...
Read More
Using visual cues for extraction of tabular data from arbitrary HTML documents
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide Web

We describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table ...
Read More
Detecting tables in Web documents

The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '06: Proceedings of the 15th international conference on World Wide Web
May 2006
1102 pages
ISBN:1595933239
DOI:10.1145/1135777
General Chairs:
Leslie Carr
University of Southampton
,
David De Roure
University of Southampton
,
Arun Iyengar
IBM Research
,
Program Chairs:
Carole Goble
University of Manchester, UK
,
Mike Dahlin
University of Texas at Austin
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
table detection
web information extraction
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 330
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Visually guided bottom-up table detection and segmentation in web documents

WWW '06: Proceedings of the 15th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Table detection in heterogeneous documents

Using visual cues for extraction of tabular data from arbitrary HTML documents

Detecting tables in Web documents