ABSTRACT
In the AllRight project, we are developing an algorithm for unsupervised table detection and segmentation that uses the visual rendition of a Web page rather than the HTML code. Our algorithm works bottom-up by grouping word bounding boxes into larger groups and uses a set of heuristics. It has already been implemented and a preliminary evaluation on about 6000 Web documents has been carried out.
- B. Krüpl, M. Herzog, and W. Gatterbauer. Using Visual Cues for Extraction of Tabular Data from Arbitrary HTML Documents. In Proc. of the 14th Int. World Wide Web Conf., pages 1000--1001, 2005. Google ScholarDigital Library
- J. Liang, I. Phillips, R. Haralick. An Optimization Methodology for Document Structure Extraction on Latin Character Documents. In IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 23, No. 7, pages 719--734, 2001. Google ScholarDigital Library
- G. Nagy and S. Seth. Hierarchical representation of optically scanned documents. In Proc. of the 7th Int. Conf. on Pattern Recognition, pages 347--349, 1984.Google Scholar
Index Terms
- Visually guided bottom-up table detection and segmentation in web documents
Recommendations
Table detection in heterogeneous documents
DAS '10: Proceedings of the 9th IAPR International Workshop on Document Analysis SystemsDetecting tables in document images is important since not only do tables contain important information, but also most of the layout analysis methods fail in the presence of tables in the document image. Existing approaches for table detection mainly ...
Using visual cues for extraction of tabular data from arbitrary HTML documents
WWW '05: Special interest tracks and posters of the 14th international conference on World Wide WebWe describe a method to extract tabular data from web pages. Rather than just analyzing the DOM tree, we also exploit visual cues in the rendered version of the document to extract data from tables which are not explicitly marked with an HTML table ...
Detecting tables in Web documents
The TABLE tags in HTML (Hypertext Markup Language) documents are widely used for formatting layout of Web documents as well as for describing genuine tables with relational information. As a prerequisite for information extraction from the Web, this ...
Comments