short-paper

Configurable Table Structure Recognition in Untagged PDF documents

Authors:
Alexey Shigarov

Matrosov Institute for System Dynamics and Control Theory of SB RAS, Irkutsk, Russian Fed.

Matrosov Institute for System Dynamics and Control Theory of SB RAS, Irkutsk, Russian Fed.
View Profile

,
Andrey Mikhailov

Matrosov Institute for System Dynamics and Control Theory of SB RAS, Irkutsk, Russian Fed.

Matrosov Institute for System Dynamics and Control Theory of SB RAS, Irkutsk, Russian Fed.
View Profile

,
Andrey Altaev

Matrosov Institute for System Dynamics and Control Theory of SB RAS, Irkutsk, Russian Fed.

Matrosov Institute for System Dynamics and Control Theory of SB RAS, Irkutsk, Russian Fed.
View Profile

DocEng '16: Proceedings of the 2016 ACM Symposium on Document EngineeringSeptember 2016Pages 119–122https://doi.org/10.1145/2960811.2967152

Published:13 September 2016Publication History

DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering

Pages 119–122

ABSTRACT

Today, PDF is one of the most popular document formats in the web. Many PDF documents are not images, but remain untagged. They have no tags for identifying the logical reading order, paragraphs, figures, and tables. One of the challenges with these documents is how to extract tables from them. The paper discusses a new system for table structure recognition in untagged PDF documents. It is formulated as a set of configurable parameters and ad-hoc heuristics for recovering table cells. We consider two different configurations for the system and demonstrate experimental results based on the existing competition dataset for both of them.

References

B. Coüasnon and A. Lemaitre. Handbook of Document Image Processing and Recognition, chapter Recognition of Tables and Forms, pages 647--677. Springer London, 2014.Google Scholar
A.C. e Silva, A.M. Jorge, and L. Torgo. Design of an end-to-end method to extract information from tables. International Journal of Document Analysis and Recognition (IJDAR), 8(2):144--171, 2006.Google Scholar
M. Göbel, T. Hassan, E. Oro, and G. Orsi. A methodology for evaluating algorithms for table understanding in PDF documents. In Proc. of the 2012 ACM Symposium on Document Engineering, pages 45--48, New York, NY, USA, 2012. Google ScholarDigital Library
M. Göbel, T. Hassan, E. Oro, and G. Orsi. ICDAR 2013 table competition. In Proc. of the 12th Int. Conf. on Document Analysis and Recognition, pages 1449--1453, 2013. Google ScholarDigital Library
T. Hassan and R. Baumgartner. Table recognition and understanding from PDF files. In Proc. of the 9th Int. Conf. on Document Analysis and Recognition - Volume 02, pages 1143--1147, Washington, DC, USA, 2007. IEEE Comp. Soc. Google ScholarDigital Library
J. Hu and Y. Liu. Analysis of Documents Born Digital, pages 775--804. Springer London, London, 2014.Google Scholar
S. Khusro, A. Latif, and I. Ullah. On methods and tools of table detection, extraction and annotation in PDF documents. J. Inf. Sci., 41(1):41--57, Feb. 2015. Google ScholarDigital Library
Y. Liu, K. Bai, P. Mitra, and C.L. Giles. TableSeer: Automatic table metadata extraction and searching in digital libraries. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 91--100, 2007. Google ScholarDigital Library
A. Nurminen. Algorithmic extraction of data in tables in PDF documents. Master's thesis, Tampere University of Technology, Tampere, Finland, 2013.Google Scholar
E. Oro and M. Ruffolo. PDF-TREX: An approach for recognizing and extracting tables from PDF documents. In Proc. of the 10th Int. Conf. on Document Analysis and Recognition, pages 906--910, 2009. Google ScholarDigital Library
J.Y. Ramel, M. Crucianu, N. Vincent, and C. Faure. Detection, extraction and representation of tables. In Proc. of the 7th Int. Conf. on Document Analysis and Recognition, pages 374--378 vol.1, 2003. Google ScholarDigital Library
R. Rastan, H.-Y. Paik, and J. Shepherd. Texus: A task-based approach for table extraction and understanding. In Proc. of the 2015 ACM Symposium on Document Engineering, pages 25--34, 2015. Google ScholarDigital Library
A. Shigarov. Table understanding using a rule engine. Expert Systems with Applications, 42(2):929--937, 2015. Google ScholarDigital Library
A. Shigarov and R. Fedorov. Simple algorithm page layout analysis. Pattern Recognition and Image Analysis, 21(2):324--327, 2011. Google ScholarDigital Library
B. Yildiz, K. Kaiser, and S. Miksch. pdf2table: A method to extract table information from PDF files. In Proc. of the 2nd Indian Int. Conf. on Artificial Intelligence, Pune, India, pages 1773--1785, 2005.Google Scholar

Index Terms

Configurable Table Structure Recognition in Untagged PDF documents
1. Applied computing
  1. Document management and text processing

Recommendations

End-to-end table structure recognition and extraction in heterogeneous documents
Abstract
Automatically detecting and parsing tables into an indexable and searchable format is an important problem in document digitization. It relates to computer vision, machine learning, and optical character recognition. This paper ...
Highlights
- Recognizing tables using object detection in structured and unstructured documents.
Read More
Automated Generation of Accessible PDF
ASSETS '20: Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and Accessibility

LaTeX is widely used in STEM fields for creating high-quality documents that are converted to the Portable Document Format (PDF) for dissemination. Currently, available LaTeX systems do not guarantee that the generated PDFs are compliant with ...
Read More
A Clustering Approach Combining Lines and Text Detection for Table Extraction
Document Analysis and Recognition – ICDAR 2023 Workshops
Abstract
Table detection is a crucial step in several document analysis applications as tables are used to present essential information to the reader in a structured manner. In companies that deal with a large amount of data, administrative documents must ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering
September 2016
222 pages
ISBN:9781450344388
DOI:10.1145/2960811
General Chair:
Robert Sablatnig
TU Wien, Austria
,
Program Chair:
Tamir Hassan
HP Labs, Austria
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 September 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
pdf accessibility
pdf document analysis
table extraction
table structure recognition
untagged pdf documents
Qualifiers
- short-paper
Conference

Acceptance Rates
DocEng '16 Paper Acceptance Rate11of35submissions,31%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 367
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Configurable Table Structure Recognition in Untagged PDF documents

DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

End-to-end table structure recognition and extraction in heterogeneous documents

Automated Generation of Accessible PDF

A Clustering Approach Combining Lines and Text Detection for Table Extraction