ABSTRACT
Today, PDF is one of the most popular document formats in the web. Many PDF documents are not images, but remain untagged. They have no tags for identifying the logical reading order, paragraphs, figures, and tables. One of the challenges with these documents is how to extract tables from them. The paper discusses a new system for table structure recognition in untagged PDF documents. It is formulated as a set of configurable parameters and ad-hoc heuristics for recovering table cells. We consider two different configurations for the system and demonstrate experimental results based on the existing competition dataset for both of them.
- B. Coüasnon and A. Lemaitre. Handbook of Document Image Processing and Recognition, chapter Recognition of Tables and Forms, pages 647--677. Springer London, 2014.Google Scholar
- A.C. e Silva, A.M. Jorge, and L. Torgo. Design of an end-to-end method to extract information from tables. International Journal of Document Analysis and Recognition (IJDAR), 8(2):144--171, 2006.Google Scholar
- M. Göbel, T. Hassan, E. Oro, and G. Orsi. A methodology for evaluating algorithms for table understanding in PDF documents. In Proc. of the 2012 ACM Symposium on Document Engineering, pages 45--48, New York, NY, USA, 2012. Google ScholarDigital Library
- M. Göbel, T. Hassan, E. Oro, and G. Orsi. ICDAR 2013 table competition. In Proc. of the 12th Int. Conf. on Document Analysis and Recognition, pages 1449--1453, 2013. Google ScholarDigital Library
- T. Hassan and R. Baumgartner. Table recognition and understanding from PDF files. In Proc. of the 9th Int. Conf. on Document Analysis and Recognition - Volume 02, pages 1143--1147, Washington, DC, USA, 2007. IEEE Comp. Soc. Google ScholarDigital Library
- J. Hu and Y. Liu. Analysis of Documents Born Digital, pages 775--804. Springer London, London, 2014.Google Scholar
- S. Khusro, A. Latif, and I. Ullah. On methods and tools of table detection, extraction and annotation in PDF documents. J. Inf. Sci., 41(1):41--57, Feb. 2015. Google ScholarDigital Library
- Y. Liu, K. Bai, P. Mitra, and C.L. Giles. TableSeer: Automatic table metadata extraction and searching in digital libraries. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 91--100, 2007. Google ScholarDigital Library
- A. Nurminen. Algorithmic extraction of data in tables in PDF documents. Master's thesis, Tampere University of Technology, Tampere, Finland, 2013.Google Scholar
- E. Oro and M. Ruffolo. PDF-TREX: An approach for recognizing and extracting tables from PDF documents. In Proc. of the 10th Int. Conf. on Document Analysis and Recognition, pages 906--910, 2009. Google ScholarDigital Library
- J.Y. Ramel, M. Crucianu, N. Vincent, and C. Faure. Detection, extraction and representation of tables. In Proc. of the 7th Int. Conf. on Document Analysis and Recognition, pages 374--378 vol.1, 2003. Google ScholarDigital Library
- R. Rastan, H.-Y. Paik, and J. Shepherd. Texus: A task-based approach for table extraction and understanding. In Proc. of the 2015 ACM Symposium on Document Engineering, pages 25--34, 2015. Google ScholarDigital Library
- A. Shigarov. Table understanding using a rule engine. Expert Systems with Applications, 42(2):929--937, 2015. Google ScholarDigital Library
- A. Shigarov and R. Fedorov. Simple algorithm page layout analysis. Pattern Recognition and Image Analysis, 21(2):324--327, 2011. Google ScholarDigital Library
- B. Yildiz, K. Kaiser, and S. Miksch. pdf2table: A method to extract table information from PDF files. In Proc. of the 2nd Indian Int. Conf. on Artificial Intelligence, Pune, India, pages 1773--1785, 2005.Google Scholar
Index Terms
- Configurable Table Structure Recognition in Untagged PDF documents
Recommendations
End-to-end table structure recognition and extraction in heterogeneous documents
AbstractAutomatically detecting and parsing tables into an indexable and searchable format is an important problem in document digitization. It relates to computer vision, machine learning, and optical character recognition. This paper ...
Highlights- Recognizing tables using object detection in structured and unstructured documents.
Automated Generation of Accessible PDF
ASSETS '20: Proceedings of the 22nd International ACM SIGACCESS Conference on Computers and AccessibilityLaTeX is widely used in STEM fields for creating high-quality documents that are converted to the Portable Document Format (PDF) for dissemination. Currently, available LaTeX systems do not guarantee that the generated PDFs are compliant with ...
A Clustering Approach Combining Lines and Text Detection for Table Extraction
Document Analysis and Recognition – ICDAR 2023 WorkshopsAbstractTable detection is a crucial step in several document analysis applications as tables are used to present essential information to the reader in a structured manner. In companies that deal with a large amount of data, administrative documents must ...
Comments