skip to main content
10.1145/2960811.2967152acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper

Configurable Table Structure Recognition in Untagged PDF documents

Authors Info & Claims
Published:13 September 2016Publication History

ABSTRACT

Today, PDF is one of the most popular document formats in the web. Many PDF documents are not images, but remain untagged. They have no tags for identifying the logical reading order, paragraphs, figures, and tables. One of the challenges with these documents is how to extract tables from them. The paper discusses a new system for table structure recognition in untagged PDF documents. It is formulated as a set of configurable parameters and ad-hoc heuristics for recovering table cells. We consider two different configurations for the system and demonstrate experimental results based on the existing competition dataset for both of them.

References

  1. B. Coüasnon and A. Lemaitre. Handbook of Document Image Processing and Recognition, chapter Recognition of Tables and Forms, pages 647--677. Springer London, 2014.Google ScholarGoogle Scholar
  2. A.C. e Silva, A.M. Jorge, and L. Torgo. Design of an end-to-end method to extract information from tables. International Journal of Document Analysis and Recognition (IJDAR), 8(2):144--171, 2006.Google ScholarGoogle Scholar
  3. M. Göbel, T. Hassan, E. Oro, and G. Orsi. A methodology for evaluating algorithms for table understanding in PDF documents. In Proc. of the 2012 ACM Symposium on Document Engineering, pages 45--48, New York, NY, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Göbel, T. Hassan, E. Oro, and G. Orsi. ICDAR 2013 table competition. In Proc. of the 12th Int. Conf. on Document Analysis and Recognition, pages 1449--1453, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Hassan and R. Baumgartner. Table recognition and understanding from PDF files. In Proc. of the 9th Int. Conf. on Document Analysis and Recognition - Volume 02, pages 1143--1147, Washington, DC, USA, 2007. IEEE Comp. Soc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Hu and Y. Liu. Analysis of Documents Born Digital, pages 775--804. Springer London, London, 2014.Google ScholarGoogle Scholar
  7. S. Khusro, A. Latif, and I. Ullah. On methods and tools of table detection, extraction and annotation in PDF documents. J. Inf. Sci., 41(1):41--57, Feb. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Y. Liu, K. Bai, P. Mitra, and C.L. Giles. TableSeer: Automatic table metadata extraction and searching in digital libraries. In Proc. of the 7th ACM/IEEE Joint Conf. on Digital Libraries, pages 91--100, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Nurminen. Algorithmic extraction of data in tables in PDF documents. Master's thesis, Tampere University of Technology, Tampere, Finland, 2013.Google ScholarGoogle Scholar
  10. E. Oro and M. Ruffolo. PDF-TREX: An approach for recognizing and extracting tables from PDF documents. In Proc. of the 10th Int. Conf. on Document Analysis and Recognition, pages 906--910, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J.Y. Ramel, M. Crucianu, N. Vincent, and C. Faure. Detection, extraction and representation of tables. In Proc. of the 7th Int. Conf. on Document Analysis and Recognition, pages 374--378 vol.1, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Rastan, H.-Y. Paik, and J. Shepherd. Texus: A task-based approach for table extraction and understanding. In Proc. of the 2015 ACM Symposium on Document Engineering, pages 25--34, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Shigarov. Table understanding using a rule engine. Expert Systems with Applications, 42(2):929--937, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Shigarov and R. Fedorov. Simple algorithm page layout analysis. Pattern Recognition and Image Analysis, 21(2):324--327, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. B. Yildiz, K. Kaiser, and S. Miksch. pdf2table: A method to extract table information from PDF files. In Proc. of the 2nd Indian Int. Conf. on Artificial Intelligence, Pune, India, pages 1773--1785, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Configurable Table Structure Recognition in Untagged PDF documents

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering
      September 2016
      222 pages
      ISBN:9781450344388
      DOI:10.1145/2960811

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 September 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      DocEng '16 Paper Acceptance Rate11of35submissions,31%Overall Acceptance Rate178of537submissions,33%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader