skip to main content
10.1145/3476887.3476890acmotherconferencesArticle/Chapter ViewAbstractPublication PageshipConference Proceedingsconference-collections
research-article

GloSAT Historical Measurement Table Dataset: Enhanced Table Structure Recognition Annotation for Downstream Historical Data Rescue

Published:31 October 2021Publication History

ABSTRACT

Understanding and extracting tables from documents is a research problem that has been studied for decades. Table structure recognition is the labelling of components within a detected table, which can be detected automatically or manually provided. This paper presents the GloSAT historical measurement table dataset designed to train table structure recognition models for use in downstream historical data rescue applications. The dataset contains 500 scanned and manually annotated images of pages from meteorological measurement logbooks. We enhance standard full table and individual cell annotations by adding additional annotations for headings, headers, and table bodies. We also provide annotations for coarse segmentation cells consisting of multiple data cells logically grouped by ruling lines of ink or whitespace in the table, which often represent data cells that are semantically grouped. Our dataset annotations are provided in VOC2007 and ICDAR-2019 Competition on Table Detection and Recognition (cTDaR-19) XML formats, and our dataset can easily be aggregated with the cTDaR-19 dataset. We report results running a series of benchmark algorithms on our new dataset, concluding that post-processing is very important for performance, and that page style is not as significant a feature as table type on model performance.

References

  1. Madhav Agarwal, Ajoy Mondal, and C. V. Jawahar. 2020. CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. arxiv:2008.10831 [cs.CV]Google ScholarGoogle Scholar
  2. Saman Arif and F. Shafait. 2018. Table Detection in Document Images using Foreground and Background Features. 2018 Digital Image Computing: Techniques and Applications (DICTA) (2018), 1–8.Google ScholarGoogle Scholar
  3. M. Everingham, L. Gool, Christopher K. I. Williams, J. Winn, and Andrew Zisserman. 2009. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88 (2009), 303–338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Gao, Yilun Huang, Hervé Déjean, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, and E. Lang. 2019. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR). 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1510–1515.Google ScholarGoogle Scholar
  5. Dafang He, Scott D. Cohen, Brian L. Price, Daniel Kifer, and C. Lee Giles. 2017. Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01(2017), 254–261.Google ScholarGoogle Scholar
  6. Elvis Koci, Maik Thiele, Josephine Rehak, O. Romero, and Wolfgang Lehner. 2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition. 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1280–1285.Google ScholarGoogle Scholar
  7. Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, M. Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. ArXiv abs/1903.01949(2020).Google ScholarGoogle Scholar
  8. Shubham Paliwal, Vishwanath D, Rohit Rahul, Monika Sharma, and Lovekesh Vig. 2020. TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images. arxiv:2001.01469 [cs.CV]Google ScholarGoogle Scholar
  9. Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. 2020. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents. arxiv:2004.12629 [cs.CV]Google ScholarGoogle Scholar
  10. Sebastian Schreiber, S. Agne, I. Wolf, A. Dengel, and Sheraz Ahmed. 2017. DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01(2017), 1162–1167.Google ScholarGoogle Scholar
  11. Erich Schubert, J. Sander, M. Ester, H. Kriegel, and Xiaowei Xu. 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst. 42 (2017), 19:1–19:21.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Asif Shahab, Faisal Shafait, Thomas Kieninger, and Andreas Dengel. 2010. An Open Approach towards the Benchmarking of Table Structure Recognition Systems. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (Boston, Massachusetts, USA) (DAS ’10). Association for Computing Machinery, New York, NY, USA, 113–120. https://doi.org/10.1145/1815330.1815345Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. PubLayNet: largest dataset ever for document layout analysis. arxiv:1908.07836 [cs.CL]Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    HIP '21: Proceedings of the 6th International Workshop on Historical Document Imaging and Processing
    September 2021
    72 pages
    ISBN:9781450386906
    DOI:10.1145/3476887

    Copyright © 2021 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 31 October 2021

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate52of90submissions,58%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format