research-article

GloSAT Historical Measurement Table Dataset: Enhanced Table Structure Recognition Annotation for Downstream Historical Data Rescue

Authors:
Juliusz Ziomek

University of Southampton, United Kingdom

University of Southampton, United Kingdom
View Profile

,
Stuart E. Middleton

University of Southampton, United Kingdom

University of Southampton, United Kingdom
View Profile

HIP '21: Proceedings of the 6th International Workshop on Historical Document Imaging and ProcessingSeptember 2021Pages 49–54https://doi.org/10.1145/3476887.3476890

Published:31 October 2021Publication History

HIP '21: Proceedings of the 6th International Workshop on Historical Document Imaging and Processing

Pages 49–54

ABSTRACT

Understanding and extracting tables from documents is a research problem that has been studied for decades. Table structure recognition is the labelling of components within a detected table, which can be detected automatically or manually provided. This paper presents the GloSAT historical measurement table dataset designed to train table structure recognition models for use in downstream historical data rescue applications. The dataset contains 500 scanned and manually annotated images of pages from meteorological measurement logbooks. We enhance standard full table and individual cell annotations by adding additional annotations for headings, headers, and table bodies. We also provide annotations for coarse segmentation cells consisting of multiple data cells logically grouped by ruling lines of ink or whitespace in the table, which often represent data cells that are semantically grouped. Our dataset annotations are provided in VOC2007 and ICDAR-2019 Competition on Table Detection and Recognition (cTDaR-19) XML formats, and our dataset can easily be aggregated with the cTDaR-19 dataset. We report results running a series of benchmark algorithms on our new dataset, concluding that post-processing is very important for performance, and that page style is not as significant a feature as table type on model performance.

References

Madhav Agarwal, Ajoy Mondal, and C. V. Jawahar. 2020. CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. arxiv:2008.10831 [cs.CV]Google Scholar
Saman Arif and F. Shafait. 2018. Table Detection in Document Images using Foreground and Background Features. 2018 Digital Image Computing: Techniques and Applications (DICTA) (2018), 1–8.Google Scholar
M. Everingham, L. Gool, Christopher K. I. Williams, J. Winn, and Andrew Zisserman. 2009. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88 (2009), 303–338.Google ScholarDigital Library
L. Gao, Yilun Huang, Hervé Déjean, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, and E. Lang. 2019. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR). 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1510–1515.Google Scholar
Dafang He, Scott D. Cohen, Brian L. Price, Daniel Kifer, and C. Lee Giles. 2017. Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01(2017), 254–261.Google Scholar
Elvis Koci, Maik Thiele, Josephine Rehak, O. Romero, and Wolfgang Lehner. 2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition. 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1280–1285.Google Scholar
Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, M. Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. ArXiv abs/1903.01949(2020).Google Scholar
Shubham Paliwal, Vishwanath D, Rohit Rahul, Monika Sharma, and Lovekesh Vig. 2020. TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images. arxiv:2001.01469 [cs.CV]Google Scholar
Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. 2020. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents. arxiv:2004.12629 [cs.CV]Google Scholar
Sebastian Schreiber, S. Agne, I. Wolf, A. Dengel, and Sheraz Ahmed. 2017. DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01(2017), 1162–1167.Google Scholar
Erich Schubert, J. Sander, M. Ester, H. Kriegel, and Xiaowei Xu. 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst. 42 (2017), 19:1–19:21.Google ScholarDigital Library
Asif Shahab, Faisal Shafait, Thomas Kieninger, and Andreas Dengel. 2010. An Open Approach towards the Benchmarking of Table Structure Recognition Systems. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (Boston, Massachusetts, USA) (DAS ’10). Association for Computing Machinery, New York, NY, USA, 113–120. https://doi.org/10.1145/1815330.1815345Google ScholarDigital Library
Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. PubLayNet: largest dataset ever for document layout analysis. arxiv:1908.07836 [cs.CL]Google Scholar

Recommendations

The IMPACT dataset of historical document images
HIP '13: Proceedings of the 2nd International Workshop on Historical Document Imaging and Processing

Representative and comprehensive datasets are a prerequisite for any research activity, from studying specific types of problems through training of algorithms to evaluating results of actual implementations. This paper describes an invaluable resource ...
Read More
BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells
CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)

Tables are the most convenient way to represent structured information in a document. Understanding the table structure is critical to understanding its contents. Several deep learning-based approaches from the literature have shown promising results in ...
Read More
An approach to named entity extraction from historical documents in traditional Mongolian script
JCDL '14: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital Libraries

In this poster, we propose an information extraction method for digitized ancient Mongolian documents by utilizing an ancient-modern dictionary. Named entities such as historical figures and place names will be extracted by employing text mining ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

HIP '21: Proceedings of the 6th International Workshop on Historical Document Imaging and Processing
September 2021
72 pages
ISBN:9781450386906
DOI:10.1145/3476887

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 October 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep Learning
Document Layout Analysis
Historical Documents
Image Processing
Measurements
Table Structure Recognition
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate52of90submissions,58%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 101
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

GloSAT Historical Measurement Table Dataset: Enhanced Table Structure Recognition Annotation for Downstream Historical Data Rescue

HIP '21: Proceedings of the 6th International Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Recommendations

The IMPACT dataset of historical document images

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

An approach to named entity extraction from historical documents in traditional Mongolian script

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

GloSAT Historical Measurement Table Dataset: Enhanced Table Structure Recognition Annotation for Downstream Historical Data Rescue

HIP '21: Proceedings of the 6th International Workshop on Historical Document Imaging and Processing

ABSTRACT

References

Cited By

Recommendations

The IMPACT dataset of historical document images

BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells

An approach to named entity extraction from historical documents in traditional Mongolian script

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media