ABSTRACT
Understanding and extracting tables from documents is a research problem that has been studied for decades. Table structure recognition is the labelling of components within a detected table, which can be detected automatically or manually provided. This paper presents the GloSAT historical measurement table dataset designed to train table structure recognition models for use in downstream historical data rescue applications. The dataset contains 500 scanned and manually annotated images of pages from meteorological measurement logbooks. We enhance standard full table and individual cell annotations by adding additional annotations for headings, headers, and table bodies. We also provide annotations for coarse segmentation cells consisting of multiple data cells logically grouped by ruling lines of ink or whitespace in the table, which often represent data cells that are semantically grouped. Our dataset annotations are provided in VOC2007 and ICDAR-2019 Competition on Table Detection and Recognition (cTDaR-19) XML formats, and our dataset can easily be aggregated with the cTDaR-19 dataset. We report results running a series of benchmark algorithms on our new dataset, concluding that post-processing is very important for performance, and that page style is not as significant a feature as table type on model performance.
- Madhav Agarwal, Ajoy Mondal, and C. V. Jawahar. 2020. CDeC-Net: Composite Deformable Cascade Network for Table Detection in Document Images. arxiv:2008.10831 [cs.CV]Google Scholar
- Saman Arif and F. Shafait. 2018. Table Detection in Document Images using Foreground and Background Features. 2018 Digital Image Computing: Techniques and Applications (DICTA) (2018), 1–8.Google Scholar
- M. Everingham, L. Gool, Christopher K. I. Williams, J. Winn, and Andrew Zisserman. 2009. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision 88 (2009), 303–338.Google ScholarDigital Library
- L. Gao, Yilun Huang, Hervé Déjean, Jean-Luc Meunier, Qinqin Yan, Yu Fang, Florian Kleber, and E. Lang. 2019. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR). 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1510–1515.Google Scholar
- Dafang He, Scott D. Cohen, Brian L. Price, Daniel Kifer, and C. Lee Giles. 2017. Multi-Scale Multi-Task FCN for Semantic Page Segmentation and Table Detection. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01(2017), 254–261.Google Scholar
- Elvis Koci, Maik Thiele, Josephine Rehak, O. Romero, and Wolfgang Lehner. 2019. DECO: A Dataset of Annotated Spreadsheets for Layout and Table Recognition. 2019 International Conference on Document Analysis and Recognition (ICDAR) (2019), 1280–1285.Google Scholar
- Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, M. Zhou, and Zhoujun Li. 2020. TableBank: Table Benchmark for Image-based Table Detection and Recognition. ArXiv abs/1903.01949(2020).Google Scholar
- Shubham Paliwal, Vishwanath D, Rohit Rahul, Monika Sharma, and Lovekesh Vig. 2020. TableNet: Deep Learning model for end-to-end Table detection and Tabular data extraction from Scanned Document Images. arxiv:2001.01469 [cs.CV]Google Scholar
- Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. 2020. CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents. arxiv:2004.12629 [cs.CV]Google Scholar
- Sebastian Schreiber, S. Agne, I. Wolf, A. Dengel, and Sheraz Ahmed. 2017. DeepDeSRT: Deep Learning for Detection and Structure Recognition of Tables in Document Images. 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR) 01(2017), 1162–1167.Google Scholar
- Erich Schubert, J. Sander, M. Ester, H. Kriegel, and Xiaowei Xu. 2017. DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN. ACM Trans. Database Syst. 42 (2017), 19:1–19:21.Google ScholarDigital Library
- Asif Shahab, Faisal Shafait, Thomas Kieninger, and Andreas Dengel. 2010. An Open Approach towards the Benchmarking of Table Structure Recognition Systems. In Proceedings of the 9th IAPR International Workshop on Document Analysis Systems (Boston, Massachusetts, USA) (DAS ’10). Association for Computing Machinery, New York, NY, USA, 113–120. https://doi.org/10.1145/1815330.1815345Google ScholarDigital Library
- Xu Zhong, Jianbin Tang, and Antonio Jimeno Yepes. 2019. PubLayNet: largest dataset ever for document layout analysis. arxiv:1908.07836 [cs.CL]Google Scholar
Recommendations
The IMPACT dataset of historical document images
HIP '13: Proceedings of the 2nd International Workshop on Historical Document Imaging and ProcessingRepresentative and comprehensive datasets are a prerequisite for any research activity, from studying specific types of problems through training of algorithms to evaluating results of actual implementations. This paper describes an invaluable resource ...
BUDDI Table Factory: A toolbox for generating synthetic documents with annotated tables and cells
CODS-COMAD '23: Proceedings of the 6th Joint International Conference on Data Science & Management of Data (10th ACM IKDD CODS and 28th COMAD)Tables are the most convenient way to represent structured information in a document. Understanding the table structure is critical to understanding its contents. Several deep learning-based approaches from the literature have shown promising results in ...
An approach to named entity extraction from historical documents in traditional Mongolian script
JCDL '14: Proceedings of the 14th ACM/IEEE-CS Joint Conference on Digital LibrariesIn this poster, we propose an information extraction method for digitized ancient Mongolian documents by utilizing an ancient-modern dictionary. Named entities such as historical figures and place names will be extracted by employing text mining ...
Comments