Top

Published in:

2018 | OriginalPaper | Chapter

TabbyPDF: Web-Based System for PDF Table Extraction

Authors : Alexey Shigarov, Andrey Altaev, Andrey Mikhailov, Viacheslav Paramonov, Evgeniy Cherkashin

Published in: Information and Software Technologies

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

PDF is one of the most widespread ways to represent non-editable documents. Many of PDF documents are machine-readable but remain untagged. They have no tags for identifying layout items such as paragraphs, columns, or tables. One of the important challenges with these documents is how to extract tabular data from them. The paper presents a novel web-based system for extracting tables located in untagged PDF documents with a complex layout, for recovering their cell structures, and for exporting them into a tagged form (e.g. in CSV or HTML format). The system uses a heuristic-based approach to table detection and structure recognition. It mainly relies on recovering a human reading order of text, including document paragraphs and table cells. A prototype of the system was evaluated, using the methodology and dataset of “ICDAR 2013 Table Competition”. The standard metric F-score is 93.64% for the structure recognition phase and 83.18% for the table extraction with automatic table detection. The results are comparable with the state-of-the-art academic solutions.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter J48S: A Sequence Classification Approach to Text Analysis Based on Decision Trees

next chapter Modification of Parallelization for Fast Sort Algorithm

Burdick, D., et al.: Financial analytics from public data. In: Proceedings of the International Workshop on Data Science for Macro-Modeling, DSMM 2014, pp. 4:1–4:6 (2014). https://doi.org/10.1145/2630729.2630742

Corrêa, A.S., Zander, P.O.: Unleashing tabular content to open data: a survey on PDF table extraction methods and tools. In: Proceedings of 18th International Conference on Digital Government Research, pp. 54–63 (2017). https://doi.org/10.1145/3085228.3085278

Coüasnon, B., Lemaitre, A.: Recognition of tables and forms. In: Handbook of Document Image Processing and Recognition, pp. 647–677 (2014). https://doi.org/10.1007/978-0-85729-859-1_20CrossRef

Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: Proceedings of 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013)

Göbel, M., Hassan, T., Oro, E., Orsi, G.: A methodology for evaluating algorithms for table understanding in PDF documents. In: Proceedings of 2012 ACM Symposium on Document Engineering, pp. 45–48 (2012). https://doi.org/10.1145/2361354.2361365

Göbel, M., Hassan, T., Oro, E., Orsi, G., Rastan, R.: Table modelling, extraction and processing. In: Proceedings of 2016 ACM Symposium on Document Engineering, pp. 1–2 (2016). https://doi.org/10.1145/2960811.2967173

Govindaraju, V., Zhang, C., Ré, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of 51st Annual Meeting of the Association for Computational Linguistics, pp. 658–664 (2013)

Hassan, T., Baumgartner, R.: Table recognition and understanding from PDF files. In: Proceedings of 9th International Conference on Document Analysis and Recognition, vol. 02, pp. 1143–1147 (2007). http://dl.acm.org/citation.cfm?id=1304596.1304833

Hu, J., Liu, Y.: Analysis of documents born digital. In: Doermann, D., Tombre, K. (eds.) Handbook of Document Image Processing and Recognition, pp. 775–804. Springer, London (2014). https://doi.org/10.1007/978-0-85729-859-1_26CrossRef

10.

Khusro, S., Latif, A., Ullah, I.: On methods and tools of table detection, extraction and annotation in PDF documents. J. Inf. Sci. 41(1), 41–57 (2015). https://doi.org/10.1177/0165551514551903CrossRef

11.

Liu, Y., Bai, K., Mitra, P., Giles, C.L.: TableSeer: automatic table metadata extraction and searching in digital libraries. In: Proceedings of 7th ACM/IEEE Joint Conference on Digital Libraries, pp. 91–100 (2007). https://doi.org/10.1145/1255175.1255193

12.

Nganji, J.T.: The portable document format (PDF) accessibility practice of four journal publishers. Libr. Inf. Sci. Res. 37, 254–262 (2015). http://www.sciencedirect.com/science/article/pii/S0740818815000134CrossRef

13.

Nurminen, A.: Algorithmic extraction of data in tables in PDF documents. Master’s thesis, Tampere University of Technology, Tampere, Finland (2013)

14.

Oro, E., Ruffolo, M.: PDF-TREX: an approach for recognizing and extracting tables from PDF documents. In: Proceedings of 10th International Conference on Document Analysis and Recognition, pp. 906–910 (2009)

15.

Perez-Arriaga, M.O., Estrada, T., Abad-Mota, S.: TAO: system for table detection and extraction from PDF documents. In: Proceedings of 29th International Florida Artificial Intelligence Research Society Conference, pp. 591–596 (2016)

16.

Ramel, J.Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of 7th International Conference on Document Analysis and Recognition, vol. 1, pp. 374–378 (2003)

17.

Rastan, R., Paik, H.Y., Shepherd, J.: TEXUS: a task-based approach for table extraction and understanding. In: Proceedings of 2015 ACM Symposium on Document Engineering, pp. 25–34 (2015). https://doi.org/10.1145/2682571.2797069

18.

Rastan, R., Paik, H.Y., Shepherd, J.: A PDF wrapper for table processing. In: Proceedings of 2016 ACM Symposium on Document Engineering, pp. 115–118 (2016). https://doi.org/10.1145/2960811.2967162

19.

Sabol, V., Tschinkel, G., Veas, E., Hoefler, P., Mutlu, B., Granitzer, M.: Discovery and visual analysis of linked data for humans. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 309–324. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_20CrossRef

20.

Shigarov, A., Bychkov, I., Ruzhnikov, G., Khmel’nov, A.: A method of table detection in metafiles. Pattern Recognit. Image Anal. 19(4), 693–697 (2009). https://doi.org/10.1134/S1054661809040191CrossRefMATH

21.

Shigarov, A.: Table understanding using a rule engine. Expert. Syst. Appl. 42(2), 929–937 (2015)CrossRef

22.

Shigarov, A., Fedorov, R.: Simple algorithm page layout analysis. Pattern Recognit. Image Anal. 21(2), 324–327 (2011). https://doi.org/10.1134/S1054661811021008CrossRef

23.

Shigarov, A., Mikhailov, A., Altaev, A.: Configurable table structure recognition in untagged PDF documents. In: Proceedings of 2016 ACM Symposium on Document Engineering, pp. 119–122 (2016). https://doi.org/10.1145/2960811.2967152

24.

Shigarov, A.O., Mikhailov, A.A.: Rule-based spreadsheet data transformation from arbitrary to relational tables. Inf. Syst. 71, 123–136 (2017). https://doi.org/10.1016/j.is.2017.08.004CrossRef

25.

e Silva, A.C.: Parts that add up to a whole: a framework for the analysis of tables. Ph.D. thesis, University of Edinburgh, Tampere, Finland (2010)

26.

e Silva, A.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recognit. (IJDAR) 8(2), 144–171 (2006)CrossRef

27.

Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: a method to extract table information from PDF files. In: Proceedings of 2nd Indian International Conference on Artificial Intelligence, Pune, India, pp. 1773–1785 (2005)

Title: TabbyPDF: Web-Based System for PDF Table Extraction
Authors: Alexey Shigarov
Andrey Altaev
Andrey Mikhailov
Viacheslav Paramonov
Evgeniy Cherkashin
Publisher: Springer International Publishing
Book: Information and Software Technologies
Print ISBN: 978-3-319-99971-5

Electronic ISBN: 978-3-319-99972-2

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-3-319-99972-2_20

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner