Skip to main content

2018 | OriginalPaper | Buchkapitel

TabbyPDF: Web-Based System for PDF Table Extraction

verfasst von : Alexey Shigarov, Andrey Altaev, Andrey Mikhailov, Viacheslav Paramonov, Evgeniy Cherkashin

Erschienen in: Information and Software Technologies

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

PDF is one of the most widespread ways to represent non-editable documents. Many of PDF documents are machine-readable but remain untagged. They have no tags for identifying layout items such as paragraphs, columns, or tables. One of the important challenges with these documents is how to extract tabular data from them. The paper presents a novel web-based system for extracting tables located in untagged PDF documents with a complex layout, for recovering their cell structures, and for exporting them into a tagged form (e.g. in CSV or HTML format). The system uses a heuristic-based approach to table detection and structure recognition. It mainly relies on recovering a human reading order of text, including document paragraphs and table cells. A prototype of the system was evaluated, using the methodology and dataset of “ICDAR 2013 Table Competition”. The standard metric F-score is 93.64% for the structure recognition phase and 83.18% for the table extraction with automatic table detection. The results are comparable with the state-of-the-art academic solutions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
4.
Zurück zum Zitat Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: Proceedings of 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013) Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: Proceedings of 12th International Conference on Document Analysis and Recognition, pp. 1449–1453 (2013)
7.
Zurück zum Zitat Govindaraju, V., Zhang, C., Ré, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of 51st Annual Meeting of the Association for Computational Linguistics, pp. 658–664 (2013) Govindaraju, V., Zhang, C., Ré, C.: Understanding tables in context using standard NLP toolkits. In: Proceedings of 51st Annual Meeting of the Association for Computational Linguistics, pp. 658–664 (2013)
11.
13.
Zurück zum Zitat Nurminen, A.: Algorithmic extraction of data in tables in PDF documents. Master’s thesis, Tampere University of Technology, Tampere, Finland (2013) Nurminen, A.: Algorithmic extraction of data in tables in PDF documents. Master’s thesis, Tampere University of Technology, Tampere, Finland (2013)
14.
Zurück zum Zitat Oro, E., Ruffolo, M.: PDF-TREX: an approach for recognizing and extracting tables from PDF documents. In: Proceedings of 10th International Conference on Document Analysis and Recognition, pp. 906–910 (2009) Oro, E., Ruffolo, M.: PDF-TREX: an approach for recognizing and extracting tables from PDF documents. In: Proceedings of 10th International Conference on Document Analysis and Recognition, pp. 906–910 (2009)
15.
Zurück zum Zitat Perez-Arriaga, M.O., Estrada, T., Abad-Mota, S.: TAO: system for table detection and extraction from PDF documents. In: Proceedings of 29th International Florida Artificial Intelligence Research Society Conference, pp. 591–596 (2016) Perez-Arriaga, M.O., Estrada, T., Abad-Mota, S.: TAO: system for table detection and extraction from PDF documents. In: Proceedings of 29th International Florida Artificial Intelligence Research Society Conference, pp. 591–596 (2016)
16.
Zurück zum Zitat Ramel, J.Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of 7th International Conference on Document Analysis and Recognition, vol. 1, pp. 374–378 (2003) Ramel, J.Y., Crucianu, M., Vincent, N., Faure, C.: Detection, extraction and representation of tables. In: Proceedings of 7th International Conference on Document Analysis and Recognition, vol. 1, pp. 374–378 (2003)
21.
Zurück zum Zitat Shigarov, A.: Table understanding using a rule engine. Expert. Syst. Appl. 42(2), 929–937 (2015)CrossRef Shigarov, A.: Table understanding using a rule engine. Expert. Syst. Appl. 42(2), 929–937 (2015)CrossRef
25.
Zurück zum Zitat e Silva, A.C.: Parts that add up to a whole: a framework for the analysis of tables. Ph.D. thesis, University of Edinburgh, Tampere, Finland (2010) e Silva, A.C.: Parts that add up to a whole: a framework for the analysis of tables. Ph.D. thesis, University of Edinburgh, Tampere, Finland (2010)
26.
Zurück zum Zitat e Silva, A.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recognit. (IJDAR) 8(2), 144–171 (2006)CrossRef e Silva, A.C., Jorge, A.M., Torgo, L.: Design of an end-to-end method to extract information from tables. Int. J. Doc. Anal. Recognit. (IJDAR) 8(2), 144–171 (2006)CrossRef
27.
Zurück zum Zitat Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: a method to extract table information from PDF files. In: Proceedings of 2nd Indian International Conference on Artificial Intelligence, Pune, India, pp. 1773–1785 (2005) Yildiz, B., Kaiser, K., Miksch, S.: pdf2table: a method to extract table information from PDF files. In: Proceedings of 2nd Indian International Conference on Artificial Intelligence, Pune, India, pp. 1773–1785 (2005)
Metadaten
Titel
TabbyPDF: Web-Based System for PDF Table Extraction
verfasst von
Alexey Shigarov
Andrey Altaev
Andrey Mikhailov
Viacheslav Paramonov
Evgeniy Cherkashin
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-99972-2_20