Skip to main content
Top
Published in:

2021 | OriginalPaper | Chapter

Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation

Authors : Rongyu Cao, Hongwei Li, Ganbin Zhou, Ping Luo

Published in: Document Analysis and Recognition – ICDAR 2021

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper we study the task of document layout recognition for digital documents, requiring that the model should detect the exact physical object region without missing any text or containing any redundant text outside objects. It is the vital step to support high-quality information extraction, table understanding and knowledge base construction over the documents from various vertical domains (e.g. financial, legal, and government fields). Here, we consider digital documents, where characters and graphic elements are given with their exact texts, positions inside document pages, compared with image documents. Towards document layout recognition with pinpoint accuracy, we consider this problem as a document panoptic segmentation task, that each token in the document page must be assigned a class label and an instance id. Considering that two predicted objects may intersect under traditional visual panoptic segmentation method, like Mask R-CNN, however, document objects never intersect because most document pages follow manhattan layout. Therefore, we propose a novel framework, named document panoptic segmentation (DPS) model. It first splits the document page into column regions and groups tokens into line regions, then extracts the textual and visual features, and finally assigns class label and instance id to each line region. Additionally, we propose a novel metric based on the intersection over union (IoU) between the tokens contained in predicted and the ground-truth object, which is more suitable than metric based on the area IoU between predicted and the ground-truth bounding box. Finally, the empirical experiments based on PubLayNet, ArXiv and Financial datasets show that the proposed DPS model obtains 0.8833, 0.9205 and 0.8530 mAP scores on three datasets. The proposed model obtains great improvement on mAP score compared with Faster R-CNN and Mask R-CNN models.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
3.
go back to reference Cao, R., Cao, Y., Zhou, G., Luo, P.: Extracting variable-depth logical document hierarchy from long documents: method, evaluation, and application. J. Comput. Sci. Technol. (2021) Cao, R., Cao, Y., Zhou, G., Luo, P.: Extracting variable-depth logical document hierarchy from long documents: method, evaluation, and application. J. Comput. Sci. Technol. (2021)
4.
go back to reference Cao, Y., Li, H., Luo, P., Yao, J.: Towards automatic numerical cross-checking: extracting formulas from text. In: WWW (2018) Cao, Y., Li, H., Luo, P., Yao, J.: Towards automatic numerical cross-checking: extracting formulas from text. In: WWW (2018)
5.
go back to reference Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and performance metrics for table detection evaluation. In: DAS (2012) Fang, J., Tao, X., Tang, Z., Qiu, R., Liu, Y.: Dataset, ground-truth and performance metrics for table detection evaluation. In: DAS (2012)
6.
go back to reference Gilani, A., Qasim, S.R., Malik, I., Shafait, F.: Table detection using deep learning. In: ICDAR (2017) Gilani, A., Qasim, S.R., Malik, I., Shafait, F.: Table detection using deep learning. In: ICDAR (2017)
7.
8.
go back to reference Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: ICDAR (2013) Göbel, M., Hassan, T., Oro, E., Orsi, G.: ICDAR 2013 table competition. In: ICDAR (2013)
9.
go back to reference He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for semantic page segmentation and table detection. In: ICDAR (2018) He, D., Cohen, S., Price, B., Kifer, D., Giles, C.L.: Multi-scale multi-task FCN for semantic page segmentation and table detection. In: ICDAR (2018)
10.
go back to reference He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017) He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: ICCV (2017)
11.
go back to reference Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: EMNLP (2018) Katti, A.R., et al.: Chargrid: towards understanding 2D documents. In: EMNLP (2018)
12.
go back to reference Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017) Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR (2017)
13.
go back to reference Kirillov, A., He, K., Girshick, R., Rother, C., Dollar, P.: Panoptic segmentation. In: CVPR (2019) Kirillov, A., He, K., Girshick, R., Rother, C., Dollar, P.: Panoptic segmentation. In: CVPR (2019)
14.
go back to reference Koci, E., Thiele, M., Lehner, W., Romero, O.: Table recognition in spreadsheets via a graph representation. In: DAS (2018) Koci, E., Thiele, M., Lehner, W., Romero, O.: Table recognition in spreadsheets via a graph representation. In: DAS (2018)
15.
go back to reference Li, H., Yang, Q., Cao, Y., Yao, J., Luo, P.: Cracking tabular presentation diversity for automatic cross-checking over numerical facts. In: KDD (2020) Li, H., Yang, Q., Cao, Y., Yao, J., Luo, P.: Cracking tabular presentation diversity for automatic cross-checking over numerical facts. In: KDD (2020)
16.
go back to reference Li, K., et al.: Cross-domain document object detection: benchmark suite and method. In: CVPR (2020) Li, K., et al.: Cross-domain document object detection: benchmark suite and method. In: CVPR (2020)
17.
go back to reference Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition (2019) Li, M., Cui, L., Huang, S., Wei, F., Zhou, M., Li, Z.: TableBank: table benchmark for image-based table detection and recognition (2019)
18.
go back to reference Li, M., et al.: Docbank: A benchmark dataset for document layout analysis. arXiv (2020) Li, M., et al.: Docbank: A benchmark dataset for document layout analysis. arXiv (2020)
19.
go back to reference Li, X.H., Yin, F., Liu, C.L.: Page object detection from pdf document images by deep structured prediction and supervised clustering. In: ICPR (2018) Li, X.H., Yin, F., Liu, C.L.: Page object detection from pdf document images by deep structured prediction and supervised clustering. In: ICPR (2018)
20.
go back to reference Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: ICLR (2016) Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. In: ICLR (2016)
21.
go back to reference Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015) Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR (2015)
22.
go back to reference Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. Int. J. Digit. Libr. Syst. (2010) Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. Int. J. Digit. Libr. Syst. (2010)
23.
go back to reference Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013) Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (2013)
24.
go back to reference Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: Conference on Pattern Recognition (1984) Nagy, G., Seth, S.C.: Hierarchical representation of optically scanned documents. In: Conference on Pattern Recognition (1984)
25.
go back to reference Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016) Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: CVPR (2016)
26.
go back to reference Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017) Redmon, J., Farhadi, A.: Yolo9000: better, faster, stronger. In: CVPR (2017)
27.
go back to reference Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015) Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NeurIPS (2015)
28.
go back to reference Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks. In: ICDAR (2019) Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks. In: ICDAR (2019)
29.
go back to reference Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks (2019) Riba, P., Dutta, A., Goldmann, L., Fornés, A., Ramos, O., Lladós, J.: Table detection in invoice documents by graph neural networks (2019)
30.
go back to reference Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015) Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: MICCAI (2015)
31.
go back to reference Shahab, A., Shafait, F., Kieninger, T., Dengel, A.: An open approach towards the benchmarking of table structure recognition systems. In: DAS (2010) Shahab, A., Shafait, F., Kieninger, T., Dengel, A.: An open approach towards the benchmarking of table structure recognition systems. In: DAS (2010)
32.
go back to reference Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: JCDL (2018) Siegel, N., Lourie, N., Power, R., Ammar, W.: Extracting scientific figures with distantly supervised neural networks. In: JCDL (2018)
33.
go back to reference Smith, R.: An overview of the tesseract OCR engine. In: ICDAR (2007) Smith, R.: An overview of the tesseract OCR engine. In: ICDAR (2007)
34.
go back to reference Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splitting and merging for table structure decomposition. In: ICDAR (2019) Tensmeyer, C., Morariu, V.I., Price, B., Cohen, S., Martinez, T.: Deep splitting and merging for table structure decomposition. In: ICDAR (2019)
35.
go back to reference Wu, S., et al.: Fonduer: Knowledge base construction from richly formatted data. In: SIGMOD (2018) Wu, S., et al.: Fonduer: Knowledge base construction from richly formatted data. In: SIGMOD (2018)
37.
go back to reference Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD (2020) Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., Zhou, M.: LayoutLM: pre-training of text and layout for document image understanding. In: KDD (2020)
38.
go back to reference Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: CVPR (2017) Yang, X., Yumer, E., Asente, P., Kraley, M., Kifer, D., Giles, C.L.: Learning to extract semantic structure from documents using multimodal fully convolutional neural networks. In: CVPR (2017)
39.
go back to reference Zhong, X., Tang, J., Yepes, A.J.: PublayNet: largest dataset ever for document layout analysis. In: ICDAR (2019) Zhong, X., Tang, J., Yepes, A.J.: PublayNet: largest dataset ever for document layout analysis. In: ICDAR (2019)
Metadata
Title
Towards Document Panoptic Segmentation with Pinpoint Accuracy: Method and Evaluation
Authors
Rongyu Cao
Hongwei Li
Ganbin Zhou
Ping Luo
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-86331-9_1

Premium Partner