nach oben

International Journal on Document Analysis and Recognition (IJDAR)

Erschienen in:

Open Access 29.04.2023 | Special Issue Paper

Inv3D: a high-resolution 3D invoice dataset for template-guided single-image document unwarping

verfasst von: Felix Hertlein, Alexander Naumann, Patrick Philipp

Erschienen in: International Journal on Document Analysis and Recognition (IJDAR) | Ausgabe 3/2023

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Numerous business workflows involve printed forms, such as invoices or receipts, which are often manually digitalized to persistently search or store the data. As hardware scanners are costly and inflexible, smartphones are increasingly used for digitalization. Here, processing algorithms need to deal with prevailing environmental factors, such as shadows or crumples. Current state-of-the-art approaches learn supervised image dewarping models based on pairs of raw images and rectification meshes. The available results show promising predictive accuracies for dewarping, but generated errors still lead to sub-optimal information retrieval. In this paper, we explore the potential of improving dewarping models using additional, structured information in the form of invoice templates. We provide two core contributions: (1) a novel dataset, referred to as Inv3D, comprising synthetic and real-world high-resolution invoice images with structural templates, rectification meshes, and a multiplicity of per-pixel supervision signals and (2) a novel image dewarping algorithm, which extends the state-of-the-art approach GeoTr to leverage structural templates using attention. Our extensive evaluation includes an implementation of DewarpNet and shows that exploiting structured templates can improve the performance for image dewarping. We report superior performance for the proposed algorithm on our new benchmark for all metrics, including an improved local distortion of 26.1 %. We made our new dataset and all code publicly available at https://felixhertlein.github.io/inv3d.

Supplementary file 1 (pdf 217 KB)

Supplementary Information

The online version contains supplementary material available at https://doi.org/10.1007/s10032-023-00434-x.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Numerous business workflows in enterprises involve printed forms, such as invoices, bills, or receipts. The receiving party has to manually digitalize the document in order to persistently access, search, or store the provided data, causing significant personnel costs. Here, existing solutions make use of scanners to create flatbed digital copies of the paper document and apply optical character recognition (OCR) to automatically extract information. This, however, creates additional costs and reduces the flexibility of the given solution.

In order to overcome the hardware restriction, current state-of-the-art approaches attempt to analyze document images taken with smartphones. Prominent examples are DewarpNet [6] and GeoTr [9] which learn to dewarp images using supervised learning, having the available dewarping meshes as ground truth. While these approaches generate promising results, they still are not satisfactorily robust when dealing with environmental factors, such as light incidence, shadows, occlusions, crumpled or folded paper, and perspective transformations.

One potential remedy is the use of additional structured information in the form of templates, which confine the general structure of the documents in order to improve unwarping. While it might be tedious to define initial templates, the added value is significant due to the considerable increase in dewarping precision and robustness.

In this paper, we follow exactly this path and propose a novel labeled invoice dataset with additional structural information to assist image dewarping. More specifically, we present Inv3D, a large, high-resolution invoice dataset comprising both synthetic data generated from carefully designed templates and challenging real-world data. Inv3D consists of 25,000 samples, each composed of four flatbed invoice image layers, two ground-truth annotations, the 3D warped document, nine supervision signal maps, and the backward transformation map (see Fig. 1). We then propose a novel supervised dewarping approach, referred to as GeoTrTemplate, which exploits the novel structured information by extending the recent GeoTr [9] algorithm. We encode both the warped image and our template image using a convolutional neural network and combine the feature representations. The subsequent transformer encoder–decoder module learns the attention between all pairs of features which enables the model to combine the warped image with our structural template information. For our extensive evaluation, we trained DewarpNet [6] without refinement network and GeoTr [9] in a unified framework and evaluate both approaches on the Doc3D dataset, as well as Inv3D, making use of established metrics such as MS-SSIM, LD, ED, and CER. In addition, we introduce the newer perceptual metric LPIPS [50] as a benchmark metric for document dewarping. To the best of our knowledge, this is the first dataset and approach to make use of structured template information for available invoices to foster research on robust image dewarping systems.

Our contributions are threefold:

We present a novel high-resolution dataset with template information, 3D renderings, a multiplicity of supervision signal maps, and backward transforms to enable designated learning of structural features for image dewarping.
We propose a novel image dewarping algorithm, which improves the state-of-the-art by a considerable margin through leveraging additional template information.
We provide an extensive empirical evaluation of the novel dataset and model. In addition, we will publish our code and data in its completeness to be utilized as a benchmark system for future research.

The paper is structured as follows: First, we introduce the related work and then we explain the creation process of Inv3D and present our own approach before reporting our evaluation. We conclude with Sect. 6.

For the extraction of textual information from images, several factors strongly influence the performance of off-the-shelf solutions such as Tesseract [38]. Images from flatbed scanners have constant illumination, little noise, and no deformations. Since these factors are important for the accuracy of OCR, there has been research to reconstruct those conditions from images of documents that were captured in the wild.

Datasets. There are several datasets available that focus on document rectification, which are summarized in Table 1. An early real dataset comprising 102 images of bend documents for evaluation was presented for a page dewarping contest [35]. A big synthetic dataset focusing on bends only by employing the cylindrical surface model from Cao et al. [2] was presented by Garai et al. [12]. One step forward toward higher complexity of deformations and thus toward higher applicability for real-world applications was presented by Ma et al. [29]. They created a large synthetic dataset comprising bends and folds. Deformations were generated in 2D and thus generalize poorly on 3D structures. This dataset was extended by RectiNet [1]. The current state-of-the-art dataset is Doc3D [6]. It is a large document unwarping dataset comprising captured 3D meshes and a collection of rendered document images. It adds more realism by including crumpled documents. CREASE [31] follows the pipeline of Das et al. [6] to render a high-resolution dataset with additional supervisory signals, which, however, is not publicly available. Inv3D represents the first publicly accessible dataset with high resolution and complex 3D structures. Similar to CREASE, we provide additional 3D annotations such as per-pixel angle, curvature, and text maps. In addition to that, Inv3D is the first template-based dataset that enables research on structured documents in the warped domain with the support of flatbed template information while comprising the same challenging aspects as Doc3D. The exploitation of known flatbed templates at inference time might ease the problem of image dewarping and thus improve the potential for real-world applications of this technology.

Table 1

Overview of existing datasets and their characteristics

Dataset	#Images	Public	3D	Crumples	High Res.	Real data	Templates	Flat Doc.
Inv3D	25k	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)
Doc3D [6]	100k	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	–	\(\checkmark \)	–	–
CREASE [31]	15k	–	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	\(\checkmark \)	–	–
DocUnet [29]	100k	\(\checkmark \)	–	–	–	–	–	\(\checkmark \)
RectiNet[1]	+8k	–	–	–	?	–	–	?
S-CNN[12]	+100k	–	\(\checkmark \)	–	?	\(\checkmark \)	–	?
DIDC [35]	102	\(\checkmark \)	–	–	–	\(\checkmark \)	–	–

Note that we could not find conclusive information for characteristics with ?, e.g., since we do not know whether that information would be made available upon public access

Approaches. The correction of illumination and noise in document images has been considered in [7, 9, 21, 33, 41]. Also, there is work on improving text extraction from noisy images [16]. The task of document rectification, i.e., generating a flatbed version of the captured document, is at the core of our work and thus reviewed in more detail.

While there has been work using 3D sensors for document unwarping [4, 23, 40, 47], we are interested in recovering a flatbed version of documents from a single image only. Features in 2D have played an important role for this task. Popular examples include the usage of baselines of horizontal text [14, 17, 18, 26, 36] and vertical stroke boundaries [27, 28]. These feature-driven approaches are frequently combined with the assumption of a cylindrical surface model [2] or a developable surface [22]. Tian and Narasimhan [39] relaxed the assumption on the surface model by allowing to reconstruct a broader class of 3D shapes. A CNN-based approach for texts in English and Bangla was presented by Garai et al. [12].

One drawback of the aforementioned works is that they focus on bend surfaces. Folds and especially crumples, which make the geometry of the surface significantly more complex, are neglected despite their common occurrence in document images. The first work using deep learning for distortion estimation, and not only feature extraction, from a single captured image is DocUNet [29]. The usage of this deep learning pipeline significantly sped up the unwarping process compared to previous methods; however, only 2D deformations were used during training. RectiNet [1] uses a gated and bifurcated stacked U-Net module and slightly improves upon the original DocUNet model. DewarpNet [6] improves upon DocUnet by incorporating 3D information. A two-stage system with two sub-networks, one for unwarping and one for texture mapping, is presented. Xie et al. [43] introduced a deep learning-based method that uses displacement flow estimation. Here, Doc3D is used as a dataset. CREASE [31] introduces a context-aware end-to-end dewarping pipeline. They acknowledge the importance of predicting the orientation of the text and add a text angle prediction as previously seen in Scene Text Recognition tasks. Feng et al. [9] introduce a transformer-based model to capture the global context of the document via self-attention. Xie et al. [44] propose a model which learns sparse control points that they use to interpolate the backward mapping. Das et al. [8] decouple local and global unwarping by dividing the image into patches prior to the rectification process and stitching them together afterward. DocScanner [10] and Marior [49] tackle the unwarping problem iteratively using a single rectification estimate in multiple steps. Xue et al. [46] unwarps the documents in two steps using the coarse and the refinement transformer. In contrast to previous work, the authors focus on high-frequency signals for training. Jiang et al. [15] formulate the task as a constrained optimization problem. They segment background/foreground, detect text lines, and use these constraints to search for an unwarping map. The PaperEdge system by Ma et al. [30] pre-unwarps the documents based on the outline of the paper sheet and learns subsequently a texture-based deformation to get the final result. Lastly, [11] propose DocGeoNet, a geometrically constrained representation of the document. The approach learns an intermediate 3D representation of a given document before inferring the backward map. While the state-of-the-art shows promising results, it is still a very active area of research to increase both the quality and reliability of current approaches. We chose to base our approach on the geometric unwarping model DocTr [9] which shows highly competitive unwarping performance. DocTr is especially a good choice since its integrated attention mechanism allows the model to meaningfully integrate the additional structural information in the form of templates.

3 Dataset

Since, to the best of our knowledge, there is no large-scale, high-resolution dataset for document dewarping publicly available, which includes visual templates, we fill this gap by creating a novel dataset called Inv3D. The dataset creation pipeline consists of four stages: resource preparation, invoice rendering, invoice warping, and finally the auxiliary map generation. Note that our dataset Inv3D focuses on a single type of documents only, namely invoices. Since we provide our generation pipeline, one can easily create their own dataset using other types of documents. We conclude this section by introducing a new real-world evaluation dataset called Inv3DReal.

3.1 Resource preparation

In order to create convincingly realistic invoices, we collected publicly available invoice templates for entrepreneurs in common text processing formats. By converting them to HTML documents, we are capable of manipulating given formats and contents through simple text modifications while preserving their overall layout. We replaced all exemplary content provided by the input document with machine-readable tags such as {{ seller.company.name }}. Thus, our system is capable of automatically inserting the correct content at the correct position within the invoice Web page as intended by the invoice template creators. In total, we collected and prepared 100 different invoices which form the basis for the subsequent stages.

3.2 Pipeline: invoice

The first step in creating a dataset sample is creating a realistic invoice instance. Starting from an invoice Web template, we perform random content generation and apply random appearance changes before rendering the invoice instance files.

Random content generation. We randomly create fake sales orders and personas that resemble real invoices as closely as possible using existing libraries¹ and the E-Commerce Kaggle dataset [3]. To achieve a high level of realism, we retain the data coherency during the generation process and fit the data to the layout constraints imposed by the web page template, i.e., number of rows available. Additionally, we generate random representations of the data to increase its variance, e.g., different date formats. Since we provide the generated content in a structured manner, its text representation, and position in the document, our dataset can be used for the task of information extraction.

Random appearance changes. By applying random modifications to the invoice Web templates, we increase the visual variance and thus reduce the potential for overfitting. We employ color and font substitutions, as well as random font size scaling. Furthermore, we replaced the logo of the given invoice document—if present—with a random logo image from the Large Logo Dataset [34] and altered the document margin.

Rendering. To create a fake invoice sample, the random content is filled into the randomly modified Web templates and rendered as an image in A4 format. Furthermore, we create three auxiliary images using JavaScript and CSS manipulations: information delta, template, and text mask. The information delta depicts all randomly generated texts. The template image shows everything except the information delta and hence the overall structure of the given document and static text. The third and last auxiliary image contains all text within the invoice document. See Fig. 2a–d.

Additionally, we provide two types of ground-truth information: the true word list and relevant image areas. The latter describes which information is expected to be at which position within the document. Figure 2e visualizes the relevant areas. These ground-truth annotations are relevant for other tasks than image dewarping, e.g., information extraction and document understanding.

3.3 Pipeline: warping

The next step in the dataset generation process is the mapping of flat invoices to deformed sheets of paper in 3D and creating 2D renderings (see Fig. 3a). We project our invoices to the meshes from Doc3D [6] using Blender.² For the environment maps, we used the Laval Indoor HDR dataset [13]. In contrast to the Doc3D dataset, we rendered our samples with a considerably higher resolution in order to better represent the real-world scenario. We chose 1600x1600 instead of 448x448 pixels used by Doc3D.

This procedure generates the rendering itself (Fig. 3a) and various additional ground-truth maps, namely an albedo map (Fig. 3b), depth map (Fig. 3c), normal map (Fig. 3d), reconstruction map (Fig. 3e), UV map (Fig. 3f), and world coordinate map (Fig. 3g).

3.4 Pipeline: auxiliary

In addition to the previously mentioned ground-truth maps, we create and provide four more maps to facilitate the usage of our dataset and to reduce the need for computation-intensive calculations. The first is a high-resolution backward map (BM), that is, a discrete function to remap relative pixel positions to their original relative position. The backward map is the inverse mapping to the UV map, which specifies the deformation of the original mesh in the warped space. Since the UV map generated by blender is incomplete in the border region of the texture, we used nearest-neighbor extrapolation to fill the missing pixels in the backward map. The other auxiliary maps are relevant for CREASE [31], namely per-pixel orientation angles (see Fig. 3h), curvature estimations (see Fig. 3i), and text masks (see Fig. 3j).

3.5 Design decisions

Since we are using more than one external resource (invoice documents, logos, environments, object meshes, and fonts) we need to define the datasets train, validation, and test split before creating the dataset. We split all resources according to our split ratios (66.6% train, 16.7% validation, 16.7% test) and assigned the resource split to each split, respectively. This way, we prevent information leakage between these three splits by using resources in more than one split at a time. Note that the fonts were split on font level instead of style level. Furthermore, the meshes were split with regard to their generation. Most meshes were created by modifying a recorded parent mesh [6]. We split all meshes by its parent mesh to keep a clear separation between splits.

3.6 Real-world dataset

To complete the Inv3D dataset, we created a real-world dataset, referred to as Inv3DReal, to measure the performance of dewarping models under realistic conditions. Inv3DReal consists of 360 pictures displaying printed and altered invoices taken by a smartphone camera under different lighting conditions and backgrounds. We randomly selected 20 samples from the synthetic test dataset as the basis and applied six different deformations (perspective, curled, fewfold, multifold, crumples easy, crumples hard) inspired by Das et al. [6], as well as three different settings (bright, colored, shadow). We provide examples in Fig. 4. The bright setting (Fig. 4h) displays the documents on a gray background with daylight incidence. The second setting (Fig. 4i) displays the document on a white background sheet with RGB lighting. Lastly, we defined the shadow setting (Fig. 4j) as a document in front of a wooden surface with multiple shadows falling onto the document.

4 Architecture

In this section, we present our novel approach for image dewarping by leveraging structural templates at training and inference time. We extend the transformer-based state-of-the art model GeoTr introduced by Feng et al. [9] to incorporate the a-priori known structural information represented through the invoice templates. In the following, we refer to our new model as GeoTrTemplate and its extension as GeoTrTemplateLarge. See Fig. 5 for a schematic. Our model receives the warped image \({{\,\mathrm{\textbf{W}}\,}}\in \mathbb {R}^{h_0 \times w_0 \times 3}\) and the template image \({{\,\mathrm{\textbf{T}}\,}}\in \mathbb {R}^{h_1 \times w_1 \times 3}\). Both inputs are scaled to a fixed resolution of \(288 \times 288\) for GeoTrTemplate and \(600 \times 600\) for GeoTrTemplateLarge before applying a geometric head H individually to each image. The head H creates deep image representations with \(36 \times 36\) positional features in a 128-dimenstional space. For GeoTrTemplate, we employed the geometric head proposed by Feng et al. [9]. We define the geometric head for our large model as a slice of the EfficientNet B7 noisy student model [45], namely the first four blocks followed by a convolutional layer. The features of the warped image \(H({{\,\mathrm{\textbf{W}}\,}})\) and the template \(H({{\,\mathrm{\textbf{T}}\,}})\) are concatenated, forming a combined input representation \(R \in \mathbb {R}^{36 \times 36 \times 256}\). We then applied the transformer encoder and decoder from Feng et al. [9] and their geometric tail module to upsample the resulting backward map. For details regarding the employed modules, see the original paper. The output is a backward map \({{\,\mathrm{\textbf{B}}\,}}\in [0, 1]^{288 \times 288 \times 2}\). Our loss function is defined as the L1-norm between the output backward map \({{\,\mathrm{\textbf{B}}\,}}\) and the true backward map \({{\,\mathrm{\varvec{\hat{B}}}\,}}\).

5 Evaluation

5.1 Metrics

The metrics can be divided into visual and text-based metrics and are explained in the following sections. Each metric measures the similarity between two images: the unwarped image based on the learned backward map \({{\,\mathrm{\textbf{B}}\,}}\) and the flat invoice image. Note that even with the perfect backward map \({{\,\mathrm{\varvec{\hat{B}}}\,}}\) the metrics do not yield a perfect score since the backward maps do not correct the lighting influence; thus, the perfectly unwarped image \({{\,\mathrm{\varvec{\hat{B}}}\,}}({{\,\mathrm{\textbf{W}}\,}})\) contains shadows and ambient lighting, while the reference image does not.

5.1.1 Visual metrics

We used the visual metrics MS-SSIM [42], LD [48], and LPIPS [50], which we explain briefly individually in the following. For all visual metrics, we resized the images to a fixed area of 598400 pixels while retaining the ground-truth aspect ratio as proposed by Ma et al. [29].

MS-SSIM. As an established perceptual metric, we employed the multiscale structural similarity (MS-SSIM) [42]. It measures the perceived change in structural information by calculating statistical properties on multiple image windows at different scales. The metric consists of multiple structural similarity (SSIM) calculations on different scales of input and reference image in order to become scale-invariant. Similar to the evaluation of Das et al. [6], we convert the source and reference image to grayscale in order to create comparable numbers before applying the metric. The MS-SSIM ranges between 0 and 1, whereas 1 denotes the optimal score.

LD. The local distortion (LD) as defined by You et al. [48] quantifies the similarity of two images based on the SIFT flow [24]. Input and reference images are converted to dense SIFT feature matrices and subsequently are matched pixel-wise to form the SIFT flow. The local distortion is defined as the mean L2-norm of the SIFT flow. We used the implementation and parametrization of Ma et al. [29] and apply it to grayscale images. The LD ranges between 0 and infinity, where 0 is the optimal score.

LPIPS. In addition to MS-SSIM and LD, we employ the learned perceptual image patch similarity (LPIPS) metric introduced by Zhang et al. [50] to measure the perceived image similarity. The authors show that the LPIPS metric is better suited for measuring perceived image similarity than SSIM. The metric is learned using a large-scale similarity preference dataset. For our evaluation, we used the pre-trained weights provided by the authors based on the AlexNet [19] model. LPIPS ranges between 0 and infinity, where 0 denotes the optimal score.

5.1.2 Text-based metrics

For many use cases such as information extraction, a text-based metric is better suited to evaluate the unwarping method. Using the learned backward map \({{\,\mathrm{\textbf{B}}\,}}\), we calculate the unwarped image \({{\,\mathrm{\textbf{B}}\,}}({{\,\mathrm{\textbf{W}}\,}})\) and perform OCR using the open-source engine Tesseract 4.0.0 [38]. In order to detect the text in images, the input image requires a sufficiently high resolution with respect to the contained text size. Therefore, we scaled the unwarped image and reference image to a size of 3740000 pixels while preserving the reference image aspect ratio. To evaluate the recognized text, we use two different metrics ED and CER described in the following.

ED. The edit distance (ED) is defined as the number of insertions, deletions, and substitutions required to transform an input text to the corresponding reference text.

CER. We calculate the character error rate (CER) for each reference text. The CER is defined as the Levenshtein distance [20] between input and reference divided by the number of characters in the reference text.

5.2 Baseline selection

We compare our results to the baselines DewarpNet (without refinement network) [8] and GeoTr [9]. Document dewarping can be decomposed into two subtasks, geometric dewarping, and illumination correction. The prior remaps all pixel locations, whereas the latter alters the per-pixel colors to remove shading and environmental light effects. As our model does purely geometric dewarping, we selected our baselines to use geometric dewarping only to make a fair comparison. The usage of an illumination correction model is decoupled from the geometric dewarping, thus can be appended to all baselines as well as our model. Since the refinement network of DewarpNet [6] and IllTr [9] are an illumination correction networks, we argue that omitting these networks is well founded.

5.3 Hyperparameters

We trained all models up to 300 epochs with an early stopping patience of 25 epochs based on the validation mean squared error between \({{\,\mathrm{\textbf{B}}\,}}\) and \({{\,\mathrm{\varvec{\hat{B}}}\,}}\). All other hyperparameters depend on the model type. We used the parametrization published by the original authors to reproduce their results as closely as possible. For GeoTrTemplate, we used a batch size of 8, the AdamW optimizer [25] with an initial learning rate of \(10^{-3}\) and the OnceCycleLR scheduler [37] with a maximum learning rate of \(10^{-3}\). Note that for the training of GeoTr, GeoTrTemplate, and GeoTrTemplateLarge, we employed gradient clipping to increase the training stability. We clipped the global gradient norm to a value of 1.

DewarpNet [6] and GeoTr [9] employ different background augmentation strategies to boost the model performances. DewarpNet replaces the warped image background with randomly selected textures from the Describable Textures Dataset [5] during training. GeoTr learns a light semantic segmentation network [32] in order to remove the background beforehand. To enable a fair comparison of both approaches and our models, we kept the original backgrounds. Furthermore, we augmented all images for training using color jitter with a random change of up to 20 % in brightness, contrast, saturation, and hue, respectively. Note that the input image resolution differs on the individual model due to architectural constraints. In particular, DewarpNet receives images with \(128 \times 128\) pixels, whereas GeoTr and GeoTrTemplate use images with a resolution of \(288 \times 288\) pixels and GeoTrTemplateLarge the resolution \(600 \times 600\).

5.4 Quantitative results

Table 2 shows the quantitative results of our approach and the baseline methods evaluated on our new dataset Inv3DReal. We include the identity backward map for reference, thus creating a lower bound for the scores. As expected, all approaches outperform the identity baseline in all metrics by far. Our experiments show that GeoTr is superior to DewarpNet without the refinement network in all metrics and for all training datasets. When comparing the two models with respect to the used training dataset, we conclude that training on the Inv3D dataset slightly improves the evaluation results compared to training on Doc3D. Since Inv3D is more similar to Inv3DReal than Doc3D, this effect is expected. The by far best results yield our models GeoTrTemplate and GeoTrTemplateLarge in all metrics, especially for the visual evaluation metrics. The local distortion improves by 23.4 % and 26.1 %, respectively, in comparison with the runner-up GeoTr trained on Inv3D. These results show the effectiveness of our approach.

Table 2

Quantitative evaluation of our new model GeoTrTemplate on Inv3DReal with respect to the training dataset

Model	Train Dataset	\(\uparrow \!\text {MS-SSIM}\)	\(\downarrow \!\text {LD}\)	\(\downarrow \!\text {LPIPS}\)	\(\downarrow \!\text {ED}\)	\(\downarrow \!\text {CER}\)
Identity	-	0.44 (0.10)	36.75 (13.80)	0.60 (0.10)	533 (169)	0.83 (0.21)
DewarpNet (w/o ref)	Doc3D	0.56 (0.11)	26.10 (11.50)	0.42 (0.12)	384 (182)	0.59 (0.25)
DewarpNet (w/o ref)	Inv3D	0.55 (0.11)	25.33 (10.56)	0.43 (0.12)	387 (176)	0.60 (0.24)
GeoTr	Doc3D	0.56 (0.11)	23.27 (10.35)	0.40 (0.12)	357 (185)	0.55 (0.26)
GeoTr	Inv3D	0.56 (0.11)	22.81 (9.98)	0.41 (0.12)	365 (181)	0.57 (0.25)
GeoTrTemplate (ours)	Inv3D	0.64 (0.12)	17.46 (10.62)	0.32 (0.13)	349 (185)	0.54 (0.26)
GeoTrTemplateLarge (ours)	Inv3D	0.65 (0.12)	16.86 (10.46)	0.31 (0.13)	327 (184)	0.51 (0.26)

Values in brackets denote standard deviations

Table 3

Comparison of our implementation with the numbers reported by the original papers trained on Doc3D and evaluated on the DocUNet benchmark [29].

Model	\(\uparrow \!\text {MS-SSIM}\)	\(\downarrow \!\text {LD}\)	\(\downarrow \!\text {LPIPS}\)	\(\downarrow \!\text {ED}\)	\(\downarrow \!\text {CER}\)
DewarpNet (w/o ref)	0.45 (0.12)	10.15 (7.49)	0.32 (0.11)	1180 (1318)	0.30 (0.22)
DewarpNet (w/o ref, orig)	0.47 (–)	8.98 (–)	–	1289 (–)	0.31 (0.25)
GeoTr	0.45 (0.12)	8.65 (6.11)	0.31 (0.10)	892 (1254)	0.23 (0.24)
GeoTr (orig)	–	8.38 (–)	–	935 (–)	0.31 (–)

Values in brackets denote standard deviations

Table 4

Detailed evaluation of GeoTrTemplate on our new benchmark Inv3DReal split by modification category

Modification	\(\uparrow \!\text {MS-SSIM}\)	\(\downarrow \!\text {LD}\)	\(\downarrow \!\text {LPIPS}\)	\(\downarrow \!\text {ED}\)	\(\downarrow \!\text {CER}\)
perspective	0.69 (0.12)	20.92 (12.15)	0.27 (0.12)	316 (191)	0.49 (0.26)
curled	0.70 (0.09)	19.74 (11.75)	0.26 (0.11)	301 (173)	0.47 (0.25)
fewfold	0.66 (0.10)	18.70 (9.67)	0.30 (0.11)	335 (184)	0.52 (0.26)
multifold	0.61 (0.11)	18.84 (11.00)	0.34 (0.12)	367 (165)	0.57 (0.23)
crumples easy	0.64 (0.11)	12.75 (7.33)	0.31 (0.11)	357 (206)	0.56 (0.31)
crumples hard	0.52 (0.10)	13.82 (8.65)	0.46 (0.11)	417 (175)	0.64 (0.21)
bright	0.69 (0.11)	17.74 (10.39)	0.25 (0.11)	272 (177)	0.42 (0.25)
color	0.66 (0.10)	19.41 (11.20)	0.37 (0.14)	329 (185)	0.51 (0.25)
shadow	0.56 (0.10)	15.24 (9.88)	0.35 (0.10)	446 (149)	0.69 (0.20)

Values in brackets denote standard deviations

We also trained the baseline models on Doc3D and evaluated on the established DocUNet benchmark [29]. We were able to reproduce the reported numbers approximately. The difference is likely due to differences in the image augmentation methods that were applied, e.g., we omit the random background replacement of DewarpNet in our setting. We chose to apply the exact same image augmentations for all approaches to allow for a fair comparison between them. See Table 3 for a direct comparison of our results with the reported numbers.

When comparing the absolute number of the DocUNet benchmark and our Inv3DReal benchmark, we observe that the DocUnet evaluations are closer to the optimum for most metrics, which indicates that our benchmark is harder to solve for given approaches. Note that the edit distance on DocUNet is higher than on Inv3DReal which is due to the fact, that DocUNet images contain more text by a large margin. Therefore, the OCR engine yields long texts, which leads to a high absolute number of insertions, deletions, and replacements for DocUNet.

To better understand the characteristics of each approach, we provide an in-depth evaluation of our model GeoTrTemplate based on Inv3DReal. We average the evaluation data per deformation class and per lighting setting. The results are given in Table 4. According to most metrics, the curled deformation appears to be the easiest task, whereas the heavy crumples represent the hardest class to unwarp. Interestingly, the LD does not agree with other metrics as stronger deformations lead to better results with regard to the LD metric. When we compare the three different environment settings, it appears that bright is the easiest, whereas shadow is the hardest according to most metrics. Similar to the deformation evaluation, we observe the inverse order according to the local distortion.

For a quantitative evaluation of DewarpNet [6] and GeoTr [9] on Inv3DReal with regard to the different modifications, please refer to supplementary material.

5.5 Qualitative results

Table 5

Ablation study of GeoTrTemplate on Inv3DReal by gradually removing template information for training and inference

Ablation	\(\uparrow \!\text {MS-SSIM}\)	\(\downarrow \!\text {LD}\)	\(\downarrow \!\text {LPIPS}\)	\(\downarrow \!\text {ED}\)	\(\downarrow \!\text {CER}\)
White Template	0.56 (0.11)	22.50 (10.34)	0.40 (0.12)	345 (183)	0.54 (0.26)
Structure only	0.61 (0.12)	19.76 (10.09)	0.34 (0.13)	341 (186)	0.53 (0.26)
Text only	0.60 (0.11)	18.71 (10.32)	0.35 (0.13)	347 (186)	0.54 (0.26)
Full	0.64 (0.12)	17.46 (10.62)	0.32 (0.13)	349 (185)	0.54 (0.26)
Random template	0.54 (0.11)	29.42 (11.65)	0.44 (0.12)	355 (178)	0.55 (0.25)

We separately evaluated the importance of correct template choice by randomly selecting templates during inference only. Values in brackets denote standard deviations

Figure 6 displays three selected samples from Inv3DReal, as well as the unwarping results from DewarpNet [6], GeoTr [9], our model GeoTrTemplateLarge, and the original invoice document. When comparing DewarpNet and GeoTr, we observe that both models have problems correcting the line straightness. A comparison of GeoTrTemplateLarge and GeoTr indicates better global positioning and straighter lines when using the template.

5.6 Ablation study

We conducted an ablation study to measure the influence of different types of structural information on the model performance. For our study, we altered the templates in three different ways and trained our model from scratch using the altered template as input. The modifications are as follows:

White Template. The template image is completely white and thus contains no additional information over the warped input image.

Text Only. The template image for this ablation contains all texts visible on the original template image (see Fig. 2c), but no other structures such as lines or images. Note that all texts were converted to black so that the texts are still visible after removing the background colors. The final template ablation is a black and white image.

Structure Only. We removed all textual information from the original template image (see Fig. 2c), such that only the structural information remains.

The evaluation results on the Inv3DReal dataset are given in Table 5. The white template ablation performs the poorest of all ablations. The absolute metrics values are comparable to the GeoTr model trained on Inv3d as given in Table 2. This performance is as expected since both experiments receive the same information as input and have a fairly similar network structure. According to the visual metrics, the structure-only ablation and the text-only ablation boost the performance of our model in each case by a few points, but the combination of both yields the best overall performance. This finding indicates a correlation between the model performance and the amount of a-priori known information about the target structure. For the text metrics, while there is no significant change of ED and CER values, the structure-only ablation performs best by a small margin. Overall, we see a stronger improvement for the visual metrics compared to the text-based metrics, which indicates that adding structural information primarily helps to improve the global positioning rather than the fine-grained details.

To investigate the influence of template on the performance further, we conducted a second ablation test. For this, we trained the GeoTrTemplate model using the Inv3D dataset and selected random templates during inference. The results are given in Table 5. The comparison of choosing a random vs the correct template shows that the correct template selection is crucial for performance. Indeed, falsely selecting a template results in degraded performance compared with not providing additional information (white template).

6 Conclusion

In this work, we presented Inv3D, a novel high-resolution invoice dataset with both synthetic and real-world data with rich label information. Apart from the rectification mesh for dewarping each individual invoice, we add corresponding template information which can be utilized during training for improving generalizability. Inv3D comprises 25,000 samples based on 100 templates and heterogeneous environmental factors, such as challenging lighting conditions and a multiplicity of document deformations. In addition, we introduced GeoTrTemplate and GeoTrTemplateLarge, two novel models which leverage a-priori available structural information for the task of document image dewarping. We conducted a detailed evaluation study to compare our new models with the state-of-the-art approaches DewarpNet [6] and GeoTr [9]. Our empirical analysis showed that both outperform the baseline methods significantly; in particular, the GeoTrTemplateLarge model improves the local distortion of GeoTr by 26.1 %. Nevertheless, the absolute values for text detection show that further research is needed in order to solve this task robustly and consistently.

In future work, we plan to investigate better approaches to exploit template information, as available in Inv3D. The iterative refinement approach proposed by Feng et al. [10] for DocScanner might be used to create a correspondence map of interest points in the warped image and the template. We believe that the availability of templates at inference time can play an important role in document dewarping through the visual cues they provide. Another interesting direction is the calculation of unwarping confidence scores based on the matching of the unwarped image and the template. The confidence scores could become important for real-world applications to avoid the acquisition of erroneous information.

Declarations

Conflict of interest

The authors declare no competing interests.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel Editorial for special issue on “advanced topics in document analysis and recognition”

Nächster Artikel Analyzing the potential of active learning for document image classification

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 217 KB)

https://faker.readthedocs.io/en/master/.

https://www.blender.org/.

Bandyopadhyay, H., Dasgupta, T., Das, N., et al.: A gated and bifurcated stacked u-net module for document image dewarping. In: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp 10,548–10,554 (2021)

Cao, H., Ding, X., Liu, C.: A cylindrical surface model to rectify the bound document image. In: Proceedings Ninth IEEE international conference on computer vision, IEEE, pp 228–233 (2003)

Chen, D.: E-commerce data. https://www.kaggle.com/carrie1/ecommerce-data, last retrieved 2022-04-11 (2017)

Chua, KB., Zhang, L., Zhang, Y., et al.: A fast and stable approach for restoration of warped document images. In: Eighth International Conference on Document Analysis and Recognition (ICDAR’05), IEEE, pp 384–388 (2005)

Cimpoi, M., Maji, S., Kokkinos, I., et al.: Describing textures in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 3606–3613 (2014)

Das, S., Ma, K., Shu, Z., et al.: Dewarpnet: Single-image document unwarping with stacked 3d and 2d regression networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 131–140 (2019)

Das, S., Sial, HM., Baldrich, R., et al.: Intrinsic decomposition of document images in-the-wild. In: British Machine Vision Conference (BMVC) (2020)

Das, S., Singh, KY., Wu, J., et al.: End-to-end piece-wise unwarping of document images. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4268–4277 (2021)

Feng, H., Wang, Y., Zhou, W., et al.: Doctr: Document image transformer for geometric unwarping and illumination correction. In: Proceedings of the 29th ACM International Conference on Multimedia, pp 273–281 (2021a)

10.

Feng, H., Zhou, W., Deng, J., et al.: Docscanner: Robust document image rectification with progressive learning. arXiv preprint arXiv:2110.14968 (2021b)

11.

Feng, H., Zhou, W., Deng, J., et al.: Geometric representation learning for document image rectification. In: European Conference on Computer Vision, Springer, pp 475–492 (2022)

12.

Garai, A., Biswas, S., Mandal, S., et al.: Dewarping of document images: a semi-cnn based approach. Multimed. Tools Appl. 80(28), 36009–36032 (2021)CrossRef

13.

Gardner, M.A., Sunkavalli, K., Yumer, E., et al.: Learning to predict indoor illumination from a single image. ACM Trans. Graph. (TOG) 36(6), 1–14 (2017)CrossRef

14.

Huang, Z., Gu, J., Meng, G., et al.: Text line extraction of curved document images using hybrid metric. In: 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR), IEEE, pp 251–255 (2015)

15.

Jiang, X., Long, R., Xue, N., et al.: Revisiting document image dewarping by grid regularization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4543–4552 (2022)

16.

Jung, ES., Son, H., Oh, K., et al.: Duet: Detection utilizing enhancement for text in scanned or captured documents. In: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp 5466–5473 (2021)

17.

Kil, T., Seo, W., Koo, HI., et al.: Robust document image dewarping method using text-lines and line segments. In: 2017 14Th IAPR international conference on document analysis and recognition (ICDAR), IEEE, pp 865–870 (2017)

18.

Kim, B.S., Koo, H.I., Cho, N.I.: Document dewarping via text-line based optimization. Patt. Recogn. 48(11), 3600–3614 (2015)CrossRef

19.

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inform. Process. Syst. 25, 84 (2012)

20.

Levenshtein, VI., et al.: Binary codes capable of correcting deletions, insertions, and reversals. In: Soviet physics doklady, Soviet Union, pp 707–710 (1966)

21.

Li, X., Zhang, B., Liao, J., et al.: Document rectification and illumination correction using a patch-based cnn. ACM Trans. Graph. (TOG) 38(6), 1–11 (2019)

22.

Liang, J., DeMenthon, D., Doermann, D.: Geometric rectification of camera-captured document images. IEEE Trans. Patt. Anal. Mach. Intell. 30(4), 591–605 (2008)CrossRef

23.

Lilienblum, E., Michaelis, B.: Book scanner dewarping with weak 3d measurements and a simplified surface model. In: International Conference on Discrete Geometry for Computer Imagery, Springer, pp 529–540 (2008)

24.

Liu, C., Yuen, J., Torralba, A.: Sift flow: dense correspondence across scenes and its applications. IEEE Trans. Patt. Anal. Mach. Intell. 33(5), 978–994 (2010)CrossRef

25.

Loshchilov, I., Hutter, F.: Fixing weight decay regularization in adam. https://openreview.net/forum?id=rk6qdGgCZ, last retrieved 2022-04-11 (2018)

26.

Lu, S., Tan, CL .: Document flattening through grid modeling and regularization. In: 18th International Conference on Pattern Recognition (ICPR’06), IEEE, pp 971–974 (2006a)

27.

Lu, S., Tan, CL.: The restoration of camera documents through image segmentation. In: Document Analysis Systems. p 484–495 (2006b)

28.

Lu, S., Chen, B.M., Ko, C.C.: A partition approach for the restoration of camera images of planar and curled document. Image Vis. Comput. 24(8), 837–848 (2006)CrossRef

29.

Ma, K., Shu, Z., Bai, X., et al.: DocUNet: Document Image Unwarping via a Stacked U-Net. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4709 (2018)

30.

Ma, K., Das, S., Shu, Z., et al.: Learning from documents in the wild to improve document unwarping. In: ACM SIGGRAPH 2022 Conference Proceedings, pp 1–9 (2022)

31.

Markovitz, A., Lavi, I., Perel, O., et al.: Can you read me now? content aware rectification using angle supervision. In: European Conference on Computer Vision, Springer, pp 208–223 (2020)

32.

Qin, X., Zhang, Z., Huang, C., et al.: U2-net: going deeper with nested u-structure for salient object detection. Patt. Recognit. 106(107), 404 (2020)

33.

Ramanna, VKB., Bukhari, SS., Dengel, A.: Document image dewarping using deep learning. In: ICPRAM, pp 524–531 (2019)

34.

Sage, A., Agustsson, E., Timofte, R., et al.: Lld - large logo dataset - version 0.1. https://data.vision.ee.ethz.ch/cvl/lld, last retrieved 2022-04-11 (2017)

35.

Shafait, F., Breuel, T.M.: Document image dewarping contest. In: 2nd Int. Workshop on Camera-Based Document Analysis and Recognition, Curitiba, Brazil, pp. 181–188 (2007)

36.

Simon, G., Tabbone, S.: Generic document image dewarping by probabilistic discretization of vanishing points. In: 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, pp 2344–2351 (2021)

37.

Smith, LN., Topin, N.: Super-convergence: Very fast training of neural networks using large learning rates. In: Artificial intelligence and machine learning for multi-domain operations applications, International Society for Optics and Photonics, p 1100612 (2019)

38.

Smith, R.: An overview of the tesseract ocr engine. In: Ninth international conference on document analysis and recognition (ICDAR 2007), IEEE, pp 629–633 (2007)

39.

Tian, Y., Narasimhan, SG.: Rectification and 3d reconstruction of curved document images. In: CVPR 2011, IEEE, pp 377–384 (2011)

40.

Ulges, A., Lampert, CH., Breuel, T.: Document capture using stereo vision. In: Proceedings of the 2004 ACM symposium on Document engineering, pp 198–200 (2004)

41.

Wang, Y., Zhou, W., Lu, Z., et al.: Udoc-gan: Unpaired document illumination correction with background light prior. In: Proceedings of the 30th ACM International Conference on Multimedia, pp 5074–5082 (2022)

42.

Wang, Z., Simoncelli, E., Bovik, A.: Multiscale structural similarity for image quality assessment. In: The Thrity-Seventh Asilomar Conference on Signals, Systems Computers, 2003, pp 1398–1402 Vol.2, (2003) https://doi.org/10.1109/ACSSC.2003.1292216

43.

Xie, G.W., Yin, F., Zhang, X.Y., et al.: Dewarping document image by displacement flow estimation with fully convolutional network. In: International Workshop on Document Analysis Systems, pp. 131–144. Springer, London (2020)CrossRef

44.

Xie, GW., Yin, F., Zhang, XY., et al.: Document dewarping with control points. In: International Conference on Document Analysis and Recognition, Springer, pp 466–480 (2021)

45.

Xie, Q., Luong, MT., Hovy, E., et al.: Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,687–10,698 (2020b)

46.

Xue, C., Tian, Z., Zhan, F., et al.: Fourier document restoration for robust document dewarping and recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 4573–4582 (2022)

47.

Yamashita, A., Kawarago, A., Kaneko, T., et al.: Shape reconstruction and image restoration for non-flat surfaces of documents with a stereo vision system. In: Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004., IEEE, pp 482–485 (2004)

48.

You, S., Matsushita, Y., Sinha, S., et al.: Multiview rectification of folded documents. IEEE Trans. Patt. Anal. Mach. Intell. 40(2), 505–511 (2017)CrossRef

49.

Zhang, J., Luo, C., Jin, L., et al.: Marior: Margin removal and iterative content rectification for document dewarping in the wild. arXiv preprint arXiv:2207.11515

50.

Zhang, R., Isola, P., Efros, AA., et al.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 586–595 (2018)

Titel: Inv3D: a high-resolution 3D invoice dataset for template-guided single-image document unwarping
verfasst von: Felix Hertlein
Alexander Naumann
Patrick Philipp
Publikationsdatum: 29.04.2023
Verlag: Springer Berlin Heidelberg
Erschienen in: International Journal on Document Analysis and Recognition (IJDAR) / Ausgabe 3/2023
Print ISSN: 1433-2833
Elektronische ISSN: 1433-2825
DOI: https://doi.org/10.1007/s10032-023-00434-x

Springer Professional

Inv3D: a high-resolution 3D invoice dataset for template-guided single-image document unwarping

Abstract

Supplementary Information

Publisher's Note

1 Introduction

3 Dataset

3.1 Resource preparation

3.2 Pipeline: invoice

3.3 Pipeline: warping

3.4 Pipeline: auxiliary

3.5 Design decisions

3.6 Real-world dataset

4 Architecture

5 Evaluation

5.1 Metrics

5.1.1 Visual metrics

5.1.2 Text-based metrics

5.2 Baseline selection

5.3 Hyperparameters

5.4 Quantitative results

5.5 Qualitative results

5.6 Ablation study

6 Conclusion

Declarations

Conflict of interest

Publisher's Note

Supplementary Information

Premium Partner

Springer Professional

Abstract

Supplementary Information

Publisher's Note

1 Introduction

2 Related work

3 Dataset

3.1 Resource preparation

3.2 Pipeline: invoice

3.3 Pipeline: warping

3.4 Pipeline: auxiliary

3.5 Design decisions

3.6 Real-world dataset

4 Architecture

5 Evaluation

5.1 Metrics

5.1.1 Visual metrics

5.1.2 Text-based metrics

5.2 Baseline selection

5.3 Hyperparameters

5.4 Quantitative results

5.5 Qualitative results

5.6 Ablation study

6 Conclusion

Declarations

Conflict of interest

Publisher's Note

Supplementary Information

Weitere Artikel der Ausgabe 3/2023

Historical document image analysis using controlled data for pre-training

An accurate approach to real-time machine-readable zone detection with mobile devices

IAMonSense: multi-level handwriting classification using spatiotemporal information

End-to-end optical music recognition for pianoform sheet music

Scheme for palimpsests reconstruction using synthesized dataset

Line extraction in handwritten documents via instance segmentation

Premium Partner