1 Introduction
-
We demonstrate how a Faster R-CNN object detector can be extended with a lightweight arrow keypoint predictor for diagram structure recognition.
-
We identify data augmentation methods for diagrams and show that they vastly improve object detection results. We also propose a method to augment diagrams with words from an external dataset, to further reduce the confusion between arrow and text symbols.
-
We propose a postprocessing pipeline to form a final diagram from object and keypoint candidates.
-
We evaluate our system on four datasets from the flowchart and finite automata domain.
2 Related work
2.1 Handwritten diagram recognition
FC_A
online handwritten flowchart dataset. The dataset is publicly available, and its size has been increased to 419 flowcharts after the publication date. Following the release, several methods for online flowchart recognition were proposed [2‐6, 9, 18, 36, 37, 40]. Wu et al. [38] is the first work that uses FC_A
for offline recognition. [38] use a three-stage recognition pipeline, including a shapeness estimation algorithm to figure out if a stroke grouping has a regular appearance. The pipeline is evaluated using the ground truth strokes without temporal information and achieves 83.2% symbol recognition accuracy. Since the evaluation does not attribute for stroke reconstruction errors, the result is not comparable with an image-based recognition system.FA
) and flowcharts (FC_B
). The latest online system in [6] is a pipeline with text/non-text separation, symbol segmentation, symbol classification and structural analysis as its core parts. An offline extension to this online system is proposed in [7] and uses a stroke reconstruction preprocessing step. For the evaluation, two offline flowchart datasets based on the existing FC_B
dataset were introduced. One of those datasets contains scans of a printed thicker stroke visualization containing real noise, which we refer to as FC_B
\(_{\texttt {scan}}\). The adapted recognizer was also tested on the unordered FC_A
strokes and achieved 84.2% symbol recognition recall.FC_A
dataset. To this end, they do transfer learning from models pre-trained on MS COCO, a large dataset of natural images depicting everyday scenes [20]. The detector achieves a mAP@0.5 of 97.7%, where mAP@0.5 corresponds to mean average precision (mAP) at a ground truth bounding box overlap of at least 50%. Due to differing evaluation metrics, this result is not comparable to [7].DIDI
in 2020 [11]. It is the first large-scale diagram dataset and consists of two parts: 22,287 diagrams with textual labels (DIDI
\(_{\texttt {text}}\)) and 36,368 diagrams without textual labels (DIDI
\(_{\texttt {no\_text}}\)). Each handwritten diagram has been collected by showing a flowchart image to the user who was then asked to draw over it. The flowchart images were rendered using GraphViz based on randomly-generated dot files. Unlike other online datasets, the handwritten diagrams are not annotated on stroke level. However, the provided dot files contain information about the rendered diagram, such as the position and size of each node.FC_A
dataset. We also demonstrated how a Faster R-CNN object detector can be extended with a lightweight arrow keypoint predictor for flowchart structure recognition. For the present paper, we have consolidated the overall technique and provide a more broadly applicable system for recognizing arrow-connected diagrams. Specifically, we improve on the data augmentation and arrow proposal sampling methods used during network training. Moreover, we propose a postprocessing pipeline to form a final diagram from object and keypoint candidates. We evaluate our system on four datasets, adding a finite automata, a flowchart dataset of scanned images, and a large-scale flowchart dataset, and provide profound insights into accurate arrow and text phrase detection, which we consider the main challenge for recognizing arrow-connected diagrams.2.2 Keypoint detection
3 Arrow R-CNN
3.1 Network architecture
3.2 Training
FC_B
\(_{\texttt {scan}}\) database.
3.3 Inference
4 Integrating diagram domain knowledge
4.1 Augmentation
4.2 Diagram-aware postprocessing
Dataset | Split | Writers (split/total) | Templates (split/total) | Diagrams | Symbols |
---|---|---|---|---|---|
FC_A | Train | 31/35 | 14/28 | 248 | 5540 |
Test | 15/35 | 14/28 | 171 | 3791 | |
FC_B | Train | 10/24 | 28/28 | 280 | 6195 |
Validation | 7/24 | 28/28 | 196 | 4342 | |
Test | 7/24 | 28/28 | 196 | 4343 | |
FA | Train | 11/25 | 12/12 | 132 | 3631 |
Validation | 7/25 | 12/12 | 84 | 2307 | |
Test | 7/25 | 12/12 | 84 | 2323 | |
DIDI \(_{\texttt {no\_text}}\) | Train | ?/364* | 940/940 | 27,278 | 193,939 |
Validation | ?/364* | 916/940 | 4545 | 34,464 | |
Test | ?/364* | 919/940 | 4545 | 34,139 | |
DIDI \(_{\texttt {text}}\) | Train | ?/364* | 5300/5629 | 16,717 | 173,070 |
Validation | ?/364* | 2131/5629 | 2785 | 30,468 | |
Test | ?/364* | 2090/5629 | 2785 | 34,052 |
FA
dataset, where initial arrows have no predecessor node, we use a heuristic and only connect an arrow to a predecessor node if the spatial distance between the arrow tail keypoint and the node bounding box is lower than 50. The duplicate edge suppression step eliminates duplicate candidate arrows that join the same two nodes in the same direction. Duplicates are resolved by choosing the arrow with the highest classification score.5 Experiments
5.1 Datasets
FC_A
[1], FC_B
[6], and DIDI
[11]), and one finite automata dataset (FA
[2]). Table 1 shows basic statistics for all datasets. As mentioned in Sect. 2.1, the DIDI
dataset consists of two parts: one that contains diagrams with textual labels (DIDI
\(_{\texttt {text}}\)) and one without textual labels (DIDI
\(_{\texttt {no\_text}}\)). Throughout the experiments, we train on the entire DIDI
dataset, but report the results for both parts separately. For all datasets, the splits were either created based on writers (FC_B
, FA
, DIDI
) or based on templates (FC_A
), such that the sets of writers or templates in the respective training, validation, and test parts are disjoint. This means that the experimental results either show to what extent the model generalizes to unseen writers or unseen layouts, but not both at the same time. As another difference, FC_A
has no validation set. Obviously, it would be possible to take a subset of the train dataset as validation set. Since the majority of related works does not conduct a further split of the training set, we opt for the same approach. To avoid overfitting to the test set, we conduct all hyperparameter tuning on the FC_B
training and validation set and train a model on FC_A
using the same configuration.FC_B
dataset, we use the offline FC_B
\(_{\texttt {scan}}\) dataset introduced in [7], which contains scans of printed FC_B
diagrams. For the FC_A
and FA
datasets, we render the strokes as anti-aliased polylines with a stroke width of 3. To ensure border-touching strokes are fully visible, we pad each image by 10 pixels on each side. For the DIDI
dataset, we create an image with the dimension of the drawing area that was shown to the user and plot the strokes at their corresponding positions. During data collection, the generated flowcharts were rescaled to fill the drawing area. The size of this drawing area varies, with a maximum of \(3600\times 2232\) pixels. To avoid overly large images, we rescale each image to the initial scale of the generated flowchart.FC_B
\(_{\texttt {scan}}\) dataset, we use the provided bounding box annotations. In the following, we outline how we generate the ground truth bounding boxes for the other online datasets. For FC_A
and FA
, we define the symbol bounding box as the union bounding box of all its strokes. As mentioned in Sect. 2.1, the DIDI
dataset is not annotated on stroke level. Instead, we use the symbol bounding boxes of the corresponding GraphViz diagram. Since the participants do not draw precisely over the diagrams, the extracted bounding boxes do not perfectly fit the handwritten shapes. To quantify this difference, we manually annotate 100 handwritten diagrams (50 without and 50 with textual labels). Figure 9 shows some drawings from this sample and illustrates two major drawing issues we identified: diagrams where a user did not draw over the flowchart as instructed (9a) and diagrams where a user forgot to draw some or all of the shapes (9b). Since the evaluation metrics described in Sect. 5.2 are based on bounding box IoU, we try to exclude these erroneous diagrams in order to get a meaningful assessment of our method. As a heuristic, we exclude a drawing if at least one bounding box contains no stroke pixels. This heuristic correctly identifies 8 diagrams with drawing errors within the 100 diagram sample, but misses one diagram where the user forgot to draw an arrow. Table 2 shows the proportion of excluded diagrams using mentioned heuristic, and it reveals that drawing mistakes occur very frequently in the DIDI
\(_{\texttt {text}}\) train and test set.DIDI
diagrams with drawing errors: a drawing is excluded if at least one diagram bounding box contains only white pixelsSplit | DIDI \(_{\texttt {no\_text}}\) | DIDI \(_{\texttt {text}}\) |
---|---|---|
Train | 404/27278 (1.48%) | 1303/16717 (7.79%) |
Validation | 12/4545 (0.26%) | 50/2785 (1.80%) |
Test | 12/4545 (0.26%) | 867/2785 (31.13%) |
Total | 428/36368 (1.18%) | 2220/22287 (9.96%) |
FA
and FC_B
\(_{\texttt {scan}}\) dataset. For the DIDI
dataset, we extract the head and tail keypoints from the arrow spline control points in the GraphViz dot file of the generated flowchart. For the FC_A
dataset, we use a heuristic to extract the keypoints from the stroke data. For each arrow, we compute the Harris corner measure response image and then identify corner peaks with a minimum distance of 5. We set the arrow head and tail keypoints as the corner points closest to the next and previous node, respectively. We quantitatively evaluate the accuracy of our heuristic on the FC_B
dataset, where the head and tail points have been annotated. For the flowcharts in the training split, we compute the mean absolute error (mae) based on the Euclidean distance between each approximated and annotated arrow keypoint. We find that the approximated arrow tail (mae = 1.38) and arrow head (mae=5.82) keypoints are sufficiently close to the human annotations.FC_A | FC_B | FA | DIDI \(_{\texttt {no\_text}}\) | DIDI \(_{\texttt {text}}\) | |
---|---|---|---|---|---|
Online methods | |||||
Wang et al. [36] | 5.8 | – | – | – | – |
Julca-Aguilar et al. [14] | 34.0 | – | – | – | – |
Bresler et al. [6] | 59.1 | 67.9 | 79.8 | – | – |
Offline methods | |||||
Bresler et al. [7] | – | 37.7 | – | – | – |
Arrow R-CNN | 68.4 | 78.6 | 83.3 | 83.9 | 85.1 |
DIDI
datasetFC_A | FC_B \(_{\texttt {scan}}\) | FA | DIDI \(_{\texttt {no\_text}}\) | DIDI \(_{\texttt {text}}\) | |
---|---|---|---|---|---|
No augmentation | 23.4 | 70.4 | 52.4 | 82.5 | 83.7 |
ShiftScaleRotate | 33.9 | 73.0 | 60.7 | 82.6 | 83.5 |
+ RandomRotate90 & Flip | 57.3 | 77.0 | 79.8 | 80.9 | 83.7 |
+ IAMWordAugmentation | 66.7 | 76.0 | 81.0 | 80.8 | 83.5 |
5.2 Evaluation metrics
FC_A | FC_B \(_{\texttt {scan}}\) | FA | DIDI \(_{\texttt {no\_text}}\) | DIDI \(_{\texttt {text}}\) | |
---|---|---|---|---|---|
Standard NMS (IoU \(\le 0.5\)) | 66.7 | 76.0 | 81.0 | 82.5 | 83.7 |
+ Node NMS | 66.7 | 78.6 | 82.1 | 83.5 | 83.8 |
+ Arrow NMS & edge suppr. | 67.8 | 78.6 | 83.3 | 83.9 | 85.1 |
+ Text NMS & merge node texts | 68.4 | 78.6 | 83.3 | 83.9 | 85.1 |
5.3 Implementation
FA
dataset. For the DIDI
dataset, we do not use augmentation methods since the dataset is very large.5.4 Results
FC_A
, where the model has to generalize to unseen layouts. For the large DIDI
dataset, the augmentation methods slightly lower the recognition rate. To demonstrate the general applicability of our method, we used the same number of training iterations for all datasets. However, the combination of large dataset size and multiple augmentation methods might require more training iterations. We leave the investigation of the interplay between augmentation methods and dataset size to future work.FC_B
\(_{\texttt {scan}}\), where the rate stays the same. Also, merging texts within a node is a straightforward method to improve the results on FC_A
.FC_A
symbol recognition at IoU 80% on test setClass | Arrow R-CNN | Wu [38] | |
---|---|---|---|
Precision | Recall | Recall\(_{\text {IoU50}}\) | |
Arrow | 94.7/97.3* | 96.0/98.5* | 80.3 |
Connection | 99.2 | 100 | 73.4 |
Data | 100 | 99.7 | 78.5 |
Decision | 100 | 99.5 | 78.9 |
Process | 99.8 | 100 | 88.3 |
Terminator | 100 | 100 | 90.6 |
Text | 99.3 | 99.1 | 86.0 |
Micro avg. | 97.9/98.8* | 98.3/99.1* | 83.2 |
FC_B
\(_{\texttt {scan}}\) symbol recognition at IoU 80% on test setClass | Arrow R-CNN | Bresler [7] | |
---|---|---|---|
Precision | Recall | Recall | |
Arrow | 98.0/98.0* | 98.0/98.0* | 84.3 |
Connection | 100 | 100 | 86.6 |
Data | 100 | 94.9 | 94.4 |
Decision | 100 | 100 | 96.9 |
Process | 95.5 | 100 | 98.8 |
Terminator | 100 | 100 | 93.6 |
Text | 99.2 | 99.3 | 93.7 |
Micro avg. | 98.7/98.7* | 98.7/98.7* | 91.3 |
FA
symbol recognition at IoU 80% on test setClass | Precision | Recall |
---|---|---|
Arrow | 98.4/99.0* | 98.4/99.0* |
Final state | 100 | 100 |
State | 100 | 100 |
Text | 99.6 | 99.7 |
Micro avg. | 99.3/99.5* | 99.3/99.5* |
DIDI
symbol recognition at IoU 50% on test setClass | DIDI \(_{\texttt {no\_text}}\) | DIDI \(_{\texttt {text}}\) | ||
---|---|---|---|---|
Precision | Recall | Precision | Recall | |
Arrow | 95.6/97.5* | 94.7/96.5* | 96.6/99.2* | 95.2/98.0* |
Box | 97.1 | 96.5 | 99.9 | 99.8 |
Diamond | 99.2 | 97.5 | 99.9 | 99.9 |
Octagon | 96.3 | 92.2 | 100 | 99.7 |
Oval | 92.6 | 97.2 | 99.7 | 99.4 |
Parallelogram | 97.9 | 97.0 | 99.9 | 99.8 |
Text | – | – | 98.5 | 97.7 |
Micro avg. | 96.1/97.0* | 95.4/96.3* | 98.4/99.1* | 97.6/98.4* |
FC_A
dataset (Table 6), where related works report 83.2% [38] and 84.2% [7] symbol recognition recall, Arrow R-CNN has a much higher recall (98.3%). The largest source of error for Arrow R-CNN is in the arrow class, where arrows have either not been detected with at least 80% bounding box overlap, or the arrow has not been joined to the correct node symbols. On the FC_A
training set, we noticed that our model fails to recognize diagrams of template 5 and 7. In these layouts, nodes are sometimes connected through a sequence of two arrows. Yet, our current postprocessing logic assumes that an arrow always points to a node and thus connects the arrow to the node closest to its head keypoint. We leave the development of appropriate methods for the arrow-after-arrow scenario to future work.FC_B
\(_{\texttt {scan}}\) test set, all arrows that have been detected correctly are also connected to their ground truth nodes. This demonstrates the effectiveness of the Arrow R-CNN arrow keypoint mechanism and postprocessing. Arrow R-CNN is also applicable to diagrams other than flowcharts. As Table 8 illustrates, the model perfectly recognizes the state and final state shapes in the FA
finite automata test set and also achieves very good results for arrows and text. For the DIDI
dataset, the symbol recognition results in Table 9 are all above 90%, but slightly lower than the results on the other dataset, even though a lower IoU threshold of 50% is used. As discussed in Sect. 5.1, this has to do with the discrepancy between the ground truth bounding boxes extracted from the diagram and the actual drawings. In Sect. 5.5, we further discuss the implications of this discrepancy and show some error cases.FC_B
\(_{\texttt {scan}}\) runtime per image: Arrow R-CNN timings are taken using a Tesla V100 GPU, with the image already resized and loaded to memoryMethod | Runtime [ms] | |||
---|---|---|---|---|
Min | Mean | Std. | Max | |
Arrow R-CNN | 59 | 91 | 15 | 119 |
Bresler et al. [7] | 2623 | 10,970 | – | 37,972 |
5.5 Error analysis and future work
FC_B
\(_{\texttt {scan}}\) flowchart. For texts and straight arrows, where one bounding box side is often very short, a prediction off by a few pixels can result in less than 80% IoU. This raises the question whether an 80% IoU threshold all symbols types is too strict. From an end-user perspective, it might only matter that the arrow has been correctly identified as an edge between two nodes. To this end, future research could investigate graph similarity measures to evaluate diagram recognition systems. The two confusions between process and data are likely due to the fact that the writer in questions draws very uneven lines. These uneven lines are typically caused by uncontrolled oscillations of the hand muscles, dampened by inertia [33]. In handwriting recognition, elastic distortion augmentation is a way to simulate these oscillations [17, 33]. We found that although elastic distortion augmentation improves classification results, it has a negative effect on localization accuracy. This is due to the fact that the distortions cause the annotated bounding box and keypoints to be inaccurate, e.g. by distorting a line close to a bounding box such that it surpasses the bounding box. Future work could investigate elastic distortion methods that also adapt ground truth annotations accordingly.
DIDI
dataset, Fig. 12 illustrates that the model is not only required to recognize the handwritten diagram, but also to predict the GraphViz diagram it originates from. As can be seen, the model correctly predicts node shape bounding boxes which are smaller than the hand-drawn shapes, i.e., it recognizes that the shapes have been drawn excessively large. However, when the position and size of the hand-drawn and GraphViz symbols differs too much, this task becomes nearly impossible. This is illustrated by the numerous text localization errors in the example, where, e.g. the hand-drawn arrow labels “Back” have very little intersection with the corresponding GraphViz texts. The example also demonstrates why the IAM word augmentation method is not very effective on DIDI
, since the augmented word bounding box is defined by the handwritten strokes instead of the corresponding diagram label. Future work could propose evaluation methods that better disentangle handwriting and GraphViz diagram recognition performance for DIDI
.