1 Introduction
1.1 Motivations
-
paragraph hierarchy can be restored,
-
images and tables can be automatically annotated on the basis of their captions,
-
extracted images can be used for similar document retrieval,
-
paragraphs spanning to more than one column can be merged,
-
logical flow of paragraphs can be restored (context of words is not lost).
-
there is no guarantee that the complete text is extracted;
-
order of paragraphs or even single sentences may be disrupted;
-
national and Latin characters can be omitted due to character coding;
-
enumerations or itemizations structure can be flattened to simple text;
-
tables and similar logical structures may not be extracted.
1.2 Contribution
-
they refer to the recognition of the whole document structure but in a specific domain, where the considered set of layouts is known in advance.
-
the method of graphic component segmentation,
-
the rule-based line segmentation method,
-
the text line grouping method,
-
layout independent text structure recognition rules,
-
the gridded table detection method,
-
the evaluation routine, based on non-white pixel analysis.
2 Related research
2.1 Document segmentation
2.2 Segment recognition
3 The problem formulation
3.1 Segmentation problem
-
Positions of page headers and footers;
-
Closeness of an image and its caption;
-
Uniformity of spacing between lines and paragraphs.
3.2 Recognition problem
-
Abstract,
-
Author,
-
Caption,
-
Header,
-
Page Footer,
-
Page Header,
-
Paragraph,
-
Table,
-
Title,
-
Graphic elements: line, picture, diagram, scheme, chart.
-
A photo is usually a spatially coherent collection of non-white pixels;
-
A block of text has a distinctive outline along its borders;
-
A drawing consists of interconnected groups of lines and curves mixed with fragments of text;
-
A caption is a block of text spatially related to a graphic element and may begin with a keyword.
3.3 Formal problem definition
4 The proposed method
-
High resolution of the document image is necessary, at least 300 DPI. The limitation stems from the applied OCR tool (Tesseract). Its recognition quality decreases when lower resolutions are used;
-
Document image contains uniform background colour and distinct contrast between background and foreground;
-
Text path has to be a straight, orthogonal line;
-
Text font size within a single paragraph has to be constant.
4.1 The idea of the method
4.2 Detailed description
4.2.1 Graphics segmentation
-
the number of pixels in the segment skeleton,
-
the skeleton height.
4.2.2 Text segmentation and recognition
-
detection of blocking connected components,
-
text line segmentation,
-
text line grouping—paragraph segmentation,
-
text region classification.
-
If a group is left-aligned and \(\rm{avg}_l < \rm{avg}_r\) then the group is ignored because it is assumed that the average distance from the neighbouring components on the left should be greater than the average distance from components on the right side;
-
If a group is right-aligned and \(\rm{avg}_r < \rm{avg}_l\) then the group is ignored on the basis of the similar rule: the average distance from the neighbouring components on the right should be greater than the average distance from components on the left side;
-
If the average distances are similar (\(|\rm{avg}_r - \rm{avg}_l| < \epsilon\)) then the group is ignored. This case often happens for the groups of words inside paragraphs due to a coincidental alignment of spaces, called rivers.
-
Components are possibly both capital or small letters if both the top and bottom edges of the bounding boxes lie on the ascents line and the base line;
-
If components are small and capital letters then horizontal overlapping should have at least 33 % of the highest component (small letters in main text in majority of documents have at least 50 % of the heights of capital letters);
-
If one of the components is a letter and the second one is an ascent or punctuation mark then they cover significantly different areas and are placed on appropriate alignment lines.
-
Both components are not marked as blocking:$$\begin{aligned} \rm{cc}_i \notin (\rm{OBS}_L \cup \rm{OBS}_R) \wedge \rm{cc}_j \notin (\rm{OBS}_L \cup \rm{OBS}_R); \end{aligned}$$(5)
-
The component on the left side (\(\rm{cc}_i\)) is marked as left-blocking and the component on the right side (\(\rm{cc}_j\)) is not left-blocking:$$\begin{aligned} \rm{cc}_i \in \rm{OBS}_L \wedge \rm{cc}_j \notin \rm{OBS}_L; \end{aligned}$$(6)
-
The component on the left side (\(\rm{cc}_i\)) is not marked as blocking and the component on the right side (\(\rm{cc}_j\)) is marked as right-blocking:$$\begin{aligned} \rm{cc}_i \notin (\rm{OBS}_L \cup \rm{OBS}_R) \wedge \rm{cc}_j \in \rm{OBS}_R. \end{aligned}$$(7)
-
The variance of font thickness is smaller than the threshold \(\Gamma = 0.6\);
-
The dominant colour inside the bounding boxes of the text lines is similar;
-
Both text lines have a similar height (exact to \(\epsilon\)) or the shorter text line has at least 66 % height of the longer one.
-
Both text lines have similar graphic features: font thickness and colours with thresholds: \(\Gamma = 0.6\), \(\rm{jnd} = 2.3\);
-
All text lines are aligned according to the paragraph type -left, -right or centre-aligned;
-
Both text lines are no further from each other than \(\phi\)—maximum value from the set of \(y\) distance values blocking object between them.
-
dominant colour,
-
font thickness,
-
location,
-
location in relation to other objects,
-
text recognized by OCR tools (Tesseract),
-
relation with blocking objects.
-
the object is located in the neighbourhood of a recognized image or table region;
-
the text begins with one of the following keywords: “image”, “img.”, “figure”, “fig.”, “photo”, “ph.”, “table”, “tab.”, “diagram”. National keywords are also included.
-
the object is located just over the paragraph or a single text line;
-
there is no blocking objects between the object and the associated paragraph;
-
horizontal projections of the object and the paragraph are overlapping;
-
the object has a font thickness greater than the average thickness of the paragraph or is written with capital letters.
-
apart from images, lines and other objects within the same text line (alignment condition), the object is located at the top of the document;
-
the recognized text starts with keyword: “p.”, “page”, “no.” or a digit.
-
apart from images, lines and other objects within the same text line (alignment condition), the object is located at the bottom of the document;
-
the recognized text starts with keyword: “p.”, “page”, “no.” or digit.
-
it is initially recognized as an image;
-
there is a caption in the neighbourhood that implies that the image is a table.
4.2.3 Recognition of tables with grid lines
-
If the largest graph has less than 85 % vertexes with orthogonal lines then the image is not a table;
-
If the resulting grid cannot be used to reconstruct at least one cell then the image is not a table:
-
If less than 50 % of the cells do not have content within them then the image is not a table, otherwise the image is a table.
5 Experimental study
5.1 Evaluation methodology
5.2 Document image corpus
Class | The number of elements |
---|---|
Paragraph | 5,839 |
Header | 1,377 |
Page footer | 1,159 |
Page header | 1,262 |
Caption | 1,168 |
Graphic element | 728 |
Table | 415 |
Title | 232 |
Author | 218 |
5.3 Evaluation of text structure recognition
Method | Prec (%) | Rec (%) |
F-Score |
---|---|---|---|
Docstrum | 92.94 | 93.14 | 93.03 |
ARLSA | 93.61 | 93.23 | 93.41 |
RXYC | 74.04 | 93.45 | 82.62 |
DSR
| 92.60 | 93.13 | 92.86 |
5.4 Evaluation of paragraph recognition
Prec (%) | Rec (%) |
F-Score |
---|---|---|
One-column document
| ||
87.14 | 86.77 | 86.95 |
Many columns document
| ||
85.87 | 84.34 | 85.09 |
Documents with many fonts
| ||
84.12 | 88.09 | 86.05 |
5.5 Quality of the whole document structure recognition method for all recognized document structures
Class | Prec (%) | Rec (%) |
F-Score |
---|---|---|---|
Paragraph | 87.40 | 85.12 | 86.24 |
Header | 88.28 | 80.15 | 84.01 |
Page footer | 84.18 | 79.16 | 81.59 |
Page header | 88.16 | 86.41 | 87.27 |
Table | 96.31 | 65.23 | 77.78 |
Caption | 88.04 | 88.12 | 88.07 |
Graphic element | 92.01 | 80.22 | 85.71 |
Title | 69.15 | 44.98 | 54.50 |