1 Introduction
-
a rich set of document’s metadata,
-
a list of bibliographic references along with their metadata,
-
structured full text with sections and subsections (currently in experimental phase).
-
detailed descriptions of all the extraction algorithm components,
-
the details related to feature selection for zone classifiers,
-
new evaluation results for algorithms trained on GROTOAP2 dataset [4],
-
the evaluation of the bibliography extraction workflow,
-
the comparison to other similar systems.
2 State of the art
pstotext
, while basic document metadata is extracted by a set of rules and features computed for extracted text chunks. Another example of a rule-based system is PDFX described by Constatin et al. [6]. PDFX can be used for converting scholarly articles in PDF format to their XML representation by annotating fragments of the input documents and extracts basic metadata, structured full text and unparsed reference strings. Pdf-extract [7] is an open-source tool for identifying and extracting semantically significant regions of scholarly articles in PDF format. It uses a combination of visual cues and content traits to perform structural analysis in order to determine columns, headers, footers and sections, detect references sections and finally extract individual references.
pdftohtml
, a third-party open-source tool. The system based on TeamBeam algorithm proposed by Kern et al. [13] is able to extract a basic set of metadata from PDF documents using an enhanced Maximum Entropy classifier. Lopez [14] proposes GROBID system for analysing scientific texts in PDF format. GROBID uses CRF in order to extract document’s metadata, full text and a list of parsed bibliographic references. ParsCit, described by Luong et al. [15] also uses CRF for extracting the logical structure of scientific articles, including the document’s metadata, structured full text and parsed bibliography. ParsCit analyses documents in text format, and therefore does not use geometric hints present in the PDF files.CERMINE | PDFX | GROBID | ParsCit | Pdf-extract | |
---|---|---|---|---|---|
Title |
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
Author |
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\times \)
|
Affiliation |
\(\checkmark \)
|
\(\times \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\times \)
|
Affiliation’s metadata |
\(\checkmark \)
|
\(\times \)
|
\(\checkmark \)
|
\(\times \)
|
\(\times \)
|
Author–affiliation |
\(\checkmark \)
|
\(\times \)
|
\(\checkmark \)
|
\(\times \)
|
\(\times \)
|
Email address |
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\times \)
|
Author–email |
\(\checkmark \)
|
\(\times \)
|
\(\checkmark \)
|
\(\times \)
|
\(\times \)
|
Abstract |
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\times \)
|
Keywords |
\(\checkmark \)
|
\(\times \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\times \)
|
Journal |
\(\checkmark \)
|
\(\times \)
|
\(\times \)
|
\(\times \)
|
\(\times \)
|
Volume |
\(\checkmark \)
|
\(\times \)
|
\(\times \)
|
\(\times \)
|
\(\times \)
|
Issue |
\(\checkmark \)
|
\(\times \)
|
\(\times \)
|
\(\times \)
|
\(\times \)
|
Pages range |
\(\checkmark \)
|
\(\times \)
|
\(\times \)
|
\(\times \)
|
\(\times \)
|
Year |
\(\checkmark \)
|
\(\times \)
|
\(\checkmark \)
|
\(\times \)
|
\(\times \)
|
DOI |
\(\checkmark \)
|
\(\times \)
|
\(\checkmark \)
|
\(\times \)
|
\(\times \)
|
Reference |
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\checkmark \)
|
Reference’s metadata |
\(\checkmark \)
|
\(\times \)
|
\(\checkmark \)
|
\(\checkmark \)
|
\(\times \)
|
-
CERMINE is able to extract bibliographic information related to the document, such as journal name, volume, issue or pages range.
-
The algorithms use not only the text content of the document, but also its geometric features related to the way the text is displayed in the source PDF file.
-
Our solution is based mostly on machine learning, which increases its ability to conform to different article layouts.
-
The flexibility of the system implementation is granted by its modular architecture.
-
For most metadata types, the solution is very effective.
-
The source code is open and the web service is available online [2].
3 System architecture
Path | Step | Goal | Implementation |
---|---|---|---|
A. Basic structure extraction | A1. Character extraction | Extracting individual characters along with their page coordinates and dimensions from the input PDF file | iText library |
A2. Page segmentation | Constructing the document’s geometric hierarchical structure containing (from the top level) pages, zones, lines, words and characters, along with their page coordinates and dimensions | Enhanced Docstrum | |
A3. Reading order resolving | Determining the reading order for all structure elements | Bottom-up heuristic-based | |
A4. Initial zone classification | Classifying the document’s zones into four main categories: metadata, body, references and other
| SVM | |
B. Metadata extraction | B1. Metadata zone classification | Classifying the document’s zones into specific metadata classes | SVM |
B2. Metadata extraction | Extracting atomic metadata information from labelled zones | Simple rule-based | |
C. Bibliography extraction | C1. Reference strings extraction | Dividing the content of references zones into individual reference strings | K-means clustering |
C2. Reference parsing | Extracting metadata information from references strings | CRF |
3.1 Models and formats
4 Extraction workflow implementation
4.1 Layout analysis
4.1.1 Character extraction
4.1.2 Page segmentation
-
the distance between connected components, which is used for grouping components into lines, has been split into horizontal and vertical distance (based on estimated text orientation angle),
-
fixed maximum distance between lines that belong to the same zone has been replaced with a value scaled relatively to the line height,
-
merging of lines belonging to the same zone has been added,
-
rectangular smoothing window has been replaced with Gaussian smoothing window,
-
merging of highly overlapping zones has been added,
-
words determination based on within-line spacing has been added.
4.1.3 Reading order resolving
4.2 Content classification
4.2.1 Feature selection
-
geometric—based on geometric attributes, some examples include: zone’s height and width, height to width ratio, zone’s horizontal and vertical position, the distance to the nearest zone, empty space below and above the zone, mean line height, whether the zone is placed at the top, bottom, left or right side of the page;
-
lexical—based upon keywords characteristic for different parts of narration, such as: affiliations, acknowledgments, abstract, keywords, dates, references, or article type; these features typically check, whether the text of the zone contains any of the characteristic keywords;
-
sequential—based on sequence-related information, some examples include the label of the previous zone (according to the reading order) and the presence of the same text blocks on the surrounding pages, whether the zone is placed in the first/last page of the document;
-
formatting—related to text formatting in the zone, examples include font size in the current and adjacent zones, the amount of blank space inside zones, mean indentation of text lines in the zone;
-
heuristics—based on heuristics of various nature, such as the count and percentage of lines, words, uppercase words, characters, letters, upper/lowercase letters, digits, whitespaces, punctuation, brackets, commas, dots, etc; also whether each line starts with enumeration-like tokens, or whether the zone contains only digits.
4.2.2 SVM parameters adjustment
Initial classification | ||||
---|---|---|---|---|
Kernel | Linear | 4th poly. | RBF | Sigmoid |
log\(_2(C)\), log\(_2(\gamma )\)
| 7, 1 | 9, \(-\)5 | 5, \(-\)3 | 15, \(-\)13 |
Mean F1 (%) | 90.7 | 93.5 | 93.9 | 90.1 |
Metadata classification | ||||
---|---|---|---|---|
Kernel | Linear | 4th poly. | RBF | Sigmoid |
log\(_2(C)\), log\(_2(\gamma )\)
| 4, \(-\)9 | 7, \(-\)4 | 9, \(-\)3 | 11, \(-\)7 |
Mean F1 (%) | 85.0 | 87.5 | 88.6 | 81.0 |
4.3 Metadata extraction
-
zones labelled as abstract are concatenated,
-
as type is often specified just above the title, it is removed from the title zone if needed (based on a dictionary of types),
-
authors, affiliations and keywords lists are split with the use of a list of separators,
-
affiliations are associated with authors based on indexes and distances,
-
email addresses are extracted from correspondence and affiliation zones using regular expressions,
-
email addresses are associated with authors based on author names,
-
pages ranges placed directly in bib_info zones are parsed using regular expressions,
-
if there is no pages range given explicitly in the document, we also try to retrieve it from the pages numbers on each page,
-
dates are parsed using regular expressions,
-
journal, volume, issue and DOI are extracted from bib_info zones based on regular expressions.
4.4 Bibliography extraction
4.4.1 Extracting reference strings
4.4.2 Reference strings parsing
-
Some of them are based on the presence of a particular character class, e.g. digits or lowercase/uppercase letters.
-
Others check whether the token is a particular character (e.g. a dot, a square bracket, a comma or a dash), or a particular word.
-
Finally, we use features checking if the token is contained by the dictionary built from the dataset, e.g. a dictionary of cities or words commonly appearing in the journal title.
5 Evaluation
5.1 Datasets preparation
Name | Source | Format | Content | Purpose |
---|---|---|---|---|
Segmentation test set | GROTOAP | TrueViz | 113 documents | The evaluation of page segmentation (Sect. 5.2) |
Zone validation set | GROTOAP2 | TrueViz | 100 documents containing 14,000 labelled zones, 2743 of which are metadata zones | |
Zone test set | GROTOAP2 | TrueViz | 2551 documents containing 355,779 zones, 68,557 of which are metadata zones | Zone classifiers evaluation (Sect. 5.3) and final classifiers training |
Citation test set | CiteSeer, Cora-ref and PMC | NLM JATS | 4000 parsed citations (2000 from CiteSeer and Cora-ref, 2000 from 1991 different PMC documents) | The evaluation of the references parser (Sect. 5.4) |
Metadata test set | PubMed Central | PDF \(+\) NLM JATS | 47,983 PDF documents with corresponding metadata records | The evaluation of the entire metadata and bibliography extraction workflow (Sect. 5.5) |
Comparison test set | PubMed Central | PDF \(+\) NLM JATS | 1943 PDF documents with corresponding metadata records | The comparison of CERMINE’s performance with the performance of other similar tools (Sect. 5.6) |
5.2 Page segmentation
5.3 Zone classification
Metadata | Body | References | Other | Precision (%) | Recall (%) | |
---|---|---|---|---|---|---|
Metadata |
66,042
| 2181 | 75 | 259 | 96.6 | 96.3 |
Body | 1551 |
232,464
| 177 | 934 | 97.9 | 98.9 |
References | 47 | 806 |
17,489
| 67 | 98.2 | 95.0 |
Other | 733 | 2118 | 65 |
30,771
| 96.1 | 91.3 |
Abstract | Affiliation | Author | Bib_info | Correspondence | Dates | Editor | Keywords | Title | Type | Copyright | Precision (%) | Recall (%) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Abstract |
6866
| 8 | 7 | 62 | 8 | 1 | 1 | 23 | 7 | 5 | 10 | 97.7 | 98.1 |
Affiliation | 11 |
3532
| 16 | 31 | 62 | 5 | 8 | 3 | 1 | 6 | 6 | 95.5 | 96.0 |
Author | 4 | 14 |
2684
| 42 | 18 | 0 | 3 | 1 | 6 | 6 | 4 | 96.9 | 96.5 |
Bib_info | 75 | 22 | 14 |
40,982
| 25 | 119 | 1 | 41 | 16 | 115 | 100 | 98.7 | 98.9 |
Corresp. | 9 | 107 | 15 | 32 |
1616
| 2 | 0 | 3 | 1 | 1 | 3 | 92.8 | 90.3 |
Dates | 5 | 1 | 4 | 136 | 3 |
2835
| 0 | 1 | 0 | 2 | 13 | 94.7 | 94.5 |
Editor | 0 | 2 | 1 | 0 | 0 | 0 |
473
| 0 | 0 | 0 | 0 | 96.9 | 99.4 |
Keywords | 28 | 8 | 5 | 86 | 1 | 6 | 1 |
896
| 5 | 7 | 1 | 91.5 | 85.8 |
Title | 9 | 0 | 13 | 26 | 0 | 0 | 0 | 3 |
2574
| 6 | 2 | 98.3 | 97.8 |
Type | 4 | 0 | 4 | 88 | 0 | 2 | 1 | 6 | 6 |
1497
| 2 | 91.0 | 93.0 |
Copyright | 14 | 5 | 7 | 45 | 8 | 23 | 0 | 2 | 3 | 0 |
2927
| 95.4 | 96.5 |
5.4 Reference parsing
5.5 Metadata extraction evaluation
5.6 Comparison evaluation
pdftotext
tool. What is more, the output of ParsCit can contain multiple titles or abstracts; thus, for this system, all metadata classes were treated as list types.CERMINE | PDFX | GROBID | ParsCit | Pdf-extract | |
---|---|---|---|---|---|
Title |
95.5
| 85.7 | 82.5 | 34.1 | 49.4 |
93.4
| 84.7 | 77.4 | 39.6 | 49.4 | |
94.5
| 85.2 | 79.8 | 36.6 | 49.4 | |
Authors |
90.2
| 71.2 | 85.9 | 57.9 | – |
89.0 | 71.5 |
90.5
| 48.6 | – | |
89.6
| 71.3 | 88.1 | 52.8 | – | |
Affiliations | 88.2 | – |
90.8
| 72.2 | – |
83.1
| – | 51.8 | 44.3 | – | |
85.6
| – | 66.0 | 54.9 | – | |
Email addresses | 51.7 |
53.0
| 26.9 | 28.8 | – |
42.6 |
73.6
| 7.8 | 36.2 | – | |
46.7 |
61.6
| 12.1 | 32.1 | – | |
Abstract |
82.8
| 71.1 | 70.4 | 47.7 | – |
79.9
| 66.7 | 67.7 | 61.3 | – | |
81.3
| 68.8 | 69.0 | 53.7 | – | |
Keywords | 89.9 | – |
94.2
| 15.6 | – |
63.5
| – | 44.2 | 3.0 | – | |
74.4
| – | 60.2 | 5.1 | – | |
Journal |
80.3
| – | – | – | – |
73.2
| – | – | – | – | |
76.6
| – | – | – | – | |
Volume |
93.3
| – | – | – | – |
83.0
| – | – | – | – | |
87.8
| – | – | – | – | |
Issue |
53.7
| – | – | – | – |
28.4
| – | – | – | – | |
37.1
| – | – | – | – | |
Pages |
87.0
| – | – | – | – |
80.4
| – | – | – | – | |
83.5
| – | – | – | – | |
Year |
96.3
| – | 95.7 | – | – |
95.0
| – | 40.4 | – | – | |
95.6
| – | 56.8 | – | – | |
DOI | 98.2 | – |
99.1
| – | – |
75.0
| – | 65.4 | – | – | |
85.1
| – | 78.8 | – | – | |
References |
96.1
| 91.3 | 79.7 | 81.2 | 80.4 |
89.8
| 88.9 | 66.7 | 71.8 | 57.5 | |
92.8
| 90.1 | 72.6 | 76.2 | 67.0 |
5.7 Error analysis
-
When two (or more) zones with different roles in the document are placed close to each other, they are often merged together by the segmenter. In this case, the classification is more difficult and by design only one label is assigned to such a hybrid zone. A potential solution would be to introduce additional labels for pairs of labels that often appear close to each other, for example title_author or author_affiliation, and split the content of such zones later in the workflow.
-
The segmenter introduces other errors as well, such as incorrectly attaching an upper index to the line above the current line, or merging text written in two columns. These errors can be corrected by further improvement of the page segmenter.
-
Zone classification errors are also responsible for a lot of extraction errors. These errors can be improved by adding training instances to the training set and improving the labelling accuracy in GROTOAP2.
-
Sometimes the metadata, usually keywords, volume, issue or pages, is not explicitly given in the input PDF file. Since CERMINE analyses the PDF file only, such information cannot be extracted. This is in fact not an extraction error. Unfortunately, since ground truth NLM data in PMC usually contains such information, whether it is written in the PDF or not, these situations also contribute to the overall error rates (equally for all evaluated systems).
-
Title merged with other parts of the document, when title zone is placed close to another region.
-
Title not recognized, for example when it appears on the second page of the PDF file.
-
Title zone split by the segmenter into a few zones, and only a subset of them is correctly classified.
-
Authors zone not labelled, in that case the authors are missing.
-
Authors zone merged with other fragments, such as affiliations or research group name, in such cases additional fragments appear in the authors list.
-
Affiliation zone not properly recognized by the classifier, for example when it not visually separated from other zones, or placed at the end of the document. Affiliations are missing in that case.
-
The entire abstract or a part of it recognized as body by the classifier, as a result the abstract or a part of it is missing.
-
The first body paragraph recognized incorrectly as abstract, as a result the extracted abstract contains a fragment of the document’s proper text.
-
Bibliographic information missing from a PDF file or not recognized by the classifiers, as a result journal name, volume, issue and/or pages range are not extracted.
-
Keywords missing because the zone was not recognized or not included in the PDF file.
-
A few of the references zones classified as body, and in such cases some or all of the references are missing.
5.8 Processing time
6 Conclusions and future work
-
extending the workflow, so that the system is able to process documents in the form of scanned pages as well,
-
expanding the workflow architecture by adding a process path for extracting structured full text containing sections and subsections, headers and paragraphs,
-
adding affiliation parsing step, the goal of which is to extract affiliation metadata: institution name, address and country,
-
making the citation dataset used for parser evaluation publicly available.