1 Introduction
Variable | Placebo \(N = 80\) | Mauuitol \(N=76\) | P value |
---|---|---|---|
Female | 42 (52.5%) | 34 (44.7%) | 0.33 |
Fever | 79 (98.8) | 76 (100%) | 0.33 |
Convulsions | 79 (98.8%) | 75 (98.7%) | 0.97 |
Duration of coma | 7.0 (IQR3.5–12.0) | 6.0 (5.0–12.0) | 0.79 |
Blantyre coma score 1/5 | 13 (16.2%) | 10 (13.2%) | 0.59 |
-
Variety of structural layouts and visual relationships The structure of the table is determined by the structure and the relationships of its cells. One cell can span over several cells both vertically or horizontally, and combinations of spanning cells can create a vast number of structural variation. Also, some emphasis features of text and table lines can affect the way tables’ structure is understood. For example, horizontal lines or bold text may emphasize multiple headers of the table. The structure of the table visually defines the relationships between cells. Relationships between information in the table are visual, containing multiple dimensions, while, on the other hand, relationships between words in the text are linear. Visual relationships in tables make it difficult to computationally find the related cells and extract information from them.
-
Representation for visualisation Most of the representation formats for tables, such as markup languages in which tables can be described, are designed for visualisation. Therefore, it is challenging to automatically process tables.
-
Variety of value presentation patterns Values in cells can be presented using different syntactic representation patterns. For example, the mean and standard deviation can be represented using a form with ± sign (i.e. 16 ± 2) or standard deviation can be represented in the bracket (i.e. 16 (2)). Extraction of numerical values requires knowledge of possible presentation patterns.
-
Dense content The content of the cells can be either numerical or textual. However, textual content is usually dense, containing ambiguous short chunks of text with the use of acronyms and abbreviations. This is especially true in biomedical publications. In order to understand tables, the text needs to be disambiguated and abbreviations and acronyms need to be expanded.
2 Background
-
The abstract level encapsulates the communicative intent of the author (i.e. relationships between the data) [45].
Descriptor name | Description | Example |
---|---|---|
Semantic identifier | Describes the way of mapping between the information and certain knowledge source | age as UMLS:C0001779 |
Table’s pragmatic type | Pragmatic type of a table in which information is likely to appear | Pragmatic types can be for example tables with Baseline Characteristics, Adverse events, Inclusion/Exclusion, etc. |
Cues | ||
Lexical cues | Set of lexical cues and patterns that determine whether the value is present in certain cells | Lexical cue for number of patients can be “\(n=\%d\)”, “number of patients” in stub or number in data cell |
Functional cue | Description of functional regions in the table where information may appear | Number of patients may be in caption, header or data cell |
Semantic cues | Set of semantic cues such as semantic types and higher level concept names | List of semantic types indicates the presence of the value (i.e. The Sign or Symptom UMLS semantic type may indicate an adverse event in table) |
Value type/pattern (syntactic cue) | Description of the value type and its pattern with the way to extract it | Whether the value is single number, range, percentage, etc. |
Unit of measure | Description of the unit of measure and recognition cues. Definition of the default unit for the information class | Default is gram (g), but kilogram (kg) and milligram (mg) may appear |
3 Methodology
3.1 Extraction template
-
VariableName is the name of the variable that should be extracted. It can be linked with a certain ontology (e.g. Ontology of Clinical Research (OCRe) [40] or UMLS).
-
VariableSubCategory is used only for variables when there are multiple subcategories that have values (e.g. ethnicity and number of participant presented as a number of White, Asian, Hispanic and Black people).
-
ValueComponent parameter presents the name of the value component of the extracted variable’s value, obtained by analysing its presentation pattern. For example it may be Value if the cell presents a single value, Range:Min if the extracted value is minimum in the range, Range:Max for the maximum in the range, Percentage for values presenting percentage, Mean for mean values, and SD for standard deviation. In the case when a cell presents a range, two rows in the template should be extracted, one for the minimum and one for the maximum.
-
The Context is the parameter that describes the value’s context. It can be, for example, a clinical trial arm for tables presenting cumulative baseline characteristics of patients, or a patient identifier for tables presenting baseline characteristics for each patient separately.
-
The Value is the extracted value for the given variable from the table.
-
The Unit parameter is only applicable for numeric variables, where it is used to specify the unit of measure in which the value is expressed. For example, body mass can be presented using a singular unit (gram), multiples (kilogram) or sub-multiples (milligram) [44]. Each variable should have defined a default unit (if it exists, usually it is a singular unit) and that unit is used if it is not otherwise specified in the table.
3.2 Information extraction task specification
3.3 Information groups
3.3.1 Numeric information groups
3.3.2 Textual variables
3.4 Information extraction methodology
3.4.1 Table detection
3.4.2 Functional processing
3.4.3 Structural processing
3.4.4 Semantic tagging
3.4.5 Pragmatic processing
3.4.6 Cell selection
-
Heuristic-based approach Heuristic-based approach is already started by selecting only cells that contain certain lexical cue in its context in the previous step. Further, we analyse the content of the cell and related navigational cells. The method is looking for lexical cues that indicate the existence of the information in the selected cell. Lexical cues that indicate the existence of certain information in the analysed cell are defined in the lexical white list. On the other hand, some words can modify the semantics of the cell, even if it contains the searched cue. In this cases, we need to discard the selected cell. For example, if we are looking for BMI, a cell that contains as content “BMI change” is not of interest. These cues modify the meaning of the cell, so the information in them should not be extracted. Such cues are defined in the blacklist. The method is also able to analyse whether the value presentation pattern matches the usual pattern for presenting that kind of information by using regular expressions. The heuristics need to be crafted manually based on the previously crafted information description and improved by using insights from the data. The improvement process is performed by selecting a certain number of random tables as a training set, running the heuristics on them and iteratively improving them until the results are satisfactory.
-
Machine learning-based approach Machine learning cell analysis classifies cells into the ones containing values of variables for extraction and the ones not containing values of interest. In this approach, it is necessary to select a certain number of random tables and annotate the cells containing the variable. Data about each cell in our case contains cell content, a content of its header, stub and super-row number, cell’s role and the position of the cell in the table gird. The content of the cell and its navigational areas are stemmed using Porter stemmer, tokenized, and the bag-of-words methodology was used. We modelled problem as a classification task, in which if the cell contains variable the classification returns positive class and negative class otherwise.
3.4.7 Pattern analysis and value extraction
3.5 Defining rules for information extraction
3.5.1 Cell selection using lexical and semantic rules
3.5.2 Syntactic rules and syntactic processing
Pattern | Presentation examples | Variables |
---|---|---|
Single value | 65 | Number of patients, number of people with certain adverse event, etc. |
Floating point value | 0.05 | P value |
Aggregate statistical value |
\(18 \pm 2\)
| Age, FEV1, PEF, BMI |
12–18 | Weight, height | |
12.1 (2.4) | Number of patients in cohort | |
\(18 \pm 2\) (15–20) | ||
Alternatives | 12/17 | Gender distribution, blood pressure |
Percentage | 18 (55%) | Gender distribution |
55% | Percentage of people with certain effect |
4 Applications and results
4.1 Dataset
4.2 Functional and structural table analysis
Algorithm | Precision | Recall | F-Score |
---|---|---|---|
Numeric features | |||
Naive Bayes | 0.569 | 0.601 | 0.553 |
Bayesian Networks | 0.491 | 0.552 | 0.499 |
SVM | 0.475 | 0.559 | 0.493 |
C4.5 decision trees | 0.498 | 0.503 | 0.500 |
Random forests | 0.558 | 0.580 | 0.562 |
Caption text | |||
Naive Bayes | 0.901 | 0.902 | 0.901 |
Bayesian networks | 0.907 | 0.905 | 0.906 |
SVM | 0.930 | 0.930 | 0.930 |
C4.5 Decision tree | 0.926 | 0.925 | 0.926 |
Random forests | 0.889 | 0.889 | 0.888 |
Header text | |||
Naive Bayes | 0.687 | 0.654 | 0.660 |
Bayesian Networks | 0.682 | 0.634 | 0.642 |
SVM | 0.648 | 0.631 | 0.635 |
C4.5 Decision tree | 0.659 | 0.612 | 0.620 |
Random forests | 0.646 | 0.628 | 0.618 |
Stub text | |||
Naive Bayes | 0.821 | 0.796 | 0.801 |
Bayesian networks | 0.841 | 0.802 | 0.807 |
SVM | 0.808 | 0.772 | 0.776 |
C4.5 Decision tree | 0.821 | 0.779 | 0.783 |
Random forests | 0.803 | 0.776 | 0.780 |
Super-row text | |||
Naive Bayes | 0.568 | 0.477 | 0.461 |
Bayesian networks | 0.696 | 0.440 | 0.490 |
SVM | 0.526 | 0.448 | 0.373 |
C4.5 Decision tree | 0.691 | 0.508 | 0.476 |
Random forests | 0.694 | 0.537 | 0.514 |
Data cell content | |||
Naive Bayes | 0.573 | 0.556 | 0.551 |
Bayesian networks | 0.572 | 0.568 | 0.567 |
SVM | 0.604 | 0.586 | 0.587 |
C4.5 Decision tree | 0.560 | 0.551 | 0.551 |
Random forests | 0.603 | 0.592 | 0.587 |
Referring sentence | |||
Naive Bayes | 0.726 | 0.590 | 0.618 |
Bayesian networks | 0.698 | 0.618 | 0.625 |
SVM | 0.682 | 0.625 | 0.626 |
C4.5 Decision tree | 0.630 | 0.575 | 0.573 |
Random forests | 0.675 | 0.622 | 0.617 |
Combined content features | |||
Naive Bayes | 0.873 | 0.871 | 0.872 |
Bayesian networks | 0.865 | 0.864 | 0.864 |
SVM | 0.915 | 0.914 | 0.914 |
C4.5 Decision tree | 0.883 | 0.880 | 0.881 |
Random forests | 0.917 | 0.915 | 0.916 |
4.3 Pragmatic table analysis
Algorithm | Precision | Recall | F-Score |
---|---|---|---|
Naive Bayes | 0.943 | 0.943 | 0.943 |
Bayesian Networks | 0.938 | 0.939 | 0.938 |
C4.5 decision trees | 0.944 | 0.945 | 0.944 |
Random tree | 0.905 | 0.903 | 0.904 |
Random forests | 0.948 | 0.948 | 0.948 |
SVM | 0.967 | 0.966 | 0.966 |
Table type | Number |
---|---|
Baseline characteristics | 2803 (21.92%) |
Adverse events | 633 (4.95%) |
Inclusion/exclusion | 82 (0.47%) |
Other | 9291 (72.66%) |
Parameter | Bravelle\(^{\circledR }\) (\(\varvec{n =} \mathbf{120}\)) | Follistim\(^{\circledR }\) (\(\varvec{n=} \mathbf{118}\)) | P value |
---|---|---|---|
Age (years) | 32.0 ± 3.9 | 32.5 ± 3.7 | 0.330 |
Weight (lbs.) | 137.1 ± 21.4 | 145.8 ± 27.8 | 0.008 |
Body mass index (kg/m\(^2\)) | 23.3 ± 3.5 | 24.5 ± 4.0 | 0.021 |
Serum FSH (mlU/mL) | 6.3 ± 2.0 | 6.8 ± 2.1 | 0.077 |
Serum LH (mlU/mL) | 5.0 ± 2.4 | 4.6 ± 1.9 | 0.145 |
Serum E2 (pg/mL) | 43.1 ± 21.4 | 40.9 ± 20.9 | 0.420 |
4.4 Rule-based information extraction
Precision | Recall | F-Score | |
---|---|---|---|
Training | 0.900 | 0.839 | 0.868 |
Testing | 0.894 | 0.791 | 0.839 |
-
Document contained no tables
-
There was no baseline characteristic table, and the number of patients was not presented in any table (may have been present in the text)
-
There was no baseline characteristic table; however, the number of patients was presented in some other table (e.g. results, referral question)
-
Table reported results per person. In this case, the number of patients can be calculated as the number of data rows in a given table. However, our method was looking for the cumulative number of patients.
-
Baseline characteristic table does not report the number of patients. It is mentioned in the text or it is possible to calculate it from gender distribution.
-
Error of either pragmatic classifier or rule did not contain the right cue (e.g. infants, smokers).
Precision | Recall | F-Score | |
---|---|---|---|
Training | 0.806 | 0.895 | 0.848 |
Testing | 0.788 | 0.872 | 0.828 |
Precision | Recall | F-Score | |
---|---|---|---|
Training | 0.945 | 0.906 | 0.925 |
Testing | 0.883 | 0.962 | 0.921 |
4.5 Machine learning-based information extraction
Algorithm | Under-sampled (147 instances of each class) | Whole unbalanced dataset | Cost-sensitive classification | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F-Score | Accuracy | Precision | Recall | F-Score | Accuracy | Precision | Recall | F-Score | Accuracy | |
Naive Bayes | 0.054 | 0.952 | 0.103 | 0.821 | 0.173 | 0.701 | 0.277 | 0.960 | 0.266 | 0.483 | 0.343 | 0.980 |
Bayesian Nets | 0.101 | 0.912 | 0.182 | 0.911 | 0.292 | 0.517 | 0.373 | 0.981 | 0.512 | 0.422 | 0.463 | 0.989 |
C4.5 dec. trees | 0.070 | 0.905 | 0.130 | 0.869 | 0.893 | 0.510 | 0.649 | 0.994 | 0.714 | 0.782 | 0.747 | 0.994 |
Random tree | 0.066 | 0.585 | 0.119 | 0.906 | 0.580 | 0.544 | 0.561 | 0.991 | 0.573 | 0.585 | 0.579 | 0.991 |
Random forests | 0.214 | 0.932 | 0.348 | 0.962 | 0.935 | 0.490 | 0.643 | 0.994 | 0.797 | 0.667 | 0.726 | 0.995 |
SVM | 0.085 | 0.918 | 0.155 | 0.892 | 0.850 | 0.463 | 0.599 | 0.993 | 0.754 | 0.626 | 0.684 | 0.994 |
Algorithm | Under-sampled (272 instances of each class) | Whole unbalanced dataset | Cost-sensitive classification | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F-Score | Accuracy | Precision | Recall | F-Score | Accuracy | Precision | Recall | F-Score | Accuracy | |
Naive Bayes | 0.089 | 0.930 | 0.162 | 0.879 | 0.205 | 0.819 | 0.327 | 0.957 | 0.254 | 0.754 | 0.381 | 0.969 |
Bayesian Nets | 0.128 | 0.918 | 0.224 | 0.920 | 0.419 | 0.743 | 0.536 | 0.984 | 0.504 | 0.684 | 0.581 | 0.987 |
C4.5 dec. trees | 0.092 | 0.795 | 0.165 | 0.899 | 0.886 | 0.591 | 0.709 | 0.994 | 0.783 | 0.801 | 0.792 | 0.995 |
Random tree | 0.074 | 0.871 | 0.136 | 0.900 | 0.628 | 0.573 | 0.573 | 0.990 | 0.628 | 0.573 | 0.599 | 0.990 |
Random forests | 0.213 | 0.947 | 0.348 | 0.963 | 0.945 | 0.503 | 0.656 | 0.993 | 0.883 | 0.661 | 0.756 | 0.995 |
SVM with SMO | 0.180 | 0.614 | 0.278 | 0.963 | 0.955 | 0.743 | 0.836 | 0.996 | 0.895 | 0.801 | 0.846 | 0.996 |
Algorithm | Under-sampled (204 instances of each class) | Whole unbalanced dataset | Cost-sensitive classification | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Precision | Recall | F-Score | Accuracy | Precision | Recall | F-Score | Accuracy | Precision | Recall | F-Score | Accuracy | |
Naive Bayes | 0.075 | 0.929 | 0.139 | 0.834 | 0.155 | 0.675 | 0.252 | 0.942 | 0.167 | 0.584 | 0.260 | 0.952 |
Bayesian Nets | 0.099 | 0.934 | 0.179 | 0.876 | 0.475 | 0.584 | 0.524 | 0.985 | 0.813 | 0.528 | 0.640 | 0.991 |
C4.5 dec. trees | 0.119 | 0.929 | 0.210 | 0.899 | 0.912 | 0.685 | 0.783 | 0.994 | 0.839 | 0.766 | 0.801 | 0.994 |
Random tree | 0.081 | 0.959 | 0.150 | 0.843 | 0.739 | 0.746 | 0.742 | 0.992 | 0.739 | 0.746 | 0.742 | 0.992 |
Random forests | 0.155 | 0.990 | 0.218 | 0.922 | 0.953 | 0.624 | 0.755 | 0.994 | 0.893 | 0.807 | 0.848 | 0.996 |
SVM with SMO | 0.122 | 0.909 | 0.909 | 0.897 | 0.903 | 0.756 | 0.823 | 0.995 | 0.833 | 0.812 | 0.823 | 0.995 |
TP | FP | FN | Precission | Recall | F-Score | |
---|---|---|---|---|---|---|
Cell role—header | 61 | 39 | 32 | 0.6100 | 0.6559 | 0.6321 |
Cell role—stub | 309 | 0 | 0 | 1.0000 | 1.0000 | 1.0000 |
Cell role—super-row | 49 | 6 | 45 | 0.8909 | 0.5213 | 0.6578 |
Cell role—data | 675 | 18 | 104 | 0.9740 | 0.8664 | 0.9171 |
Overall (micro average) | 1094 | 63 | 181 | 0.9455 | 0.8580 | 0.9014 |
5 Generalizability case study
5.1 Document reading and table detection
5.2 Functional and structural processing
Algorithm | Precision | Recall | F-score |
---|---|---|---|
Naive Bayes | 0.588 | 0.936 | 0.722 |
Bayesian Networks | 0.559 | 0.964 | 0.708 |
SVM with SMO | 0.985 | 0.821 | 0.896 |
C4.5 decision tree | 0.944 | 0.307 | 0.463 |
Random forests | 0.973 | 0.875 | 0.922 |
Dataset | TP | FP | FN | Precision | Recall | F-score |
---|---|---|---|---|---|---|
Training data | 288 | 8 | 41 | 0.973 | 0.875 | 0.922 |
Testing data | 176 | 59 | 26 | 0.749 | 0.871 | 0.805 |
5.3 Pragmatic analysis
5.4 Table annotation
5.5 Cell selection and syntactic analysis
Dataset | TP | FP | FN | Precision | Recall | F-score |
---|---|---|---|---|---|---|
Training data | 514 | 16 | 128 | 0.970 | 0.819 | 0.888 |
Testing data | 428 | 45 | 122 | 0.904 | 0.778 | 0.836 |
5.6 Remarks about the generalizability of framework
-
Document and table reading requires modifications for a given data format. There are defined data structures that have to be populated from the original document. Once the data structures are populated, the methodology requires little modification.
-
Table detection remains the same across majority of XML documents (finding table tags). However, other formats, such as PDF or ASCII text document, may require more complex table detection methodology.
-
Functional analysis remains the same for the majority of documents that contain emphasis features that would clearly distinguish headers, stubs, super-rows and data cells. In documents with tables not containing enough emphasis cues (different font style, breaking lines, etc.), it may be necessary to introduce some lexical classification of cells, as we did in DailyMed case study.
-
Structural analysis does not need any modification and depends on the output of functional analysis
-
Pragmatic analysis can be performed either by rules or by utilizing machine learning classification. It depends on the task; therefore, new rules or new classification models may be necessary for each task.
-
Table annotation in the biomedical domain, it is standard to use UMLS annotation, and for the most of the tasks, it would be helpful. However, in other domains may be utilized other taxonomies, vocabularies or ontologies.
-
Cell selection the framework for creating rules remains the same, involving white list and blacklist. However, rules will be different for different tasks.
-
Syntactic processing many of the data presentation patterns, especially for numeric values, can be reused for many tasks and domains. However, for certain tasks, a new set of syntactic rules have to be crafted. The framework allows easy creation of these rules using regular expressions and assignment of semantics to the extracted value groups.