1 Introduction
Connectives | Sentences | Labels |
---|---|---|
As | There was no debate as the Senate passed the bill on to the House [10] | Causal |
As | It has a fixed time, as collectors well know [10] | Non-causal |
After | Bischoff in a round table discussion claimed he fired Austin after he refused to do a taping in Atlanta [61] | Causal |
After | In stark contrast to his predecessor, five days after his election he spoke of his determination to do what he could to bring peace [61] | Non-causal |
– | He derives great joy and happiness from cycling [5] | Causal |
Sentences | Causality forms | Causality pairs |
---|---|---|
Financial stress is one of the main causes of divorce | Explicit with intra-sentential | <Financial stress,divorce> |
Financial stress can speed divorce up | Implicit | <Financial stress,divorce> |
You may hear that unfaithful can lead to divorce. On the other hand, financial stress is another significant factor | Inter-sentential | <Financial stress,divorce> |
2 Previous surveys
3 Benchmark datasets
Datasets | Published years | Causality sizes | Sources | Availability | Balanced | Related works |
---|---|---|---|---|---|---|
SemEval-2007 task 4 | 2007 | 220 | Wikipedia | Publicly available\(^\mathrm{a}\) | X | |
SemEval-2010 Task 8 | 2010 | 1331 | Wikipedia | Publicly available\(^\mathrm{b}\) | – | |
PDTB 2.0 | 2018 | 9190 | WSJ | License required\(^\mathrm{c}\) | – | |
TACRED | 2018 | 269 | Newswire, Web | License required\(^\mathrm{d}\) | – | |
BioInfer | 2007 | 1461 | PubMed | Publicly available\(^\mathrm{e}\) | X | |
ADE | 2012 | 6821 | PubMed | Publicly available\(^\mathrm{f}\) | – |
-
SemEval-2007 task 4 It is part of SemEval (Semantic Evaluation), the 4th edition of the semantic evaluation event [27]. This task provides a dataset for classifying semantic relations between two nominals. Within the set of seven relations, the organizers split the Cause–Effect examples into 140 training with 52.0% positive data, and 80 test with 51.0% positive data. This dataset has the following advantages: (a) Strong reputation. SemEval is one of the most influential, largest scale natural language semantic evaluation competition. As of 2020, SemEval has been successfully held for fourteen sessions and has a high impact in both industry and academia. (b) Easily accessible. Each relation example with the annotated results is collected in a separate TXT file, which can also reduce the workload of data preprocessing. On the contrary, the main limitation is the small data amount that 140 training and 80 test examples are far from meeting the needs for developing a CE system.
-
SemEval-2010 task 8 Unlike its predecessor, SemEval-2007 Task 4, which has an independent binary-labeled dataset for each kind of relation, this is a multi-classification task in which relation label for each sample is one of nine kinds of relations [34]. Within the 10,717 annotated examples, there are 1003 training with 13.0% positive data, and 328 test with 12.0% positive data. This small sample amount and imbalanced condition are the major limitations of this dataset.
-
PDTB 2.0 The second release of the penn discourse treebank (PDTB) dataset from Prasad et al. [74] is the largest annotated corpus of discourse relations. It includes 72,135 non-causal and 9190 causal examples from 2312 Wall Street Journal (WSJ) articles. In addition, there is a type of implicit relation in the dataset known as AltLex (Alternative lexicalization) corpus, in which causal meanings are not expressed by explicit causal lexical markers. However, the authors store PDTB in a complex way that researchers need to use tools to convert it into easy-to-operate files.
-
TACRED Similar to SemEval, the Text Analysis Conference (TAC) is a series of evaluation workshops about NLP research. The TAC Relation Extraction Dataset (TACRED) contains 106,264 newswire and online text that have been collected from the TAC KBP challenge.1 during the year from 2009 to 2014 [97]. The sentences are annotated with person- and organization-oriented related type (e.g., per:title, org:founded). The main limitation of TACRED is the small number of examples that there are only 269 cause_of_death instances available for CE task.
-
BioInfer Pyysalo et al. [75] introduce an annotated corpus, BioInfer (Bio Information Extraction Resource), which contains 1100 sentences with the relations of genes, proteins, and RNA from biomedical publications. There are 2662 relations in the 1100 sentences, of these 1461 (54.9%) are causal-effect. The original data is collected in detail in the XML form, including sentence with entity markup.
-
ADE The corresponding ADE task aims to extract two entities (drugs and diseases) and relations about drugs with their adverse effects (ADEs) [33, 55]. Dataset in the task is collected from 1644 PubMed abstracts, in which 6821 sentences have at least one ADE relation, and 16,695 sentences are annotated as non-ADE sentences. Annotators only label drugs and diseases in the ADE sentences, so some studies, like [55], only use the 6821 sentences in the experiments.
4 Evaluation metrics
5 Causal relation extraction methods
5.1 Knowledge-based approaches
5.1.1 Explicit intra-sentential causality
5.1.2 Implicit causality
Causal pattern | Linguistic realization |
---|---|
Destroy | “short-circuit in brake wiring destroyed the power supply” |
Prevent | “message box prevented viewer from starting” |
Exceed | “breaker voltage exceeded allowable limit” |
Reduce | “resistor reduced voltage output” |
Cause | “gray lines caused by magnetic influence” |
Induce | “bad cable extension might have induced the motion problem” |
Due to | “replacement of geometry connection cable due to wear and tear” |
Mar | “cluttered options mars console menu” |
5.1.3 Inter-sentential causality
5.2 Statistical machine learning-based approaches
5.2.1 Explicit Intra-sentential causality
5.2.2 Implicit causality
5.2.3 Inter-sentential causality
5.3 Deep learning-based approaches
5.3.1 Explicit intra-sentential causality
5.3.2 Implicit causality
5.3.3 Inter-sentential causality
6 Systems summary
6.1 Knowledge-based approaches
6.2 Statistical machine learning-based approaches
6.3 Deep learning-based approaches
Dataset | System | Year | Technique | F-score (%) |
---|---|---|---|---|
SemEval-2007 | Beamer et al. [5] | 2008 | Knowledge-based | 65.8 |
Task 4 | Girju et al. [28] | 2009 | Knowledge-based | 70.6 |
SemEval-2010 | Sorgente et al. [83] | 2013 | Statistical ML-based | 73.9 |
Task 8 | Xu et al. [91] | 2015 | Deep learning-based | 83.7 |
Li et al. [56] | 2021 | Deep learning-based | 84.6 | |
Zhang et al. [98] | 2018 | Deep learning-based | 84.8 | |
Zhao et al. [99] | 2016 | Statistical ML-based | 85.6 | |
Pakray and Gelbukh [69] | 2014 | Statistical ML-based | 85.8 | |
Wang et al. [89] | 2016 | Deep learning-based | 88.0 | |
Kyriakakis et al. [51] | 2019 | Deep learning-based | 90.6 | |
PDTB 2.0 | Lin et al. [57] | 2009 | Statistical ML-based | 51.0 |
Rutherford and Xue [81] | 2014 | Statistical ML-based | 54.4 | |
Ponti and Korhonen [73] | 2017 | Deep learning-based | 54.5 | |
Chen et al. [15] | 2016 | Deep learning-based | 54.8 | |
Lan et al. [52] | 2017 | Deep learning-based | 58.9 | |
PDTB 2.0 | Hidey and McKeown [35] | 2016 | Statistical ML-based | 75.3 |
AltLex | Martínez-Cámara et al. [61] | 2017 | Deep learning-based | 81.9 |
TACRED | Zhang et al. [97] | 2017 | Deep learning-based | 65.4 |
Zhang et al. [98] | 2018 | Deep learning-based | 68.2 | |
BioInfer | Chen et al. [14] | 2020 | Deep learning-based | 49.8 |
Airola et al. [1] | 2008 | Statistical ML-based | 61.3 | |
ADE | Kang et al. [42] | 2014 | Knowledge-based | 54.3 |
Gurulingappa et al. [33] | 2012 | Statistical ML-based | 70.0 | |
Li et al. [55] | 2017 | Deep learning-based | 71.4 | |
Bekoulis et al. [7] | 2018 | Deep learning-based | 74.6 | |
Bekoulis et al. [6] | 2018 | Deep learning-based | 75.5 | |
Wang and Lu [88] | 2020 | Deep learning-based | 80.1 | |
Zhao et al. [100] | 2020 | Deep learning-based | 81.1 |
Technique | System | Year | Causality form | Dataset | Performance |
---|---|---|---|---|---|
Knowledge-based | Khoo et al. [46] | 1998 | Inter-sentential | 1082 WSJ sentences | Accuracy (68.0%) |
Khoo et al. [47] | 2000 | Explicit intra-sentential | 130 medical abstracts | Precision (68.0%) | |
Garcia et al. [25] | 2006 | Explicit intra-sentential | Technical texts in French | Precision (85.2%) | |
Bui et al. [12] | 2010 | Inter-sentential | 630 medical sentences | F-score (84.0%) | |
Radinsky et al. [79] | 2012 | Explicit intra-sentential | 150 years of news articles | Precision (77.8%) | |
Ittoo and Bouma [38] | 2013 | Implicit | 32,545 documents in PD-CS | F-score (85.0%) | |
Statistical ML-based | Marcu and Echihabi [60] | 2002 | Inter-sentential | BLIPP | Accuracy (87.3%) |
Girju [26] | 2003 | Explicit intra-sentential | TREC | Precision (73.9%) | |
Pechsiri et al. [70] | 2006 | Implicit | 3000 medical sentences in Thai | Pre cision (86.0%) | |
Blanco et al. [10] | 2008 | Explicit intra-sentential | TREC | F-score (91.3%) | |
Kim et al. [48] | 2013 | Explicit intra-sentential | Six months of news articles | t-value (3.87) | |
Oh et al. [66] | 2013 | Inter-sentential | 850 Japanese QA examples | F-score (77.0%) | |
Keskes et al. [44] | 2014 | Implicit | 90 documents in Arabic | F-score (80.6%) | |
Qian and Zhou [77] | 2016 | Inter-sentential | 1500 medical abstracts | Precision (58.3%) | |
Deep learning-based | Kruengkrai et al. [49] | 2017 | Inter-sentential | 159,350 web sentences | Precision (55.1%) |
Dasgupta et al. [21] | 2018 | Inter-sentential | Extended semEval-2010 Task 8 | F-score (66.0%) | |
Jin et al. [39] | 2020 | Inter-sentential | 1986 sentences in Chinese | F-score (82.3%) |
7 Open problems and future directions
-
Multiple causalities Most previous CE only focused on one causal pair from an instance, but causality in the real-world literature is more complex. Causal Patterns in Sciences from Harvard Graduate School of Education introduce three common causal patterns2 as below:Like the study of [21], traditional ways to deal the above kinds of multiple causalities is dividing sentence into several sub-sentences that extracted causal pairs separately. This method is computationally expensive and cannot take into consideration the dependencies among causality pairs.
-
(9a) Domino Causality that one cause produces multiple effects.
-
(9b) Relational Causality that two causes work together to produce an effect.
-
(9c) Mutual Causality that cause and effect impact each other simultaneously, or sequentially.
The Tag2Triplet algorithm from [56] can extract multiple causal triplets simultaneously. It counts the number and the distribution of each causal tag to judge the tag as simple causality or complexity causality. Afterward, it applies a Cartisian Product of the causal entities to generate possible causal triplets. In addition, [17, 87] utilize deep learning with relational reasoning to identify multiple relations in one instance simultaneously. -
-
Data deficiency Typically, for many classification tasks, more than 10 million samples are required to train a deep learning model, so that it can match or exceed human performance [29]. However, just as the size of the four benchmark datasets introduced in Sect. 3 is far from the size of a satisfactory deep learning model, the annotated data in the real world is very specific and small.Based on the assumption that any sentence that contains a pair of entities that participate in a known Freebase relation is likely to express that relation, Mintz et al. [65] introduce the first distant supervision (DS) system for relation extraction, which creates and labels training instances by Freebase as a relation labels resource. However, this method suffers from a large amount of noise labeled data. The survey of [82] introduces methods of addressing the problem of incomplete and wrong labels from DS, like at-least-one models, topic-based models, and pattern correlations models. The very recent research from Huang and Wong [37] proposes a novel way for relation extraction from insufficient labeled data. They first utilize a BiLSTM with attention mechanism to encodes sentences in an unsupervised learning way, the word sequence of entity pairs act as the relation embeddings. Afterward, a random forest classifier is used to learn the relation types from these relation embeddings. This approach of combine unsupervised learning with supervised learning provides us another new idea of solving data deficiency problem in CE task.
-
Document-level causality Both intra- and inter-sentential causality are at the sentence-level, in real-world scenarios; however, large amounts of causality span multiple sentences, and even in different paragraphs. Unlike being extracted through linguistic cues or features directly, a satisfactory document-level CE requires that the model has strong pattern recognition, logical and common-sense reasoning [18]. All of these aspects need long-term research and exploration.Zeng et al. [95] introduce a system of combine GCN with relational reasoning to extract relations within a document. They first construct a mention-level GCN to model complex interaction among entities, and then utilize a path reasoning mechanism to infer relations between two entities. This method outperforms the state of the art on the public dataset, DocRED from Yao et al. [94]. Similar approaches can be found in [64, 86].