1 Introduction to Coreference Resolution
-
To evaluate pre-existing English and German coreference resolution systems
-
To investigate the effectiveness of performing coreference resolution on a variety of out-of-domain texts in both English and German (outlined in Sect. 4) from digital curation scenarios.
2 Summary of Approaches to Coreference Resolution
3 Three Implementations
-
Rule-based (Multi-Sieve Approach): English, German
-
Statistical (Mention Ranking Model): English, German
-
Projection-based (Crosslingual): coreference for German using English models.
3.1 Rule-Based Approach
Text
(de):
|
Barack Obama besuchte Berlin Am Abend traf Barack Obama die Kanzlerin |
Coref
:
| [Barack Obama, Barack Obama] |
3.2 Statistical Approach
-
Distance features: the distance between the two mentions in a sentence, number of mentions
-
Syntactic features: number of embedded NPs under a mention, Part-Of-Speech tags of the first, last, and head word (based on the German parsing models included in the Stanford CoreNLP, Rafferty and Manning 2008)
-
Semantic features: named entity type, speaker identification
-
Lexical Features: the first, last, and head word of the current mention.
3.3 Projection-Based Approach
-
Transferring models: Computing coreference on text in English, and projecting these annotations on parallel German text via word alignments in order to obtain German coreference model
-
Transferring data: Translating German text to English, computing coreference on translated English text using English coreference model and then projecting the annotations back on to the original text via word alignment.
4 Evaluation and Case Studies
-
Mendelsohn Letters Dataset (German and English): The collection (Bienert and de Wit 2014) contains 2,796 letters, written between 1910 and 1953, with a total of 1,002,742 words on more than 11,000 sheets of paper; 1,410 of the letters were written by Erich and 1,328 by Luise Mendelsohn. Most are in German (2,481), the rest is written in English (312) and French (3).
-
Research excerpts for a museum exhibition (English): This is a document collection retrieved from online archives: Wikipedia, archive.org, and Project Gutenburg. It contains documents related to Vikings; the content of this collection has been used to plan and to conceptualise a museum in Denmark.
-
Regional news stories (German): This consists of a general domain regional news collection in German. It contains 1,037 news articles, written between 2013 and 2015.
Corpora | Language | Documents | Words | Domain |
---|---|---|---|---|
Mendelsohn | DE | 2,501 | 699,213 | Personal letters |
Mendelsohn | EN | 295 | 21,226 | Personal letters |
Vikings | EN | 12 | 298,577 | Wikipedia and E-books |
News | DE | 1,037 | 716,885 | News articles and summaries |
System | MUC | B-cube |
---|---|---|
BART | 45.3 | 64.5 |
Sieve | 49.2 | 45.3 |
Statistical | 56.3 | 50.4 |
Neural | 60.0 | 56.8 |
-
Setting 1: Whole System with all 6 sieves in place
-
Setting 2: Contains all mentions but no coreference links
-
Setting 3: Setting 1 minus the module that is deleting any cluster that does not contain a single mention that has been recognized as an entity
-
Setting 4: Setting 1 with the module that is deleting any cluster that does not contain a single mention that has been recognized as an entity executed after the sieves have been applied.
System | MUC | B-cube |
---|---|---|
Setting 1 | 54.4 | 11.2 |
Setting 2 | 70.5 | 23.1 |
Setting 3 | 58.9 | 15.0 |
Setting 4 | 56.1 | 12.0 |
System | MUC | B-cube |
---|---|---|
CorZu | 60.1 | 58.9 |
CoRefGer-rule | 50.2 | 63.3 |
CoRefGer-stat | 40.1 | 45.3 |
CoRefGer-proj | 35.9 | 40.3 |
Dataset | Sents. | Words | Mentions |
---|---|---|---|
Mendelsohn EN | 21K | 109K | 48% |
Mendelsohn DE | 34K | 681K | 26% |
Vikings EN | 39K | 310K | 49% |
News Stories DE | 53K | 369K | 25% |
4.1 Add-On Value of Coreference Resolution to Digital Curation Scenarios
A model or dictionary can only spot “Ray Brock”, but, “him”, “he” and “his” also refer to this entity. With the aid of coreference resolution, we can increase the recall for named entity recognition as well as potentially expand the range for event detection.“Then came Ray Brock for dinner. On him I will elaborate after my return or as soon as a solution pops up on my “Klappenschrank”. Naturally, he sends his love to Esther and his respects to you.”
-
Input a text document, and run coreference resolution on it
-
With the aid of the above, replace all occurrences of pronouns with the actual noun in full form, such that “he” and “his” are replaced with “Ray Brock” and “Ray Brock’s” respectively
-
Run a NLP process such as Named Entity Recognition on the new document and compare with a run without the coreference annotations.