Skip to main content

Open Access 12.11.2024 | Original Research

Deciphering disagreement in the annotation of EU legislation

verfasst von: Gijs van Dijck, Carlos Aguilera, Shashank M. Chakravarthy

Erschienen in: Artificial Intelligence and Law

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The topic of annotating legal data has received surprisingly little attention. A key challenge of the annotation process is reaching a sufficient agreement between annotators and filtering mistakes from genuine disagreement. This study presents an approach that provides insights into and resolves potential disagreement amongst annotators. It (1) introduces different strategies to calculate agreement levels and compares (2) agreement levels between annotators (inter-annotator agreement) before and after a revision round and (3) agreement levels for annotators who annotate the same texts twice (intra-annotator agreement). The inter-annotator agreement levels are compared to a revision round in which an arbiter corrected the annotator’s labels. The analysis is based on the annotation of EU legislative provisions at two stages (initial annotations, after annotator revisions) and for various tasks (Definitions, References, Quantities, IF-THEN statements, Exceptions, Scope, Hierarchy, Deontic Clauses, Active and Passive Role) by multiple annotators. The results reveal that agreement levels vary based on the stage of measurement (before/after revisions), the nature of the task, the method of assessment, and the annotator combination. The agreement scores - along with some initial measurements—align with those reported in previous research but increase after each revision round. This suggests that annotator revisions can substantially reduce disagreement. Additionally, disagreements were found not only between but also among annotators. This inconsistency does not appear to stem from a lack of understanding of the guidelines or a lack of seriousness in task execution, as evidenced by moderate to substantial inter-annotator agreement scores. These findings suggest that annotators identified multiple valid interpretations, which highlights the complexity of annotating legislative provisions. The results underscore the significance of embracing, addressing, and reporting about (dis)agreement in different ways and at the various stages of an annotation task.
Hinweise

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Training machine learning models in the legal domain is challenging, particularly when obtaining high-quality, structured data. Despite the availability of legal markup standards like Akoma Ntoso (AKN) and LegalRuleML (LRML) (Athan et al. 2013; Palmirani et al. 2011), legal texts are rarely formatted according to these schemas. Instead, legal information must often be extracted from unstructured, unformatted text, an undertaking complicated by the documents being lengthy, complex, and filled with specialized terminology.
These challenges are further compounded by interpretation issues, where the subtleties of legal language frequently lead to discrepancies in classification and annotation. Yet, despite the importance of ensuring annotation consistency in legal data, a thorough investigation into how annotator disagreement arises, how it should be addressed, and the impact such disagreement has on the reliability of legal machine-learning models has received little attention. Only a handful of studies have explored the complexities of annotating legal texts and the methodological considerations surrounding inter-annotator (dis)agreement.
Recent research on annotator agreement reveals that agreement levels among annotators of legal datasets intended for machine learning are commonly relatively low. Braun (2023), after examining 29 publications that relied on the annotation of legal machine learning data sets, found average agreement scores of .76 (Cohen’s kappa), .675 (Fleiss’ kappa), and .677 (Krippendorff’s alpha), with the highest Krippendorff’s alpha value not exceeding .78. Nevertheless, Braun (2023) reports that explanations about the annotation process are missing or at best limited to often a single statement containing that the study relied on annotations by a legal expert.
This study addresses these gaps by focusing on analyzing annotator agreement throughout the annotation process and offering a comparative evaluation of different agreement calculation methods on a curated sample of EU legislative texts. It aims to deepen the understanding of inter-rater reliability in the legal domain and introduces approaches to ensure consistency in legal dataset annotation. The key contributions include an examination of how agreement levels change across different categories of legal annotations before and after a revision process and a comparison of different approaches for calculating inter-rater reliability regarding spans with different start and end indices.
This paper continues with a discussion of related work (Sect. 2), the methods used (Sect. 3), the results (Sect. 4), a discussion of the limitations and the results (Sects. 56), and the conclusion (Sect. 7).
Similar to Braun (2023), our overview reveals that explanations about the annotation process are mostly missing or, at best, limited to a single statement containing that the study relied on annotations by a legal expert. Very limited information about (dis)agreement levels or how disagreement was handled is available. Data nor annotation guidelines are commonly shared or published. Our study does include this information.
Several studies have been conducted that train models based on annotated datasets. Table 8 in Appendix A provides an overview of research on the (semi-)automatic classification of norms. The overview indicates the categories and data used, whether the analysis relied on rule-based or machine-based methods, the number and type of annotators against which the output was compared, and the model performance. The output of the (semi-) automatic classification of norms can be used for frameworks aimed at the formalization of legislative documents like FLINT (van Doesburg and van Engers 2019), DataLex (Mowbray et al. 2021, 2023), and Lexsearch (Tiscornia and Turchi 1997).
Our study complements research on the (semi-)automatic classification of norms into specific norm categories, as this type of research benefits from high-quality annotated datasets that follow thorough annotation procedures. This research provides for such a thorough procedure.

3 Method

3.1 Task selection

The answer to what information should be extracted from legal documents lies at the intersection of several influential frameworks. One of such frameworks is the already mentioned LegalRuleML, an OASIS standard developed for facilitating the exchange and sharing of the characteristics of legal knowledge between legal documents, business rules, and software applications (Palmirani et al. 2011). This standard provides a variety of legal elements, including constitutive rules (definitions), obligations, prohibitions, permissions, agents (entities that act or have the capability to act) and authorities (persons or organizations with the power to create, endorse, or enforce legal norms), legal sources (References, LegalSources), and time and events (norm validity).1Eunomos, a legal document management system with terminology management (Boella et al. 2012), includes deontic clauses (obligation, prohibition, permission, exception), active and passive roles, violations (crime or tort resulting from violation), and sanctions (sanction resulting from violation), along with a norm identifier (link to the relevant provision in the source document), descriptions, and notes (other information of interest). Legal-URN (Ghanavati 2013), in turn, distinguishes between atomic statements that contain an actor, modal verb, clause, cross-reference, precondition, and/or exception. Humphreys et al. (2021) integrated the previous initiatives, which results in three (sub)categories:
  • Definitions: regular, include/exclude, by example, by reference,
  • Norm types: obligation, right, permission, power,
  • Meta norms: legal effect, scope, exception, hierarchy, rationale.
The categories mostly overlap with the ones identified in the publications mentioned in Table 8 as well as with other works (e.g., van Doesburg and van Engers 2019; Mowbray et al. 2021; Ghanavati 2013; Amantea et al. 2019; Ghanavati et al. 2014), which distinguish objective, constitutive, deontic, scope, and meta-norms (meta-norms: procedural, and meta-norms: contextual).
In our study, we adopted the most widely recognized and utilized categories, structuring them into nine tasks for automated extraction: recognizing definitions, references, quantities, IF-THEN statements, exceptions, scope, hierarchies, deontic clauses, and roles (active and passive). Table 1 outlines these tasks, with further detail available in our annotation guidelines in the supplementary materials.
Table 1
Annotation categories, descriptions, and examples
Category
Sub-category
Description
Example
Definition
Definition
Definitions can take the shape of an ‘is-a’, ‘part-of’, or ‘is-not’. The terms and text of the definitions were annotated with separate tags
“Res judicata is a legal doctrine that bars the re-litigation (...).” (is-a) “Additional regulatory measures may include economic oversight measures.” (part-of) “Practice of the profession of lawyer within the meaning of this Directive shall not include the provision of services, which is covered by Directive 77/249/EEC.” (is-not)
Reference
Reference
References to documents that can be considered ‘law’. A reference to a legal document mentions or points to an identifiable document (e.g., Regulation, Directive, standard, case law) or parts of a document (e.g., Article, Annex, Appendix)
“Article 2” “Regulation (EU) No 1236/2010” “Articles 5 to 15 of this Regulation”
Quantities
Quantities
Measurements, amounts, percentages, ratios, numbers of items, and any other information that can be expressed numerically
“12 months” “one Member State”
IF-THEN statement
IF-THEN
Logical relationship between a condition and a desired action or outcome
IF-THEN: “It shall be required to obtain a license where one of the requirements is not met.”
 
IF
The IF is the condition or requirement that needs to be met in order to get the consequence THEN
IF: “one of the requirements is not met”
 
THEN
The THEN is the consequence that results from the condition or requirement to be met
THEN: “It shall be required to obtain a license”
Exception
Exception text
A negative condition which, if satisfied, results in the norm not coming into force
“Practice of the profession of lawyer within the meaning of this Directive shall not include the provision of services, which is covered by Directive 77/249/EEC.”
 
Norm deviated from
The norm (text) that is deviated from/that the exception applies to
“The Commission shall not adopt a decision pursuant to Article 17(1) of the Treaty unless it has given the parties concerned the opportunity to be heard.”
Scope
 
Legal statements that determine whether, under which circumstances, or the extent to which a rule is applicable (material scope), when it is applicable (temporal scope), the geographical boundaries or locations to which the norm applies (territorial scope), or to whom the norm is applicable (personal scope)
"This Directive shall apply to unfair business-to-consumer commercial practices, as laid down in Article 5, before, during, and after a commercial transaction in relation to a product."
Hierarchy
 
The ranking or order of legal norms or provisions based on their authority or importance
“This Article shall be without prejudice to Article 17(3).”
Deontic clause
Power
Ability or authority to impose obligations, permissions, or prohibitions on others
"A member state must establish a framework of deterrence and sanctions in order to ensure that EU rules on data protection are enforced effectively."
 
Right
Grants a person an inherent entitlement to perform or demand certain actions, inactions, or behaviors from others without interference. A right suggests that the entitlement is inherent to an individual’s nature or position and does not depend on external factors or conditions. Inherent entitlements are often considered universal and inalienable, meaning they cannot be taken away or revoked
“Any lawyer shall become entitled to pursue on a permanent basis, in any other Member State under his home-country professional title, the activities specified in Article 5.”
 
Obligation
Action or inaction that is permissible and non-optional
"Food businesses are required to ensure that their products are safe for human consumption and comply with relevant EU legislation."
 
Permission
Action or inaction that is permissible and optional. A permission grants a freedom that can be revoked or restricted at will (as opposed to rights)
"Vendors are allowed to not post labels on their products."
 
Prohibition
Action or inaction that is not permissible. A prohibition is not optional
"The Commission shall not adopt a decision pursuant to Article 17(1) of the Treaty unless it has given the parties concerned the opportunity to be heard."
Role
Active Role
A person (natural or legal), an organization, or a body who / that bears the primary responsibility for fulfilling or violating a deontic statement
“The data subject shall have the right not to be subject to a decision based solely on automated processing, (...)” (data subject)
 
Passive Role
A person (natural or legal), an organization, or a body who / that is affected by the deontic statement but does not bear the primary responsibility for fulfilling or violating it. The passive role is the correlative role to the active role for a particular deontic clause
“Providers of intermediary services shall provide their legal representative with necessary powers (...)” (legal representatives)
Note that the term ’is-not’ is the opposite of ’part-of’ and may therefore also have been labeled ’not-part-of’. We did not encounter situations of ’is-not’ definitions where the definition is the opposite of an ’is-a’ definition. The annotation guidelines with extensive descriptions and examples for each category are available at https://​doi.​org/​10.​34894/​ZJJIOB

3.2 Data

We sampled EU legislative provisions from data from the Cellar database, the common data repository of the Publications Office of the European Union. Because EU legislation consists of many amendments and repeals, we randomly selected 10,000 EU regulations. We filtered the initial dataset by selecting only regulations with at least ten provisions to reduce the likelihood of including amendments and repeals. We randomly extracted provisions from the remaining laws, with a maximum of one per regulation, until we reached 200 provisions. These provisions served as the data to be annotated in each task. Of these 200 provisions, 20 were randomly sampled to analyze the intra-rater reliability (see below).

3.3 Procedure

Nine individuals were recruited, one of which was only involved in the drafting and refining of the annotation guidelines. The eight annotators, who received remuneration for the hours worked, were third-year law students with elementary training in computational methods and diverse backgrounds in terms of nationality (Australia, Belgium, China, Estonia, Finland, Italy, Norway, Romania). They were split into three groups, with each group assigned to the following tasks:
  • Annotators 1+2: Definition terms, Definition texts, References, Quantities.
  • Annotators 3+4+5: IF-THEN statements, IF statements, THEN statements, Exception texts, Text that exception applies to.
  • Annotators 6+7+8: Deontic clauses, Scope, Hierarchy.
It was decided to assign annotators specific tasks, also in the training rounds, to reduce the cognitive load caused by having to pay attention to a large variety of annotation tasks. When the annotators were asked to perform all tasks simultaneously during the first training round, it became clear that the cognitive burden of paying attention to different instructions significantly increased the number of mistakes.
First, we trained the annotators in more than ten iterations. In some training rounds, the annotators were asked to select two or three provisions that, in their opinion, could lead to disagreement. The annotators would subsequently label the selected provisions and comment on the annotations made by the other annotators, particularly in case of disagreement. In the other training rounds, the annotators were commonly presented with three or four statutory provisions (articles) of European regulations at each iteration and were requested to annotate the provisions. The results were discussed, commonly in joint sessions and sometimes in writing. The annotation guidelines were updated until it was concluded that disagreements were no longer the result of differences in interpretation of the guidelines.
After the training rounds, for calculating the inter-rater (and intra-rater) agreement, two or three annotators independently annotated each of the 200 provisions. After providing their annotations, the annotators reevaluated conflicting annotations. In this stage of the annotation process, the annotators learned where they disagreed but were not provided with the annotations of the other annotators and, in the case of the three-annotator groups, were not informed who disagreed with whom. Moreover, it was also not disclosed to the annotators whether, and if so, how, the other annotator(s) would change their annotation.
After completing the 200 provisions, the annotators labeled the 20 randomly selected provisions to analyze the intra-rater reliability. The impression was that the annotators did not recall annotating the same provisions twice.
The final corpus encompassed the labels derived from the independent annotation processes after correction by the annotators. It also includes the agreement scores after an arbiter revision, where one of the researchers corrected the revised labels of the annotators based on misinterpretations or a wrong application of the annotation guidelines. This stage of the study is discussed below.
The labeling and calculation of the interrater reliability was carried out on the Lawnotation platform (https://​www.​lawnotation.​org/​).

3.4 Inter-rater reliability (interRR)

We calculated Krippendorff’s alpha (Krippendorff 2019) and Fleiss’ kappa (Fleiss 1971) for the texts annotated by three annotators, and Cohen’s kappa (Cohen 1960, 1968) for pairs of annotators, including for the provisions annotated by two annotators. As a rule of thumb, ranges of.81\(-\)1.00 (almost perfect) and.61-.80 (substantial) (Cohen’s kappa), of \(>.75\) (Fleiss’ kappa, Landis and Koch (1977)), and.80 (reliable) and.67 (tentative) (Krippendorff’s alpha, Krippendorff (2019)) can be considered substantial agreement. We compared the results of the pairs of annotators and the three annotators together for each (sub-)task.
Calculating interRR was not straightforward. Annotators could (and frequently did) select texts with different start and end indices. A strict interRR calculation would treat even the inclusion or exclusion of punctuation or a space as a disagreement. In such instances, the annotators would carry out the task successfully (recognizing and highlighting the obligation or IF-THEN statement). Still, the lack of complete agreement would count negatively toward the interRR calculation. We relied on the contained overlap approach, where situations of an annotation of one annotator being fully included (contained) in another annotation are considered agreement. To test the robustness of this approach, we compared different types of measurement to compute interRR scores. The comparison is described in more detail below. Finally, we decided to exclude non-text annotations (e.g., ". 2")2 from the calculations, as these were not substantively meaningful, yet they would increase the agreement levels.

3.5 Intra-rater reliability (intraRR)

The annotation task was set up so the annotators labeled 20 of the 200 provisions twice. This allowed, at least to an extent, to calculate the consistency of the annotators. Consequently, the intraRR was calculated for these 20 provisions. We chose the contained overlap approach for the evaluation, meaning that an annotation of one annotator being part of an annotation of another annotator counted as agreement. Non-text annotations were again excluded from the calculations. We report Cohen’s kappa scores for the intraRR analysis.

4 Results

4.1 Frequencies

Table 2
Number of annotations and non-annotations for the initial (left) and the after annotator revisions (right) annotations for the 200 annotated provisions and 20 annotated provisions (only initial annotations are available for the latter category)
Category
Subcategory
200 provisions
20 provisions
  
Non-annotations (n)
Annotations (n)
Non-annotations (n)
Annotations (n)
   
A1
A2
A3
 
A1
A2
A3
Definition
Term
298/297
88/92
68/71
n/a
26
7
3
n/a
 
Text
275/278
85/91
71/75
n/a
25
8
3
n/a
Reference
 
757/747
603/594
615/618
n/a
68
59
59
n/a
Quantity
 
294/295
98/98
85/92
n/a
28
14
14
n/a
Exception
Text
215/214
38/39
26/37
23/37
20
2
2
3
 
Applies_to
231/233
39/36
26/33
23/35
22
2
2
3
IF-THEN
IF-THEN
213/207
621/689
636/637
639/650
20
40
42
46
 
IF
981/963
889/856
900/926
849/915
93
90
63
59
 
THEN
501/452
921/872
972/1009
918/1018
29
87
63
56
Scope
 
190/193
37/37
35/41
47/47
17
4
3
3
Hierarchy
 
211/200
15/18
10/10
13/11
19
3
1
1
Deontic clause
Obligation
285/286
452/453
430/433
397/400
27
27
31
23
 
Permission
245/244
77/74
62/62
74/72
21
3
2
2
 
Prohibition
198/199
14/15
9/13
8/12
0
0
0
0
 
Power
230/228
18/18
22/24
28/29
18
1
0
0
 
Right
209/209
13/15
16/21
18/14
21
3
1
3
Role
Active role
638/629
378/392
403/416
410/410
48
23
24
26
 
Passive role
414/431
111/150
171/181
174/196
24
4
8
5
A1, A2, and A3 denote different annotators for distinct task sets: (Definition+Reference+Quantity), (Exception+IF-THEN), and (Scope+Hierarchy+Deontic clause+Role). Each set has unique annotators. Non-annotations are spans that none of the annotators highlighted
Table 2 shows the number of annotations and non-annotations per task per annotator at each stage. Except for IF-THEN statements (IF-THEN, IF, THEN) and Obligation, the number of non-annotations exceeds the number of annotations. Less than 50 annotations were observed for Exception, Scope, Hierarchy, Prohibition, Power, and Right. In contrast, more than 350 annotations were found for the categories Reference, IF-THEN statements (IF-THEN, IF, THEN), Obligation, and Active Role. Fluctuations can also be observed for the number of annotations. For instance, annotator 1 (A1) had (872-921=) 49 fewer annotations after revising their initial annotations for the THEN category, whereas A3 had (1018-918=) 100 more annotations for that category. Finally, low numbers can be observed for several categories when annotating the 20 provisions, including the Exception, Scope, Hierarchy, Permission, Prohibition, Power, and Right categories.

4.2 Inter-rater reliability (interRR)

We first inspected how the agreement levels changed throughout the annotation process. As explained above, the agreements were captured after the initial annotations and after the annotators changed annotations based on the feedback on where they disagreed. Table 3 compares the agreement scores (contained overlap) at the different stages.
Table 3
Inter-rater reliability per task for pairs of annotators for different stages
Category
Subcategory
Initial annotations
Annotations after annotator revisions
  
A1-A2-A3
A1-A2
A1-A3
A2-A3
A1-A2-A3
A1-A2
A1-A3
A2-A3
  
Po
\({\alpha}\)
\({\kappa}\)
\({\kappa }\)
\({\kappa }\)
\({\kappa }\)
Po
\({\alpha }\)
\({\kappa }\)
\({\kappa }\)
\({\kappa }\)
\({\kappa }\)
Definition
Term
n/a
n/a
n/a
.62
n/a
n/a
n/a
n/a
n/a
.70
n/a
n/a
 
Text
n/a
n/a
n/a
.67
n/a
n/a
n/a
n/a
n/a
.72
n/a
n/a
Reference
 
n/a
n/a
n/a
.91
n/a
n/a
n/a
n/a
n/a
.94
n/a
n/a
Quantity
 
n/a
n/a
n/a
.85
n/a
n/a
n/a
n/a
n/a
.89
n/a
n/a
Exception
Text
.92
.61
.61
.64
.58
.61
.95
.79
.79
.78
.81
.78
 
Applies_to
.93
.62
.62
.57
.60
.71
.94
.72
.72
.71
.66
.75
IF-THEN
IF-THEN
.91
.78
.78
.81
.71
.79
.89
.74
.74
.69
.71
.80
 
IF
.86
.71
.71
.71
.68
.69
.86
.71
.71
.72
.67
.72
 
THEN
.80
.59
.59
.57
.56
.54
.79
.57
.57
.54
.47
.53
Scope
 
.94
.79
.79
.80
.77
.79
.95
.83
.83
.85
.80
.83
Hierarchy
 
.96
.61
.61
.53
.47
.86
.97
.73
.73
.70
.67
.85
Deontic clause
Obligation
.85
.69
.69
.69
.65
.70
.87
.73
.73
.73
.69
.75
 
Permission
.90
.71
.71
.72
.70
.70
.92
.76
.76
.79
.74
.74
 
Prohibition
.95
.49
.49
.68
.43
.33
.97
.71
.71
.77
.64
.70
 
Power
.89
.25
.25
.24
.15
.34
.90
.35
.35
.38
.28
.39
 
Right
.94
.50
.50
.37
.55
.56
.93
.46
.46
.34
.45
.60
Role
Active role
.89
.75
.75
.72
.76
.76
.91
.80
.80
.77
.78
.82
 
Passive role
.84
.54
.54
.46
.48
.64
.86
.63
.63
.60
.59
.67
Po=observed agreement, \({\varvec{\alpha }}\)=Krippendorff’s alpha, \({\varvec{\kappa }}\)=Fleiss’ kappa, \({\varvec{\kappa }}\)=Cohen’s kappa. The table presents the results for annotated chunks with contained overlap and tolerance=0. Krippendorff’s alpha scores (Cohen’s kappa for Definition, Reference, Quantity) are in bold. A1, A2, and A3 denote different annotators for distinct task sets: (Definition+Reference+Quantity), (Exception+IF-THEN), and (Scope+Hierarchy+Deontic clause+Role). Each set has unique annotators. Observed agreement (Po) is reported only in cases with three annotators
Several trends can be observed. First, acceptable agreement levels were reached after the initial annotations for most categories. Second, substantial increases can be observed across all categories (except for IF-THEN statements) at each iteration. Increases between.03 (Reference) and.22 (Prohibition) can be observed from the initial annotations to the ‘after annotator revisions’ stage. A third trend is that the variation of the agreement scores decreased after learning where they disagreed and after having the possibility to change their annotations (without knowing the annotations of the other annotators). Where Cohen’s kappa scores between pairs of annotators initially differed as much as.35 (.68 and.33 for Prohibition), the largest difference after the annotator revisions was.26 (Right), whereas the difference for Prohibition reduced to.13. Similarly, the agreement for Power between the annotator pairs went from .24, .15, and .34 to .38, .28, and .39 respectively. Reduced differences in Cohen’s kappa scores between pairs of annotators can be observed across all categories except for the IF and THEN categories.

4.3 Arbiter revisions

The agreements were captured at three points: after the initial annotations, after the annotators changed annotations based on the feedback on where they disagreed, and after the arbiter revisions. At the latter stage, an arbiter—one of the authors—reexamined all conflicting annotations and corrected disagreements following the annotation guidelines. The hope was that the process would filter mistakes and result in genuine disagreements.
Table 4 shows the number of annotations and non-annotations per task per annotator at the after-arbiter revision stage. When comparing the frequencies to those in Table 2, it becomes apparent that there seems to be a decreasing trend for non-annotations at each iteration (initial/after annotator/after arbiter), although not in every category, for every annotator, or at every stage.
Table 4
Number of non-annotations and annotations (A1, A2, A3) for the 200 annotated provisions. Non-annotations are spans that none of the annotators highlighted
Category
Subcategory
Non-annotations (n)
Annotations (n)
   
A1
A2
A3
Definition
Term
298
88
92
95
 
Text
275
85
91
101
Reference
 
757
603
594
597
Quantity
 
294
98
98
100
Exception
Text
215
38
39
39
 
Applies_to
231
39
36
38
IF-THEN
IF-THEN
213
621
689
663
 
IF
981
889
856
908
 
THEN
501
921
872
972
Scope
 
190
37
37
41
Hierarchy
 
211
15
18
18
Deontic clause
Obligation
289
444
436
435
 
Permission
245
83
77
79
 
Prohibition
197
17
16
16
 
Power
211
9
16
20
 
Right
206
13
16
14
Role
Active role
610
402
415
409
 
Passive role
409
186
198
204
The agreement levels reached near-perfect to perfect after the arbiter reviewed the revised annotations from the annotators and made changes where deemed necessary. Table 5 presents these results, which incorporates the arbiter’s revisions alongside the initial and annotator-revised annotations as shown in Table 3.
The extremely high agreement scores cast doubt on the task the arbiter was performing. The idea was that the arbiter would correct annotations not made following the annotation guidelines, with resulting differences being genuine disagreement rather than error. In practice, however, it seems more likely that the annotator applied the guidelines according to his interpretation. The implications of this finding will be explored in the Discussion section.
Table 5
Inter-rater reliability per task for pairs of annotators for different stages
Category
Subcategory
Initial annotations
Annotations after annotator revisions
Annotations after arbiter revisions
  
A1-A2-A3
A1-A2
A1-A3
A2-A3
A1-A2-A3
A1-A2
A1-A3
A2-A3
A1-A2-A3
A1-A2
A1-A3
A2-A3
  
Po
\({\alpha }\)
\({\kappa }\)
\({\kappa }\)
\({\kappa }\)
\({\kappa }\)
Po
\({\alpha }\)
\({\kappa }\)
\({\kappa }\)
\({\kappa }\)
\({\kappa }\)
Po
\({\alpha }\)
\({\kappa }\)
\({\kappa }\)
\({\kappa }\)
\({\kappa }\)
Definition
Term
n/a
n/a
n/a
.62
n/a
n/a
n/a
n/a
n/a
.70
n/a
n/a
n/a
n/a
n/a
.92
n/a
n/a
 
Text
n/a
n/a
n/a
.67
n/a
n/a
n/a
n/a
n/a
.72
n/a
n/a
n/a
n/a
n/a
.96
n/a
n/a
Reference
 
n/a
n/a
n/a
.91
n/a
n/a
n/a
n/a
n/a
.94
n/a
n/a
n/a
n/a
n/a
.99
n/a
n/a
Quantity
 
n/a
n/a
n/a
.85
n/a
n/a
n/a
n/a
n/a
.89
n/a
n/a
n/a
n/a
n/a
.99
n/a
n/a
Exception
Text
.92
.61
.61
.64
.58
.61
.95
.79
.79
.78
.81
.78
.97
.89
.89
.87
.94
.90
 
Applies_to
.93
.62
.62
.57
.60
.71
.94
.72
.72
.71
.66
.75
.96
.83
.83
.80
.82
.86
IF-THEN
IF-THEN
.91
.78
.78
.81
.71
.79
.89
.74
.74
.69
.71
.80
.92
.80
.80
.75
.79
.84
 
IF
.86
.71
.71
.71
.68
.69
.86
.71
.71
.72
.67
.72
.92
.84
.84
.86
.84
.84
 
THEN
.80
.59
.59
.57
.56
.54
.79
.57
.57
.54
.47
.53
.87
.73
.73
.72
.70
.68
Scope
 
.94
.79
.79
.80
.77
.79
.95
.83
.83
.85
.80
.83
.99
.96
.96
.96
.94
.99
Hierarchy
 
.96
.61
.61
.53
.47
.86
.97
.73
.73
.70
.67
.85
.99
.93
.93
.90
.90
1.00
Deontic clause
Obligation
.85
.69
.69
.69
.65
.70
.87
.73
.73
.73
.69
.75
.94
.89
.89
.88
.88
.93
 
Permission
.90
.71
.71
.72
.70
.70
.92
.76
.76
.79
.74
.74
.97
.91
.91
.92
.90
.93
 
Prohibition
.95
.49
.49
.68
.43
.33
.97
.71
.71
.77
.64
.70
1.00
.98
.98
.97
.97
1.00
 
Power
.89
.25
.25
.24
.15
.34
.90
.35
.35
.38
.28
.39
.96
.64
.64
.70
.53
.70
 
Right
.94
.50
.50
.37
.55
.56
.93
.46
.46
.34
.45
.60
.95
.59
.59
.45
.47
.69
Role
Active role
.89
.75
.75
.72
.76
.76
.91
.80
.80
.77
.78
.82
.94
.88
.88
.86
.89
.93
 
Passive role
.84
.54
.54
.46
.48
.64
.86
.63
.63
.60
.59
.67
.93
.82
.82
.81
.83
.85
Po=observed agreement, \({\varvec{\alpha }}\)=Krippendorff’s alpha, \({\varvec{\kappa }}\)=Fleiss’ kappa, \({\varvec{\kappa }}\)=Cohen’s kappa. The table presents the results for annotated spans with contained overlap and tolerance=0. Krippendorff’s alpha scores (Cohen’s kappa for Definition, Reference, Quantity) are in bold. Observed agreement (Po) is reported only in cases with three annotators

4.4 Approaches for calculating InterRR

So far, we presented results of contained overlap, where one annotation is fully contained within another. An advantage of the contained overlap approach is that it does not penalize situations where annotators select different text spans that are essentially the same. However, differences under the contained overlap approach may also include relevant differences. It can, therefore, be useful to compare the results of different approaches for calculating interRR.
We propose different ways of calculating the interRR to obtain more insight into the agreement levels. First, we propose to contrast (dis)agreement on the span level (i.e., a span that commonly includes multiple words) with (dis)agreement on the word level. ‘Member States shall...’ is an example of an annotation (or non-annotation) on the span level. In contrast, annotations on the word level would treat ‘Member’, ‘States’, ‘shall’, and ‘...’ as separate annotations. Inter-rater agreement scores are sensitive to disagreements in provisions with large annotations or non-annotations, considering that large spans only count as one annotation. With relatively few annotations consisting of large spans, each disagreement significantly affects the agreement scores, even when the ‘true’ agreement is higher. Annotations on the word level do not face this problem, yet have the downside that each (non-)annotation is treated the same, regardless of whether the (non-)annotations are deemed relevant or trivial. Second, we propose to inspect two ways of overlap: strict match, where annotations must match exactly, and contained overlap, where the annotation of one annotator is fully included (contained).
To illustrate the proposed ways of calculating interRR, Table 6 presents the results after the arbiter revisions. When choosing strict match, we decided on a tolerance of 10 (i.e., a ten-character difference or less is considered agreement) for the annotations of spans to prevent rather small differences in the number of characters (n=\(<10\)) annotated from being considered a disagreement. Consequently, we report the results for (1) span-based and word-based (dis)agreement separately, (2) contained and equal overlap (tolerance=10 with equal overlap), and (3) all annotators and each pair of annotators.
The word-based approach produced results similar to the span-based approach, except for the IF-THEN (sub)category (lower agreement in words-based than in span-based). Furthermore, slight differences can be observed between the Contained and Equal overlap (tolerance=10) approach, with the latter producing slightly lower agreement. Only the results in the Exception category and the IF-THEN (sub)category are somewhat worse in the latter approach than in the Contained overlap approach.
Table 6
Inter-rater reliability per task for pair annotators after arbiter revision
Category
Subcategory
Chunks
Words
  
Contained overlap (tolerance=0)
Equal overlap (tolerance=10)
Contained overlap (tolerance=10)
  
A1-A2-A3
A1-A2
A1-A3
A2-A3
A1-A2-A3
A1-A2
A1-A3
A2-A3
A1-A2-A3
A1-A2
A1-A3
A2-A3
  
Po
\({{\alpha }}\)
\({{\kappa }}\)
\({{\kappa }}\)
\({{\kappa }}\)
\({{\kappa }}\)
Po
\({{\alpha }}\)
\({{\kappa }}\)
\({{\kappa }}\)
\({{\kappa }}\)
\({{\kappa }}\)
Po
\({{\alpha }}\)
\({{\kappa }}\)
\({{\kappa }}\)
\({{\kappa }}\)
\({{\kappa }}\)
Definition
Term
n/a
n/a
n/a
.92
n/a
n/a
n/a
n/a
n/a
.83
n/a
n/a
n/a
n/a
n/a
.85
n/a
n/a
 
Text
n/a
n/a
n/a
.96
n/a
n/a
n/a
n/a
n/a
.86
n/a
n/a
n/a
n/a
n/a
.92
n/a
n/a
Reference
 
n/a
n/a
n/a
.99
n/a
n/a
n/a
n/a
n/a
.97
n/a
n/a
n/a
n/a
n/a
.98
n/a
n/a
Quantity
 
n/a
n/a
n/a
.99
n/a
n/a
n/a
n/a
n/a
.95
n/a
n/a
n/a
n/a
n/a
.98
n/a
n/a
Exception
Text
.97
.89
.89
.87
.94
.90
.93
.72
.72
.66
.73
.75
.99
.89
.89
.88
.93
.88
 
Applies_to
.96
.83
.83
.80
.82
.86
.94
.74
.74
.68
.76
.77
.99
.85
.85
.83
.85
.87
IF-THEN
IF-THEN
.92
.80
.80
.75
.79
.84
.84
.64
.64
.57
.63
.67
.88
.57
.57
.48
.48
.59
 
IF
.92
.84
.84
.86
.84
.84
.89
.78
.78
.78
.77
.79
.92
.82
.82
.82
.79
.83
 
THEN
.87
.73
.73
.72
.70
.68
.81
.61
.61
.59
.59
.59
.85
.62
.62
.54
.49
.55
Scope
 
.99
.96
.96
.96
.94
.99
.97
.90
.90
.93
.86
.93
.99
.92
.92
.90
.93
.93
Hierarchy
 
.99
.93
.93
.90
.90
1.00
.98
.89
.89
.83
.83
1.00
1.00
.89
.89
.84
.84
1.00
Deontic clause
Obligation
.94
.89
.89
.88
.88
.93
.92
.83
.83
.85
.81
.84
.92
.83
.83
.85
.82
.80
 
Permission
.97
.91
.91
.92
.90
.93
.94
.84
.84
.80
.86
.85
.98
.91
.91
.91
.92
.92
 
Prohibition
1.00
.98
.98
.97
.97
1.00
1.00
.98
.98
.97
.97
1.00
1.00
.97
.97
.95
.95
1.00
 
Power
.96
.64
.64
.70
.53
.70
.95
.55
.55
.62
.53
.52
.99
.74
.74
.75
.70
.76
 
Right
.95
.59
.59
.45
.47
.69
.94
.49
.49
.52
.39
.55
.99
.61
.61
.60
.49
.74
Role
Active role
.94
.88
.88
.86
.89
.93
.95
.89
.89
.89
.89
.90
.99
.94
.94
.94
.94
.93
 
Passive role
.93
.82
.82
.81
.83
.85
.91
.78
.78
.76
.76
.82
.99
.87
.87
.85
.85
.89
Po=observed agreement, \(\alpha \)=Krippendorff’s alpha, \(\kappa \)=Fleiss’ kappa, \(\kappa \)=Cohen’s kappa. Krippendorff’s alpha scores (Cohen’s kappa For Definition, Reference, Quantity) are in bold. Observed agreement (Po) is reported only in cases with three annotators

4.5 Disagreement analysis

Disagreements regarding the annotations after the arbiter revisions were inspected to identify provisions that remain ambiguous. We report our observations for each category below.
Definitions. Disagreement for definitions mostly occurred with ‘part-of’ definitions. The word ‘including’ can signal the presence of such a definition, but it is not necessarily a part-of definition. For instance, in the sentence:
This Regulation lays down rules for the implementation of Regulation (EC) No 882/2004 and Decision 2009/470/EC as regards the arrangements for the granting of Union financial aid provided for in Article 32(7) of Regulation (EC) No 882/2004 and Article 31(1) of Decision 2009/470/EC for the activities of EU reference laboratories (‘laboratories’) other than the Joint Research Centre, including for the organisation of workshops, and the conditions according to which that aid is granted.
it may be argued that ‘organization of workshops’ is a part-of definition of the definition term ‘the activities of EU reference laboratories’. Similarly, in the statement:
The monitoring programme may include serological methods as an additional tool once a suitable test is validated by the EU reference laboratory.
the question arises whether ‘serological methods’ is an part-of definition of ‘monitoring programme’.
‘Is-a’ definitions rarely resulted in disagreement. In:
standard import value equal to the weighted average of the representative prices referred to in Article 134
it is debatable whether ‘the weighted average of the representative prices referred to in Article 134’ is a definition of ‘standard import value’ or a means of calculation. Sometimes, disagreement consisted of annotators annotating different parts of the text:
(k) any other cost related parameters where the infrastructure manager can demonstrate to the regulatory body that values for each such parameter, including variation to each such parameter where relevant, are objectively measured and recorded.
In this example, one annotator highlighted ‘any other cost related parameters’ as a definition term, whereas the other labeled ‘values for each such parameter’.
References. In the annotation guidelines, references were defined as ’References to documents that can be considered ‘law’. Standards (e.g., ‘ISO/IEC 15415:2011’, ‘UN/CEFACT P1000-3’) were considered law by one annotator and were consequently annotated, but not by the other. Other differences in annotations for references consisted of annotators labeling different strings. For instance, one annotator highlighted ‘Regulations (EU) 2016/248 and (EC) No 657/2008‘ as one string, whereas the other annotator highlighted ‘(EU) 2016/248’ and ‘(EC) No 657/2008’ separately - the question is whether it is necessary to include ‘Regulation’ in the annotation.
Quantity. The quantity for which disagreement remained, concerned the text ‘EUR 5/100 kg’. One annotator labeled the complete span, whereas the other labeled ‘EUR 5’ and ‘100 kg’ separately.
IF-THEN, IF, THEN. Annotators frequently identified different IFs and THENs in an IF-THEN statement. For instance, in:
When an invoice relates to a group of transactions, the Agency may attribute any under-payment to any of the relevant transactions
one IF-THEN combination can be IF ‘When an invoice relates to a group of transactions’ THEN ‘the Agency may attribute any under-payment to any of the relevant transactions’. Alternatively, or additionally, it is possible to consider IF ‘the Agency’ THEN ‘may attribute any under-payment to any of the relevant transactions When an invoice relates to a group of transactions’.
Some disagreement consisted of annotators separating (or not) potentially multiple IF-THEN statements.
The Commission shall assess the application and, within 9 months from receipt of a complete application, it shall approve the innovative technology as an eco-innovation together with the testing methodology".
Here, one can argue that ‘The Commission shall assess the application’ and ‘it shall approve the innovative technology as an eco-innovation together with the testing methodology’ are one IF-THEN statement (IF the Commission THEN shall assess... and... shall approve...) or separate IF-THEN statements (IF the Commission THEN shall assess + IF it [the Commission] THEN shall approve...).
It also occurred that some annotators labeled an exclusion as a separate IF-THEN statement, with separate IFs and THENs, whereas other annotators did not.
A request for the change of the name or address of the proprietor of a registered EU trade mark pursuant to Article 55(1) of Regulation (EU) 2017/1001 shall contain: (...) (b) the name and the address of the proprietor of the EU trade mark as recorded in the Register, unless an identification number has already been given by the Office to the proprietor, in which case it shall be sufficient for the applicant to indicate that number and the proprietor’s name;
The exception (signaled by ‘unless’) could be construed as an IF-THEN statement: IF an identification number has already been given by the Office to the proprietor THEN it shall be sufficient for the applicant to indicate that number and the proprietor’s name. It is also possible to include the exception as part of the THEN statement ‘shall contain the name and the address (...)’ that connects to the IF ‘A request for the change of the name or address (...)’.
The annotators were instructed to, when in doubt, consider whether an IF-THEN statement could be relevant for compliance checking. During the training stage this resulted in more agreement yet also led to some disagreement.
The draft shall be accessible and editable by the complainant party prior to submission of the final fully completed electronic complaint form.
One annotator labeled ‘prior to submission of the final fully completed electronic complaint form’ as an IF and ‘The draft shall be accessible and editable by the complainant party’ as a THEN. It may be argued that the IF statement is relevant for compliance-checking purposes and requires the labeling of a separate IF statement, but the opposite may also hold true.
What text belongs to the IF and THEN also led to disagreement.
Without prejudice to Regulation (EC) No 2111/2005, following receipt of the supplementary report referred to in paragraph 3(b), the Commission may take any of the following steps: (...)
Here, one may choose to treat ‘Without prejudice to Regulation (EC) No 2111/2005’ as part of the IF statement (IF ‘without prejudice (...)’ THEN ‘the Commission may (...)’) or as part of the THEN statement (IF ‘the Commission’ THEN ‘without prejudice (...), may take any of the following steps: (...)’). Or somewhat similarly:
The Agency shall levy the annual fees provided for in Annex III for every biocidal product or biocidal product family authorised by the Union
In this provision, ‘for every biocidal product or biocidal product family authorised by the Union’ may be treated as separate IF next to ‘The Agency’ or as part of the THEN statement (’shall levy the annual fees provided for in Annex III’).
Exceptions. Disagreement was observed on whether a statement could be considered an exception.
existing stocks which have been produced and labelled before 12 November 2013 in accordance with the rules applicable before 12 November 2013 may continue to be placed on the market and used until they are exhausted
This provision reads as if a default rule about placing stocks on the market might exist. Because the annotators could not access other provisions than the one they were asked to annotate, they could not check whether such a default rule existed. As a result, some annotators labeled the statement as an exception, whereas others did not. Even when a potential default rule could be found, there was disagreement regarding whether there was a default rule - exception combination. In the example:
Where, in order to execute the request, the requested authority is required to disclose the fact that the requesting authority has made a request, the requested authority shall disclose it after having discussed the nature and extent of the disclosure required with the requesting authority and after having obtained its consent to such disclosure. Where the requesting authority does not provide its consent to the disclosure, the requested authority shall not act upon the request, and the requesting authority may withdraw or suspend its request until it is able to provide such consent to disclosure.
some annotators considered ‘the requested authority shall not act upon the request (...)’ as an exception to ‘the requested authority shall disclose (...)’, whereas others did not label this as an exception. The same could be observed in the following example:
This Regulation shall enter into force on the twentieth day following that of its publication in the Official Journal of the European Union.
However, Article 3(1) shall apply as of 1 July 2017 to natural persons undertaking one or more of the activities provided for in Article 2(1) with regards to refrigeration units of refrigerated trucks and trailers
The ‘however’ suggests a deviation from the main rule. On the other hand, legislation ‘entering into force’ and ‘applying’ can be different events.
In some instances, disagreement resulted from the fact that annotators did or did not include certain parts of the provision.
Means of communication and publicity measures referred to in Article 12 of Delegated Regulation (EU) 2017/40, as well as educational materials and tools to be used within the accompanying educational measures, shall exhibit the European flag and mention the ‘School scheme’ and unless the size of the materials and tools exclude this, the financial contribution of the Union
Here, ‘unless the size of the materials and tools exclude this’ is the exception that applies to ‘the financial contribution of the Union’. Some annotators, however, also labeled ‘shall exhibit’ as part of the text that the exception applied to. Also, some annotators decided to mark the referenced text, whereas others did not.
Member States may provide for a shorter notification period than laid down in paragraph 1
This provision provides for an exception to the rule stipulated in paragraph 1 of the same provision. Some annotators highlighted ‘paragraph 1’ as the rule the exception applied to, whereas others decided to label the actual text in paragraph 1 instead of the reference to paragraph 1.
Scope, Hierarchy. A disagreement for hierarchy was observed for a situation where the statement can be considered a mere exception or a ranking or order of legal norms or provisions based on their authority or importance (hierarchy):
Participating insurance and reinsurance undertakings, insurance holding companies and mixed financial holding companies shall submit quarterly, unless the scope or the frequency of the reporting is limited in accordance with the second subparagraph of Article 254(2) of Directive 2009/138/EC (...).
It also occurred that a statement was labeled as scope by one annotator and as a hierarchy by another.
Where one and the same service is intended for both private use, including use by the customer’s staff, and business use, the supply of that service shall be covered exclusively by Article 44 of Directive 2006/112/EC, provided there is no abusive practice.
In this example, the disagreement might be explained by the ‘provided there is no abusive practice’, which may or may not suggest a hierarchy.
Deontic Clauses. A common disagreement concerned distinguishing between obligations and powers. This frequently involved situations where the ‘Member States’ are the active role and are urged to do something (e.g., establish), where the ‘doing’ can be interpreted as an obligation or a power. For instance:
Member States shall establish a legal and administrative framework to ensure that (...).
Similarly, disagreements occurred between permission and right.
Inspectors shall be empowered to contact the trial subjects directly, in particular in case of reasonable suspicion that they were not informed adequately of their participation in the clinical trial.
The statement was considered an inherent entitlement (right) or a freedom that can be revoked at will (permission).
It also happened that one annotator labeled a statement as permission, whereas the other labeled it as a power,
Member States may decide not to apply Articles 1 to 11 to framework agreements concluded before 15 March 2003.
or
Administrative measures developed at national level for the management of the existing schemes may continue to apply as long as they comply with this Regulation
or where there was disagreement amongst the annotators as to whether to label the statement as a prohibition or a right:
Where a competent authority has granted exclusive rights to the railway undertaking performing a public service contract in accordance with Article 3 of Regulation (EC) No 1370/2007, the existence of such rights shall not preclude access being granted to an applicant
Disagreement could also concern situations where it is debatable whether a statement included an action or simply consisted of an IF-THEN statement without a permissible and non-optional action (obligation):
A deadline for payment shall be considered to have been observed only if the full amount of the fee has been paid in due time.
or
Only activities linked to the closure of the programme may be carried out between 1 January 2023 and 30 September 2024.
Furthermore, nesting was another cause of disagreement:
The closure of the programme shall not prejudice the Commission’s right to undertake, at a later stage, financial corrections vis-à-vis the Managing Authority or the beneficiaries if the final amount of the programme or the projects has to be readjusted as a result of controls or audits carried out after the closure date.
In this context, the right to undertake financial corrections is embedded within the prohibition against (’shall not’) prejudicing that right. Annotators disagreed on whether to label the right to undertake financial corrections as an independent right or as a component of the overarching prohibition.
Active Role, Passive Role. More disagreement was observed for passive roles than for active roles. The disagreement mostly concerned whether the deontic statement would affect a person or an entity that does not bear the primary responsibility for fulfilling an obligation or violating a prohibition. For instance, in the example:
The notification of a selected major project by the managing authority to the Commission in accordance with (...).
the question arises whether the Commission is ‘affected’ by the notification of the managing authority. Another example concerns:
When the number of aid applicants in a Member State is less than 100, on-the-spot checks shall be carried out on the premises of not less than five applicants
Here, ‘five applicants’ may be considered affected by the on-the-spot checks, yet the applicants are unidentified persons or entities.
In some instances, the disagreement was caused by the annotators disagreeing on whether there was a deontic clause or the type of deontic clause. The presence and type of clause can impact the active and passive role’s presence (or absence). Finally, annotators labeling the same entity in different spans resulted in disagreement.
In the event that the Member States and Commission have not submitted objections in respect of the draft decision of the initiating Member State, that State shall adopt (...).
In this example, labeling ‘Member State’ or ‘that State’ substantively entails the same yet results in disagreement. Similarly, it was sometimes debatable whether strings should be labeled as having single or multiple roles.
(...) the master of a fishing vessel or his representative (...)

4.6 Intra-rater reliability (intraRR)

Annotators labeling the same 20 provisions twice allowed for calculating the consistency within each annotator for the various tasks and labels. It becomes apparent from Table 7 that the intraRR ranged from exceptionally low (e.g., Hierarchy for A2 and A3) to perfect agreement (e.g., Scope for A2 and A3). Reference and Quantity are the only categories where the kappa values did not fall below.69, followed by Right (>=.62), Active Role (>=.58), and Scope (>=.56).
The results indicate that the consistency varied depending on the annotator and the task. Only Definition, Reference, Quantity, Scope, and Right were associated with moderate to high agreement levels for all annotators. When focusing on individual annotators, the results reveal that no annotator was inconsistent across the assigned tasks. For instance, A3, who showed low and extremely low reliability for Passive Role and Hierarchy, respectively, had moderate to almost perfect agreement scores on most other categories. Also, A3, who displayed very low agreement scores for IF-THEN, IF, and THEN, had rather high agreement levels for Exception Text (note that A3 for this task is not the same annotator as A3 for Scope, Hierarchy, etc.).
Table 7
Intra-rater reliability per annotator per task
Category
Subcategory
Contained overlap (tolerance=0)
  
A1
A2
A3
 
  
Po
\({\varvec{\kappa }}\)
Po
\({\varvec{\kappa }}\)
Po
\({\varvec{\kappa }}\)
Definition
Term
.94
.82
.90
.62
 
Text
.97
.91
.90
.62
Reference
 
.89
.77
.95
.91
Quantity
 
.87
.69
.91
.79
Exception
Text
.91
.45
.91
.45
.96
.78
 
Applies_to
1.00
1.00
.88
.34
.88
.34
IF-THEN
IF-THEN
.70
.39
.76
.51
.46
-.08
 
IF
.72
.43
.73
.45
.50
-.07
 
THEN
.67
.34
.74
.47
.62
.23
Scope
 
.82
.56
1.00
1.00
1.00
1.00
Hierarchy
 
.88
.33
.90
-.05
.90
-.05
Deontic clause
Obligation
.70
.39
.77
.54
.79
.56
 
Permission
.96
.78
.91
.00
.91
.45
 
Prohibition
n/a
n/a
n/a
n/a
n/a
n/a
 
Power
.90
-.05
.94
.00
1.00
n/a
 
Right
.92
.62
1.00
1.00
1.00
1.00
Role
Active role
.82
.58
.88
.73
.97
.94
 
Passive role
.87
.43
.83
.52
.80
.28
Notes: Po=observed agreement, \(\kappa \)=Cohen’s kappa. A1, A2, and A3 denote different annotators for distinct task sets: (Definition+Reference+Quantity), (Exception+IF-THEN), and (Scope+Hierarchy+Deontic clause+Role). Each set has unique annotators

5 Limitations

Our study is limited to the annotation of EU legislative provisions. Even though we carefully selected the provisions to include a wide array of EU regulations, it cannot be excluded that there are more examples of provisions that pose challenges and difficulties not included in the research sample. Additionally, it is uncertain whether the results can be generalized beyond EU legislation.
Moreover, the recruitment of student annotators may have yielded sub-optimal results. Even though the annotators were trained extensively, it could be that annotators with more legal expertise would have made different annotations, which could have resulted in either higher agreement or more genuine disagreement. The challenge would, however, be who to recruit. Annotators with the required expertise to annotate statutory provisions must meet at least the following four conditions. First, the annotators should have legal expertise, meaning they should be familiar with the formulation and interpretation of the law, including the nuances that come with it. Second, they need to be familiar with the tested concepts, such as deontic modalities, the different types of definitions, and logical conditions, both theoretically and empirically. Third, the annotators need to agree on and become familiar with the annotation guidelines. Fourth, they need to be available. Potential annotators who meet all these requirements are rare. Each annotator in this project spent between 4 and 8 h per week for three months on training and then an additional 30–50 h on annotating the provisions.
Finally, the low agreement scores for the ’power’ category, and to a lesser extent the ’right’ category suggest that the guidelines could be further improved for specifically these categories.

6 Discussion

Previous research reported Krippendorff’s alpha and Cohen’s kappa values not exceeding.74 and.82, respectively. Our study finds different agreement levels depending on the stage in which the agreement is measured, the task at hand, the way of measuring, and the combination of annotators. The after-annotator-revision agreement scores, and even most of the initial annotation agreement scores, seem comparable to those in previous research.
The findings signify the importance and impact of annotation processes. First, this research introduces different ways to calculate the agreement in text labeling tasks, which allows for comparing these measures and testing the robustness of the inter-rater or intra-rater agreement scores. Furthermore, this study finds that although acceptable agreement levels for most categories were reached after the initial annotations, the agreement levels substantially increased across all categories after the revision stage.
Additionally, the agreement scores between pairs of annotators became more consistent when they were allowed to change their annotations after learning about where they disagreed. In this study, the annotators only learned about disagreements between at least two annotators, but not about what labels other annotators provided and (in the case of the 3-annotator group) which annotator (dis)agreed with whom. Moreover, the annotators were instructed not to consult each other when reconsidering their annotations to avoid anchoring or order effects and dominant group member effects. The increases suggest that exposing annotators and having them reconsider their annotations leads to more consistency and improved agreement levels while preserving genuine disagreement.
Pointing out disagreement and including revision rounds would also benefit an arbiter. The possibility of seeing in which instances annotators disagree after reconsidering their annotations provides an arbiter with valuable information about whether to change annotations. Comparing already revised annotations by the annotators may also assist the arbiter in deciding which solution is preferred. Nevertheless, an arbiter revision, as suggested by the results, can override mistakes as well as genuine disagreement since the arbiter, in the absence of discussions, could apply their interpretation of the guidelines rather than or in addition to correcting mistakes.
The differences found at each stage (initial / after-annotator revisions / after-arbiter revisions) of annotations, and the fact that disagreements seem inevitable, at least for certain categories, raises questions regarding the validity of the results produced by previous research that developed and tested models to extract information from legislative provisions automatically, EU provisions in particular. Although these models may have been reliable in that the repeated training of the models is likely to yield similar results, their measurement validity might have been improved if multiple annotators had been recruited and revision rounds had been implemented. Particularly, studies that rely on a single expert when training a model might have created a model that reliably can replicate what that particular single expert did, but not necessarily what other (non-)experts or would do.
Disagreement was not only found between annotators but also within annotators. This disagreement was quite substantial for some of the categories. It is unlikely that this resulted from annotators performing badly because they did not understand the annotation guidelines or did not take the tasks seriously. Low scores would be expected for the inter-annotator agreement if that were the case. However, annotators that show low intra-annotator agreement for some annotators for some categories (e.g., Power, Hierarchy, IF-THEN, IF, THEN, Passive Role) have, except for Power, moderate to substantial interRR levels. For instance, A3, for whom intraRR kappa scores of -.08, -.07, and.23 were found for IF-THEN, IF, and THEN statements, shared initial interRR kappa values of.71,.68, and.56 (A3-A1) and. 79,.69, and.54 (A3-A2), respectively, with the other annotators. The conclusion, therefore, seems to be that annotators found multiple solutions to be justified. The combination of intraRR and interRR scores suggests that genuine disagreement is frequently observed when annotating legislative provisions, EU provisions in particular.
A reverse pattern was sometimes also observed. The Right category, compared to other categories, produced a fair to moderate interRR (Right) but moderate to perfect intraRR. The finding that categories, the Right category in particular, may be reliably measured within annotators separately but less reliably between annotators may suggest different interpretations of what constitutes a right.
Since disagreement is common and sometimes desirable, the quality of the models trained on annotated data may benefit from including disagreement in developing the models. As mentioned by Braun (2023), solutions that capture disagreement may be considered, such as ensemble models, which combine multiple individual models (Akhtar et al. 2020). Another solution that has been proposed is to leverage the confidence scores of annotators to weigh the annotations (Ramponi and Leonardelli 2022), which will be explored in future research.
In conclusion, discussion-based revisions, whether by the annotators, an arbiter, or a combination, seem to be the most likely approach to reduce disagreement while preserving genuine disagreement. The extensive effort that goes into discussions, the training of annotators, and the carrying out of the annotation task suggests that it might not be realistic to expect data sets that include around 10,000 annotations to train a model. Instead, consideration should be given to developing and testing low-resource models that require a relatively low number of training examples (Minkova et al. 2023).
Finally, low agreement scores were observed for Power and Right, compared to the other categories, even after extensive training and discussions. The error analysis revealed that annotators found it difficult to distinguish between the categories from different categories, particularly obligations and permissions. Future research may need to subject these terms to additional testing.

7 Conclusion

This study examined (dis)agreement among annotators regarding the annotation of EU legislative provisions. The agreement levels for different stages of the annotation process were reported and compared. Although high at the final stage across almost all legal categories analyzed, the agreement levels improved substantially after going through a round of revisions by the annotators. The results and observations underscore the importance of annotation procedures and the reporting thereof. By doing so, this study contributes to enhancing the reliability and validity of automated information extraction models in the legal domain.

8 Supplementary information

The data, guidelines, and analyses are available at https://​doi.​org/​10.​34894/​ZJJIOB.

Declarations

Conflict of interest

The authors declare no Conflict of interest. The funding organization had no role in the design of the study, in the collection, analysis, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Appendix A overview of related work

Table 8.
Table 8
Overview of Related Work
Publication
Categories
Data
Automated annotation (rule-based / machine-based
Annotators
Performance
Sun et al. (2023)
Obligation, permission, prohibition, ordinary sentence
913 provisions on German tenancy law ((Waltl et al., 2017)) + 1,000 sentences from Chinese social security policies
Rule-based + Machine-based
Waltl et al. (2017) / [not retrieved]
F1=.89 (DeonticBERT), F1=.94 (BiLSTM)
Liga and Palmirani (2023)
Rule / non-rule, deontic / non-deontic
707 provisions from the GDPR, with 966 formulae: 271 obligations, 76 permissions, and 619 constitutive rules
Machine-based
[DAPRECO - no details available]
.87 (rule / non-rule),.87 (deontic / non-deontic)
Liga and Palmirani (2022)
Rule / non-rule, deontic / non-deontic, obligation / permission, obligation / permission / constitutive
707 provisions from the GDPR, with 966 formulae: 271 obligations, 76 permissions, and 619 constitutive rules
Machine-based
[DAPRECO - no details available]
.82 (obligations),.55 (permissions),.89 (constitutive rules)
Bakker et al. (2022)
Action, actor, object, recipient
854 sentences (497 contained an action) in Dutch legislation
Rule-based + Machine-based
‘The first three authors’ (\(\kappa \) =.785)
Rule-based: mean accuracy =.42 (annotated dataset),.52 (test set),.52 (Aliens Act (positives)),.74 (Aliens Act (all))
Machine-based: mean accuracy =.81 (Test set),.80 (Aliens Act (positives)),.69 (Aliens Act (all))
     
Humphreys (2016): Humphreys et al. (2021)
Definitions (regular, include/exclude, by example, by reference) Norm types (obligation, right, permission, power) Meta norms (legal effect, scope, exception, hierarchy, rationale)
224 sentences from Directive 95/46/EC: 58 definitions and 166 norms Directive 95/46/EC (excluding preamble and annexes)
Rule-based
Manual annotation by a legal expert
Obligation (.76), power (.85), legal effect (.35), definition (.93), permission (.67), scope (.92), rationale (1.00), right (1.00), exception (1.00), hierarchy (1.00)
Joshi et al. (2021)*
Obligations, permissions, prohibitions
Contracts
Machine-based
[not retrieved]
Precision=90%, recall=89.66%
Shaghaghian et al. (2020)
Obligations
3,000 randomly selected text snippets from contracts
Machine-based
‘human legal experts’
F1=.92
Waltl et al. (2019)
Duties, indemnities, permissions, prohibitions, objections, continuations, consequences, definitions, and references
601 sentences extracted from German tenancy law
Machine-based
Manually classified by a single domain expert
Average F1 scores of.78 (rule-based) and.83 (machine-based)
Sleimi et al. (2018)
Action, Actor, Artifact, Condition, Exception, Location, Modality, Reason, Sanction, Situation, Time, Violation
1339 phrases in 200 randomly selected statements from the traffic laws
Machine-based
‘The first author’ + ‘the second author (...) independently annotated 10% of these statements’ (\(\kappa \) = 0.815)
Precision (average) = 87.4%; Recall (average) = 85.5%
Glaser et al. (2018)
Duties, indemnities, permissions, prohibitions, objections, continuations, consequences, definitions, references
913 sentences (German tenancy law + rental agreements)
Machine-based
‘two human legal experts’, ‘a third legal expert acted as the editor in order to decide on an annotation in the case of disagreement’
F1=.83
Chalkidis et al. (2018)
Obligations, prohibitions
6,385 contractual provisions of 100 randomly selected English service agreements
Machine-based
‘Each section was annotated by one law student (five students in total). A paralegal expert checked and corrected all annotations’
F1 scores of.90 (obligations) and.84 (prohibitions)
Dragoni et al. (2018)
Obligation, permission, prohibition
Sections 8.2.1(a)\(-\)8.2.1(c) of the Australian Telecommunications Consumer Protections Code, TC628-2012 (TCPC)
Rule-based + Machine-based
‘an analyst’
Precision = 95.92% to 100.00%
Aires et al. (2047)
Norm (norm/non-norm), party
256 unannotated Australian contracts; 92 manually annotated contracts; norm sentence set (9,864 norms, 10,544 non-norms)
Rule-based
[not retrieved]
F1=87%; precision=79%; recall=98% (norm identification) Accuracy=60% (party)
O’Neill et al. (2017)
Obligations, prohibitions, permissions
Training set: 1,297 sentences (596 obligations, 94 prohibitions, 607 permissions) Test set: 312 obligations, 248 permissions, 62 prohibitions
Machine-based
‘both subject matter experts’ (\(\kappa \) =.74)
F1=.79
Waltl et al. (2017)
Statutory rights, statutory duties, objection, legal consequence, procedure, reference, continuation, definition
532 sentences from German tenancy law (§535 - §595 German civil code)
Machine-based
‘two legal experts’
Accuracy=.90 for some categories
Waltl et al. (2016)
References (fully-explicit, semi-explicit, implicit, tacit)
Ten (out of more than 6,000) German laws containing the most words
Rule-based
[not retrieved]
StGB: full-explicit references (precision: 98%; recall 97%), semi-explicit references (precision: 80%; recall 80%), implicit references (precision: 93%; recall 93%) KWG: full-explicit references (precision: 89%; recall 88%), semi-explicit references (precision: 82%; recall 60%), and implicit references (precision: 96%; recall 92%)
Peters and Wyner (2016)
Duty (ActionVerb, DutyBearer, DutyCounterPart, DutyObject)
Two EU Directives
Machine-based + Rule-based
A legal expert
F1=75%
Gao and Singh (2014)
Practical commitment, dialectical commitment, authorization, power, prohibition, sanction, non-norm
868 sentences from manufacturing contracts (training set), 99 manually annotated norm candidate sentences from the manufacturing domain (test set)
Rule-based + machine-based
‘The authors (assumed sufficiently competent in the study of norms and text mining) annotated the norm type for each sentence, resolving conflicting annotations via discussion’
Weighted average F-score of 84%
de Araujo et al. (2013)*
[not retrieved]
[not retrieved]
[not retrieved]
[not retrieved]
[not retrieved]
Wyner and Peters (2011)
Obligation, permission, list, antecedent, consequent, exception phrases, subject agent, direct object theme, passivised verb, subject theme, by-phrase agent
1,777 words passage from the US Code of Federal Regulations, US Food and Drug Administration, Department of Health and Human Services regulation for blood banks on testing requirements for communicable disease agents in human blood, Title 21 part 610 section 40.6
Rule-based
[not retrieved]
Precision=100%, Recall=100% (obligation, permission, list, antecedent, consequent) Precision=100%, Recall=36% (exception phrases); Precision=88%%, Recall=100% (subject agent); Precision=100%, Recall=30% (direct object theme); Precision=100%, Recall=57% (passivised verb); Precision=100%, Recall=70% (subject theme); Precision=100%, Recall=14% (by-phrase agent)
Grabmair et al. (2011)*
     
Francesconi (2010)
Definition, liability, prohibition, duty, permission, penalty
209 provisions
Machine-based
[not retrieved]
Precision = 85.71% (prohibition), 69.23% (duty), 78.95% (permission), 85.83% (penalty). Recall = 92.30% (prohibition), 30.50% (duty), 100.00% (permission), 89.34% (penalty)
de Maat et al. (2010)
Definition, permission, obligation, delegation, publication provision, application provision, enactment date, citation title, change (scope, insertion, replacement, repeal, renumbering)
584 sentences in 18 different Dutch regulations; 2 bills (test set)
Rule-based vs machine-based
[not retrieved]
Rule-based: Accuracy=94%-96% (for sentences), Accuracy=94%-100% (for lists) Machine-based: Accuracy=89%-96% (for sentences), Accuracy=83%-100% (for lists)
de Maat et al. (2010)
Definition, norm (right/permission), delegation, publication provision, application provision, enactment date, citation title, value assignment, penalisation, change, mixed type, norm (statement of fact), scope, insertion, replacement, repeal, renumbering
592 sentences and 62 lists in 18 different Dutch regulations
Rule-based
[not retrieved]
Accuracy=91% (for sentences), Accuracy=81% (for lists)
Kiyavitskaya et al. (2008)
Right, anti-right, obligation, anti-obligation, exception, some types of constraints
2,262 words of Section 164.520 of Privacy Rule (Italy) + 6,185 words of Stanca Act (Italy)
Rule-based
‘four junior researchers from the software engineering area, two of whom were not from the group working on the tool’ + ‘an expert analyst’
Experiment 1: ‘(a) the total number of entities identified was about 10 percent larger than when starting from the original document; however, t-test results do not allow us to claim that this improvement is statistically significant; (b) annotators were faster by about 12.3 per cent’. Experiment 2: ‘The tool outperformed the human annotator in identifying instances of the concepts actor, policy, action, and resource’
Francesconi and Passerini (2007)
Repeal, definition, delegation, delegification, duty, exception, inserting, prohibition, permission, penalty, substitution
582 provisions distributed among the 11 classes
Machine-based
[not retrieved]
Accuracy=92%
Biagioli et al. (2005)
Repeal, definition, delegation, delegification, prohibition, reservation, insertion, obligation, permission, penalty, substitution
582 provisions distributed among the 11 classes
Machine-based
‘selected and labelled by legal experts’
Classification:.46\(-\)1.00 (precision),.33\(-\)1.00 (recall) Provision argument extraction: Success=82%
* = full-text version could not be retrieved
Literatur
Zurück zum Zitat Biagioli C, Francesconi E, Passerini A, Montemagni S, Soria C (2005) Automatic semantics extraction in law documents. Assoc Comput Mach 10(1145/1165485):1165506 Biagioli C, Francesconi E, Passerini A, Montemagni S, Soria C (2005) Automatic semantics extraction in law documents. Assoc Comput Mach 10(1145/1165485):1165506
Zurück zum Zitat Boella G, Humphreys L, Martin M, Rossi P, van der Torre L (2012) Eunomos, a legal document and knowledge management system to build legal services [Book Section]. In M Palmirani, U Pagallo, P Casanovas, & G Sartor (Eds.), Ai approaches to the complexity of legal systems. models and ethical challenges for legal systems, legal language and legal ontologies, argumentation and software agents (p.131–146). Berlin, HeidelbergSpringer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-35731-2_9 Boella G, Humphreys L, Martin M, Rossi P, van der Torre L (2012) Eunomos, a legal document and knowledge management system to build legal services [Book Section]. In M Palmirani, U Pagallo, P Casanovas, & G Sartor (Eds.), Ai approaches to the complexity of legal systems. models and ethical challenges for legal systems, legal language and legal ontologies, argumentation and software agents (p.131–146). Berlin, HeidelbergSpringer Berlin Heidelberg. https://​doi.​org/​10.​1007/​978-3-642-35731-2_​9
Zurück zum Zitat de Araujo D.A, Rigo S.J, Muller C, Chishman R (2013) Automatic information extraction from texts with inference and linguistic knowledge acquisition rules [Conference Proceedings]. 2013 ieee/wic/acm international joint conferences on web intelligence (wi) and intelligent agent technologies (iat) (Vol. 3, p.151-154) de Araujo D.A, Rigo S.J, Muller C, Chishman R (2013) Automatic information extraction from texts with inference and linguistic knowledge acquisition rules [Conference Proceedings]. 2013 ieee/wic/acm international joint conferences on web intelligence (wi) and intelligent agent technologies (iat) (Vol. 3, p.151-154)
Zurück zum Zitat de Maat E, Krabben K, Winkels R (2010). Machine learning versus knowledge based classification of legal texts. In: Proceedings of the 2010 conference on legal knowledge and information systems: Jurix 2010: The twenty-third annual conference (p.87-96). IOS Press de Maat E, Krabben K, Winkels R (2010). Machine learning versus knowledge based classification of legal texts. In: Proceedings of the 2010 conference on legal knowledge and information systems: Jurix 2010: The twenty-third annual conference (p.87-96). IOS Press
Zurück zum Zitat Dragoni M, Villata S, Rizzi W, Governatori G (2018). Combining natural language processing approaches for rule extraction from legal documents [Conference Proceedings]. U Pagallo, M Palmirani, P Casanovas, G Sartor, & S Villata (Eds.), Ai approaches to the complexity of legal systems (p.287-300). Springer International Publishing Dragoni M, Villata S, Rizzi W, Governatori G (2018). Combining natural language processing approaches for rule extraction from legal documents [Conference Proceedings]. U Pagallo, M Palmirani, P Casanovas, G Sartor, & S Villata (Eds.), Ai approaches to the complexity of legal systems (p.287-300). Springer International Publishing
Zurück zum Zitat Gao X, Singh MP (2014) Extracting normative relationships from business contracts. In: International foundation for autonomous agents and multiagent systems Gao X, Singh MP (2014) Extracting normative relationships from business contracts. In: International foundation for autonomous agents and multiagent systems
Zurück zum Zitat Grabmair M, Ashley K, Hwa R, Sweeney PM (2011) Toward extracting information from public health statutes using text classification machine learning. Legal knowledge and information systems-jurix 2011: the twenty-fourth annual conference, university of vienna, austria, 14th-16th dec 2011 (Vol. 235, p.73-82) Grabmair M, Ashley K, Hwa R, Sweeney PM (2011) Toward extracting information from public health statutes using text classification machine learning. Legal knowledge and information systems-jurix 2011: the twenty-fourth annual conference, university of vienna, austria, 14th-16th dec 2011 (Vol. 235, p.73-82)
Zurück zum Zitat Kiyavitskaya N, Zeni N, Breaux T.D, Antón A.I, Cordy J.R, Mich L, Mylopoulos J (2008) Automating the extraction of rights and obligations for regulatory compliance [Conference Proceedings]. Q Li, S Spaccapietra, E Yu, & A Olivé (Eds.), Conceptual modeling - er 2008 (p.154–168). Springer Berlin Heidelberg Kiyavitskaya N, Zeni N, Breaux T.D, Antón A.I, Cordy J.R, Mich L, Mylopoulos J (2008) Automating the extraction of rights and obligations for regulatory compliance [Conference Proceedings]. Q Li, S Spaccapietra, E Yu, & A Olivé (Eds.), Conceptual modeling - er 2008 (p.154–168). Springer Berlin Heidelberg
Zurück zum Zitat Liga D, Palmirani M (2022) Transfer learning for deontic rule classification: The case study of the gdpr. In: E Francesconi, G Borges, & C Sorge (Eds.), Legal knowledge and information systems—jurix 2022: The thirty-fifth annual conference, saarbrücken, germany, 14-16 december 2022 (Vol. 362, p.200-205). IOS Press. https://doi.org/10.3233/FAIA220467 Liga D, Palmirani M (2022) Transfer learning for deontic rule classification: The case study of the gdpr. In: E Francesconi, G Borges, & C Sorge (Eds.), Legal knowledge and information systems—jurix 2022: The thirty-fifth annual conference, saarbrücken, germany, 14-16 december 2022 (Vol. 362, p.200-205). IOS Press. https://​doi.​org/​10.​3233/​FAIA220467
Zurück zum Zitat Liga D, Palmirani M (2023). Deontic sentence classification using tree kernel classifiers [Conference Proceedings]. K Arai (Ed.), Intelligent systems and applications (p.54–73). Springer International Publishing Liga D, Palmirani M (2023). Deontic sentence classification using tree kernel classifiers [Conference Proceedings]. K Arai (Ed.), Intelligent systems and applications (p.54–73). Springer International Publishing
Zurück zum Zitat Minkova K, Chakravarthy S, van Dijck G (2023) Low-resource deontic modality classification in EU legislation. D Preoţiuc-Pietro, C Goanta, I Chalkidis, L Barrett, GJ Spanakis, & N Aletras (Eds.), Proceedings of the natural legal language processing workshop 2023 (pp. 149–158). SingaporeAssociation for Computational Linguistics. https://aclanthology.org/2023.nllp-1.15 Minkova K, Chakravarthy S, van Dijck G (2023) Low-resource deontic modality classification in EU legislation. D Preoţiuc-Pietro, C Goanta, I Chalkidis, L Barrett, GJ Spanakis, & N Aletras (Eds.), Proceedings of the natural legal language processing workshop 2023 (pp. 149–158). SingaporeAssociation for Computational Linguistics. https://​aclanthology.​org/​2023.​nllp-1.​15
Zurück zum Zitat Palmirani M, Governatori G, Rotolo A, Tabet S, Boley H, Paschke A (2011) Legalruleml: Xml-based rules and norms [Book Section]. In F Olken, M Palmirani, & D Sottara (Eds.), Rule-based modeling and computing on the semantic web (p.298-312). Berlin, HeidelbergSpringer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-24908-2_30 Palmirani M, Governatori G, Rotolo A, Tabet S, Boley H, Paschke A (2011) Legalruleml: Xml-based rules and norms [Book Section]. In F Olken, M Palmirani, & D Sottara (Eds.), Rule-based modeling and computing on the semantic web (p.298-312). Berlin, HeidelbergSpringer Berlin Heidelberg. https://​doi.​org/​10.​1007/​978-3-642-24908-2_​30
Zurück zum Zitat Shaghaghian S, Feng L, Jafarpour B, Pogrebnyakov N (2020) Customizing contextualized language models for legal document reviews [Conference Proceedings]. 2020 ieee international conference on big data (big data) (p.2139-2148) Shaghaghian S, Feng L, Jafarpour B, Pogrebnyakov N (2020) Customizing contextualized language models for legal document reviews [Conference Proceedings]. 2020 ieee international conference on big data (big data) (p.2139-2148)
Zurück zum Zitat Sleimi A, Sannier N, Sabetzadeh M, Briand L, Dann J (2018) Automated extraction of semantic legal metadata using natural language processing. In: 2018 ieee 26th international requirements engineering conference (re) (p.124–135) Sleimi A, Sannier N, Sabetzadeh M, Briand L, Dann J (2018) Automated extraction of semantic legal metadata using natural language processing. In: 2018 ieee 26th international requirements engineering conference (re) (p.124–135)
Zurück zum Zitat Tiscornia D, Turchi F (1997) Formalization of legislative documents based on a functional model. In: Proceedings of the 6th international conference on artificial intelligence and law (p.63-71). Association for Computing Machinery. https://doi.org/10.1145/261618.261633 Tiscornia D, Turchi F (1997) Formalization of legislative documents based on a functional model. In: Proceedings of the 6th international conference on artificial intelligence and law (p.63-71). Association for Computing Machinery. https://​doi.​org/​10.​1145/​261618.​261633
Zurück zum Zitat van Doesburg R, van Engers TM (2019) Explicit interpretation of the dutch aliens act [Conference Proceedings]. Proceedings of the workshop on artificial intelligence and the administrative state co-located with 17th international conference on ai and law (aias@icail). https://ceur-ws.org/Vol-2471/paper6.pdf van Doesburg R, van Engers TM (2019) Explicit interpretation of the dutch aliens act [Conference Proceedings]. Proceedings of the workshop on artificial intelligence and the administrative state co-located with 17th international conference on ai and law (aias@icail). https://​ceur-ws.​org/​Vol-2471/​paper6.​pdf
Zurück zum Zitat Waltl B, Landthaler J, Matthes F (2016) Differentiation and empirical analysis of reference types in legal documents. In: International conference on legal knowledge and information systems Waltl B, Landthaler J, Matthes F (2016) Differentiation and empirical analysis of reference types in legal documents. In: International conference on legal knowledge and information systems
Zurück zum Zitat Waltl B, Muhr J, Glaser I, Bonczek G, Scepankova E, Matthes F (2017) Classifying legal norms with active machine learning [Book] Waltl B, Muhr J, Glaser I, Bonczek G, Scepankova E, Matthes F (2017) Classifying legal norms with active machine learning [Book]
Zurück zum Zitat Wyner A, Peters W (2011) On rule extraction from regulations [Conference Proceedings]. Frontiers in artificial intelligence and applications (Vol. Volume 235: Legal Knowledge and Information Systems, p.113–122) Wyner A, Peters W (2011) On rule extraction from regulations [Conference Proceedings]. Frontiers in artificial intelligence and applications (Vol. Volume 235: Legal Knowledge and Information Systems, p.113–122)
Metadaten
Titel
Deciphering disagreement in the annotation of EU legislation
verfasst von
Gijs van Dijck
Carlos Aguilera
Shashank M. Chakravarthy
Publikationsdatum
12.11.2024
Verlag
Springer Netherlands
Erschienen in
Artificial Intelligence and Law
Print ISSN: 0924-8463
Elektronische ISSN: 1572-8382
DOI
https://doi.org/10.1007/s10506-024-09423-9