While an algorithm may perform very well on the categorisation task, the obtained categories are not useful by themselves. As the documents used by the system are only available when the judgements are made and public, the outcome categorisation does not contribute any new information (one can simply extract the verdict from the published judgement). This view is also supported by Bex and Prakken (
2021) who insist that the ability to categorise decisions without explaining why the categorisation was made, does not provide any useful information and may even be misleading. The performance of a machine learning model for judgement categorisation, however, may provide useful information about how informative the characteristic features are. To enable feature extraction, it is important that the system is not a ‘black box’ (such as many of the more recent neural classification models). Therefore, rather than ‘predicting court decisions’ the main objective of the outcome-based judgement categorisation task should be to identify
predictors underlying the categorisations.
As we only discuss publications that categorise judgements on the basis of the outcome of the case, we will refer to outcome-based judgement categorisation simply as judgement categorisation.
3.2.1 Research in outcome-based judgement categorisation
Most of the papers in the field categorise judgements. The papers surveyed that involve judgement categorisation can be found in Table
2. For all fifteen papers, we indicate the paper itself, the court, whether or not the authors provide a method of analysing feature importance (FI) and consequently identify specific predictors of the outcome within the text, and the maximum performance.
Within these studies, two broad categories can be distinguished depending on which type of data they use. On the one hand, most studies use the raw text, explicitly selecting parts of the judgement which does not include (references to) the verdict. On the other hand, there are (fewer) studies which manually annotate data and use that as a basis for the categorisation.
Kowsrihawat et al. (
2018) used the raw text to categorise (with an accuracy of 67%) the documents of the Thai Supreme Court on the basis of the facts of the case and the text related to the legal provisions in the cases such as murder, assault, theft, fraud and defamation using a range of statistical and neural methods. Medvedeva et al. (
2018), Medvedeva et al. (
2020a) categorised (with an accuracy of at most 75%) decisions of the ECtHR using only the facts of the case (i.e. a separate section in each ECtHR judgement). Notably, Medvedeva et al. (
2020a) identified the top predictors (i.e. sequences of one or more words) for each category, which was possible due to the (support vector machine) approach they used. Strickson and De La Iglesia (
2020) worked on categorising judgements of the UK Supreme Court and compared several systems trained on the raw text of the judgement (without the verdict) and reported an accuracy of 69%, while also presenting the top predictors for each class. Sert et al. (
2021) categorised cases of the Turkish Constitutional Court related to public morality and freedom of expression using a traditional neural multi-layer perceptron approach with an average accuracy of 90%. Similarly to Medvedeva et al. (
2020a), Chalkidis et al. (
2019) also investigated the ECtHR using the facts of the case, and proposed several neural methods to improve categorisation performance (up to 82%). They additionally proposed an approach (a hierarchical attention network) to identify which words and facts were most important for the classification of their systems. In their subsequent study Chalkidis et al. (
2020) used a more sophisticated neural categorisation algorithm which was specifically tailored for legal data (LEGAL-BERT). Unfortunately, while their approach did show an improved performance (with an F1-score of 83%) it was not possible to determine the best predictors of the outcome due to the system’s complexity. Medvedeva et al. (
2021) reproduced the algorithms in Chalkidis et al. (
2019) and Chalkidis et al. (
2020) in order to compare their performance for categorisation and forecasting tasks (see below) for a smaller subset of ECtHR cases, and achieved an F1-score of up to 92% for categorising judgements of 2019. The scores however varied throughout the years. For example, categorisation of cases from 2020 did not surpass 62%. Several other categorisation studies (with accuracies ranging between 69 and 88%) focused on the facts of the ECtHR, but likewise did not investigate the best predictors (Kaur and Bozic
2019; O’Sullivan and Beel
2019; Condevaux
2020). Malik et al. (
2021) used neural methods to develop a system that categorised Indian Supreme Court Decisions achieving 77% accuracy. As their main focus was to develop an explainable system, they used an approach which allowed them to investigate the importance of their features, somewhat similar to the approach of Chalkidis et al. (
2020).
Manually annotated data was used by Kaufman et al. (
2019) who focused on data from the US Supreme Court (SCOTUS) Database (Spaeth et al.
2014) and achieved an accuracy of 75% using statistical methods (i.e. AdaBoosted decision trees). However, they did not investigate the most informative predictors. Shaikh et al. (
2020) also used manually annotated data to categorise the decisions of murder-cases of the Delhi District Court with an accuracy of up to 92% using classification and regression trees. These authors manually annotated 18 features, including whether the injured is dead or alive, the type of evidence, the number of witnesses et cetera. Importantly, they analysed the impact of each type of feature for each type of outcome.
Finally, Salaün et al. (
2020) essentially combined the two types of predictors, by not only extracting a number of characteristics from the cases of Rental Tribunal of Quebec (including the court location, judge, types of parties, et cetera), but also using the raw text of the facts (as well as the complete text excluding the verdict), achieving a performance of at most 85% with a French BERT model, FlauBERT.
Notably, the performance of Sert et al. (
2021) was very high. Despite the high success rate of their system, however, the authors warn against using it for decision-making. Nevertheless, they do suggest that their system can potentially be used for prioritising the cases that have a higher likelihood to end up in a violation. This suggestion mirrors the proposition made by Aletras et al. (
2016) for potentially using their system to prioritise cases with human rights violations. In both cases, however, the experiments were conducted using data extracted from the final judgements of the court, and the performance of these systems using data compiled before the verdict was reached (i.e. information necessary to prioritise cases) is unknown. Making these types of recommendations is therefore potentially problematic.
Many categorisation papers shown in Table
2 claim to be useful for legal aid. However, as we argued before, categorisation as such is not a useful task, given that the verdict can simply be read in the judgement text. To be useful, it is essential that categorisation performance is supplemented with the most characteristic features (i.e. predictors). Unfortunately, only a minority of studies provides this information. And even if they do, the resulting features, especially when using the raw text (i.e. characteristic words or phrases), may not be particularly meaningful.
In an attempt to be maximally explainable, Collenette et al. (
2020) suggest using Abstract Dialectical Framework instead of machine learning. They apply this framework to deducing the verdict from the text of judgements of the ECtHR regarding Article 6 of the ECHR (the right to a fair trial). The system requires the user to answer a range of questions, and on the basis of the provided answers, the model determines whether there was a violation of the right to a fair trial or not. The questions for the system were derived by legal experts, and legal expertise is also required to answer these questions (Collenette et al.
2020). While their system seemed to perform flawlessly when tested on ten cases, we face the same issue as with the machine learning systems. Specifically, the main input data is based on the final decision that has already been made by the judge. For instance, one of the questions that the model requires to be answered is whether the trial was independent and impartial, which is a question that has to be decided on by the judge. While this type of tool may potentially 1 day be used for judicial support, for example, as a checklist for a judge when making a specific decision, it is unable to actually forecast decisions in advance, or point to external factors that are not identified by legal experts.