Skip to main content

Open Access 28.11.2024 | Original Research

Mining EU consultations through AI

verfasst von: Fabiana Di Porto, Paolo Fantozzi, Maurizio Naldi, Nicoletta Rangone

Erschienen in: Artificial Intelligence and Law

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Consultations are key to gather evidence that informs rulemaking. When analysing the feedback received, it is essential for the regulator to appropriately cluster stakeholders’ opinions, as misclustering may alter the representativeness of the positions, making some of them appear majoritarian when they might not be. The European Commission (EC)’s approach to clustering opinions in consultations lacks a standardized methodology, leading to reduced procedural transparency, while making use of computational tools only sporadically. This paper explores how natural language processing (NLP) technologies may enhance the way opinion clustering is currently conducted by the EC. We examine 830 responses to three legislative proposals (the Artificial Intelligence Act, the Digital Markets Act and the Digital Services Act) using both a lexical and semantic approach. We find that some groups (like small and medium companies) have low similarity across all datasets and methodologies despite being clustered in one opinion group by the EC. The same happens for citizens and consumer associations for the consultation run over the DSA. These results suggest that computational tools actually help reduce misclustering of stakeholders’ opinions and consequently allow greater representativeness of the different positions expressed in consultations. They further suggest that the EC could identify a convergent methodology for all its consultations, where such tools are employed in a consistent and replicable rather than occasionally. Ideally, it should also explain when one methodology is preferred to another. This effort should find its way into the Better Regulation toolbox (EC 2023). Our analysis also paves the way for further research to reach a transparent and consistent methodology for group clustering.
Hinweise
This article is the result of a collaborative effort. However, Fabiana Di Porto composed paragraphs 2 and 6; Nicoletta Rangone authored paragraphs 1 and 3; while Maurizio Naldi and Paolo Fantozzi contributed paragraphs 4 and 5.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Consultations serve diverse functions. Above all, they are an input of legitimacy for policy-makers (Braun and Busuioc 2020), because they allow for participation in rulemaking. At the same time, they consent to collect evidence that is needed to inform regulation (Shapiro 2018). With the advent of Artificial Intelligence (AI), scholars have been investigating whether such tools could be used for consultation purposes. Already in 2004, Coglianese contended that computational tools can help categorize and analyze comments (Coglianese 2004). More recently, Livermore et al. have suggested that computational tools can help enhance participation, especially in highly-responded consultations, e.g., by “aggregating and analyzing comments to identify emergent content that is only apparent when comments are understood in relationship to each other and not simply read as individual, atomized responses to a regulatory proposal” (Livermore et al. 2017) (see also Rangone 2023).
The idea that such tools can help policymakers identify the main topics has finally been acknowledged by the European Commission’s JRC (European Commission 2021). Nonetheless, the EC itself shows a mixed attitude as to the use of computational tools in consultations (Di Porto 2022). On the one hand, it occasionally employs text processing technologies (like Atlas.ti.1,2 or gensim.summarization3) to analyze comments collected in consultations with large numbers of participants. On the other hand, it expressly recommends using data analysis software (such as STATA or CODA4) only for the sake of identifying the existence of campaigns by organized interests (see p. 476 in EC (2023)). The same happens in the U.S., where technology is widely employed by federal agencies, especially to detect duplicated, misattributed and computer-generated comments (Balla et al. 2022).
What is remarkable, however, is that the Commission employs such tools in neither a continuous nor standardized way, despite the availability of an EU expert body that could provide support in this specific area: the Commission Competence Centre on Text Mining and Analysis (CC-TMA).5 AI tools can effectively be used not only to identify duplicates but also to summarize the overall sentiment of comments. Hence, they do not only allow for significant time savings (Engstrom et al. 2020), but they may prevent decision-makers from being affected by information overload bias, which could impede an adequate assessment of comments (Rangone 2022).
A completely neglected question is that of using computational tools to cluster stakeholders’ positions by identifying similar ideas and grouping them together into cluster groups. In fact, to report the stakeholders’ positions over an issue, the EC combines different groups showing similar views. For instance, in the following sentence, the EC is pooling together big, medium, small and micro businesses (see p. 480 in EC (2023)): ‘Although 74% of industry respondents considered that the legislative framework was delivering benefits, only 32% of citizens agreed with this view.’ EC (2023). However, it does not clarify how this pooling is made, nor if any special tool is used therein.
Identifying and clustering groups in consultations’ feedback is a sensitive and crucial task, and can be done through computational tools for both closed and open comments, as the EC toolbox stresses.6. The reason for analyzing collected comments is to first inform rule-makers of the different views expressed by different stakeholder groups and then incorporate them in final legislative proposals. Therefore, grouping ideas and positions of stakeholders vis-à-vis a regulatory proposal serves for accountability reasons: it helps “inform stakeholders on how their input has been considered” and “explain why certain suggestions could not be taken up” (see tool 54, p. 477 in EC (2023)). Moreover, the EC expressly asks policymakers when reporting on the outcome of the consultation, to explain the methodologies and tools for transparency reasons (see p. 477 in EC (2023)).
Notwithstanding these acknowledgements, the EC barely mentions its computational methodologies in the synopsis reports attached to EU legislative proposals. Certainly, the EC does stakeholder clustering in a non-standardized manner, proceeding on a case-by-case basis.
Despite clustering stakeholders’ opinions being understudied, it is essential to ensure it is done in an effective way so as not to alter the representativeness of the varied positions and provide rule-makers with the right information: misclustering groups may make an opinion appear prevalent while in fact, it is not. For instance, identifying “industry” as the relevant cluster instead of (the much smaller) “big companies” may lead a position to prevail, while, if correctly attributed, such opinion would be minoritarian.
Against this background, our analysis suggests that the EC and rule-makers, in general, consider using computational tools when analyzing consultations’ feedback to perform stakeholder clustering. By employing Natural Language Processing (NLP) techniques, in combination with the methodologies already in use, we purport that rule-makers may better identify interest groups’ positions while making truly data-informed regulation possible. Indeed, because NLP techniques allow computers to recognize and analyze text, they can help improve the identification of interest groups in consultations. In particular, we address the following research question: To what extent do NLP techniques add to existing group identification approaches employed by the European Commission? In order to address this question, we have applied different computational techniques to analyze the content of 830 replies to the consultations run by the EC over the Digital Services Act (DSA),7 the Digital Markets Act (DMA)8 and the Artificial Intelligence Act (AIA) proposal.9
We find that computational tools successfully support the task of opinion group clustering by automatically identifying similarities and dissimilarities of opinions. Employing both lexical and semantic approaches, we find, for instance, that small companies and medium ones do have low similarity across all methodologies and corpora while being clustered in all analyses run by the EC. Moreover, trade unions tend to focus on topics of their own, quite apart from those of other stakeholder categories. Surprisingly, citizens have very little to share with consumer organizations in all three consultations. Another unexpected result is the similarity of positions between NGOs and large companies in the DSA consultation, which is confirmed in all the methodologies we employed.
The paper is structured as follows. We begin by reporting how the EC clusters opinions in the DSA, DMA and AI Act (Sect. 2). Afterwards, we explain our datasets in Sect. 3 and our methodology in Sect. 4. Then, we present our results in Sect. 5 and discuss how they contribute to the broader debate about the effectiveness of consultations to correctly inform rulemaking (Sect. 6).

2 Case studies: stakeholders’ opinion clustering in the DSA, DMA, and AI act

Of the three legislative procedures we consider, computational text analysis has been carried out by the European Commission only for some of the documents examined in this paper.
In the AIA’s Impact Assessment (European Commission 2021), a word cloud has been shown, reporting the most frequent words appearing in all comments, alongside a summary of major comments, with no explanation of the clusterization methodology employed. For instance, when it affirms that “Especially NGOs, EU citizens and others preferred the combination of [regulatory] option 2 ‘voluntary labelling’ and sub-option 3b” (see p. 20 in European Commission (2021)), the EC does not explain how it merged the two groups (NGOs and citizens) in one sole opinion. Similarly, no indication of the methodology employed to analyse the responses to both closed and open questions is given in the DSA’s Impact Assessment (European Commission 2020).
Only the DMA’s Impact Assessment (European Commission 2020) refers to the employment of linguistic computational tools. Here, tokenization, stemming, and cosine similarity are listed as tools to identify duplicate responses. In other words, such tools are used specifically to recognise campaigns, while automatic summarization tools (the gensim.summarization module in Python (Vinnarasu and Jose 2019)) is employed to summarise the comments including a lot of text and help analyse their content.
This demonstrates that the EC does not adopt a standardized and harmonized approach to assessing comments and forming opinion groups. Moreover, this results in a non-transparent approach that does not allow replicability or control.
Overall, none of the three response reports tries to identify opinions’ similarities of interests between the stakeholder categories as we do in this paper.

3 Datasets

We considered the comments on three major documents released by the European Union: the AIA, the DMA, and the DSA. These legislative acts reflect the EU’s commitment to shaping the digital future.
The proposed AIA regulates the development and deployment of artificial intelligence technologies within the EU. It sets out a harmonized legal framework to foster trust, promote innovation, and safeguard fundamental rights. It classifies AI systems into four categories: unacceptable risk, high risk (or systemic risk for general-purpose AI models), limited risk, and minimal risk. High-risk AI systems, such as those used in critical infrastructure or for biometric identification, are subject to strict obligations, including transparency, accountability, and human oversight. The AIA also establishes a European Artificial Intelligence Board (EAIB) to oversee its implementation and coordinate cooperation between member states.
The DMA addresses the challenges arising from the market power of large online platforms, commonly known as gatekeepers. It aims to ensure fair and open digital markets by setting out obligations for gatekeepers, including prohibitions on unfair practices, data access, and interoperability requirements. Gatekeepers are subject to enhanced regulatory scrutiny, and fines for non-compliance can reach up to 10% of their global turnover. The DMA also establishes a Digital Markets Advisory Committee (DMAC) to provide guidance and expertise to the European Commission. The DMA aims to create a level playing field, foster competition, and protect the rights of users and businesses in the digital space.
The DSA updates the legal framework governing digital services. It aims to strike a balance between protecting fundamental rights and user safety on one side and promoting innovation on the other side. It introduces new obligations for online platforms, including transparency requirements for content moderation practices, enhanced notice and action mechanisms, and measures to address the spread of illegal content. It also establishes a Digital Services Coordinator within each member state to ensure effective implementation and cooperation at the national level. The DSA focuses on promoting accountability, reducing risks associated with online platforms, and safeguarding user trust.
We collected all publicly available comments (both replies to questionnaires and feedback comments) and their attachments, when uploaded, in PDF format. Hence, we had to convert them into a machine-workable format since no API allows us to collect data in a plain text version. PDF-to-text conversion is not error-free. Hence, we manually intervened on possibly errored text. In addition, some comments, though marked as expressed in English, were actually formulated in a different language than English.
We collected comments both on the inception impact assessments (phase 1) and the draft legislative proposals (phase 2). Namely, we scraped publicly available documents from the publication of the DSA, DMA and AIA’s Inception Impact Assessments10 until the closure of consultations over them. We further collected documents starting with the publication of the DSA, DMA and AIA proposals11 and ending on July 19, 2022. This date was chosen as it marked the final adoption of both the DSA and DMA proposals and, at the same time, included the expiration of the deadline for submitting comments to the AIA final proposal.12 The overall composition of our dataset is shown in Table 1. The figures reported in that table refer to the number of comments, where each comment and its possible attachment count as one.
Table 1
Size of the dataset of publicly available documents
DSA
DMA
AIA
Total
272
215
384
871
The documents in the datasets are of quite varying sizes. In Table 2, we show the major statistics. We have documents made of just one character but 10% of the whole dataset is more than 30–40 thousand character long.
Table 2
Length statistics for the documents (number of characters)
Consultation
Min
Max
Mean
90-percentile
AIA
5
784,854
23,271
45,452
DMA
1
558,176
24,488
47,844
DSA
291
112,801
15,570
32,190
The categories we employ are as emerging from the pertaining field in the comment, with some grouping by size due to the small size of some categories:
  • Business associations (distinguished by size as large, medium, small, and micro);
  • Companies (again distinguished by size as large, medium, small, and micro);
  • Citizens;
  • Customer organizations;
  • Non-governmental organizations (NGOs);
  • Others;
  • Public Authorities;
  • Trade unions.
The size of companies, business associations, and trade unions is as shown in Table 3, where the number of employees is considered for companies, and the number of members for associations and trade unions instead. For trade unions, we have neglected the distinction by size, as we do not have a number of documents large enough for each size group to draw statistically reliable conclusions.
Table 3
Size of stakeholders
Category
Size
Micro
1–9
Small
10–49
Medium
50–249
Large
\(\ge \) 250
We can decompose the overall number of documents belonging to each corpus by the nature of the stakeholder authoring them. We report the results in Table 4. Some documents, though contributing to the overall figures in Table 1, are not counted in Table 4 as they do not include the stakeholder category that authored them and so cannot be attributed to any specific category. It is to be noted that these are a different group than that labelled as Others, since the latter refers to the comments where users have expressly selected the label Others from the available menu. Also, we have not analysed further the documents pertaining to categories with an extremely low number of documents (Non-EU citizens).
Table 4
No. of documents by stakeholder category and corpus
 
AIA
DMA
DSA
Academic Research Institutions
26
9
11
Business Associations
110
60
94
Company
104
60
66
Consumer Organisations
7
6
3
EU Citizens
14
12
3
NGO
64
38
50
Non EU Citizens
0
0
2
Other
25
9
14
Public Authorities
6
6
7
Trade Unions
15
3
6
Total (by consultation)
371
203
256
Overall total
830

4 Methodology

Our aim is to identify similarities of interests between stakeholders when replying to consultations so as to identify homogeneous opinion groups and to help the EC correctly address them in rulemaking. We adopt both a semantic approach and a lexical approach, where similarities of interests are identified through similarities of words. This approach may lead to possible aggregations of stakeholders’ categories, meaning that different stakeholders may belong to the same opinion group because they talk about similar topics, and can thus be jointly addressed by the EC. At the same time, subjects belonging to the same category (e.g., public authorities, international, national, regional and local opinion clusters), may show dissimilarities in their positions, making them belong to different stakeholder groups.
To help perform such clusterization through a lexical and semantic approach, we consider both unigrams and n-grams automatically extracted from the documents, where unigrams (1-grams) are single words (e.g., regulation; monopoly), while n-grams are sequences of n words (e.g., the bigram “regulation monopoly”). Here we relax the more stringent definition by which n-grams are sequences of consecutive words and allow them to be also non-consecutive, though appearing in the same document. We consider both unigrams and n-grams since the former highlight the relevance of each word per se, while the latter account for the relevance of the context where a combination of words appearing jointly may provide more information than the set of single words in it. Hereafter, we describe the steps we go through, considering unigrams first in Sect. 4.1 and n-grams then in Sect. 4.2. In the following, we use the symbols reported in Table 5.
Table 5
Glossary of symbols
Symbol
Meaning
d
Document
\(f_i(g,s)\)
Frequency of the word \(w_i(g)\) in the documents submitted by the stakeholder category s in the collection g
G
Collection of documents (AIA, DSA, or DMA)
\(n_D(g)\)
number of documents
\(n_G\)
Number of collections (3 in this paper)
\(n_S\)
Number of stakeholder categories
\(n_W(g)\)
Number of distinct words in collection g
\(w_i (g)\)
i-th word in collection g

4.1 Extracting unigrams

We consider a corpus D of documents (the generic document in the corpus is indicated hereafter as d), which are grouped into \(n_G\) collections. In our case, the documents are the comments provided by the stakeholders at each consultation round and are grouped around the three acts described in Sect. 3. Each collection contains \(n_D (g)\) documents and \(n_W(g)\) distinct words, \(g=1,2,\ldots , n_G\), as obtained after pre-processing each document by removing stopwords and lemmatizing the remaining words. The i-th word in the collection g is \(w_i (g)\). Stopword removal and lemmatization have been carried out in Python through the spaCy library,13 using the pre-trained model en_core_web_lg.
In order to quantify the contents of comments for each stakeholder, we use two indicators: the relative frequency and the TF-IDF (Term Frequency- Inverse Document Frequency). As to the relative frequency, we can compute the frequency \(f_i (g,s)\) of each word \(w_i (g)\) in the subset of collection g containing the comments posted by contributors falling in the stakeholder class \(s=1,2, \ldots , n_S\). The associated relative frequency, which can serve as an estimate of its probability, is
$$\begin{aligned} p_i (g,s) =\frac{f_i (g,s)}{\sum _{j=1}^{n_W(g)}f_j (g,s)}. \end{aligned}$$
(1)
The TF-IDF is instead
$$\begin{aligned} t_i (g,s) = p_i (g,s) \log \frac{n_D (g)}{\vert \{ d\in G: f_i (g,s) >0 \}\vert +1} \end{aligned}$$
(2)
The rationale for the relative frequency is that frequency serves as a means to quantify the occurrence of each unigram, and unigrams with higher occurrence are considered more relevant. In general, since some unigrams may appear more frequently, but do not pertain to a specific stream of comments, we correct that shortcoming by using the TF-IDF approach, which weighs unigram frequency by the number of documents where the unigram appears. Unigrams appearing in all documents are considered less relevant than those unigrams appearing in just a subset of the collection of documents (Aizawa 2003). For example, let’s consider that we have just two unigrams, appearing both 1000 times in a corpus made of 1000 documents. Their relative frequency is the same for both, equal to 0.5. However, the first unigram appears in 500 out of the 1000 documents, while the second unigram appears in just two documents. We may conclude that the second unigram is more specific. Actually, the TF-IDF computation through Eq. 2 returns 0.252 for the first unigram and 1.261 for the second unigram, confirming our expectations.
A sample TF-IDF distribution, reporting the top 20 unigrams for medium-sized companies in the AIA corpus, is shown in Fig. 1. On the other hand, examples of unigrams exhibiting a very low TF-IDF are the following: surprise, subcontracting, refocus, notation, and experts. It is to be noted that a low TF-IDF value may be due to either a low frequency of the unigram or a large number of documents where the unigram appears, making it less specific, and thus less interesting for the sake of grouping opinions.
We end up with a frequency distribution (respectively, a TF-IDF distribution) for each collection and each stakeholder, which provides us with an indication of the themes dealt with by each stakeholder, including their relevance. We can use the frequency distributions for stakeholders to assess the similarity of interests. We assume that two stakeholders exhibiting similar frequency distributions (i.e., using similar words with the same frequency) pursue similar interests, sharing similar interests. In order to assess the similarity of frequency distributions, we resort to the Ruzicka index (Cha 2007), defined as follows for the two stakeholder categories j and k and the collection g:
$$\begin{aligned} R = \frac{\sum _{i=1}^{n_W(g)}\min (p_i(g,j),p_i(g,k))}{\sum _{i=1}^{n_W(g)}\max (p_i(g,j),p_i(g,k))} \end{aligned}$$
(3)
This index takes value in the [0,1] range, where \(R=0\) for two stakeholders whose sets of words are disjoint (i.e., they do not have any word in common), while \(R=1\) for two stakeholders that use exactly the same words with the same frequency. We can, therefore, use R as a measure of similarity.

4.2 Extracting n-grams

We want to extract some keywords from the documents representing them, so we apply the method described by Sharma and Li in Sharma and Li (2019). For extracting n-grams, we resort to an embedding approach. We can convert each word \(w_{i,d}\) into a vector of embeddings \(e_{i,d}\), which describe the context where the word appears. We accomplish embedding by fine-tuning an SBERT pre-trained model (Reimers and Gurevych 2019). SBERT is an improvement of the BERT transformer architecture proposed by Devlin et al. (2018) optimized to generate embeddings that are similar for similar texts. Pre-training is carried out through the all-roberta-large-v1 dataset.14 After extracting the embeddings for each document, we can aggregate all the embedding vectors for a category (consultation, stakeholder type and stakeholder size) as \(e_d = \sum _i e_{i,d}\). We further compute the sum of embeddings pertaining to adjacent words for different values of the summing window size s, i.e., \(T_{d,i,s} = \sum _{j=0}^{s} e_{i+j, d}\). Finally, we compute the cosine similarity \(C(i,s)=T_{d,i,s}\cdot e_d\) for \(s\in \).The n-grams corresponding to the maximum cosine similarity are retained as the most significant keyphrases \(p_d=w_{k,d}\vert \vert w_{k+1,d} \vert \vert \cdot \vert \vert w_{k+s,d} \quad \backslash k=\underset{i,s}{\mathop {\mathrm {arg\,max}}\limits C(i,s)}\), where \(\vert \vert \) is the concatenation operator15.
Though the combined use of word embedding and cosine similarity works in a different way than humans to understand natural language, their use has now been established for some time (Wang et al. 2019). On the other hand, the use of human annotators to achieve the same task, i.e., judging the similarity of two documents, exhibits significant subjectivity issues, especially when long texts are involved, as shown by Alm (2011) and Wu et al. (2024). However, in order to assess the degree of consistency between human annotators and our algorithm, we have computed the ranking consistency index for a subset of documents. The full analysis is reported in Appendix A, where we have employed a couple of human annotators. For the annotators, we get a ranking consistency index equal to 61.5% and 60%, respectively, which is quite decent.
An example of the Top 10 n-grams obtained for medium companies from the AIA consultation is shown in Table 6. In comparison to the unigrams shown in Fig. 1, n-grams provide more information and allow us to have a better understanding of the claims stakeholders (here, medium companies) are making. For example, in this case, the examination of the n-grams for medium companies in the AIA collection shows the large occurrence of words in the regulation semantic field, highlighting the great concern of that stakeholder category (medium companies) to establish a regulatory framework for AI, and calling for clarity of rules.
Table 6
Top 10 n-grams for medium companies in AIA
N-gram
Cosine similarity
Practical regulatory implications AI agree
0.740
Need enhanced regulatory clarity AI
0.726
Appropriate regulatory options noted AI
0.723
Requires regulatory framework AI
0.717
AI regulation proposal explaining
0.716
Proposed AI regulation clearly
0.713
AI regulation summarized follows
0.713
AI regulation proposal conclusion
0.711
AI governance forthcoming
0.708
Proposed AI regulation explanatory
0.707

4.3 Comparing the n-grams of stakeholders

After getting the most significant 20 n-grams for each category of stakeholders, we compute the similarity between each couple of stakeholders using the Jaccard index. The Jaccard index is defined as the ratio of the number of n-grams that the two categories have in common to the overall number of distinct n-grams appearing in at least one category. It is an index in the [0,1] range, where we get 0 if the two stakeholders have no n-grams in common, and 1 if the two stakeholders have exactly the same Top 20 n-grams in common.
We explored two different approaches, both based on the Top20 n-grams: the words and the shingles one. There is a difference between the Jaccard similarities returned by the two different methods. The Jaccard similarity applied on the words approach reports a score based on a coarse grain type of comparison, while the same exact example analysed using a shingles approach reports a finer grain type of similarity. For instance, considering the two sentences “I’m studying hard” and “I studied harder”, and applying a words approach, we obtain the n-grams (we consider n=2): I’m studying and studying hard for the first sentence, and I studied and studied harder for the second sentence, obtaining a Jaccard similarity equal to 0. While using the shingles approach (considering shingles of length 3), we obtain n-grams like I’m, stu, ard, der, ied, ing, etc., resulting in a Jaccard similarity between the two sentences equal to 0.2608. In the words approach, we considered an n-gram \(g_{s,i}\), contained in the Top20 n-grams of the stakeholder s, as a sequence of words \(g_{s,i} = (w_{s,i,0}, w_{s,i,1},\; \dots , w_{s,i,n})\), so we computed the set of words of the stakeholder as \(W_s = \{w_{s,j,k} \quad |\quad \forall w_{s,j,k} \quad \not \exists \, (j, k) \ne (j^\prime , k^\prime ),\; w_{s,j^\prime ,k^\prime } = w_{s,j,k}\}\). After that step, we computed the Jaccard similarity \(J^w_{s, s^\prime }\) between all the pairs \((W_s, W_{s^\prime }) \;|\; s \ne s^\prime \).
In the shingles approach,16 we considered an n-gram \(g_{s,i}\), contained in the Top20 n-grams of the stakeholder s, as a simple sequence of characters \(g_{s,i} = (c_{s,i,0}, c_{s,i,1},\; \dots \;, c_{s,i,n})\), and we considered a shingle \(h_{s, i, j}\) of length l as a sub-sequence of these characters \(h_{s, i, j} = (c_{s,i,j}, c_{s,i,j+1},\; \dots , c_{s,i,j+l-1})\). Then we computed the set of shingles of length l for a stakeholder s as \(H_s = \{h_{s,i,j} \quad |\quad \forall h_{s,i,j} \quad \not \exists \, (i, j) \ne (i^\prime , j^\prime ),\; h_{s,i^\prime ,j^\prime } = h_{s,i,j}\}\). After that step, we computed the Jaccard similarity \(J^h_{s, s^\prime }\) between all the pairs \((H_s, H_{s^\prime }) \;|\; s \ne s^\prime \).
We recommend using both words and shingles approaches, because they highlight different similarities between the same set of data. This is even more evident with longer texts (just like the feedback comments we are analysing here): with the words approach we are comparing the use of the words together with the context along the documents, while with the shingles approach we are comparing something more similar at the same time to the roots and the endings of the words, and so something closer to the meaning of the words.
At the same time, if we consider the comparison carried out on the distribution of frequencies of unigrams using the Ruzicka index, we can obtain a different similarity score: we focus on the type of language used in the document. Indeed, the frequency of some set of words used in a document could represent many types of uses of the language: for instance, it would be easy to understand if a text is either formal or informal. In our case, we are interested in the type of description of the same topic in different documents. So, for instance, a view that is more critical will result in a different distribution of frequencies with respect to the same topic seen in a positive manner. Even if we are not interested in labeling the types of distribution, we can anyway use them to compare the stakeholders’ views. So, joining the results obtained from the three methods, we can perform a deeper comparison between different viewpoints of stakeholders categories.

5 Results

In this section, we report the results obtained by applying the techniques described in Sect. 4. We consider the results grouped by consultation, i.e., the corpus of feedback comments to AIA, DSA, and DMA, illustrating the results for unigrams first.
In order to represent the similarities for all the couples of stakeholders, we resort to heatmaps, where colours express the degree of similarity. High similarity is represented in green, and low similarity in red. Rows and columns in heatmaps show the stakeholder categories being compared. The list of abbreviations is shown in Table 7.
Table 7
Abbreviations for stakeholder categories
Abbreviation
Meaning
BA-L
Large Business Associations
BA-M
Medium Business Associations
BA-S
Small Business Associations
BA-US
Ultra Small Business Associations
C-L
Large Companies
C-M
Medium Companies
C-S
Small Companies
C-US
Ultra Small Companies
CIT
EU Citizen
CO
Consumers’ Organizations
NGO
Non-governmental organizations
OTH
Others
PA
Public Authorities
TU-L
Large Trade Unions
TU-M
Medium Trade Unions
TU-S
Small Trade Unions
TU-US
Ultra Small Trade Unions
We aim to verify if clustering opinions expressed by stakeholders taking part in EC consultations can be done automatically. Then we wish to assess if that clustering is made by the EC in an accurate way, see Sect. 2. In fact, when the EC asserts that the majority group, composed, for instance, by stakeholder categories X and Y, agree or disagree on an issue, it should have read all comments and grouped them by opinion (i.e. cross-cutting the stakeholder categories). For instance, the DSA Impact Assessment underlines that “businesses and business associations bemoaned that regulatory oversight is neither clear nor foreseeable” (see part. 1, p. 22–23 of European Commission (2020)). It also mentions that a majority of stakeholder groups (“including business associations, academic institutions and the general public”) are sceptical about mandating cooperation of all platforms with national authorities (see part. 1, p. 46–47 of European Commission (2020)). The document does not mention the technique used to reorganise and assemble the different positions given by the authors. This only happens with the DMA, where details about how the EC combines opinions are given [14, part. 2, p. 28-29]. We purport not only that this operation can be done, but that it should, by employing a more objective and effective approach through n-grams and multiple indices. In order to carry out our analysis, we use the same stakeholder categories identified by the EC in the three consultations: business associations, companies, citizens, NGOs, public authorities, trade unions, and others.
There are some limitations in our procedure. When comments from one category are few or too short to gather information, we do not consider them in the comparison (this happened for instance, for comments by non-European citizens, which appeared in two documents only). Another limitation of the comparison is that the EC contemplates a residual category of commentators named “Others”, whose exact composition is unclear. Nonetheless, because comments in this category are numerically relevant, we do consider them in our analysis, but the inferences we gather may be not significant.

5.1 What do similarities of single words (unigrams) tell us?

We analyze unigrams by considering first the weighted frequency (TF-IDF) defined in Sect. 5.1.1 and then the unweighted frequency lists (TF) defined in Sect. 5.1.2. While TF gives us information on how often a term appears in a document, TF-IDF (where the importance of a term is inversely related to its frequency across documents) gives us information about the relative rarity of documents where a term appears. Terms appearing in fewer documents but several times, make that term more specific on the debated issue.

5.1.1 Analysis of weighted frequency (TF-IDF)

a. TF-IDF for AIA
In Table 8, we observe the largest similarities of contents (Ruzicka index equal or above 0.50) between business associations of adjacent size (e.g., large with medium ones, and medium with small ones). A slightly surprising outcome is the similarity between small and medium business associations on one side with large companies on the other side, which is uncommon if one considers that they usually represent very different interests.
Also, a significant similarity is observed between large companies and NGOs (R=0.52). The observed similarity is unpredictable and may mean that they are looking into the same critical issues. This outcome is nonetheless very relevant, as it could confirm, by computational means, the EC’s statement that a “vast majority of views” based “on the feedback mostly of business associations, companies and NGOs”, prefer the second (out of three17) regulatory option proposed (see page 20 of European Commission (2020)). One should caution that conclusions cannot be derived from the sole analysis of unigrams, as they cannot tell much about the context. For this reason, we resort to a deeper analysis using n-grams in Sect. 5.1.3.
A significant similarity larger than \(R>0.5\) also emerges, uncommonly, between ultra-small business associations and large companies \(R=0.56\). The similarities between business associations appear anyway larger than those between the other categories. That means that the positions expressed by the different stakeholder categories over the AIA tend to be quite divergent, overall.
We find, unlike the EC, that business associations have high similarities with large companies (with a Ruzicka index ranging between 0.49 and 0.56). As to small and medium companies, our analysis shows that they do not have similar interests with large companies. And again, small and medium companies show different patterns, making the merge of their positions dubious (see p. 15 in European Commission (2021)). We shall, therefore, check if these results are confirmed using other methodologies (below, in Sect. 5.1.3).
On the other end, we find stakeholders that do not share significant interests with other categories, represented by values of the Ruzicka index lower than 0.1. This is the case for public authorities which appear to be interested in issues of their own, quite apart from the interests of other stakeholders. We cannot make any hypothesis on this discrepancy, given that we do not know the methods used by the EC. To make sure that our findings are correct, in the following we employ other methodologies.
Table 8
Ruzicka index in AIA for TF-IDF
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab8_HTML.png
b. TF-IDF for DMA
A more limited set of similarities is observed in Table 9 for the DMA. No stakeholders exhibit \(R>0.50\), with the highest similarity (\(R=0.50\)) being observed, again, between small and micro business associations, and between small business associations and large companies. However, the similarities between business associations appear larger than those between other categories.
Quite the opposite results can be observed between our analysis and that of the EC for small and medium companies (which tend to be users of large platforms). While we find that they only have a Ruzicka index \(R=0.26\), hinting that they do not share interests (this result is confirmed below, in Sect. 5.1.2.b), the EC credits these two stakeholders with sharing relevant opinions (see p. 16 in European Commission (2020)).
Moreover, we observe that citizens have low similarities with consumer organizations (\(R=0.25\)), which is surprising since those two groups should theoretically share the same interests and concerns.
Table 9
Ruzicka index in DMA for TF-IDF
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab9_HTML.png
c. TF-IDF for DSA
The same picture appears in Table 10 for the similarities between business associations in the DSA collection. Moreover, we observe that citizens have an even lower similarity with consumer organizations (\(R=0.13\)) than what was observed in the DMA. This is surprising since those two groups should theoretically share the same interests and concerns (this is confirmed below, in Sect. 5.1.2c). If one considers that the DSA is aimed at protecting consumers and citizens, this result provides an alert that calls for further investigation.
Table 10
Ruzicka index in DSA for TF-IDF
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab10_HTML.png

5.1.2 Analysis of raw term frequency (TF)

What changes when we consider unweighted frequency lists, i.e., the raw frequency of words rather than TF-IDF? We examine again the Ruzicka index by corpus.
a. Word count in the AIA
In Table 11, we see the data for AIA, which have to be compared with those in Table 8. Here we observe an even larger and more significant set of similarities between business associations (especially the large ones) and companies (of large and medium size). This is a relevant result since medium companies are generally associated with small and micro ones as per the type of interests they express, while our computation shows that they tend to align with large businesses instead (see some early results in Di Porto (2021)). With regard to the discrepancy we found in the analysis of business associations and medium companies (Table 8), the TF seems to confirm the EC instead. Indeed, besides large business associations, all other groups have values larger than 0.50. These conflicting results between the TDF-IF and IF methodologies generate an alert for further investigation through n-gram analysis (see below, Sect. 5.1.3).
Concerning NGOs, citizens and Others, the EC identifies a convergence of opinions (European Commission 2021). We find instead that similarity exists between citizens and NGOs (\(R=0.51\)) as well as NGOs and Others (\(R=0.55\)) but not so much between Citizens and Others (\(R=0.49\)). This finding is relevant as it highlights a possible limitation in the way these three groups of stakeholders have been merged to support one opinion. At the same time, however, we know little about who the ‘Others’ are, and therefore, we cannot infer much.
At the bottom end, trade unions share nothing or little with the other categories, confirming what we found above (paragraph a in Sect. 5.1.1 and Table 8).
Table 11
Ruzicka index in AIA for word counts
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab11_HTML.png
b. Word count in the DMA
We can carry out the same comparison for the DMA, comparing Table 12 versus Table 9. Here, we observe the same increase in similarity for business associations and companies, though on a smaller scale than for AIA. The isolation of citizens and the small overlapping of interests of trade unions are confirmed. Also, the low similarity of opinions between small and medium companies is confirmed (R=0.28), thus confirming what we found in Sect. 5.1.1.b.
Table 12
Ruzicka index in DMA for word counts
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab12_HTML.png
c. Word count in the DSA
Finally, we have the DSA, where we compare the results for the raw frequency lists in Table 13 against the TF-IDF data of Table 10. Again, we observe a larger correlation between the interests of business associations (of medium/small size) and large companies when we consider raw frequencies. The set of isolated stakeholders (citizens and trade unions) is observed again. Also, the low level of convergence between Consumer Organizations and citizens that we found in Sect. 5.1.1.c (R=0,15).
Table 13
Ruzicka index in DSA for word counts
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab13_HTML.png
d. Overall considerations on the similarity of unigrams
Concluding the analysis of unigrams, we can observe that raw frequencies and TF-IDF data mostly agree, which means that there is a significant uniformity of opinions across feedback comments (i.e., there are no interests showing up in just a comment or very few comments).
A wider similarity is observed between medium/small business associations on one side and large companies on the other side. Instead, trade unions and citizens appear to show interests of their own, quite detached from the other stakeholders.
Also, this approach confirms our results (above, Sect. 5.1.1.a), showing (unlike the EC) divergent results for the AIA consultation, where big companies and business associations (of all sizes) show large similarities of interests. Also, medium and small companies appear not to share much, making them two separate opinion groups in most cases. Especially, medium companies seem to share more with large ones than with small and micro enterprises. This happens with the DMA especially, but is also observable in the DSA. A puzzling result is that of business associations (of all sizes) and medium companies: while data from TDF-IF contrast with the EC’s analysis, those from IF seem to confirm it. Finally, unlike the EC’s analysis, we find that citizens and consumer organizations exhibit a very low similarity of interests in the DSA.

5.1.3 Similarity of n-grams

We can now turn to n-grams. We report the similarity assessed by the Jaccard index for the three corpora as for unigrams, applying both the methods we have described in Sect. 4. N-grams allow us to capture more specific interests that cannot be properly represented by single words. We limit the analysis to the Top20 n-grams, considering those as representative of the contents highlighted by stakeholders, while we consider lower-ranked n-grams less relevant.
a. N-grams in the AIA
We start with AIA in Table 14, which shows the results for the words approach, where words in the n-gram may be non-consecutive (see the description of the two approaches in Sect. 4). Here, we observe a similar picture to what we saw with unigrams in Tables 8 and 11. Even larger similarities are observed between medium-to-micro business associations and companies. The correlation is certainly stronger for large companies, but it appears to be significant even for medium and micro companies. On the other hand, trade unions are confirmed as having quite different interests than other stakeholders.
Additional correlations appear when we look for similarity considering shingles. In Table 15, we see that trade unions exhibit the greatest share of interests with business associations and micro companies, though the Jaccard index is anyway low (in the [0.29 - 0.43] range for large and medium business associations, respectively). We do not mention the large values appearing for the relationship between business associations and the Other category since the latter is a miscellaneous category, and we would not be able to draw significant conclusions.
Unlike the unigrams analysis (Tables 8 and 11), the n-grams analysis shows low similarity between large companies and NGOs (0.41 with the words approach and 0.39 using shingles, respectively, as can be seen in Tables 14 and 15). This inconsistency may be due to the fact that the two values are not directly comparable to each other, as the n-grams analysis based on the Jaccard index does not take into account the frequency of word combinations, but just their presence. This is a case where human intervention would be recommended.
In the same vein, the n-grams approach confirms the similarity between business associations (of all sizes) and medium companies resorting from the IF (but not the TDF-IF) methodology. An exception is that of large Business Associations, which show values lower than 0,50 in all the methodologies we employed.
Table 14
Jaccard index in AIA for single words
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab14_HTML.png
Table 15
Jaccard index in AIA for shingles of size 5 in top 20 ngrams
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab15_HTML.png
b. N-grams in the DMA and DSA
The closeness of interests is lower when we examine the DSA and DMA instead. In the case of single words (see Tables 16 and 17), we observe, however, values above 0.5 for the following pairs:
  • Small business associations and micro business associations (both DSA and DMA corpora);
  • Small business associations and large/medium companies (DMA only);
  • Citizens with small/micro business associations and large/medium companies (DMA only);
  • Large and medium companies (DMA only);
  • Large companies with NGOs (DSA only).
Table 16
Jaccard index in DMA for words in top 20 ngrams
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab16_HTML.png
Table 17
Jaccard index in DSA for words in top 20 ngrams
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab17_HTML.png
When we look at shingles in Tables 18 and 19, we observe the following connections:
  • Citizens with small/micro business associations (DMA only);
  • Small business associations with large/medium companies (DMA);
  • Micro business associations with larger business associations and large companies (DSA).
In both the shingles and words approaches, small and medium companies have very low similarity, confirming that these two categories of stakeholders shall not be clustered together.
Table 18
Jaccard index in DMA for shingles of size 5 in top 20 ngrams
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab18_HTML.png
Table 19
Jaccard index in DSA for shingles of size 5 in top 20 ngrams
https://static-content.springer.com/image/art%3A10.1007%2Fs10506-024-09426-6/MediaObjects/10506_2024_9426_Tab19_HTML.png

5.2 Topic modelling

In addition to extracting the topics of messages through n-grams, we have also applied topic modelling techniques, namely Latent Dirichlet Allocation (LDA) Blei et al. (2003). Though LDA could appear to be preferable with respect to extracting topics from n-grams, it has its own shortcomings that make it less preferable. Those shortcomings are mainly related to the degree of subjective choices that we have to make when employing it. First, the number of topics is left to the user, who typically has to set a trade-off between the internal consistency of topics and the need for a parsimonious description of the contents. Further, labelling the topics is again left to the user, who must find a topic name reflecting the contents, which may not be so easy. Finally, LDA (the most well-known topic modelling technique) requires the removal of stopwords (which is typically accomplished by resorting to lists), but here we have a significant group of domain stopwords to be removed for which we did not use a list of domain-specific stopwords (a comprehensive list of stopwords is provided in Di Porto et al. (2022)).
In order to examine the impact of choosing the number of topics, we show the results obtained when we have 5, 10, and 20 topics in Tables 2122, and 23 respectively. Those results were obtained by jointly considering all the comments related to the three consultations so as to reach a significant number of documents for LDA. For each topic, we show the top five words. The use of LDA highlights some problems in the set of comments. As can be seen, some topics (namely Topic 3 in the 5-topic case, Topic 0 in the 10-topic case, and Topics 3 and 15 in the 20-topic case) include words in languages other than English. This occurs despite the fact that comments were labelled as expressed in English. However, being in the original text, those non-English words make their way into the LDA topics. If we look at the simpler 5-topic case, we see a significant degree of overlapping. For example, Topic 0 and Topic 1 share three words out of the top 5. Topics 2 and 4 again share two words. Similar overlappings are observed in the 10-topic case, where the word data appears in four topics. Increasing the number of topics to 20 helps some significant subjects emerge, like Topic 16, which clearly highlights the competition theme. Other topics gain a specific connotation due to a single word or a couple, like Topic 14 with illegal or Topic 17 with intelligence and economic. On the other hand, extensive overlapping is ever more present, e.g., between Topics 8, 9, and 13, which share four words out of five.
In order to provide a quantitative measure of the capability of LDA to give us an internally-consistent set of topics, we have adopted the \(C_v\) metric proposed in Röder et al. (2015). This metric is based on a sliding window over words that are considered separate virtual documents, used to estimate a probability given by the ratio of documents where the word appears. After that, the vectors obtained for the words are grouped and compared in pairs using a cosine similarity score. The results are aggregated by mean to obtain a final score of coherence. We report the results for the topics output by LDA in our case in Table 20. We obtain values roughly around 0.5, which are neither very good nor exceptionally bad, but certainly do not push LDA as the best topic-identifying tool in this context.
Table 20
Internal coherence measure for LDA topics
Topics
\(C_v\)
5
0.446
10
0.477
20
0.512
Table 21
Top five words for each topic given by applying LDA with 5 topics
Topic 0
AI
0.02762
Systems
0.01582
Risk
0.00906
Rghts
0.00896
Use
0.00643
Topic 1
AI
0.03261
Risk
0.00878
Systems
0.00830
Data
0.00827
European
0.00592
Topic 2
Platform
0.00754
Online
0.00551
Patforms
0.00503
Digital
0.00401
Competition
0.00369
Topic 3
de
0.01874
data
0.01515
Digital
0.01246
et
0.01236
la
0.01106
Topic 4
services
0.00980
Platforms
0.00918
Data
0.00822
Online
0.00738
Content
0.00722
Table 22
Top five words for each topic given by applying LDA with 10 topics
Topic 0
de
0.04042
et
0.02654
la
0.02375
des
0.02036
les
0.02028
Topic 1
Ethics
0.00605
Human
0.00580
IEEE
0.00548
Global
0.00530
Systems
0.00529
Topic 2
Children
0.00726
Opinion
0.00595
EN
0.00551
Digital
0.00345
Social
0.00336
Topic 3
MVNO
0.00315
Mobile
0.00253
l
0.00241
AI
0.00222
MVNOs
0.00181
Topic 4
Platforms
0.01080
Content
0.01070
Online
0.01054
Services
0.00880
Commission
0.00644
Topic 5
Impact
0.00872
Assessment
0.00863
und
0.00850
Competition
0.00694
der
0.00683
Topic 6
Data
0.01190
Rights
0.00763
AI
0.00584
ADM
0.00530
Processing
0.00525
Topic 7
Data
0.02168
Platform
0.01051
Services
0.00993
Digital
0.00986
Business
0.00692
Topic 8
AI
0.01001
rRights
0.00984
Data
0.00838
Systems
0.00775
Article
0.00724
Topic 9
AI
0.04278
Systems
0.01228
Risk
0.01189
Fata
0.00782
High
0.00740
Table 23
Top five words for each topic given by applying LDA with 20 topics
Topic 0
Marketplaces
0.01197
Toys
0.00891
EU
0.00507
Toy
0.00428
Agoria
0.00397
Topic 1
Ethics
0.00696
IEEE
0.00640
Systems
0.00618
Global
0.00589
systems
0.00575
Topic 2
Opinion
0.02133
EN
0.01730
European
0.00707
2
0.00641
standards
0.00466
Topic 3
de
0.04736
et
0.03113
la
0.02794
des
0.02397
les
0.02377
Topic 4
Content
0.01312
Platforms
0.01217
Online
0.01147
Services
0.01033
Commission
0.00818
Topic 5
Impact
0.01610
Assessment
0.01595
Competition
0.01143
DMA
0.00971
Innovation
0.00864
Topic 6
Data
0.01583
AI
0.01562
Health
0.01410
Healthcare
0.01312
Patients
0.01007
Topic 7
Data
0.01662
Platform
0.01264
Services
0.01254
Users
0.00942
bBusiness
0.00854
Topic 8
AI
0.02002
Systems
0.01366
Rghts
0.01203
Data
0.00782
Risk
0.00732
Topic 9
AI
0.04470
Risk
0.01241
Systems
0.01179
High
0.00764
Data
0.00749
Topic 10
AI
0.02623
2019
0.00763
Bias
0.00730
Human
0.00607
Social
0.00514
Topic 11
Processing
0.01040
AIaaS
0.00982
Data
0.00835
Providers
0.00820
Content
0.00623
Topic 12
AI
0.00208
Systems
0.00103
Data
0.00079
European
0.00070
System
0.00055
Topic 13
AI
0.02857
Data
0.01248
Risk
0.00774
systems
0.00641
Use
0.00612
Topic 14
Online
0.01640
Platforms
0.00905
Illegal
0.00660
EU
0.00656
European
0.00525
Topic 15
und
0.02134
der
0.01830
Die
0.01600
Von
0.01219
KI
0.01053
Topic 16
Market
0.01119
Microsoft
0.01068
Competition
0.00834
Platform
0.00581
Antitrust
0.00573
Topic 17
Data
0.03231
Digital
0.02597
Intelligence
0.00923
AI
0.00844
economic
0.00809
Topic 18
BDVA
0.01275
DAIRO
0.00988
AI
0.00824
Data
0.00577
Ingka
0.00526
Topic 19
Data
0.01274
Education
0.01224
Rights
0.01122
ADM
0.00902
European
0.00818
All in all, LDA does not seem capable to capture the topic actually discussed in the consultation, as those are buried in a much larger number of unspecific words. We deem the n-gram analysis being more useful in this context.

6 Discussion and conclusions

To uncover similarity and dissimilarity of stakeholders' opinions expressed in consultations over EU legislative proposals, we proposed a methodology based on the use of embeddings (accomplished through the transformer architecture SBERT) and similarity indices (cosine similarity coupled with the Jaccard index, for sheer presence, or the Ruzicka index, to measure frequency of appearance).
We examined several variants, based on the use of TF-IDF vs the unweighted frequency of tokens, unigrams vs n-grams, and sequences of words vs characters (shingles). We further complemented the analysis using the LDA approach. We do not mean these variants to be mutually exclusive. In fact, these variants provide different viewpoints and may highlight similarities of interest. We deem that conclusions should be based on the joint usage of those techniques to complement traditional (human) ones. The different methodologies we employed provide some initial insights questioning the way the EC performs group clustering, which is worth mentioning. This is particularly relevant, as misclustering groups may alter the representativeness of the positions of stakeholders, making some opinions appear majoritarian when they might not be. This might hinder the quality of information that feeds rule-making, making regulation that is based on that information possibly ineffective.
As to the specific case studies reported here for the triplet of corpora (AIA, DMA, and DSA), we have seen that the isolation of trade unions (i.e., its uniqueness of interests) stands out across all documents and methodologies. However, we should caution that the number of comments authored by trade unions is very small (15 for AIA, 3 for DMA and 6 for the DSA). In these cases, it would be advisable to complement the notice & comment methodology of consultation (that the EC used) with other means, such as focus groups or expert panels.
A striking result is a low similarity between citizens and consumer associations occurring across all datasets and methodologies. Also here, however, the small corpus (14 comments per AIA, 12 for the DMA and only 3 for the DSA) might affect the output, and, more generally, calls for bigger investments in individual engagement (e.g. through seminars). This result is worth noticing and should be carefully considered by the EC.
A low similarity can be observed between small and medium companies, a result that stands out and that is confirmed in all methodologies and across all consultations. This result is especially worth noticing as it questions whether the very cluster of SMEs really exists. SMEs are considered cumulatively in the Better Regulation toolbox. For instance, the mandatory SME test requires an assessment of the impact of perspective regulation on small and medium companies taken together (see Tool 23 in EC (2023)). This is relevant, because there exist plenty of cases where regulation is targeted at SMEs taken together, like mitigation measures, exemptions, or exceptions. Our results suggest that targeting regulation could be done by considering small companies and medium ones as belonging to two different clusters.
On the other hand, significant similarities of interests exist, roughly across all documents, between business associations (excluding the large ones) and large companies.
More insights can be gathered if looking at each consultation separately. With reference to the DSA, large companies and NGOs have high similarity indices in all the approaches (ranging from 0.49 to 0.55). This finding is unexpected since these two groups should represent opposing interests, with NGOs protecting consumers’ rights against large companies by flagging illegal and harmful activities for removal. In the DMA, instead, we register contrasting results for the same pair (large companies and NGOs): while they seem to share a lot in the unigrams analysis (with the index ranging from 0.52 to 0.55), the similarity index lowers to 0.39\(-\)0.41 using the wider focus associated with the n-grams approach. This inconsistency would suggest to investigate further through human intervention.
Also, in other (very few) cases, our approaches provide diverging outcomes. This is the case with the NGOs-large companies pair, where high similarity is observable in the unigrams approach, while very low similarity is extracted from the n-grams technique.
When diverging results emerge, further investigation would be needed. That means that any automatic analysis must be supported by human supervision and intervention, given that the two approaches look at different things. We also notice opposite results for the business associations-medium companies pair (for the AIA): while the TF-IDF approach shows low similarity, all other techniques tell that business associations (except for large ones) share interests with medium companies. In similar cases, we should understand whether the lower values obtained through TF-IDF are due to the extremely low number of documents containing the signals (words or their combinations) of opinion-sharing and if those documents are really significant or represent isolated views.
These observations could support a rearrangement of the way the EC currently clusters stakeholders’ opinions. In some cases, this may call for aggregation, which would avoid excessive fragmentation of the set of stakeholders’ opinions, unsupported by real differences in interests. Though the figures provided by our methodologies span a continuum of values in the (0,1) range, the analysis suggests that where the similarity index gets larger that a given threshold (e.g. 0.5) group clustering might not be problematic. On the contrary, if the index is very low (e.g. 0.2) special attention should be paid before proceeding with group clustering.
Finally, we admit that the dataset we employed is far from clean, compromising the quality of data and eventually results. Relying on pdf-to-text conversion has introduced errors in the text, in addition to the inaccuracies already present in the original documents. It would be essential to be able to work on native plain-text versions in the future, so as to feed our tool with error-free text.
In order to have clean datasets and reduce the risk of errors, the EC should ensure that inputs to consultations be always in plain text (e.g. avoiding the use of logos or pdf files for attachments). For instance, it could make available an open data platform to upload all comments and copy-and-paste long feedback (instead of attaching pdf files). This would greatly enhance transparency while allowing researchers (vetted or not) to evaluate the EC’s action (achieving greater accountability). Also, participation would be increased as no comments would be discarded for technical reasons.
The EC is advised to disclose all computational tools and methodologies employed to analyze inputs to consultations. Ideally, it should avoid using computational tools on a random basis, while employing convergent methodologies in all its consultations. Eventually, it could identify a set of replicable techniques and explain when one should be preferred to another. This effort should find its way into the Better Regulation toolbox (EC 2023).
In conclusion, by allowing the correct clustering of opinions, these techniques can help reduce misclustering stakeholder groups and consequently allow greater representativeness of the different positions expressed in consultations (Fig. 2).
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://​creativecommons.​org/​licenses/​by-nc-nd/​4.​0/​.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Anhänge

Appendix A: Comparison with human annotators

In supervised learning, it is common to compare the performance of classification tasks carried out by machines with what human annotators would output, assuming human annotators provide us with the ground truth. Here, we are faced with the task of verifying the effectiveness of the present categorization of parties according to the established EU scheme based on the contents of their posts. We have completed that task through text mining techniques and further validated the results through human annotation. For the purpose of annotation, we have selected a subset of comments to form couples and submit them to human annotators for their judgment about their similarity of contents. It is to be noted that the number of couples we may form with n documents grows quadratically with n (precisely, we have \(\frac{n(n-1)}{2}\) couples. For example, for 300 documents, we would end up with 44,850 couples to submit to our human annotators. For that reason, we have selected a number of documents compatible with the available annotation resources. Further, our algorithm outputs a similarity judgment based on an index that may take any value in the [0,1] range. We cannot ask human annotators to deliver their judgments on such a fine scale. We must resort to a discrete-value output, for which we can use a Likert scale. In this paper, we adopt a 5-level Likert scale, where 1 corresponds to not similar at all, and 5 corresponds to very similar, the similarity being referred to the contents rather than the actual narration structure. We also have to consider that the documents in our dataset may be very long, as shown in Table 2, and human readers find it difficult to go over a long document and faithfully record the most important topics discussed in that document (Sharma and Deep 2014). In order to select the documents to be submitted to human annotators, we first identified the groups containing more than 20 documents each. The rationale for that preliminary selection is that we wish to have a number of documents large enough to get a statistically significant comparison. That number of groups having enough documents turned out to be 6 for AIA, 3 for DMA, and 4 for DSA. For each of those groups, we selected 5 documents, precisely the documents exhibiting the largest similarity with the average document in the group, as those are the most representative documents in the group. Since we can match any document in a group with all the documents in the other groups within the same consultation, and we can perform \(5^2=25\) comparison between the 5 documents in a group and the 5 documents in the other group, the number of couples for g groups is \(25\frac{g(g-1)}{2}\). The resulting number of couples that we can form is then 375 for AIA, 75 for DMA, and 150 for DSA, for a grand total of 600 couples. In order to grant greater independence of judgment and reduce the incidence of biases, we run two rounds of annotations per each feedback pair (two annotators assessed 300 feedback pairs and two other annotators assessed the remaining 300). We can compare the cosine similarity values output by our algorithm for the same couples for which we had a level assigned by human annotators. As we hinted above, we are comparing a numerical continuous output provided by the algorithm with the ordinal output provided by humans. The comparison is then far from perfect. Anyway, we expect the algorithm output to reflect the human output, in that the cosine similarity grows with the level in the Likert scale. We represent the comparison by drawing the boxplots for all the consultations and the five levels in the Likert scale. The bounds for each box represent the first and third quartile, while the horizontal line within the box represents the median value. The whiskers represent the 10- and 90-percentile, respectively, with the outliers below the 10-percentile and above the 90-percentile represented by empty circles. We report the results in Figs. 3 and 4 for AIA, Figs. 5 and 6 for DMA, and Figs. 7 and 8 for DSA. In most cases, we observe that a monotonic trend is observed for the median, though a significant degree of overlapping exists between the cosine similarity values for different Likert levels. At the same time, we must observe that, while the Likert values output by humans are scattered among the five levels, the similarity figures output by the algorithm are concentrated within a narrow range of values, roughly between 0.88 and 0.96. Human annotators are called to distinguish between cases where the algorithmic distance may be very small.
In order to obtain a numerical index to assess the degree of consistency between human annotators and that resulting from word embedding and cosine similarity, we have employed a ranking agreement index, which has been computed as follows. For each couple of documents we have both a cosine similarity value and a Likert-scale similarity level. We expect the Likert-based level to be consistent with cosine similarity. We can measure the degree of consistency when we consider a different couple of documents, for which we have again a cosine similarity value and a Likert level. We call those two couples of documents Couple A and Couple B, respectively. Consistency is achieved if the Likert level of Couple B (A) is at least as high as that of Couple A (B) when the cosine similarity of Couple B (A) is larger than that of Couple A (B). We can compute the fraction of times consistency is achieved over all the Couples A and B that we can form, using the difference in cosine similarity as a weight (the larger the cosine similarity difference, the more we expect the Likert level to be consistent). The result is the percentage of consistency, which, of course, varies in the [0,100%] range. We perform the consistency computation for both human annotators, obtaining 61.5 and 60%, respectively.
Fußnoten
1
See their webpage at https://​atlasti.​com.
 
2
Used in the consultation over the DMA: below, see p. 29 in European Commission (2020).
 
4
Both software packages are recommended to identify and remove duplicates. See https://​community.​coda.​io/​t/​button-to-delete-duplicates/​24138 and http://​www.​stata.​com.
 
6
For closed questions, see tool 54, pt. 1.3.1 in EC (2023): “when analysing and presenting the results, distinction should be made between the different stakeholder categories that contributed to the consultation.” For open questions, see tool 54, pt. 1.3.2 in EC (2023): “Textual input to open questions is. compared to quantitative data, richer and more complex and therefore”...can only be treated through “systematic analysis” (i.e. not statistically). The latter helps prevent biases to which qualitative data, more than quantitative, is extremely prone. Under the approach to basic analysis, responses would most commonly be grouped into broad stakeholder groups (typically citizens/NGOs, international, national, local and/or regional authorities, industry, and others). Under the simplest approach, responses from a particular group for a particular question could then be quickly read to get an overview of the two or three most recurrent points being made.”
 
7
Recorded as Regulation (EU) 2022/2065 of 19 October 2022 (http://​data.​europa.​eu/​eli/​reg/​2022/​2065/​oj).
 
8
Recorded as Regulation (EU) 2022/1925 of 14 September 2022 (http://​data.​europa.​eu/​eli/​reg/​2022/​1925/​oj).
 
9
Recorded as Proposal for a Regulation laying down harmonised rules on Artificial intelligence of 21 April 2021, COM/2021/206 final (https://​eur-lex.​europa.​eu/​legal-content/​EN/​TXT/​?​uri=​celex:​52021PC0206).
 
12
6 August 2021.
 
15
Text concatenation is the operation of joining words end-to-end. For example, the concatenation of snow and ball is snowball.
 
16
Shingles are contiguous subsequences of characters within a document.
 
17
The three regulatory options proposed were: i. legislation for specific AI applications; ii horizontal framework for high-risk AI applications, and iii horizontal framework for all AI applications.
 
Literatur
Zurück zum Zitat Aizawa A (2003) An information-theoretic perspective of TF-IDF measures. Infor Process Manag 39(1):45–65CrossRef Aizawa A (2003) An information-theoretic perspective of TF-IDF measures. Infor Process Manag 39(1):45–65CrossRef
Zurück zum Zitat Alm CO (2011) Subjective natural language problems: Motivations, applications, characterizations, and implications. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 107–112 Alm CO (2011) Subjective natural language problems: Motivations, applications, characterizations, and implications. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 107–112
Zurück zum Zitat Balla SJ, Bull R, Dooling BC, Hammond E, Livermore MA, Herz M, Noveck BS (2022) Responding to mass, computer-generated, and malattributed comments. Admin Law Rev 74(1):95 Balla SJ, Bull R, Dooling BC, Hammond E, Livermore MA, Herz M, Noveck BS (2022) Responding to mass, computer-generated, and malattributed comments. Admin Law Rev 74(1):95
Zurück zum Zitat Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022 Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Zurück zum Zitat Braun C, Busuioc M (2020) Stakeholder engagement as a conduit for regulatory legitimacy? Taylor & Francis, OxfordshineCrossRef Braun C, Busuioc M (2020) Stakeholder engagement as a conduit for regulatory legitimacy? Taylor & Francis, OxfordshineCrossRef
Zurück zum Zitat Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1 Cha S-H (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1
Zurück zum Zitat Coglianese C (2004) The internet and citizen participation in rulemaking. ISJLP 1:33 Coglianese C (2004) The internet and citizen participation in rulemaking. ISJLP 1:33
Zurück zum Zitat Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 Devlin J, Chang M-W, Lee K, Toutanova K (2018) BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805
Zurück zum Zitat Di Porto F (2021) Artificial Intelligence and Competition Law. A Computational Analysis of the DMA and DSA. Concurrences 3 Di Porto F (2021) Artificial Intelligence and Competition Law. A Computational Analysis of the DMA and DSA. Concurrences 3
Zurück zum Zitat Di Porto F, Grote T, Volpi G, Invernizzi R (2022) Talking at cross purposes? a computational analysis of the debate on informational duties in the digital services and the digital markets acts. Technology and Regulation 10 Di Porto F, Grote T, Volpi G, Invernizzi R (2022) Talking at cross purposes? a computational analysis of the debate on informational duties in the digital services and the digital markets acts. Technology and Regulation 10
Zurück zum Zitat Engstrom DF, Ho DE, Sharkey CM, Cuéllar M-F (2020) Government by algorithm: Artificial intelligence in federal administrative agencies. NYU School of Law, Public Law Research Paper (20–54) Engstrom DF, Ho DE, Sharkey CM, Cuéllar M-F (2020) Government by algorithm: Artificial intelligence in federal administrative agencies. NYU School of Law, Public Law Research Paper (20–54)
Zurück zum Zitat Livermore MA, Eidelman V, Grom B (2017) Computationally assisted regulatory participation. Notre Dame L Rev 93:977 Livermore MA, Eidelman V, Grom B (2017) Computationally assisted regulatory participation. Notre Dame L Rev 93:977
Zurück zum Zitat Rangone N (2023) Artificial intelligence challenging core state functions: A focus on law-making and rule-making. Revista de Derecho Público: teoría y método 8:95–126 Rangone N (2023) Artificial intelligence challenging core state functions: A focus on law-making and rule-making. Revista de Derecho Público: teoría y método 8:95–126
Zurück zum Zitat Rangone N (2022) Improving consultation to ensure the European Union’s democratic legitimacy: From traditional procedural requirements to behavioural insights. European Law Journal Rangone N (2022) Improving consultation to ensure the European Union’s democratic legitimacy: From traditional procedural requirements to behavioural insights. European Law Journal
Zurück zum Zitat Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM ’15. ACM Press, New York, New York, USA Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM ’15. ACM Press, New York, New York, USA
Zurück zum Zitat Shapiro S (2018) Can analysis of policy decisions spur participation? J Benefit-Cost Anal 9(3):435–461CrossRef Shapiro S (2018) Can analysis of policy decisions spur participation? J Benefit-Cost Anal 9(3):435–461CrossRef
Zurück zum Zitat Sharma AD, Deep S (2014) Too long-didn’t read: A practical web based approach towards text summarization. In: Applied Algorithms: First International Conference, ICAA 2014, Kolkata, India, 2014. Proceedings 1, pp. 198–208. Springer Sharma AD, Deep S (2014) Too long-didn’t read: A practical web based approach towards text summarization. In: Applied Algorithms: First International Conference, ICAA 2014, Kolkata, India, 2014. Proceedings 1, pp. 198–208. Springer
Zurück zum Zitat Vinnarasu A, Jose DV (2019) Speech to text conversion and summarization for effective understanding and documentation. Int J Electr Comput Eng 9(5):3642 Vinnarasu A, Jose DV (2019) Speech to text conversion and summarization for effective understanding and documentation. Int J Electr Comput Eng 9(5):3642
Zurück zum Zitat Wang B, Wang A, Chen F, Wang Y, Kuo C-CJ (2019) Evaluating word embedding models: methods and experimental results. APSIPA Trans Signal Infor Process 8:19 Wang B, Wang A, Chen F, Wang Y, Kuo C-CJ (2019) Evaluating word embedding models: methods and experimental results. APSIPA Trans Signal Infor Process 8:19
Zurück zum Zitat Wu W, Chen W, Zhang C, Woodland P (2024) Modelling variability in human annotator simulation. In: Findings of the Association for Computational Linguistics ACL 2024, pp. 1139–1157 Wu W, Chen W, Zhang C, Woodland P (2024) Modelling variability in human annotator simulation. In: Findings of the Association for Computational Linguistics ACL 2024, pp. 1139–1157
Metadaten
Titel
Mining EU consultations through AI
verfasst von
Fabiana Di Porto
Paolo Fantozzi
Maurizio Naldi
Nicoletta Rangone
Publikationsdatum
28.11.2024
Verlag
Springer Netherlands
Erschienen in
Artificial Intelligence and Law
Print ISSN: 0924-8463
Elektronische ISSN: 1572-8382
DOI
https://doi.org/10.1007/s10506-024-09426-6