Skip to main content

About this book

Focusing on methodologies, applications and challenges of textual data analysis and related fields, this book gathers selected and peer-reviewed contributions presented at the 14th International Conference on Statistical Analysis of Textual Data (JADT 2018), held in Rome, Italy, on June 12-15, 2018. Statistical analysis of textual data is a multidisciplinary field of research that has been mainly fostered by statistics, linguistics, mathematics and computer science. The respective sections of the book focus on techniques, methods and models for text analytics, dictionaries and specific languages, multilingual text analysis, and the applications of text analytics. The interdisciplinary contributions cover topics including text mining, text analytics, network text analysis, information extraction, sentiment analysis, web mining, social media analysis, corpus and quantitative linguistics, statistical and computational methods, and textual data in sociology, psychology, politics, law and marketing.

Table of Contents


Techniques, Methods and Models


Text Analytics: Present, Past and Future

Text analytics is a large umbrella under which it is possible to report countless techniques, models, methods for automatic and quantitative analysis of textual data. Its development can be traced back the introduction of the computer, but the prodromes date back, the importance of text analysis has grown over time and has been greatly enriched with the spread of the Internet and social media, which constitute an important flow of information also in support of official statistics. This paper aims to describe, through a timeline the past, the present and the possible future scenario of text analysis. Moreover, the main macro-steps for a practical study are illustrated.
Domenica Fioredistella Iezzi, Livia Celardo

Unsupervised Analytic Strategies to Explore Large Document Collections

The technological revolution of the last years allowed to process different kinds of data to study several real-world phenomena. Together with the traditional source of data, textual data became more and more critical in many research domains, proposing new challenges to scholars working with documents written in natural language. In this paper, we explain how to prepare a set of documents for quantitative analyses and compare the different approaches widely used to extract information automatically, discussing their advantages and disadvantages.
Michelangelo Misuraca, Maria Spano

Studying Narrative Flows by Text Analysis and Network Text Analysis

The Case Study of Italian Young People’s Perception of Work in Digital Contexts
The paper presents a joint use of text analysis and network text analysis in order to study the narrations. Text analysis allows to detect the main themes subjects in the narrations and hence the processes of signification, network text analysis permits to track down the relations between linguistic expressions of text, identifying, therefore, the path of flow of thoughts. Using jointly the two methods is possible not only to explore the content of narrations, but starting from the words and concepts with higher semantic strength, also to identify the processes of signification. To this purpose, we will present a research aiming to understand high school students’ perception of employment precariousness in Italy.
Cristiano Felaco, Anna Parola

Key Passages : From Statistics to Deep Learning

This contribution compares statistical analysis and deep learning approaches to textual data. The extraction of key passages using statistics and deep learning is implemented using the Hyperbase software. An evaluation of the underlying calculations is given by using examples from two different languages—French and Latin. Our hypothesis is that deep learning is not only sensitive to word frequency but also to more complex phenomena containing linguistic features that pose problems for statistical approaches. These linguistic patterns, also known as motives Mellet and Longrée (Belg J Linguist 23:161–173, 2009 [9]), are essential for highlighting key passages. If confirmed, this hypothesis would provide us with a better understanding of the deep learning black box. Moreover, it would bring new ways of understanding and interpreting texts. Thus, this paper introduces a novel approach to explore the hidden layers of a convolutional neural network, trying to explain which are the relevant linguistic features used by the network to perform the classification task. This explanation attempt is the major contribution of this work. Finally, in order to show the potential of our deep learning approach, when testing it on the two corpora (French and Latin), we compare the obtained linguistic features with those highlighted by a standard text mining technique (z-score computing).
Laurent Vanni, Marco Corneli, Dominique Longrée, Damon Mayaffre, Frédéric Precioso

Concentration Indices for Dialogue Dominance Phenomena in TV Series: The Case of the Big Bang Theory

Dialogues in a TV series (especially in sitcoms) represent the main interaction among characters. Dialogues may exhibit concentration, with some characters dominating, or showing instead a choral action, where all characters contribute equally to the conversation. The degree of concentration represents a distinctive feature (a signature) of the TV series. In this paper, we advocate the use of a concentration index (the Hirschman–Herfindahl Index) to examine dominance phenomena in TV series and apply it to the Big Bang Theory TV series. The use of the concentration index allows us to reveal a declining trend in dialogue concentration as well as the decline of some characters and the emergence of others. We find the decline in dominance to be highly correlated with a decline in popularity. A stronger concentration is present for episodes (i.e. by analysing concentration of episodes rather than speaking lines), where the number of characters that dominate episodes is quite small.
Andrea Fronzetti Colladon, Maurizio Naldi

A Conversation Analysis of Interactions in Personal Finance Forums

Online forums represent a widely used means of getting information and taking decision concerning personal finance. The dynamics of conversations taking place in the forum may be examined through a social network analysis (SNA). Two major personal finance forums are analysed here though an SNA approach, also borrowing metrics from Industrial Economics such as the Hirschman–Herfindahl Index (HHI) and the CR4 index. The major aim is to analyse the presence of dominance phenomena among the speakers. A social network is built out of the sequence of posts and replies. Though no heavy dominance is found on the basis of the HHI, a few speakers submit most posts and exhibit an aggressive behaviour by engaging in monologues (submitting a chain of posts) and tit-for-tats (immediate reply by a speaker to a post replying to that same speaker). Most replies occur within a short timeframe (within half an hour).
Maurizio Naldi

Dictionaries and Specific Languages


Big Corpora and Text Clustering: The Italian Accounting Jurisdiction Case

Currently, big corpora, coming from Open Government Data projects, in several research areas, allow quickly to available a large number of documents. In the legal field, many information are stored and therefore made it necessary to develop a specific dictionary to classify text data. The paper aims to present the workflow to visualize and classify Big Legal Corpora, identifying the main steps of a chain process. The analyzed corpus is composed of 123,989 judgments, in Italian language, published by the Court of Audit, from 2010 to 2018.
Domenica Fioredistella Iezzi, Rosamaria Berté

Lexicometric Paradoxes of Frequency: Comparing VoBIS and NVdB

An emblematic “anti-spoken” canon, the linguistic variety offered as the standard model in Italian schools following national unification, while showing some signs of evolution over time, has remained relatively artificial and resistant to change. The lexical frequencies documented for scholastic Italian can therefore be intrinsically unaligned with those of the base vocabulary as well as with data for apparently similar varieties of Italian. This implies a need for interpretive models that assess quantitative data in the light of the complex paradigmatic relations among potential competing usages as well as the multi-layered connections between the number and type of meanings observed across different contexts of use. In this paper, I review the scholastic Italian modelled by teachers in the first 150 years after unification, with a view to assessing the strengths and weaknesses of applying lexicometric parameters to a linguistic variety that was targeted at an inexpert audience but specialist in nature, informed by lofty ideals but conditioned by practical educational needs, and constantly evolving yet resistant to the pull of contemporary living varieties.
Luisa Revelli

Emotions and Dense Words in Emotional Text Analysis: An Invariant or a Contextual Relationship?

Emotional textual analysis (ETA) analyses the symbolic level of texts as a part of applied research and interventions. It is based on the study of the association of dense words, words that convey most of the emotional components of texts. In this approach, language is thought of as an organizer of the relationship between the individual contributor of the text and his or her context, rather than as a detector of the individual’s emotions. This paper presents the findings of a research project whose goal is to advance the reflection on the theoretical construct of dense words based on its use in psychosocial interventions. Seventy-nine dictionaries of 79 ETAs, which were conducted over a period of 11 years by two different groups of researchers, were analysed in order to: (1) measure the concordance between the two groups of researchers in identifying dense and non-dense words in the analysed texts; and (2) explore, by correspondence factorial analysis, the relationship between the range of consistency of the words and their presence as dense or non-dense in the groups of dictionaries that were aggregated whenever their original corpuses were thematically similar. The results of these analyses show the level of agreement among researchers in identifying emotional density, confirm the hypothesis that there are affective invariants in words and suggest that the relationship between words and emotions can be represented along a continuum in which the context assumes a discriminating function in identifying density or non-density. Thus, it would seem that the construct of density has fuzzy boundaries and is not dichotomously polarized as in sentiment analysis, which classifies words as positive or negative.
Nadia Battisti, Francesca Romana Dolcetti

Text Mining of Public Administration Documents: Preliminary Results on Judgments

In the public administration (PA), most of the information (both qualitative and quantitative) is contained in texts in natural language and often the texts “hide” within them many numerical or quantifiable data. Among the documents of the PA, the judgments, which can contain information on criminal events that have an economic dimension, are particularly important: there is a flow of money generated by corruption or bribery and public procurement can be subject to disruption auction, just to mention some topics often brought to public opinion by the media. A possible source of data is that of the sentences issued by the Court of Cassation that can be obtained on the Web site http://​www.​italgiure.​giustizia.​it/​sncass/​. In this study, we present the preliminary results on 308 sentences of the Court of Cassation: information obtained with text mining—the names of companies, their legal form and the company name—is linked with additional information from business registers, creating the conditions for future economic analysis of criminal events.
Maria Francesca Romano, Antonella Baldassarini, Pasquale Pavone

Using the First Axis of a Correspondence Analysis as an Analytic Tool

Application to Establish and Define an Orality Gradient for Genres of Medieval French Texts
Our corpus of medieval French texts is divided into 59 discourse units (DUs) which cross text genres and spoken versus non-spoken text chunks (as tagged with q and sp TEI tags). A correspondence analysis (CA) performed on selected POS tags indicates orality as the main dimension of variation across DUs. Orality prevails over textual features which could fit in a one-dimensional model as well, such as text form (verse vs. prose) or time (composition century). We then design several methodological paths to investigate this gradient as computed by the CA first axis. Bootstrap is used to check the stability of observations; gradient-ordered barplots provide both a synthetic and analytic view of the correlation of any variable with the gradient; a way is also found to characterize the gradient poles (here, more-oral or less-oral poles) not only with the POS used for the CA analysis, but also with word forms, in order to get a more accurate and lexical description. This methodology could be transposed to other data with a potential gradient structure.
Bénédicte Pincemin, Alexei Lavrentiev, Céline Guillot-Barbance

Discursive Functions of French Modal Forms: What Can Correspondence Analysis Tell Us About Genre and Diachronic Variation?

Our aim is to describe discursive functions of a set of French modal forms by establishing their combinatory profiles based on their co-occurrence with different connectors. We then compare these profiles using correspondence analysis in order to find evidence of genre and diachronic variation. The use of these forms is explored in contexts of informative discourse within two distinctly different genres—contemporary written press and encyclopedic discourse—as well as within two diachronic spans.
Corinne Rossari, Ljiljana Dolamic, Annalena Hütsch, Claudia Ricci, Dennis Wandel

Multilingual Text Analysis


How to Think About Finding a Sign for a Multilingual and Multimodal French-Written/French Sign Language Platform?

This article examines the access to the signs in French sign language (LSF) within a corpus taken from the collaborative platform Ocelles, from a multilingual French bijective/LSF perspective. There is currently no monolingual dictionary in SL, so deaf users must necessarily master the written language of by relying on the maximal resemblance sequence of signs/experience, or use the lexical unit without resemblance with the referent. This model, which is also integrative, therefore takes into account the diachronic link existing within language under the influence of pressures between transfer structures and lexical units. The morphemic approach to the study of lexical units is in this case legitimate since their compositionality does not rely on strict phonology but, in the first place, on complex morphology. First of all, we shall present our paradigm and the origins of the Ocelles multilingual and multimodal platform (written, oral, and signed languages), out of which our French-written/LSF corpus is built. We will then describe a process likely to enable users to search for an LSF signifier and to relate this result to that of the corresponding written French signifier.
Cédric Moreau

Corpus in “Natural” Language Versus “Translation” Language: LBC Corpora, A Tool for Bilingual Lexicographic Writing

The aim of this paper is to describe the work done to exploit the LBC database for the purpose of translation analysis as a resource to edit the bilingual lexical sections of our dictionaries of cultural heritage (in nine languages). This database, made up of nine corresponding corpora, contains texts whose subject is cultural heritage, ranging from technical texts on art history to books on art appreciation, such as tour guides, and travel books highlighting Italian art and culture. We will illustrate the different questions with the SketchEngine LBC French corpus, made up at the moment of 3,000,000 words. Our particular interest here is in research that not only orients lexical choices for translators but that also precedes the selection of bilingual quotations (from our Italian/French parallel corpus) and that we rely on for editing an optional element of the file called “translation notes.” We will rely on this as much for works on “universals of translation” already described by Baker (Corpus linguistics and translation studies. Implications and applications. In Baker M et al (eds) Text and technology. Benjamins, Amsterdam/Philadelphia, pp 233–250 (1993)) as for studies aimed at improving translation quality assessment (TQA). We will show how a targeted consultation of different corpora and subcorpora that the database allows us to distinguish (“natural language” vs “translation,” “technical texts” vs “popularization texts” or “literary texts”) can help us identify approximations or translation errors, so as to build quality comparative lexicographical information.
Annick Farina, Riccardo Billero

The Conditional Perfect, A Quantitative Analysis in English-French Comparable-Parallel Corpora

The frequency of the conditional perfect in English and French was observed in a corpus of almost 12-million words corpus consisting of four 2.9-million-word comparable and parallel subcorpora, tagged by POS and lemma, and analyzed using regular expressions. Intralinguistically, authors and translators were compared using the Wilcoxon-Mann–Whitney test and Cliff's delta as a measurement of effect size, while potential interlinguistic influences were assessed by means of Spearman's correlation test. French-translated-from-English was found to be distinct from original French in its use of all conditional perfect forms, but significant differences between translators and authors were observed in both languages when COULD, MIGHT and POUVOIR were used as auxiliaries.
Daniel Henkel

Repeated and Anaphoric Segments Applied to Trilingual Knowledge Extraction

In a context of globalized societies, multilingualism is becoming an economic and social phenomenon. Translation constitutes a crucial element for communication. A good translation guarantees the quality of the transmission of all information. However, can translation on its own be used to face the challenge of multilingual information monitoring? With the advent of the digital age and the integration of many new technologies, corporate governance is undergoing a complete metamorphosis. One of the priorities remains the efficient exploitation of accumulated big data. This paper endeavors to shed light on the need for multilingual monitoring and on current or future developments of text mining, for three major languages (French, English, and Chinese), in crucial areas for the world's future and to describe the specificity and efficiency of anaphora.
Lionel Shen



Looking for Topics: A Brief Review

This paper presents a brief review of several endeavors to identify latent variables (axes or clusters). When dealing with textual data, these latent variables (clusters or axes) are sometimes designated ex ante by the term “topic”. The first attempts to identify interpretable latent variables dates back to factor analysis at the beginning of last century. Recent years have witnessed a series of algorithmic attempts such as non-negative matrix factorization (NMF) or latent Dirichlet allocation (LDA). In the meantime, latent variables are also identified through several hybridizations and synergies of principal axes methods and clustering techniques. A single medium-sized classical corpus (Shakespeare’s 154 Sonnets) will serve as a benchmark to sketch and compare in a compact way some characteristic features of several methods.
Ludovic Lebart

Where Are the Social Sciences Going to? The Case of the EU-Funded SSH Research Projects

This article investigates how the emergence of a European research funding affects the directions and processes of scientific knowledge production. It does so by examining the extended abstracts of the EU-funded research projects (2007–2013) realized within the broad domain of the Social Sciences and Humanities (SSH). By using some automated content analysis techniques, the essay tries to shed light on the interplay between the policy layer and the SSH field, taking into account the programmatic and legal structures inherent the 7FP, on one side, the national scientific systems and the disciplinary structure of the SSH field, on the other side.
Matteo Gerli

Topic Modeling of Twitter Conversations: The Case of the National University of Colombia

Topic modeling provides a useful method for finding symbolic representations of ongoing social events, which represent claims about the shape of social reality, its causes and the responsibility for action such causes imply. It has received special attention among social researchers in the last decade. During this time, Twitter has acted as the most common platform for people to share narratives about social events. This study proposes Latent Dirichlet Allocation (LDA) based topic modeling of Twitter conversations to determine the topics shared on Twitter about the financial crisis in the National University of Colombia. We downloaded all tweets that included the hashtag #CrisisUNAL (UNAL is the Spanish acronym for National University of Colombia) using the Twitter API interface and analyzed over 45,000 tweets published between 2011 and 2015. The results illustrate the strength of topic modeling for analyzing large text corpora and provide a way to study new emerging information that people share on Twitter.
Eliana Sanandres, Raimundo Abello, Camilo Madariaga

Analyzing Occupational Safety Culture Through Mass Media Monitoring

In the last years, a group of researchers within the National Institute for Insurance against Accidents at Work (INAIL) has launched a pilot project about mass media monitoring in order to find out how the press deal with the culture of safety and health at work. To monitor mass media, the institute has created a relational database of news concerning occupational injuries and diseases, which was filled with information obtained from the newspaper articles about work-related accidents and incidents, including the text itself of the articles. In keeping with that, the ultimate objective is to identify the major lines for awareness-raising actions on safety and health at work. The hypothesis is that, for different kind of accidents, a different language is used by journalists to narrate the events. To verify it, news have been preprocessed, and several analyses have been implemented on these data, using automatic text analysis techniques; our purpose is to find language distinctions connected to groups of similar injuries. The identification of various ways in reporting the events, in fact, could provide new elements to describe safety knowledge, also establishing collaborations with journalists in order to enhance the communication and raise people attention toward workers’ safety. At the end, the results obtained have confirmed the starting hypothesis, and they have also provided useful tools to understand the media communication related to safety in workplaces.
Livia Celardo, Rita Vallerotonda, Daniele De Santis, Claudio Scarici, Antonio Leva

What Volunteers Do? A Textual Analysis of Voluntary Activities in the Italian Context

Over the past decades, the complex phenomenon of volunteering has been mainly analyzed within the field of economic literature with respect to its “economic value added,” that is the capability of these activities to increase the productivity level of some specific goods or services. The paper adopts a different point of view, in that voluntary organizations are analyzed as places of innovation, where new jobs arise and people acquire new skills. Thus, volunteering can be understood as a “social innovation” factor. In order to gain a deeper insight into the types of voluntary works, we have used data coming from the Istat survey “Multiscopo, Aspetti della vita quotidiana” (Multipurposes survey, daily life aspects), released in the year 2013. In our textual analysis, we have utilized the information included in the open-ended question section provided by respondents regarding the description of the tasks performed individually as volunteers. After stemming, lemmatization, and cleaning, the data have been analyzed by means of community detection based on semantic network analysis, with the purpose of identifying job patterns, then by a clustering procedure on answers (Reinert approach to textual clustering), and lastly through correspondence analysis on generalized aggregated lexical tables (CA-GALT) with the aim to explore volunteers’ profiles. In particular, we have singled out differences in gender, age, educational level, region of residence, and type of voluntary association.
Francesco Santelli, Giancarlo Ragozini, Marco Musella

Free Text Analysis in Electronic Clinical Documentation

This study finds reason in the now wide availability of clinical documentation stored in electronic form to track the patient’s health status during his care path. The diffusion of these practices make available many biomedical collections of electronic data, easily accessible at low cost that can be used for research purposes in the field of observational epidemiological studies, in analogy with what was historically already done in studies based on the reviewing of medical records. However, since these collections are not organized according to specific survey schemes, they sometimes do not allow the index events to be discriminated with the necessary reliability between one source and another automatically. This poses the problem of finding effective methods of critical rereading of texts able of solving this deficiency and, according to the possibility, of bringing the words or segments back into subdomains that can be analyzed statistically. It is proposed to address the problem, showing study criteria and an empirical experience, consistent with the needs of a biomedical context.
Antonella Bitetto, Luigi Bollani

Educational Culture and Job Market: A Text Mining Approach

The paper explores teachers and students’ training culture in clinical psychology in an Italian university in order to understand whether the educational context support the development of useful professional skills support the access of the newly graduated psychologists into the job market. To this aim, students and teachers’ interview transcriptions (n = 47) were collected into two corpora. Both corpora underwent Emotional Text Mining (ETM), which is a text mining procedure allowing for the identification of the social mental functioning (culture) setting people’s behaviors, sensations, expectation, attitudes, and communication. The results show 4 clusters and 3 factors for each corpus highlighting similarity in the representation of the training culture between student and teachers. The education and the professional training are perceived as two different processes with different goals. The text mining procedure turned out to be an enlightening tool to explore the university training culture: The research results were presented to the Master Coordinator who decided to plan a group activity for the students aiming to improve the educational process.
Barbara Cordella, Francesca Greco, Paolo Meoli, Vittorio Palermo, Massimo Grasso


Additional information

Premium Partner

    Image Credits