Skip to main content

About this book

This book of peer-reviewed contributions presents the latest findings in classification, statistical learning, data analysis and related areas, including supervised and unsupervised classification, clustering, statistical analysis of mixed-type data, big data analysis, statistical modeling, graphical models and social networks. It covers both methodological aspects as well as applications to a wide range of fields such as economics, architecture, medicine, data management, consumer behavior and the gender gap. In addition, it describes the basic features of the software behind the data analysis results, and provides links to the corresponding codes and data sets where necessary.

This book is intended for researchers and practitioners who are interested in the latest developments and applications in the field of data analysis and classification. It gathers selected and peer-reviewed contributions presented at the 11th Scientific Meeting of the Classification and Data Analysis Group of the Italian Statistical Society (CLADAG 2017), held in Milan, Italy, on September 13–15, 2017.

Table of Contents


Clustering and Classification


Cluster Weighted Beta Regression: A Simulation Study

In several application fields, we have to model a response that takes values in a limited range. When these values may be transformed into rates, proportions, concentrations, that is to continuous values in the unit interval, beta regression may be the appropriate choice. In the presence of unobserved heterogeneity, for example when the population of interest is composed by different subgroups, finite mixture of beta regression models could be useful. When conditions of exogeneity of the covariates set are not met, extended modeling approaches should be proposed. For this purpose, we discuss the class of cluster-weighted beta regression models.
Marco Alfó, Luciano Nieddu, Cecilia Vitiello

Detecting Wine Adulterations Employing Robust Mixture of Factor Analyzers

An authentic food is one that is what it claims to be. Nowadays, more and more attention is devoted to the food market: stakeholders, throughout the value chain, need to receive exact information about the specific product they are commercing with. To ascertain varietal genuineness and distinguish potentially doctored food, in this paper we propose to employ a robust mixture estimation method. Particularly, in a wine authenticity framework with unobserved heterogeneity, we jointly perform genuine wine classification and contamination detection. Our methodology models the data as arising from a mixture of Gaussian factors and depicts the observations with the lowest contributions to the overall likelihood as illegal samples. The advantage of using robust estimation on a real wine dataset is shown, in comparison with many other classification approaches. Moreover, the simulation results confirm the effectiveness of our approach in dealing with an adulterated dataset.
Andrea Cappozzo, Francesca Greselin

Simultaneous Supervised and Unsupervised Classification Modeling for Assessing Cluster Analysis and Improving Results Interpretability

In the unsupervised classification field, the unknown number of clusters and the lack of assessment and interpretability of the final partition by means of inferential tools denote important limitations that could negatively influence the reliability of the final results. In this work, we propose to combine unsupervised classification with supervised methods in order to enhance the assessment and interpretation of the obtained partition. In particular, the approach consists in combining of the clustering method k-means (KM) with logistic regression (LR) modeling to have an algorithm that allows an evaluation of the partition identified through KM, to assess the correct number of clusters, and to verify the selection of the most important variables. An application on real data is presented to better clarify the utility of the proposed approach.
Mario Fordellone, Maurizio Vichi

A Parametric Version of Probabilistic Distance Clustering

Probabilistic distance (PD) clustering method grounds on the basic assumption that the product between the probability of the unit belonging to a cluster and the distance between the unit and the cluster center is constant, for each statistical unit. This constant is a measure of the classificability of the point, and the sum of the constant over units is referred to as the joint distance function (JDF). The parameters that minimize the JDF maximize the classificability of the units. The goal of this paper is to introduce a new distance measure based on a probability density function, specifically, we use the multivariate Gaussian and Student-t distributions. We show using two simulated data sets that the use of a distance based on these two density functions improves the performance of PD clustering.
Christopher Rainey, Cristina Tortora, Francesco Palumbo

An Overview on the URV Model-Based Approach to Cluster Mixed-Type Data

In this paper, we provide an overview on the underlying response variable (URV) model-based approach to cluster and, optionally, simultaneously reduce ordinal and, optionally, continuous variables. We summarize and compare its main features discussing some key issues. An example of application to real data is illustrated comparing and discussing clustering performances.
Monia Ranalli, Roberto Rocci

Exploratory Data Analysis


Preference Analysis of Architectural Façades by Multidimensional Scaling and Unfolding

The methods of paired comparison and ranking play an important role in the analysis of preference data. In this study, first we show how asymmetric multidimensional scaling allows to represent in a diagram the preference order that comes out in a paired-comparison task concerning architectural façades. A ranking task involving the same stimuli and the same subject sample further enriched the preference analysis, because multidimensional unfolding applied to the ranking data matrix allows to detect the relationships between subjects and architectural façades. The results show that high curved façade is the most preferred, followed by the medium curved, angular and rectilinear ones. Rectilinear stimuli were always the least preferred and not angularity as expected.
Giuseppe Bove, Nicole Ruta, Stefano Mastandrea

Community Structure in Co-authorship Networks: The Case of Italian Statisticians

Community detection is a very appealing topic in network analysis. A precise definition of community is still lacking, so the comparison of different methods is not a simple task. This paper shows exploratory results by adopting two well-known community detection methods and a new proposal to discover groups of scientists in the co-authorship network of Italian academic statisticians.
Domenico De Stefano, Maria Prosperina Vitale, Susanna Zaccarin

Analyzing Consumers’ Behavior in Brand Switching

Asymmetric multidimensional scaling is extended to represent differences among consumers in brand switching. The asymmetric multidimensional scaling, based on the singular value decomposition, represents asymmetric relationships among brands in the brand switching by introducing the outward tendency which corresponds to the left singular vector and the inward tendency which corresponds to the right singular vector. The resulting configuration is represented in a plane spanned by the left and the right singular vectors where each brand is represented as a point. Each dimension (component) has its own plane or a two-dimensional configuration. The asymmetric multidimensional scaling is extended so that each consumer is represented as a point in the plane. The joint configuration of brands and consumers represents how each consumer or a group of consumers relates to brands in the brand switching. The procedure is applied successfully to brand switching data among potato snacks.
Akinori Okada, Hiroyuki Tsurumi

Evaluating the Quality of Data Imputation in Cardiovascular Risk Studies Through the Dissimilarity Profile Analysis

Missing data handling is one of the crucial problems in statistical analyses, and almost always is overcome by imputation. Although the literature is rich in different imputation approaches, the problem of the assessment of the quality of imputation, i.e., appraising whether the imputed values or categories are plausible for variables and units, seems to have received less attention. This issue is critical in every field of application, such as the medical context considered here, i.e., the assessment of cardiovascular disease risks. We faced the problem of comparing the results obtained with different imputation methods and assessing the quality of imputation through the dissimilarity profile analysis (DPA), which is a multivariate exploratory method for the analysis of dissimilarity matrices. We also combined DPA with the traditional profile analysis for data matrices in order to improve understanding of the differentiation components among imputation methods.
Nadia Solaro

Statistical Modeling


Measuring Economic Vulnerability: A Structural Equation Modeling Approach

Macroeconomic vulnerability is currently measured by the United Nations through a weighted average of eight variables related to exposure to shocks, and frequency of shocks, known as Economic Vulnerability Index (EVI). In this paper we propose to extend this measure by taking into account additional variables related to resilience, i.e., the ability of a country to recover after a shock. Since vulnerability can be considered as a latent variable, we explore the possibility of using the Structural Equation Model approach as an alternative to an index based on arbitrary weights. Using data from a panel of 98 countries over 19 years, we test our results with respect to the ability of the indices based on weighted averages, or on the SEM, in explaining the growth rate in real GDP per capita.
Ambra Altimari, Simona Balzano, Gennaro Zezza

Bayesian Inference for a Mixture Model on the Simplex

The Flexible Dirichlet (Ongaro and Migliorati, J. Multivar. Anal. 114:412–426, 2013) is a distribution for compositional data (i.e., data whose support is the simplex), which can fit data better than the classical Dirichlet distribution, thanks to its mixture structure and to additional parameters that allow for a more flexible modeling of the covariance matrix. This contribution presents two Bayesian procedures—both based on Gibbs sampling—in order to estimate its parameters. A simulation study has been conducted in order to evaluate the performances of the proposed estimation algorithms in several parameter configurations. Data are generated from a Flexible Dirichlet with D = 3 components and with representative parameter configurations.
Roberto Ascari, Sonia Migliorati, Andrea Ongaro

Stochastic Models for the Size Distribution of Italian Firms: A Proposal

What determines the size distribution of business firms? What kind of firm dynamics may be underlying observed firm size distributions? Which candidate distributions may be used for fitting purposes? We here address these questions from a stochastic model perspective. We construct a firm dynamics process that leads to a Dagum distribution of firm size at equilibrium. An empirical study shows that the proposed model captures the empirical regularities of firm size distributions with considerable accuracy.
Anna Maria Fiori, Anna Motta

Modeling Return to Education in Heterogeneous Populations: An Application to Italy

The Mincer human capital earnings function is a regression model that relates individual’s earnings to schooling and experience. It has been used to explain individual behavior with respect to educational choices and to indicate productivity on a large number of countries and across many different demographic groups. However, recent empirical studies have shown that often the population of interest embed latent homogeneous subpopulations, with different returns to education across subpopulations, rendering a single Mincer’s regression inadequate. Moreover, whatever (concomitant) information is available about the nature of such a heterogeneity, it should be incorporated in an appropriate manner. We propose a mixture of Mincer’s models with concomitant variables: it provides a flexible generalization of the Mincer model, a breakdown of the population into several homogeneous subpopulations, and an explanation of the unobserved heterogeneity. The proposal is motivated and illustrated via an application to data provided by the Bank of Italy’s Survey of Household Income and Wealth in 2012.
Angelo Mazza, Michele Battisti, Salvatore Ingrassia, Antonio Punzo

Changes in Couples’ Bread-Winning Patterns and Wife’s Economic Role in Japan from 1985 to 2015

The trend towards dual-income families can be detected in recent years in many industrialized countries. However, despite the continuing rise in Japanese women’s rates of participation in the economy over the period of industrialization and beyond, the notion of gendered division of labour has been seen as “normal” in Japanese society. The aim of this paper is to examine whether the determinants of married women’s labour force participation have changed over the past several decades. Based upon social survey of national sample in Japan conducted in 1985, 1995, 2005, and 2015, we analyse the income provision-role type of the dual-income couples and examine change/stability of the factors that differentiate couples where the husband provides the majority of the couple’s income from equal providers. We find the changing effects of women’s own human capital on contribution to household income. On the other hand, the division of labour within households has not changed a lot over the past several decades.
Miki Nakai

Weighted Optimization with Thresholding for Complete-Case Analysis

Complete-case analysis, also known as listwise deletion method (LD), is a relatively popular technique to handle datasets with incomplete entries. It is known to be effective when data are missing completely at random. However, by reducing the size of the dataset it can weaken the final statistical analysis. We present an optimization algorithm that improves the size of the final dataset after applying LD. It is based on a constrained weighted optimization technique to determine the maximum number of variables and respondents from the initial dataset that are preserved after applying LD. The main feature is that the method allows for selecting a specific set of variables (or respondents) that must be kept during the optimization, while balancing their relative importance by means of suitable weights. Moreover, we provide analytic formulas for the optimal solution, that can be easily evaluated numerically, reducing the computational complexity associated to the usage of off-the-shelf packages for solving similar large constrained optimization problems. We illustrate the application of our weighted optimization method to some examples and real datasets.
Graziano Vernizzi, Miki Nakai

Graphical Models


Measurement Error Correction by Nonparametric Bayesian Networks: Application and Evaluation

In this paper a procedure for measurement error correction based on nonparametric Bayesian networks is proposed. The performance of the proposed method is evaluated using a validation sample collected by Banca d’Italia and a major Italian bank group to investigate the measurement error mechanism in the main financial variables amounts observed in the Banca d’Italia survey on Household Income and Wealth. Specifically, in this paper attention is focused on the bond amounts. By means of Uninet’s programmatic engine working directly from R, data can be corrected unit by unit by sampling from the nonparametric Bayesian network. Thanks to the validation sample, the distances between the true and the imputed values are computed and the procedure is evaluated.
Daniela Marella, Paola Vicard, Vincenzina Vitale, Dan Ababei

Copula Grow-Shrink Algorithm for Structural Learning

The PC algorithm is the most known constraint-based algorithm for learning a directed acyclic graph using conditional independence tests. For Gaussian distributions the tests are based on Pearson correlation coefficients. PC algorithm for data drawn from a Gaussian copula model, Rank PC, has been recently introduced and is based on the Spearman correlation. Here, we present a modified version of the Grow-Shrink algorithm, named Copula Grow-Shrink; it is based on the recovery of the Markov blanket and on the Spearman correlation. By simulations it is shown that the Copula Grow-Shrink algorithm performs better than the PC and the Rank PC algorithms, according to the structural Hamming distance. Finally, the new algorithm is applied to Italian energy market data.
Flaminia Musella, Paola Vicard, Vincenzina Vitale

Context-Specific Independencies Embedded in Chain Graph Models of Type I

For a set of variables collected in a contingency table, we focus on a particular kind of relationships such as the context-specific independencies. These are conditional independencies that hold for particular values of the conditioning set. Given the advantages of the graphical models, we use them to represent different relationships among the variables, including the context-specific independencies. In particular, we enrich chain graph models with labelled arcs. Furthermore, we consider the well-known relationships between chain graph models and hierarchical multinomial marginal models and we introduce new constraints on parameters in order to describe the context-specific relationship. Finally, we provide an application to the study of innovation in Italy by comparing two different periods.
Federica Nicolussi, Manuela Cazzaro

Big Data Analysis


Big Data and Network Analysis: A Combined Approach to Model Online News

In recent years, large volumes of data are generated by automatic extraction of information, innovative data mining, and predictive analytics. This paper proposes an innovative approach by combining Big Data with the analysis of relational structures in order to improve actionable analytics-driven decision patterns. From the website of one of the largest online Italian newspapers, interactions among users and their comments about a 2016 Italian constitutional review bill are organized in a Big Data audience model. Readers’ sentiments are measured and relational patterns are classified by descriptive measurements and clustering structures implemented in Network Analysis methods.
Giovanni Giuffrida, Simona Gozzo, Francesco Mazzeo Rinaldi, Venera Tomaselli

Experimental Design Issues in Big Data: The Question of Bias

Data can be collected in scientific studies via a controlled experiment or passive observation. Big data is often collected in a passive way, e.g. from social media. In studies of causation great efforts are made to guard against bias and hidden confounders or feedback which can destroy the identification of causation by corrupting or omitting counterfactuals (controls). Various solutions of these problems are discussed, including randomisation.
Elena Pesce, Eva Riccomagno, Henry P. Wynn
Additional information

Premium Partner

    Image Credits