nach oben

2015 | Buch

Kapitel lesen Erstes Kapitel lesen

Advances in Statistical Models for Data Analysis

herausgegeben von: Isabella Morlini, Tommaso Minerva, Maurizio Vichi

Verlag: Springer International Publishing

Buchreihe : Studies in Classification, Data Analysis, and Knowledge Organization

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This edited volume focuses on recent research results in classification, multivariate statistics and machine learning and highlights advances in statistical models for data analysis. The volume provides both methodological developments and contributions to a wide range of application areas such as economics, marketing, education, social sciences and environment. The papers in this volume were first presented at the 9th biannual meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, held in September 2013 at the University of Modena and Reggio Emilia, Italy.

Inhaltsverzeichnis

Frontmatter

Using the dglars Package to Estimate a Sparse Generalized Linear Model

Abstract

dglars is a publicly available R package that implements the method proposed in Augugliaro et al. (J. R. Statist. Soc. B 75(3), 471–498, 2013) developed to study the sparse structure of a generalized linear model (GLM). This method, called dgLARS, is based on a differential geometrical extension of the least angle regression method. The core of the dglars package consists of two algorithms implemented in Fortran 90 to efficiently compute the solution curve.

Luigi Augugliaro, Angelo M. Mineo

A Depth Function for Geostatistical Functional Data

Abstract

In this paper we introduce a depth measure for geostatistical functional data. The aim is to provide a tool which allows to get a center-outward ordering of functional data recorded by sensors placed on a geographic area. Although the topic of ordering functional data has already been addressed in the literature, no proposal analyzes the case in which there is a spatial dependence among the curves. With this aim, we extend a well-known depth measure for functional data by introducing a new component in the measurement, which accounts for the spatial covariance. An application of the proposed method to a wide range of simulated cases shows its effectiveness in discovering a useful ordering of the spatially located curves.

Antonio Balzanella, Romano Elvira

Robust Clustering of EU Banking Data

Abstract

In this paper we present an application of robust clustering to the European union (EU) banking system. Banks may differ in several aspects, such as size, business activities and geographical location. After the latest financial crisis, it has become of paramount importance for European regulators to identify common features and issues in the EU banking system and address them in all Member States (or at least those of the Euro area) in a harmonized manner. A key issue is to identify using publicly available information those banks more involved in risky activities, in particular trading, which may need to be restructured to improve the stability of the whole EU banking sector. In this paper we show how robust clustering can help in achieving this purpose. In particular we look for a sound method able to clearly cut the two-dimensional space of trading volumes and their shares over total assets into two subsets, one containing safe banks and the other the risky ones. The dataset, built using banks’ balance sheets, includes 245 banks from all EU27 countries, but Estonia, plus a Norwegian bank. With appropriate parameters, the TCLUST routine could provide better insight of the data and suggest proper thresholds for regulators.

Jessica Cariboni, Andrea Pagano, Domenico Perrotta, Francesca Torti

Sovereign Risk and Contagion Effects in the Eurozone: A Bayesian Stochastic Correlation Model

Abstract

This research proposes a Bayesian multivariate stochastic volatility (MSV) model to analyze the dynamics of sovereign risk in eurozone CDS markets during the recent financial crisis. We follow an MCMC approach to parameters and latent variable estimation and provide evidence of significant volatility shifts in asset returns, strong simultaneous increases in cross-market correlations, as well as sharp declines in correlations patterns. Overall, these findings are highly consistent with various empirical characterizations of contagion put forward in the literature, allowing us to conclude that the recent financial crisis generated severe contagion effects in sovereign debt markets of eurozone countries.

Roberto Casarin, Marco Tronzano, Domenico Sartore

Female Labour Force Participation and Selection Effect: Southern vs Eastern European Countries

Abstract

The aim of this paper is to explore the main determinants of women’s job search propensity as well as the mechanism underlying the selection effect across the four European countries (Italy, Greece, Hungary and Poland) with the lowest female labour force participation. The potential bias due to the overlap in some unobserved characteristics is addressed via a bivariate probit model. Significant selection effects of opposite signs are found for the Greek and Polish labour markets.

Rosalia Castellano, Gennaro Punzo, Antonella Rocca

Asymptotics in Survey Sampling for High Entropy Sampling Designs

Abstract

The aim of the paper is to establish asymptotics in sampling finite populations. Asymptotic results are first established for an analogous of the empirical process based on the Hájek estimator of the population distribution function and then extended to Hadamard-differentiable functions. As an application, asymptotic normality of estimated quantiles is provided.

Pier Luigi Conti, Daniela Marella

A Note on the Use of Recursive Partitioning in Causal Inference

Abstract

A tree-based approach for identification of a balanced group of observations in causal inference studies is presented. The method uses an algorithm based on a multidimensional balance measure criterion applied to the values of the covariates to recursively split the data. Starting from an ad-hoc resampling scheme, observations are finally partitioned in subsets characterized by different degrees of homogeneity, and causal inference is carried out on the most homogeneous subgroups.

Claudio Conversano, Massimo Cannas, Francesco Mola

Meta-Analysis of Poll Accuracy Measures: A Multilevel Approach

Abstract

Following a meta-analysis approach as a special case of multilevel modelling, we identify potential sources of dissimilarities in accuracy measures of pre-election polls, carried out during Parliamentary elections in Italy from 2001 to 2008. The predictive accuracy measure, computed to compare the pre-electoral poll result to the actual result, is the dependent variable and the poll characteristics are the explanatory variables and are introduced in a hierarchical model. In the model each outcome is affected by a specific sampling error assumed to have a normal distribution and a known variance. The multilevel model approach decomposes variance components as well as meta-analysis random models. We propose a multilevel approach, in order to make the estimation procedure easier and more flexible than in a traditional meta-analysis approach.

Rosario D’Agata, Venera Tomaselli

Families of Parsimonious Finite Mixtures of Regression Models

Abstract

Finite mixtures of regression (FMR) models offer a flexible framework for investigating heterogeneity in data with functional dependencies. These models can be conveniently used for unsupervised learning on data with clear regression relationships. We extend such models by imposing an eigen-decomposition on the multivariate error covariance matrix. By constraining parts of this decomposition, we obtain families of parsimonious mixtures of regressions and mixtures of regressions with concomitant variables. These families of models account for correlations between multiple responses. An expectation-maximization algorithm is presented for parameter estimation and performance is illustrated on simulated and real data.

Utkarsh J. Dang, Paul D. McNicholas

Quantile Regression for Clustering and Modeling Data

Abstract

This paper aims to propose an innovative approach to identify a typology in a quantile regression model. Quantile regression is a regression technique that allows to focus on the effects that a set of explanatory variables has on the entire conditional distribution of a dependent variable. The proposal concerns the use of multivariate techniques to simultaneously cluster and model data and it is illustrated using an empirical analysis. This analysis regards the impact of student features on the university outcome, measured by the degree mark. The analysis is based on the idea that the dependence structure could be different for units belonging to different groups.

Cristina Davino, Domenico Vistocco

Nonmetric MDS Consensus Community Detection

Abstract

Community detection methods for the analysis of complex networks are increasingly important in modern literature. At the same time it is still an open problem. The approach proposed in this work is to adopt an ensemble procedure for obtaining a consensus matrix from which to perform a nonmetric MDS approach and then a clustering algorithm which allows to get a consensus partition of the nodes. The simulation study offers some interesting insights on the procedure because it shows that it is possible to understand the key nodes and the stable communities by considering different algorithms. The proposed approach is still applied to real data related to a network of patents.

Carlo Drago, Antonio Balzanella

The Performance of the Gradient-Like Influence Measure in Generalized Linear Mixed Models

Abstract

A gradient-like statistic, recently introduced as an influence measure, has been proven to work well in large sample, thanks to its asymptotic properties. In this work, through small-scale simulation schemes, the performance of such a diagnostic measure is further investigated in terms of concordance with the main influence measures used for outlier identification. The simulation studies are performed by using generalized linear mixed models (GLMMs).

Marco Enea, Antonella Plaia

New Flexible Probability Distributions for Ranking Data

Abstract

Recently, several models have been proposed for analysing the ranks assigned by people to some object. These models summarize the liking feeling towards the object, possibly with respect to a set of explanatory variables. Some recent works have suggested the use of the Shifted Binomial and of the Inverse Hypergeometric distribution for modelling the approval rate, while mixture models have been considered for taking into account the uncertainty in the ranking process. We propose two new probability distributions, the Discrete Beta and the Shifted-Beta Binomial, which ensure much flexibility and allow the joint modelling of the scale (approval rate) and the shape (uncertainty) parameters of the rank distribution.

Salvatore Fasola, Mariangela Sciandra

Robust Estimation of Regime Switching Models

Abstract

It is well known that generalized-M (GM) estimators for linear models are consistent and lead to a small loss of efficiency with respect to least squares (LS) estimator. When they are extended to threshold models the consistency of GM estimators is guaranteed only under certain objective functions. In this paper we explore, in a simulation experiment, the loss of consistency of GM-SETAR estimator under different objective functions, time-series length, parameter combinations and type of contaminations. Finally the best robust estimator is applied to study the dynamic of electricity prices where regime switching and high spikes are widely observed features.

Luigi Grossi, Fany Nan

Incremental Visualization of Categorical Data

Abstract

Multiple correspondence analysis (MCA) is a well-established dimension reduction method to explore the associations within a set of categorical variables and it consists of a singular value decomposition (SVD) of a suitably transformed matrix. The high computational and memory requirements of ordinary SVD make its application impractical on massive or sequential data sets that characterize several modern applications. The aim of the present contribution is to allow for incremental updates of existing MCA solutions, which lead to an approximate yet highly accurate solution; this makes it possible to track, via MCA, the association structures in data flows. To this end, an incremental SVD approach with desirable properties is embedded in the context of MCA.

Alfonso Iodice D’Enza, Angelos Markos

A New Proposal for Tree Model Selection and Visualization

Abstract

The most common approach to build a decision tree is based on a two-step procedure: growing a full tree and then prune it back. The goal is to identify the tree with the lowest error rate. Alternative pruning criteria have been proposed in literature. Within the framework of recursive partitioning algorithms by tree-based methods, this paper provides a contribution on both the visual representation of the data partition in a geometrical space and the selection of the decision tree. In our visual approach the identification of the best tree and of the weakest links is immediately evaluable by the graphical analysis of the tree structure without considering the pruning sequence. The results in terms of error rate are really similar to the ones returned by the classification and regression trees (CART) procedure, showing how this new way to select the best tree is a valid alternative to the well-known cost-complexity pruning.

Carmela Iorio, Massimo Aria, Antonio D’Ambrosio

Object-Oriented Bayesian Network to Deal with Measurement Error in Household Surveys

Abstract

In this paper we propose to use the object-oriented Bayesian networks (OOBNs) architecture to model measurement errors in the Italian survey on household income and wealth (SHIW) 2008 when the variable of interest is categorical. The network is used to stochastically impute microdata for households. Imputation is performed both assuming a misreport probability constant over all the population and learning a Bayesian network for estimating such a probability. Finally, potentialities and possible extensions of this approach are discussed.

Daniela Marella, Paola Vicard

Comparing Fuzzy and Multidimensional Methods to Evaluate Well-Being in European Regions

Abstract

We suggest a new criterion based on fuzzy sets theory in order to evaluate well-being in European regions at NUTS 2 level. With reference to the various domains of this vague and multidimensional concept, a subset of 16 variables available in Eurostat database is selected. After a fuzzy transformation, the variables are aggregated into a fuzzy synthetic indicator, considering different weighting criteria. For each region the fuzzy indicator value, in the range [0, 1], may be interpreted as a membership degree to the subset of the areas with the highest well-being. The results are compared with the ones obtained by principal component analysis (PCA) and k-means cluster analysis applied to the same dataset. Furthermore, the relationships of the fuzzy indicator with GDP per capita and with human development index (HDI) are highlighted. The advantages and the drawbacks of the suggested approach are discussed.

Maria Adele Milioli, Lara Berzieri, Sergio Zani

Cluster Analysis of Three-Way Atmospheric Data

Abstract

Classification of meteorological time series is important for the analysis of the climate variability and climate change. The clustering of several years in groups that are homogeneous with reference to the amount of precipitation and to the atmospheric condition can aid in understanding the structure of precipitation and may be important in developing hydrological models. In this paper we propose a cluster analysis of multivariate time series based on a dissimilarity measure that considers the functional form of the data. The unit to be classified are 148 years, from 1861 to 2008, and the variables are the values of precipitation, the minimum temperature, and the maximum temperature in different occasions (days or months) in the province of Modena (Northern Italy).

Isabella Morlini, Stefano Orlandini

Asymmetric CLUster Analysis Based on SKEW-Symmetry: ACLUSKEW

Abstract

A procedure of cluster analysis to deal with asymmetric similarities is introduced, where the similarity from one object to the other object is not necessarily equal to the similarity from the latter to the former. The procedure analyzes one-mode two-way asymmetric similarities among objects to classify objects into clusters. Each cluster consists of a dominant (central) object and the other (noncentral) objects. The central object of a cluster represents the cluster and dominates the other objects in the cluster. In the present procedure, differences between two conjugate similarities (two times of skew-symmetries) are weighted by multiplying with the sum of the two corresponding similarities. Thus the larger the similarity between two objects is, the more prominently the difference is evaluated. The present procedure is applied to car switching data among car categories, and the result is compared with the result which was obtained by analyzing unweighted differences between two conjugate similarities. The comparison shows the weight is reasonable.

Akinori Okada, Satoru Yokoyama

Parsimonious Generalized Linear Gaussian Cluster-Weighted Models

Abstract

Mixtures with random covariates are statistical models which can be applied for clustering and for density estimation of a random vector composed by a response variable and a set of covariates. In this class, the generalized linear Gaussian cluster-weighted model (GLGCWM) assumes, in each mixture component, an exponential family distribution for the response variable and a multivariate Gaussian distribution for the vector of real-valued covariates. For parsimony sake, a family of fourteen models is here introduced by applying some constraints on the eigen-decomposed covariance matrices of the Gaussian distribution. The EM algorithm is described to find maximum likelihood estimates of the parameters for these models. This novel family of models is finally applied to a real data set where a good classification performance is obtained, especially when compared with other well-established mixture-based approaches.

Antonio Punzo, Salvatore Ingrassia

New Perspectives for the MDC Index in Social Research Fields

Abstract

The great interest in quantitative social research has led to the development of specific statistical techniques suitable in dealing with dependence between variables also in the presence of ordinal data. A specific index, hereafter called monotonic dependence coefficient (MDC), was provided as a monotonic dependence measure. Due to its properties and specific features, MDC overcomes the Pearson’s correlation coefficient, since it captures not only linear dependence relationships but also any general monotonic one. The MDC adequacy is validated by a simulation study assessing its performance with respect to the traditional Pearson’s correlation coefficient. Finally, a real application of MDC to real data is also illustrated.

Emanuela Raffinetti, Pier Alda Ferrari

Clustering Methods for Ordinal Data: A Comparison Between Standard and New Approaches

Abstract

The literature on cluster analysis has a long and rich history in several different fields. In this paper, we provide an overview of the more well-known clustering methods frequently used to analyse ordinal data. We summarize and compare their main features discussing some key issues. Finally, an example of application to real data is illustrated comparing and discussing clustering performances of different methods.

Monia Ranalli, Roberto Rocci

Novelty Detection with One-Class Support Vector Machines

Abstract

In this paper we apply one-class support vector machine (OC-SVM) to identify potential anomalies in financial time series. We view anomalies as deviations from a prevalent distribution which is the main source behind the original signal. We are interested in detecting changes in the distribution and the timing of the occurrence of the anomalous behaviour in financial time series. The algorithm is applied to synthetic and empirical data. We find that our approach detects changes in anomalous behaviour in synthetic data sets and in several empirical data sets. However, it requires further work to ensure a satisfactory level of consistency and theoretical rigour.

John Shawe-Taylor, Blaž Žličar

Using Discrete-Time Multistate Models to Analyze Students’ University Pathways

Abstract

The methodologies adopted in the last decades to analyze students’ university careers using cohort studies focus mainly on the risk to observe one of the possible competing states, specifically dropout or graduation, after several years of follow-up. In this perspective all the other event types that may prevent the occurrence of the target event are treated as censored observations. A broader analysis of students’ university careers from undergraduate to postgraduate status reveals that several competing and noncompeting events may occur, some of which have been denoted as absorbing while others as intermediate. In this study we propose to use multistate models to analyze the complexity of students’ careers and to assess how the risk to experience different states varies along the time for students’ with different profiles. An application is provided to show the usefulness of this approach.

Isabella Sulis, Francesca Giambona, Nicola Tedesco

Titel: Advances in Statistical Models for Data Analysis
herausgegeben von: Isabella Morlini
Tommaso Minerva
Maurizio Vichi
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-17377-1
Print ISBN: 978-3-319-17376-4
DOI: https://doi.org/10.1007/978-3-319-17377-1