Skip to main content

2011 | Buch

New Perspectives in Statistical Modeling and Data Analysis

Proceedings of the 7th Conference of the Classification and Data Analysis Group of the Italian Statistical Society, Catania, September 9 - 11, 2009

herausgegeben von: Salvatore Ingrassia, Roberto Rocci, Maurizio Vichi

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Classification, Data Analysis, and Knowledge Organization

insite
SUCHEN

Über dieses Buch

This volume provides recent research results in data analysis, classification and multivariate statistics and highlights perspectives for new scientific developments within these areas. Particular attention is devoted to methodological issues in clustering, statistical modeling and data mining. The volume also contains significant contributions to a wide range of applications such as finance, marketing, and social sciences. The papers in this volume were first presented at the 7th Conference of the Classification and Data Analysis Group (ClaDAG) of the Italian Statistical Society, held at the University of Catania, Italy.

Inhaltsverzeichnis

Frontmatter

Data Modeling for Evaluation

Frontmatter
Evaluating the Effects of Subsidies to Firms with Nonignorably Missing Outcomes

In the paper, the effects of subsidies to Tuscan handicraft firms are evaluated; the study is affected by missing outcome values, which cannot be assumed missing at random. We tackle this problem within a causal inference framework. By exploiting Principal Stratification and the availability of an

instrument

for the missing mechanism, we conduct a likelihood-based analysis, proposing a set of plausible identification assumptions. Causal effects are estimated on (latent) subgroups of firms, characterized by their response behavior.

Fabrizia Mealli, Barbara Pacini, Giulia Roli
Evaluating Lecturer’s Capability Over Time. Some Evidence from Surveys on University Course Quality

The attention towards the evaluation of the Italian university system prompted to an increasing interest in collecting and analyzing longitudinal data on students’ assessments of courses, degree programs and faculties. This study focuses on students’ opinions gathered in three contiguous academic years. The main aim is to test a suitable method to evaluate lecturer’s performance over time considering students’ assessments on several features of the

lecturer’s capabilities

. The use of the same measurement instrument allows us to shed some light on changes that occur over time and to attribute them to specific characteristics. Multilevel analysis is combined with Item Response Theory in order to build up specific trajectories of performance of

lecturer’s capability

. The result is a random-effects ordinal regression model for four-level data that assumes an ordinal logistic regression function. It allows us to take into account several factors which may influence the variability in the assessed quality over time.

Isabella Sulis, Mariano Porcu, Nicola Tedesco
Evaluating the External Effectiveness of the University Education in Italy

This paper aims at checking the possibility to measure the external effectiveness of course programs groups of all Italian universities, taking account of both characteristics of individuals and context factors that differently affect the Italian regions. We perform the analysis using a multilevel logistic model on data set from survey on job opportunities of the Italian graduates in 2004, conducted in 2007 by the Italian National Institute of Statistics

Matilde Bini
Analyzing Research Potential through Redundancy Analysis: the case of the Italian University System

The paper proposes a multivariate approach to study the dependence of the scientific productivity on the human research potential in the Italian University system. In spite of the heterogeneity of the system, Redundancy Analysis is exploited to analyse the University research system as a whole. The proposed approach is embedded in an exploratory data analysis framework.

Cristina Davino, Francesco Palumbo, Domenico Vistocco
A Participative Process for the Definition of a Human Capital Indicator

In this paper, we discuss a method for defining the hierarchical structure of a composite indicator of graduate human capital that could be used to measure the educational effectiveness of Italian universities. The structure and weights of the dimensions of graduate human capital, and the set and weights of the elementary indicators, were determined using a three-round Delphi-like procedure. We contacted the rectors, the presidents of the evaluation boards and other qualified professors at Italian universities, as well as representatives of worker unions and entrepreneur associations. Our exercise shows that most dimensions of graduate human capital are related to the educational role of universities and that weights and indicators of the dimensions can plausibly be measured with the participation of the concerned individuals.

Luigi Fabbris, Giovanna Boccuzzo, Maria Cristiana Martini, Manuela Scioni
Using Poset Theory to Compare Fuzzy Multidimensional Material Deprivation Across Regions

In this paper, a new approach to the fuzzy analysis of multidimensional material deprivation data is provided, based on partial order theory. The main feature of the methodology is that the information needed for the deprivation assessment is extracted directly from the relational structure of the dataset, avoiding any kind of scaling and aggregation procedure, so as to respect the ordinal nature of the data. An example based on real data is worked out, pertaining to material deprivation in Italy for the year 2004.

Marco Fattore, Rainer Brüggemann, Jan Owsiński
Some Notes on the Applicability of Cluster-Weighted Modeling in Effectiveness Studies

In the nineties, numerous authors proposed the use of Multilevel Models in effectiveness studies. However, this approach has been strongly criticized. Cluster-Weighted Modeling (CWM) is a flexible statistical framework, which is based on weighted combinations of local models. While Multilevel Models provide rankings of the institutions, in the CWM approach many models of effectiveness are estimated, each of them being valid for a certain subpopulation of users.

Simona C. Minotti
Impact Evaluation of Job Training Programs by a Latent Variable Model

We introduce a model for categorical panel data which is tailored to the dynamic evaluation of the impact of job training programs. The model may be seen as an extension of the dynamic logit model in which unobserved heterogeneity between subjects is taken into account by the introduction of a discrete latent variable. For the estimation of the model parameters we use an EM algorithm and we compute standard errors on the basis of the numerical derivative of the score vector of the complete data log-likelihood. The approach is illustrated through the analysis of a dataset containing the work histories of the employees of the private firms of the province of Milan between 2003 and 2005, some of whom attended job training programs supported by the European Social Fund.

Francesco Bartolucci, Fulvia Pennoni

Data Analysis in Economics

Frontmatter
Analysis of Collaborative Patterns in Innovative Networks

This paper focuses on territorial innovative networks, where a variety of actors (firms, institutions and research centers) are involved in research activities, and aims to set up a strategy for the analysis of such networks. The strategy is twofold and relies, on the one hand, on the secondary data available from administrative databases and, on the other, on survey data related to the organizations involved in innovative networks. In order to describe the peculiar structures of innovative networks, the proposed strategy adopts the techniques suggested in the framework of Social Network Analysis. In particular, the main goal of the analysis is to highlight the network characteristics (interactions between industry, university and local government) that can influence network efficiency in terms of knowledge exchange and diffusion of innovation. Our strategy will be discussed in the framework of an Italian technological district, i.e., a type of innovative network.

Alfredo Del Monte, Maria Rosaria D’Esposito, Giuseppe Giordano, Maria Prosperina Vitale
The Measure of Economic Re-Evaluation: a Coefficient Based on Conjoint Analysis

During the last 40 years conjoint analysis has been used to solve a wide variety of concerns in market research. Recently, a number of studies have begun to use conjoint analysis for the economic valuation of non-market goods. This paper discusses how to extend the conjoint analysis area of application by introducing a coefficient to measure economic re-evaluation on the basis of utility scores and the relative importances of attributes provided by conjoint analysis. We utilise the suggested coefficient for the economic valuation of a typical non-market good, such as a worldwide cultural event, to reveal the trade offs between its attributes in terms of revenue variation. Our findings indicate the most valuable change to be made to the existing status quo to generate economic surplus.

Paolo Mariani, Mauro Mussini, Emma Zavarrone
Do Governments Effectively Stabilize Fuel Prices by Reducing Specific Taxes? Evidence from Italy

After the sharp increase of oil prices experienced in recent years, in order to stabilize fuel prices, many countries experimented automatic fiscal mechanisms consisting of a one-to-one reduction in specific taxes matching the rise in input prices. This study investigates the impact of these mechanisms on wholesale gasoline and motor diesel prices. Our estimates highlight that fiscal sterilization brings about a rise in final wholesale prices that more than compensate reduction in taxes. Hence, these “flexible” taxation mechanisms could not be a proper policy for stabilizing price levels in fuel markets.

Marina Di Giacomo, Massimiliano Piacenza, Gilberto Turati
An Analysis of Industry Sector Via Model Based Clustering

The paper presents an unsupervised procedure for the evaluation of the firm financial status, aiming at identifying a potentially weak level of solvency of a company through its positioning in a segmented sector. Model Based Clustering is, here, used to segment real datasets concerning sectoral samples of industrial companies listed in five European stock exchange markets.

Carmen Cutugno
Impact of Exogenous Shocks on Oil Product Market Prices

The presence in Italy of a high number of vertically integrated energy companies, has given us the idea to investigate the effects that adoption of new price policies, and geopolitical events, have on the mechanisms of price transmissions in the Italian wholesale and retail gasoline markets, using weekly data from January 03/2000 to November 28/2008. The interaction among crude oil prices, gasoline spot prices and the before tax gasoline retail prices have been considered. The results show that industrial policies have a significant role in explaining gasoline prices. To be more specific, shock in the retail market has an important role in the increasing price of gasoline.

Antonio Angelo Romano, Giuseppe Scandurra

Nonparametric kernel estimation

Probabilistic Forecast for Northern New Zealand Seismic Process Based on a Forward Predictive Kernel Estimator

In seismology predictive properties of the estimated intensity function are often pursued. For this purpose, we propose an estimation procedure in time, longitude, latitude and depth domains, based on the subsequent increments of likelihood obtained adding an observation one at a time. On the basis of this estimation approach a forecast of earthquakes of a given area of Northern New Zealand is provided, assuming that future earthquakes activity may be based on the smoothing of past earthquakes.

Giada Adelfio, Marcello Chiodi
Discrete Beta Kernel Graduation of Age-Specific Demographic Indicators

Several approaches have been proposed in literature for the kernel graduation of age-specific demographic indicators. Nevertheless, although age is pragmatically a discretized variable with a finite support (typically age at last birthday is considered), commonly used methods employ continuous kernel functions. Moreover, symmetric kernels, that bring in further bias at the support boundaries (the so-called problem of boundary bias), are routinely adopted. In this paper we propose a discrete kernel smooth estimator specifically conceived for the graduation of discrete finite functions, such are age-specific indicators. Kernel functions are chosen from a family of conveniently discretized and re-parameterized beta densities; since their support matches the age range, the issue of boundary bias is eliminated. An application to 1999–2001 mortality data from the Valencia Region (Spain) is also presented.

Angelo Mazza, Antonio Punzo
Kernel-Type Smoothing Methods of Adjusting for Unit Nonresponse in Presence of Multiple and Different Type Covariates

This paper deals with the nonresponse problem in the estimation of the mean of a finite population following a nonparametric approach. Weighting adjustment is a popular method for handling unit nonresponse. It operates by increasing the sampling weights of the respondents in the sample using estimates of their respond probabilities. Typically, these estimates are obtained by fitting parametric models relating response occurrences and auxiliary variables. An alternative solution is the nonparametric estimation of the response probabilities. The aim of this paper is to investigate, via simulation experiments, the small-sample properties of kernel regression estimation of the response probabilities when the auxiliary information consists in a mix of continuous and discrete variables. Furthermore the practical behavior of the method is evaluated on data of a web survey on accommodation facilities in the province of Florence.

Emilia Rocco

Data analysis in industry and services

Measurement Errors and Uncertainty: A Statistical Perspective

Evaluation of measurement systems is necessary in many industrial contexts. The literature on this topic is mainly focused on how to measure uncertainties for systems that yield continuous output. Few references are available for categorical data and they are briefly recalled in this paper. Finally a new proposal to measure uncertainty when the output is bounded ordinal is introduced.

Laura Deldossi, Diego Zappa
Measurement Uncertainty in Quantitative Chimerism Monitoring after Stem Cell Transplantation

Following allogeneic stem cell transplantation, graft status is often inferred from values for DNA chimerism in blood or bone marrow. Timely assessment of graft status is critical to determine proper management of post cell transplantation. A common methodology for chimerism testing is based on STR-PCR, i.e. PCR amplification of

S

hort

T

andem DNA

R

epeats. This is a complex technology platform for indirect DNA measurement. It is however associated with inherent variability originating from preparation, amplification of the DNA, and uncalibrated product detection. Nonetheless, these semi-quantitative measurements of DNA quantity are used to determine graft status from estimated percent chimerism [

%Chim

]. Multiplex PCR partially overcomes this limitation by using a set of simultaneously amplified STR markers, that enables computing a [

mean%Chim

] value for the sample. Quantitative assessment of measurement variability and sources of error in [

mean%Chim

] is particularly important for longitudinal monitoring of graft status. In such cases, it is necessary to correctly interpret differential changes

of

[

mean%Chim

] as reflective of the biological status of the graft, and not mere error of the assay. This paper presents a systematic approach to assessing different sources of STR measurement uncertainty in the tracking of chimerism. Severe procedural and cost constraints are making this assessment non trivial. We present our results in the context of Practical Statistical Efficiency (PSE), the practical impact of Statistical work, and InfoQ, the Information Quality encapsulated in ChimerTrack®;, a software application tracking chimerism.

Ron S. Kenett, Deborah Koltai, Don Kristt
Satisfaction, Loyalty and WOM in Dental Care Sector

We propose two different measures of the dental care industry: (a) BALG matrix, a new instrument to measure patient loyalty and its extent; (b) SERVQUAL based approach to measure patient satisfaction. Further investigation concerns the link between patient satisfaction and loyalty. The results prove that patient loyalty in the dental care industry is similar to consumer behaviour in all the other B2B and B2C services and furthermore, the results highlight low dependency of patient satisfaction on loyalty.

Paolo Mariani, Emma Zavarrone
Controlled Calibration in Presence of Clustered Measures

In the context of statistical controlled calibration we introduce the ‘multilevel calibration estimator’ in order to account for clustered measurements. To tackle this issue more closely, results from a simulation study will be extensively discussed. Finally, an application from a building industry will be presented.

Silvia Salini, Nadia Solaro

Visualization of Relationships

Frontmatter
Latent Ties Identification in Inter-Firms Social Networks

Social networks are usually analyzed through manifest variables. However there are social latent aspects that strongly qualify such networks. This paper aims to propose a statistical methodology to identify latent variable in inter-firm social networks. A multidimensional scaling technique is proposed to measure a latent variable as a combination of an appropriate set of two or more manifest relational aspects. This method, tested on an inter-firm social network in the Marche region (Italy), it is a new way to grasp social aspect with quantitative tools that could be implemented under several different conditions, using also other variables.

Patrizia Ameli, Federico Niccolini, Francesco Palumbo
A Universal Procedure for Biplot Calibration

Biplots form a useful tool for the graphical exploration of multivariate data sets. A wide variety of biplots has been described for quantitative data sets, contingency tables, correlation matrices and matrices of regression coefficients. These are produced by principal component analysis (PCA), correspondence analysis (CA), canonical correlation analysis (CCO) and redundancy analysis (RDA). The information content of a biplot can be increased by adding scales with tick marks to the biplot arrows, a process called

calibration

. We describe a general procedure for obtaining scales that is based on finding an optimal calibration factor by generalized least squares. This procedure allows automatic calibration of axes in all forementioned biplots. Use of the optimal calibration factor produces graduations that are identical to Gower’s predictive scales. A procedure for automatically shifting calibrated axes towards the margins of the plot is presented.

Jan Graffelman
Analysis of Skew-Symmetry in Proximity Data

Skew-symmetric data matrices can be represented in graphical displays in different ways. Some simple procedures that can be easily performed by standard software will be proposed and applied in this communication. Other methods based on spatial models that need ad hoc computational programs will be also reviewed emphasizing advantages and disadvantages in their applications.

Giuseppe Bove
Social Stratification and Consumption Patterns: Cultural Practices and Lifestyles in Japan

The aim of this paper is to examine the relationship between cultural consumption and social stratification. Based upon a nationally representative 2005 Japanese sample (N = 2,915), we uncovered the association between a wide range of cultural capital and social class in Japan. In doing so, we re-examined conventional occupational schemes and developed a detailed occupational classification. Correspondence analysis revealed that both men and women who are well-educated and have a higher occupational position have more cultural capital. The results also indicate gender-specific cultural consumption patterns. For women, highbrow culture is important for distinguishing themselves and maintaining social position. In contrast, highbrow culture is defined as an irrelevant waste of time for men of higher position and instead business culture, which is characterized by a mixture of enterprise and rationality, prevails.

Miki Nakai
Centrality of Asymmetric Social Network: Singular Value Decomposition, Conjoint Measurement, and Asymmetric Multidimensional Scaling

The centrality of an asymmetric social network, where relationships among actors are asymmetric, is investigated by singular value decomposition, conjoint measurement, and asymmetric multidimensional scaling. They were applied to asymmetric relationships among managers of a small firm. Two sets of outward and inward centralities were derived by singular value decomposition. The first set is similar to the centralities obtained by conjoint measurement and by asymmetric multidimensional scaling. The second set represents the different aspect from the centralities of the first set as well as those derived by the conjoint measurement and the asymmetric multidimensional scaling.

Akinori Okada

Classification

Frontmatter
Some Perspectives on Multivariate Outlier Detection

We provide a selective view of some key statistical concepts that underlie the different approaches to multivariate outlier detection. Our hope is that appreciation of these concepts will help to establish a unified and widely accepted framework for outlier detection.

Andrea Cerioli, Anthony C. Atkinson, Marco Riani
Spatial Clustering of Multivariate Data Using Weighted MAX-SAT

Incorporating geographical constraints is one of the main challenges of spatial clustering. In this paper we propose a new algorithm for clustering of spatial data using a conjugate Bayesian model and weighted MAX-SAT solvers. The fast and flexible Bayesian model is used to score promising partitions of the data. However, the partition space is huge and it cannot be fully searched, so here we propose an algorithm that naturally incorporates the geographical constraints to guide the search over the space of partitions. We illustrate our proposed method on a simulated dataset of social indexes.

Silvia Liverani, Alessandra Petrucci
Clustering Multiple Data Streams

In recent years, data streams analysis has gained a lot of attention due to the growth of applicative fields generating huge amount of temporal data. In this paper we will focus on the clustering of multiple streams. We propose a new strategy which aims at grouping similar streams and, together, at computing summaries of the incoming data. This is performed by means of a divide and conquer approach where a continuously updated graph collects information on incoming data and an off-line partitioning algorithm provides the final clustering structure. An application on real data sets corroborates the effectiveness of the proposal.

Antonio Balzanella, Yves Lechevallier, Rosanna Verde
Notes on the Robustness of Regression Trees Against Skewed and Contaminated Errors

Regression trees represent one of the most popular tools in predictive data mining applications. However, previous studies have shown that their performances are not completely satisfactory when the dependent variable is highly skewed, and severely degrade in the presence of heavy-tailed error distributions, especially for grossly mis-measured values of the dependent variable. In this paper the lack of robustness of some classical regression trees is investigated by addressing the issue of highly-skewed and contaminated error distributions. In particular, the performances of some non robust regression trees are evaluated through a Monte Carlo experiment and compared to those of some trees, based on M-estimators, recently proposed in order to robustify this kind of methods. In conclusion, the results obtained from the analysis of a real dataset are presented.

Giuliano Galimberti, Marilena Pillati, Gabriele Soffritti
A Note on Model Selection in STIMA

Simultaneous Threshold Interaction Modeling Algorithm (STIMA) has been recently introduced in the framework of statistical modeling as a tool enabling to automatically select interactions in a Generalized Linear Model (GLM) through the estimation of a suitable defined tree structure called ‘trunk’. STIMA integrates GLM with a classification tree algorithm or a regression tree one, depending on the nature of the response variable (nominal or numeric). Accordingly, it can be based on the Classification Trunk Approach (CTA) or on the Regression Trunk Approach (RTA). In both cases, interaction terms are expressed as ‘threshold interactions’ instead of traditional cross-products. Compared with standard tree-based algorithms, STIMA is based on a different splitting criterion as well as on the possibility to ‘force’ the first split of the trunk by manually selecting the first splitting predictor. This paper focuses on model selection in STIMA and it introduces an alternative model selection procedure based on a measure which evaluates the trade-off between goodness of fit and accuracy. Its performance is compared with the one deriving from the current implementation of STIMA by analyzing two real datasets.

Claudio Conversano
Conditional Classification Trees by Weighting the Gini Impurity Measure

This paper introduces the concept of the conditional impurity in the framework of tree-based models in order to deal with the analysis of three-way data, where a response variable and a set of predictors are measured on a sample of objects in different occasions. The conditional impurity in the definition of splitting criterion is defined as a classical impurity measure weighted by a predictability index.

Antonio D’Ambrosio, Valerio A. Tutore

Analysis of financial data

Visualizing and Exploring High Frequency Financial Data: Beanplot Time Series

In this paper we deal with the problem of visualizing and exploring specific time series such as high-frequency financial data. These data present unique features, absent in classical time series, which involve the necessity of searching and analysing an aggregate behaviour. Therefore, we define peculiar aggregated time series called beanplot time series. We show the advantages of using them instead of scalar time series when the data have a complex structure. Furthermore, we underline the interpretative proprieties of beanplot time series by comparing different types of aggregated time series. In particular, with simulated and real examples, we illustrate the different statistical performances of beanplot time series respect to boxplot time series.

Carlo Drago, Germana Scepi
Using Partial Least Squares Regression in Lifetime Analysis

The problem of collinearity among right-censored data is considered in multivariate linear regression by combining mean imputation and the Partial Least Squares (PLS) methods. The purpose of this paper is to investigate the performance of PLS regression when explanatory variables are strongly correlated financial ratios. It is shown that ignoring the presence of censoring in the data can cause a bias. The proposed methodology is applied to a data set describing the financial status of some small and medium-sized Tunisian firms. The derived model is interesting to be able to predict the lifetime of a firm until the occurrence of the failure event.

Intissar Mdimagh, Salwa Benammou
Robust Portfolio Asset Allocation

Selection of stocks in a portfolio of shares represents a very interesting problem of ‘optimal classification’. Often such optimal allocation is determined by second-order conditions which are very sensitive to outliers. Classical Markowitz estimators of the covariance matrix seem to provide poor results in financial management, so we propose an alternative way of weighting observations by using a forward search approach. An application to real data, which shows the advantages of the proposed approach is given at the end of this work.

Luigi Grossi, Fabrizio Laurini
A Dynamic Analysis of Stock Markets through Multivariate Latent Markov Models

A correct classification of financial products represents the essential and required step for achieving optimal investment decisions. The first goal in portfolio analysis should be the allocation of each asset into a class which groups investment opportunities characterized by a homogenous risk-return profile. Furthermore, the second goal should be the assessment of the stability of the classes composition. In this paper we address both objectives by means of the latent Markov models, which allow us to investigate the dynamic pattern of financial time series through an innovative framework. First, we propose to exploit the potential of latent Markov models in order to achieve latent classes able to group stocks with a similar risk-return profiles. Second, we interpret the transition probabilities estimated within latent Markov models as the probabilities of switching between the well-known states of financial markets: the upward trend, the downward trend and the lateral phases. Our results allow us both to discriminate the stock’s performance following a powerful classification approach and to assess the stock’s dynamics by predicting which state is going to experience next.

Michele Costa, Luca De Angelis
A MEM Analysis of African Financial Markets

In the last few years, international institutions stressed the role of African financial markets to diversify investors’ risk. Focusing on the volatility of financial markets, this paper analyses the relationships between developed markets (US, UK and China) and some Sub-Saharian African (SSA) emerging markets (Kenya, Nigeria and South Africa) in the period 2004–2009 using a Multiplicative Error model (MEM). We model the dynamics of the volatility in one market including interactions from other markets, and we build a fully interdependent model. Results show that South Africa and China have a key role in all African markets, while the influence of the UK and the US is weaker. Developments in China turn out to be (fairly) independent of both UK and US markets. With the help of impulse-response functions, we show how recent turmoil hit African countries, increasing the fragility of their infant financial markets.

Giorgia Giovannetti, Margherita Velucchi
Group Structured Volatility

In this work we investigate the presence of ‘group’ structures in financial markets. We show how this information can be used to simplify the volatility modelling of large portfolios. Our testing dataset is composed by all the stocks listed on the S&P500 index.

Pietro Coretto, Michele La Rocca, Giuseppe Storti

Functional Data Analysis

Frontmatter
Clustering Spatial Functional Data: A Method Based on a Nonparametric Variogram Estimation

In this paper we propose an extended version of a model-based strategy for clustering spatial functional data. The strategy, we refer, aims simultaneously to classify spatially dependent curves and to obtain a spatial functional model prototype for each cluster. The fit of these models implies to estimate a variogram function, the trace variogram function. Our proposal is to introduce an alternative estimator for the trace-variogram function: a kernel variogram estimator. This works better to adapt spatial varying features of the functional data pattern. Experimental comparisons show this approach has some advantages over the previous one.

Elvira Romano, Rosanna Verde, Valentina Cozza
Prediction of an Industrial Kneading Process via the Adjustment Curve

This work addresses the problem of predicting a binary response associated to a stochastic process. When observed data are of functional type a new method based on the definition of special Random Multiplicative Cascades is introduced to simulate the stochastic process. The

adjustment curve

is a decreasing function which gives the probability that a realization of the process is

adjustable

at each time before the end of the process. For real industrial processes, this curve can be used for monitoring and predicting the quality of the outcome before completion. Results of an application to data from an industrial kneading process are presented.

Giuseppina D. Costanzo, Francesco Dell’Accio, Giulio Trombetta
Dealing with FDA Estimation Methods

In many different research fields, such as medicine, physics, economics, etc., the evaluation of real phenomena observed at each statistical unit is described by a curve or an assigned function. In this framework, a suitable statistical approach is Functional Data Analysis based on the use of basis functions. An alternative method, using Functional Analysis tools, is considered in order to estimate functional statistics. Assuming a parametric family of functional data, the problem of computing summary statistics of the same parametric form when the set of all functions having that parametric form does not constitute a linear space is investigated. The central idea is to make statistics on the parameters instead of on the functions themselves.

Tonio Di Battista, Stefano A. Gattone, Angela De Sanctis

Computer Intensive Methods

Frontmatter
Testing for Dependence in Mixed Effect Models for Multivariate Mixed Responses

In regression modelling for multivariate responses of mixed type, the association between outcomes may be modeled through dependent, outcomespecific, latent effects. Parametric specifications of this model already exist in the literature; in this paper, we focus on model parameter estimation in a Finite Mixture (FM) framework. A relevant issue arises when independence should be tested vs dependence. We review the performance of LRT and penalized likelihood criteria to assess the presence of dependence between outcome-specific random effects. The model behavior investigated through the analysis of simulated datasets shows that AIC and BIC are of little help to test for dependence, while bootstrapped LRT statistics performs well even with small sample sizes and limited number of bootstrap samples.

Marco Alfó, Luciano Nieddu, Donatella Vicari
Size and Power of Tests for Regression Outliers in the Forward Search

The Forward Search is a method for detecting masked outliers and for determining their effect on models fitted to the data. We have estimated the actual statistical size and power of the Forward Search in regression through a large number of simulations, for a wide set of sample sizes and several dimensions. Special attention is given here to the statistical size. The work confirms for regression the excellent Forward Search properties shown in the multivariate context by Riani et al. (Journal of the Royal Statistical Society. Series B 71:1–21, 2009).

Francesca Torti, Domenico Perrotta
Using the Bootstrap in the Analysis of Fractionated Screening Designs

In recent years the bootstrap has been favored for the analysis of replicated designs, both full and fractional factorials. Unfortunately, its application to fractionated designs has some limitations if interactions are considered. In this paper the bootstrap is used in a more effective way in the analysis of fractional designs, arguing that its application is closely connected with the projective properties of the design used. A two-stage approach is proposed for two-level orthogonal designs of projectivity

P

 = 3, both replicated and unreplicated, which are especially useful in screening experiments. This approach combines a preliminary search for active factors with a thorough study of factor effects, including interactions, via a bootstrap analysis. It is especially useful in non-standard situations such as with nonnormal data, outliers, and heteroscedasticity, but depends heavily on the assumption of factor sparsity. Three examples are used for illustration.

Anthony Cossari
CRAGGING Measures of Variable Importance for Data with Hierarchical Structure

This paper focuses on algorithmic Variable Importance measurement when hierarchically structured data sets are used. Ensemble learning algorithms (such as Random Forest or Gradient Boosting Machine), which are frequently used to assess the Variable Importance, are unsuitable for exploring hierarchical data. For this reason an ensemble learning algorithm called CRAGGING has been recently proposed. The aim of this paper is to introduce the CRAGGING Variable Importance measures, then inspecting how they perform empirically.

Marika Vezzoli, Paola Zuccolotto
Regression Trees with Moderating Effects

This paper proposes a regression tree methodology that considers the relationships among variables belonging to different levels of a data matrix which is characterized by a hierarchical structure. In such way we consider two kinds of partitioning criteria dealing with non parametric regression analysis. The proposal is based on a generalization of Classification and Regression Trees algorithm (CART) that considers a different role played by moderating variables. In the work are showed some applications on real and simulated dataset to compare the proposal with classical approaches.

Gianfranco Giordano, Massimo Aria
Data Mining for Longitudinal Data with Different Treatments

The CLAssification and Data Analysis Group (CLADAG) of the Italian Statistical Society recently organised a competition, the ‘Young Researcher Data Mining Prize’ sponsored by the SAS Institute. This paper was the winning entry and in it we detail our approach to the problem proposed and our results. The main methods used are linear regression, mixture models, Bayesian autoregressive and Bayesian dynamic models.

Mouna Akacha, Thaís C. O. Fonseca, Silvia Liverani

Data analysis in environmental and medical sciences

Supervised Classification of Thermal High-Resolution IR Images for the Diagnosis of Raynaud’s Phenomenon

This paper proposes a supervised classification approach for the differential diagnosis of Raynaud’s Phenomenon on the basis of functional infrared imaging (IR) data. The segmentation and registration of IR images are briefly discussed and two texture analysis techniques are introduced in a spatial framework to deal with the feature extraction problem. The classification of data from healthy subjects and from patients suffering for primary and secondary Raynaud’s Phenomenon is performed by using Stepwise Linear Discriminant Analysis (LDA) on a large number of features extracted from the images. The results of the proposed methodology are shown and discussed for images related to 44 subjects.

Graziano Aretusi, Lara Fontanella, Luigi Ippoliti, Arcangelo Merla
A Mixture Regression Model for Resistin Levels Data

Resistin is a mainly adipose-derived peptide hormone, that reduces insulin sensitivity in adipocytes, skeletal muscles and hepatocytes. Only in recent years resistin has been studied in liver disease and the regarding literature is very poor. According to recent studies considering resistin like a clinical biomarker in the assessment of liver cirrhosis, we propose an application of a finite mixture regression model with concomitant variable in order to individualize factors that influence resistin levels in patients affected by different virus hepatitis. The estimated model shows the existence of two separated components differing for the intercept and for some covariates; moreover high serum resistin levels do not seem to be associated with liver histological lesions by C virus, but only by B virus hepatitis.

Gargano Romana, Alibrandi Angela
Interpreting Air Quality Indices as Random Quantities

Synthetic indices are a way of condensing complex situations to give one single value. A very common example of this in environmental studies is that of air quality indices; in their construction, statistics is helpful in summarizing multidimensional information. In this work, we are going to consider synthetic air-quality indices as random quantities, and investigate their main properties by comparing the confidence bands of their cumulative distribution functions.

Francesca Bruno, Daniela Cocchi
Comparing Air Quality Indices Aggregated by Pollutant

In this paper a new aggregate Air Quality Index (AQI) useful for describing the global air pollution situation for a given area is proposed. The index, unlike most of currently used AQIs, takes into account the combined effects of all the considered pollutants to human health. Its good performance, tested by means of a simulation plan, is confirmed by a comparison with two other indices proposed in the literature, one of which is based on the Relative Risk of daily mortality, considering an application to real data.

Mariantonietta Ruggieri, Antonella Plaia
Identifying Partitions of Genes and Tissue Samples in Microarray Data

An important challenge in microarray data analysis is the detection of genes which are differentially expressed across different types of experimental conditions. We provide a finite mixture model aimed at clustering genes and experimental conditions, where the partition of experimental conditions may be known or unknown. In particular, the idea is to adopt a finite mixture approach with mean/covariance reparameterization, where an explicit distinction among up-regulated genes, down-regulated genes, non-regulated genes (with respect to a reference probe) is made; moreover, within each of these groups genes that are differentially expressed between two or more types of experimental conditions may be identified.

Francesca Martella, Marco Alfò

Analysis of Categorical Data

Frontmatter
Assessing Balance of Categorical Covariates and Measuring Local Effects in Observational Studies

This paper presents a data driven approach that enables one to obtain a global measure of imbalance and to test it in a multivariate way. The main idea is based on the general framework of Partial Dependence Analysis (Daudin, 1981 J. J.) and thus of Conditional Multiple Correspondences Analysis (Escofier, B. (1988). Analyse des correpondances multiples conditionelle. La Revue de Modulad) as tools for investigating the dependence relationship between a set of observed categorical covariates (

X

) and an assignment-to-treatment indicator variable (

T

), in order to obtain a global imbalance measure (GI) according to their dependence structure.We propose the use of suchmeasure within a strategy whose aimis to compute treatment effects by subgroups. A toy example is presented for illustrate the performance of this promising approach.

Furio Camillo, Ida D’Attoma
Handling Missing Data in Presence of Categorical Variables: a New Imputation Procedure

In this paper we propose a new method to deal with missingness in categorical data. The new proposal is a forward imputation procedure and is presented in the context of the Nonlinear Principal Component Analysis, used to obtain indicators from a large dataset. However, this procedure can be easily adopted in other contexts, and when other multivariate techniques are used. We discuss the statistical features of our imputation technique in connection with other treatment methods which are popular among Nonlinear Principal Component Analysis users. The performance of our method is then compared to the other methods through a simulation study which involves the application to a real dataset extracted from the Euro-barometer survey. Missing data are created in the original data matrix and then the comparison is performed in terms of how close the Nonlinear Principal Component Analysis outcomes from missing data treatment methods are to the ones obtained from the original data. The new procedure is seen to provide better results than the other methods under the different conditions considered.

Pier Alda Ferrari, Alessandro Barbiero, Giancarlo Manzi
The Brown and Payne Model of Voter Transition Revisited

We attempt a critical assessment of the assumptions, in terms of voting behavior, underlying the Goodman (1953) and the Brown and Payne (1986) models of voting transitions. We argue that the first model is only a slightly simpler version of the second which, however, is fitted in a rather inefficient way. We also provide a critical assessment of the approach inspired by King et al. (1999) which has become popular among Sociologists and Political scientists. An application to the 2009 European and local election in the borough of Perugia is discussed.

Antonio Forcina, Giovanni M. Marchetti
On the Nonlinearity of Homogeneous Ordinal Variables

The paper aims at evaluating the nonlinearity existing in homogeneous ordinal data with a one-dimensional latent variable, using Linear and NonLinear Principal Components Analysis. The results of a simulation study with Probabilistic and Monte Carlo gauges show that, when variables are linearly related, a source of nonlinearity can affect each single variable, but the nonlinearity of the global solution is negligible and, therefore, can be left out to construct a measure of the latent trait underlying homogeneity data.

Maurizio Carpita, Marica Manisera
New Developments in Ordinal Non Symmetrical Correspondence Analysis

For the study of association in two and three-way contingency tables the literature offers a large number of techniques that can be considered. When there is an asymmetric dependence structure between the variables, the Goodman-Kruskal and Marcotorchino index (with respect to the Pearson chi-squared statistic) can be used to measure the strength of their association when they are collected in two and three way contingency tables, respectively. In the last years, special attention has been paid to the graphical representation of the dependence structure between two or more variables, preserving the information arising from the ordinal structure of the modalities. In this paper, the authors synthesize the main proposals falling within the framework called Ordinal Non Symmetrical Correspondence Analysis for two and three way contingency tables.

Biagio Simonetti, Luigi D’Ambra, Pietro Amenta
Correspondence Analysis of Surveys with Multiple Response Questions

Correspondence Analysis (CA) of surveys studies the relationships between several categorical variables defined with respect to a certain population. However, one of the main sources of information is the type of survey in which it is usual to find multiple response questions and/or conditioned questions that do not need to be answered by the whole population. In these cases, the data coded as 0 (category of no chosen response) and 1 (category of chosen response) can be expressed by means of an incomplete disjunctive table (IDT). The direct application of standard CA to this type of table could lead to inappropriate results. We therefore propose a new methodology for the analysis of incomplete disjunctive tables.

Amaya Zárraga, Beatriz Goitisolo

Multivariate Analysis

Frontmatter
Control Sample, Association and Causality

We introduce the control sample in survey sampling as a tool for measuring the association between a possible cause

X

and a particular effect

Y

. The elements of the target population are divided in two groups according to whether

X

is present or not, the absence of

X

identifies the control group. A random sample is selected from each group with the aim of measuring the association between

X

and

Y

. We propose an unbiased estimator of the associational risk difference between groups and then generalize our approach to the problem of estimating the causal risk difference.

Riccardo Borgoni, Donata Marasini, Piero Quatto
A Semantic Based Dirichlet Compound Multinomial Model

This contributions deals with the methodological study of a generative approach for the analysis of textual data. Instead of creating heuristic rules for the representation of documents and word counts, we employ a distribution able to model words along text considering different topics. In this regard, following Minka proposal, we implement a Dirichlet Compound Multinomial distribution that is a mixture of random variables over words and topics. Moving from such approach we propose an extension called sbDCM that takes into account the different latent topics that compound the document. The number of topics to be inserted can be known or unknown in advance, on the basis of the application context. Without losing in generality we present the case where the number and characteristics of topics are properly evaluated on the basis of available data.

Paola Cerchiello, Elvio Concetto Bonafede
Distance-Based Approach in Multivariate Association

We show how to relate two data sets, where the observations are taken on the same individuals. We study some measures of multivariate association based only on distances between individuals. A permutation test is proposed to decide whether the association is significant. With these measures we can handle very general data.

Carles M. Cuadras
New Weighed Similarity Indexes for Market Segmentation Using Categorical Variables

In this paper we introduce new similarity indexes for binary and polytomous variables, employing the concept of “information content”. In contrast to traditionally used similarity measures, we suggest to consider the frequency of the categories of each attribute in the sample. This feature is useful when dealing with rare categories, since it makes sense to differently evaluate the pairwise presence of a rare category from the pairwise presence of a widespread one. We also propose a weighted index for dependent categorical variables. The suitability of the proposed measures from a marketing research perspective is shown using two real data sets.

Isabella Morlini, Sergio Zani
Causal Inference with Multivariate Outcomes: a Simulation Study

Within the framework of the Rubin Causal Model, Principal Stratification is used to address post-treatment complications in randomized experiments, such as noncompliance, unintended missing outcomes, and truncation by death of the outcomes. We focus on a likelihood approach, exploiting the properties of multivariate finite mixture models in order to relax some of the usual identifying assumptions. These include monotonicity and exclusion restrictions hypotheses. A simulation study is conducted to show that the simultaneous modeling of more than one outcome may improve model identification and efficiency.

Paolo Frumento, Fabrizia Mealli, Barbara Pacini
Using Multilevel Models to Analyse the Context of Electoral Data

Multilevel models are used to analyse contextual effects in hierarchical structures in order to explore the relationship among nested units. This study aims to observe the link among the territorial micro units nested in higher levels. We examine electoral data in two stages, defined in first level units inside nested structures. In these we used economic, demographic and social variables in order to characterize the context and explore its effects upon the electoral outline of territorial units.

Rosario D’Agata, Venera Tomaselli
A Geometric Approach to Subset Selection and Sparse Sufficient Dimension Reduction

Sufficient dimension reduction methods allow to estimate lower dimensional subspaces while retaining most of the information about the regression of a response variable on a set of predictors. However, it may happen that only a subset of the predictors is needed. We propose a geometric approach to subset selection by imposing sparsity constraints on some coefficients. The proposed method can be applied to most existing dimension reduction methods, such as sliced inverse regression and sliced average variance estimation, and may help to improve the estimation accuracy and facilitate interpretation. Simulation studies are presented to show the effectiveness of the proposed method applied to two popular dimension reduction methods, namely SIR and SAVE, and a comparison is made with LASSO and stepwise OLS regression.

Luca Scrucca
Local Statistical Models for Variables Selection

The objective of this paper is to find the most frequent itemsets in a database, made up of categorical explanatory variables and a continuous response variable. To achieve this aim we propose to extend local data mining techniques based on association rules. We assess the performance of our model by developing appropriate model indicators derived from classical concentration measures.

Silvia Figini
Backmatter
Metadaten
Titel
New Perspectives in Statistical Modeling and Data Analysis
herausgegeben von
Salvatore Ingrassia
Roberto Rocci
Maurizio Vichi
Copyright-Jahr
2011
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-11363-5
Print ISBN
978-3-642-11362-8
DOI
https://doi.org/10.1007/978-3-642-11363-5

Premium Partner