Skip to main content

2013 | Buch

Statistical Models for Data Analysis

herausgegeben von: Paolo Giudici, Salvatore Ingrassia, Maurizio Vichi

Verlag: Springer International Publishing

Buchreihe : Studies in Classification, Data Analysis, and Knowledge Organization

insite
SUCHEN

Über dieses Buch

The papers in this book cover issues related to the development of novel statistical models for the analysis of data. They offer solutions for relevant problems in statistical data analysis and contain the explicit derivation of the proposed models as well as their implementation. The book assembles the selected and refereed proceedings of the biannual conference of the Italian Classification and Data Analysis Group (CLADAG), a section of the Italian Statistical Society. ​

Inhaltsverzeichnis

Frontmatter
Ordering Curves by Data Depth

Application of depth methods to functional data provides new tools of analysis, in particular an ordering of curves from the center outwards. Two specific depth definitions are band depth and half-region depth (López-Pintado & Romo (2009).

Journal of the American Statistical Association, 104

, 718–734; López-Pintado & Romo (2011).

Computational Statistics & Data Analysis, 55

, 1679–1695). Another research area is local depth (Agostinelli and Romanazzi (2011).

Journal of Statistical Planning and Inference, 141

, 817–830.) aimed to identify multiple centers and dense subsets of the space. In this work we suggest local versions for both band and half-region depth and illustrate an application with real data.

Claudio Agostinelli, Mario Romanazzi
Can the Students’ Career be Helpful in Predicting an Increase in Universities Income?

The students’ academic failure and the delay in obtaining their final degree are a significant issue for the Italian universities and their stakeholders. Based on indicators proposed by the Italian Ministry of University, the Italian universities are awarded a financial incentive if they reduce the students’ attrition and failure. In this paper we analyze the students’ careers performance using: (1) aggregate data; (2) individual data. The first compares the performances of the Italian universities using the measures and the indicators proposed by the Ministry. The second analyzes the students’ careers through an indicator based on credit earned by each student in seven academic years. The primary goal of this paper is to highlight elements that can be used by the policy makers to improve the careers of the university students.

Massimo Attanasio, Giovanni Boscaino, Vincenza Capursi, Antonella Plaia
Model-Based Classification Via Patterned Covariance Analysis

This work deals with the classification problem in the case that groups are known and both labeled and unlabeled data are available. The classification rule is derived using Gaussian mixtures where covariance matrices are given according to a multiple testing procedure which asesses a pattern among heteroscedasticity, homometroscedasticity, homotroposcedasticity, and homoscedasticity. The mixture models are then fitted using all available data (labeled and unlabeled) and adopting the EM and the CEM algorithms. The performance of the proposed procedure is evaluated by a simulation study.

Luca Bagnato
Data Stream Summarization by Histograms Clustering

In this paper we introduce a new strategy for summarizing a fast changing data stream. Evolving data streams are generated by non stationary processes which require to adapt the knowledge discovery process to the new emerging concepts. To deal with this challenge we propose a clustering algorithm where each cluster is summarized by a histogram and data are allocated to clusters through a Wasserstein derived distance. Histograms are a well known graphical tool for representing the frequency distribution of data and are widely used in data stream mining, however, unlike to existing methods, we discover a set of histograms where each one represents a main concept in the data. In order to evaluate the performance of the method, we have performed extensive tests on simulated data.

Antonio Balzanella, Lidia Rivoli, Rosanna Verde
Web Panel Representativeness

Web panels are becoming more and more popular for data collection. They present specific problems and advantages with respect to the usual modes of collection. This paper analyzes possible re-weighting adjustments for non-response in panel data. Different weighting schemes are evaluated by means of a simulation study based on real data.

Annamaria Bianchi, Silvia Biffignandi
Nonparametric Multivariate Inference Via Permutation Tests for CUB Models

A new approach for modelling discrete choices in rating or ranking problems is represented by a class of mixture models with covariates (Combination of Uniform and shifted Binomial distributions, CUB models), proposed by Piccolo (2003,

Quaderni di Statistica, 5

, 85–104), D’Elia & Piccolo (2005,

Computational Statistics & Data Analysis, 49

, 917–934), Piccolo (2006,

Quaderni di Statistica, 8

, 33–78) and Iannario (2010,

Metron, LXVIII

, 87–94). In case of a univariate response, a permutation solution to test for covariates effects has been discussed in Bonnini et al. (2012,

Communication in Statistics: Theory and Methods

), together with parametric inference. We propose an extension of this nonparametric test to deal with the multivariate case. The good performances of the method are showed trough a simulation study and the procedure is applied to real data regarding the evaluation of the Ski School of Sesto Pusteria (Italy).

Stefano Bonnini, Luigi Salmaso, Francesca Solmi
Asymmetric Multidimensional Scaling Models for Seriation

Singular value decomposition (SVD) of skew-symmetric matrices was proposed to represent asymmetry of proximity data. Some authors considered the plane (

bimension

or

hedron

) determined by the first two singular vectors to detect orderings (

seriation

) for preference or dominance data. Following these approaches, in this paper some procedures of asymmetric multidimensional scaling useful for seriation are proposed focalizing on a model that is a particular case of

rank-2

SVD model. An application to Thurstone’s paired comparison data on the relative seriousness of crime is also presented.

Giuseppe Bove
An Approach to Ranking the Hedge Fund Industry

Due to the complexity and heterogeneity of hedge fund strategies, the evaluation of their performance and risk is a challenging task. Starting from the standard mutual fund industry, the literature has evolved in the direction of refining traditional measures (e.g. the Sharpe Ratio) or introducing new ones. This paper develops an approach, based on the Principal Component Analysis, to uncover the relevant information for performance measurement and combine it into a unique rank.

Riccardo Bramante
Correction of Incoherences in Statistical Matching

Statistical matching is studied inside a coherent setting, by focusing on the problem of removing inconsistencies. When structural zeros among involved variables are present, incoherencies on the parameter estimations can arise. The aim is to compare different methods to remove such incoherences based on specific pseudo-distances. The comparison is given through an exemplifying example of 100 simulations from a known population with three categorical variables, that carries out to the light peculiarities of the statistical matching problem.

Andrea Capotorti, Barbara Vantaggi
The Analysis of Network Additionality in the Context of Territorial Innovation Policy: The Case of Italian Technological Districts

Evidence from economic literature suggests that innovative activities based on extensive interactions between industry, universities and local government can yield high levels of economic performance. In many countries, therefore, steps have been taken at an institutional level to set up innovation networks and, in particular, regional technological districts. Our paper deals with Italian Technological Districts: we aim to analyse the network additionality for territorial innovation determined by district policy. The analysis is based on a priori structural regional characteristics and on Social Network Analysis techniques.

Carlo Capuano, Domenico De Stefano, Alfredo Del Monte, Maria Rosaria D’Esposito, Maria Prosperina Vitale
Clustering and Registration of Multidimensional Functional Data

In order to find similarity between multidimensional curves, we consider the application of a procedure that provides a simultaneous assignation to clusters and alignment of such functions. In particular we look for clusters of multivariate seismic waveforms based on EM-type procedure and functional data analysis tools.

M. Chiodi, G. Adelfio, A. D’Alessandro, D. Luzio
Classifying Tourism Destinations: An Application of Network Analysis

Tourism is basically a spatial phenomenon, which implies moving consumption within space. Starting from the assumption that the destinations are nodes of a network, we are able to reconstruct a spatial grid where each locality shows different grades and types of centrality. The analysis, focusing on the spatial dimension, shows clusters of locations. By shifting interest from single locations to destination networks, the study points out the structural features of each network. Employing traditional network analysis measures, we classify destinations considering the routes of a self-organized tourists sample that visited more than one destination in Sicily.

Rosario D’Agata, Venera Tomaselli
On Two Classes of Weighted Rank Correlation Measures Deriving from the Spearman’s ρ

Weighted Rank Correlation indices are useful for measuring the agreement of two rankings when the top ranks are considered more important than the lower ones. This paper investigates, from a descriptive perspective, the behaviour of (

i

) five existing indices that introduce suitable weights in the simplified formula of the Spearman’s

ρ

and (

ii

) an additional five indices we derive using the same weights in the Pearson’s product-moment correlation index between ranks. For their evaluation, we consider that a good Weighted Rank Correlation index should (1) differ from

ρ

, if computed on the same pair of rankings and (2) assume a broad variety of values in the range

$$[-1,+1]$$

, in order to better discriminate amongst different reorderings of the ranks. Results suggest that linear weights should be avoided and show that indices (

ii

) do not have equalities with

ρ

and are more sensitive.

Livia Dancelli, Marica Manisera, Marika Vezzoli
Beanplot Data Analysis in a Temporal Framework

We propose in this work a new approach for modelling, forecasting and clustering beanplot financial time series. The beanplot time series like the histogram time series or the interval time series can be very useful to model the intra-period variability of the series. These types of new time series can be very useful with High Frequency financial data, data collected with often irregularly spaced observations.

Carlo Drago, Carlo Lauro, Germana Scepi
Supervised Classification of Facial Expressions

Over the last decade, the statistical analysis of facial expressions has become an active research topic that finds potential applications in many areas. As the expression plays remarkable social interaction, the development of a system that accomplishes the task of automatic classification is challenging. In this work, we thus consider the problem of classifying facial expressions through shape variables represented by log-transformed Euclidean distances computed among a set of anatomical landmarks.

S. Fontanella, C. Fusilli, L. Ippoliti
Grouping Around Different Dimensional Affine Subspaces

Grouping around affine subspaces and other types of manifolds is receiving a lot of attention in the literature due to its interest in several fields of application. Allowing for different dimensions is needed in many applications. This work extends the TCLUST methodology to deal with the problem of grouping data around different dimensional linear subspaces in the presence of noise. Two ways of considering error terms in the orthogonal of the linear subspaces are considered.

L. A. García-Escudero, A. Gordaliza, C. Matrán, A. Mayo-Iscar
Hospital Clustering in the Treatment of Acute Myocardial Infarction Patients Via a Bayesian Semiparametric Approach

In this work, we develop Bayes rules for several families of loss functions for hospital report cards under a Bayesian semiparametric hierarchical model. Moreover, we present some robustness analysis with respect to the choice of the loss function, focusing on the number of hospitals our procedure identifies as “unacceptably performing”. The analysis is carried out on a case study dataset arising from MOMI

2

(Month MOnitoring Myocardial Infarction in MIlan) survey on patients admitted with ST-Elevation Myocardial Infarction to the hospitals of Milan Cardiological Network. The major aim of this work is the ranking of the health-care providers performances, together with the assessment of the role of patients’ and providers’ characteristics on survival outcome.

Alessandra Guglielmi, Francesca Ieva, Anna Maria Paganoni, Fabrizio Ruggeri
A New Fuzzy Method to Classify Professional Profiles from Job Announcements

In the last years, Universities have created an office of placement to facilitate the employability of graduates. University placement offices select for companies, which offer a job and/or training position, a large number of graduates only based on degree and grades.

We adapt

c

-means algorithm to discover professional profiles from job announcements. We analyse 1,650 job announcements collected in DB SOUL since January 1st, 2010 to April 5th, 2011.

Domenica Fioredistella Iezzi, Mario Mastrangelo, Scipione Sarlo
A Metric Based Approach for the Least Square Regression of Multivariate Modal Symbolic Data

In this paper we propose a linear regression model for multivariate modal symbolic data. The observed variables are probabilistic modal variables according to the definition given in (Bock and Diday (2000).

Analysis of symbolic data: exploratory methods for extracting statistical information from complex data

. Springer), i.e. variables whose realizations are frequency or probability distributions. The parameters are estimated through a Least Squares method based on a suitable squared distance between the predicted and the observed modal symbolic data: the squared

2

Wasserstein distance. Measures of goodness of fit are also presented and an application on real data corroborates the proposed method.

Antonio Irpino, Rosanna Verde
A Gaussian–Von Mises Hidden Markov Model for Clustering Multivariate Linear-Circular Data

A multivariate hidden Markov model is proposed for clustering mixed linear and circular time-series data with missing values. The model integrates von Mises and normal densities to describe the distribution that the data take under different latent regimes, with parameters that depend on the evolution of an unobserved Markov chain. Estimation is facilitated by an EM algorithm that treats the states of the latent chain and missing values as different sources of incomplete information. The model is exploited to identify sea regimes from multivariate marine data.

Francesco Lagona, Marco Picone
A Comparison of Objective Bayes Factors for Variable Selection in Linear Regression Models

This paper deals with the variable selection problem in linear regression models and its solution by means of Bayes factors. If substantive prior information is lacking or impractical to elicit, which is often the case in applications, objective Bayes factors come into play. These can be obtained by means of different methods, featuring Zellner–Siow priors, fractional Bayes factors and intrinsic priors. The paper reviews such methods and investigates their finite-sample ability to identify the simplest model supported by the data, introducing the notion of full discrimination power. The results obtained are relevant to structural learning of Gaussian DAG models, where large spaces of sets of recursive linear regressions are to be explored.

Luca La Rocca
Evolutionary Customer Evaluation: A Dynamic Approach to a Banking Case

Today, the most important asset for a bank is its customer and therefore, the main targets to achieve by management are: knowledge of his needs, anticipation of his concerns and to distinguish itself in his eyes. The awareness that a satisfied customer is a highly profitable asset effort to provide a satisfactory service to the customer by diversifying its services. This paper aims to analyze the customer evaluation evolution of the main attributes of banking services to catch differences among the clusters and time lags through a dynamic factorial model. We propose a new system of weights by which assessing the dynamic factor reduction that is not optimal for all the instances considered across different waves. An empirical study will be illustrated: it is based on customer satisfaction data coming from a national bank with a spread network throughout Italy which wanted to analyze its reduced competitiveness in retail services, probably due to low customer satisfaction.

Caterina Liberati, Paolo Mariani
Measuring the Success Factors of a Website: Statistical Methods and an Application to a “Web District”

In this paper we propose a statistical methodology to address the issue of measuring success factors of an ecommerce application, and in particular of a regional e-marketplace, using as a measurement framework based on the customers’ satisfaction. In the first part of the paper, two different ranking methods have been compared in order to identify the more appropriate tool to analyse the opinions expressed by the visitors: a novel non parametric index, named the Stochastic dominance index, built on the basis of the cumulative distribution function alone, and a qualitative ranking based on the median and on the Leti Index. SDI has resulted to be more convenient for comparison purposes and, according to this measurement tool, the higher satisfaction has been expressed for the quality of the products. Then, a logistic regression has been performed to understand the impact of the different satisfaction factors on the overall satisfaction. The empirical evidence confirms the literature on the importance of the different success factors, showing that Website user friendliness and Information about purchase mechanisms have the major impact on the overall satisfaction.

Eleonora Lorenzini, Paola Cerchiello
Component Analysis for Structural Equation Models with Concomitant Indicators

A new approach to structural equation modelling based on so-called Extended Redundancy Analysis has been recently proposed in literature, enhanced with the added characteristic of generalizing Redundancy Analysis and Reduced-Rank Regression models for more than two blocks. However, in presence of direct effects linking exogenous and endogenous variables, the latent composite scores are estimated by ignoring the presence of the specified direct effects. In this paper, we extend Extended Redundancy Analysis, permitting us to specify and fit a variety of relationships among latent composites and endogenous variables. In particular, covariates are allowed to affect endogenous indicators indirectly through the latent composites and/or directly.

Pietro Giorgio Lovaglio, Giorgio Vittadini
Assessing Stability in NonLinear PCA with Hierarchical Data

Composite indicators of latent variables can be constructed by NonLinear Principal Components Analysis when data are collected by multiple-item scales. The aim of this paper is to establish the stability of the contribution made by each item to the composite indicator, by means of a resampling-based procedure able to take account of the hierarchical structure that often exists in the data, that is when individuals are nested in groups. The procedure modifies the standard nonparametric bootstrap technique and was applied to real data on job satisfaction from the most extensive survey on Italian social cooperatives.

Marica Manisera
Using the Variation Coefficient for Adaptive Discrete Beta Kernel Graduation

Various approaches have been proposed in literature for the kernel graduation of mortality rates. Among them, this paper considers, as a starting point, the fixed bandwidth discrete beta kernel estimator, a recent proposal conceived to intrinsically reduce boundary bias and in which age is pragmatically considered as a discrete variable. An adaptive variant of this estimator also exists, which allows the bandwidth to vary with age according to the reliability of the data as expressed only by the amount of exposure. This paper presents a further adaptive version, obtained by measuring the reliability via the reciprocal of the variation coefficient, which is function of both the amount of exposure and the observed mortality rates. A simulation study is accomplished to evaluate the gain in performance of the new estimator with respect to its predecessors.

Angelo Mazza, Antonio Punzo
On Clustering and Classification Via Mixtures of Multivariate t-Distributions

The use of mixture models for clustering and classification has received renewed attention within the literature since the mid-1990s. The multivariate Gaussian distribution has been at the heart of this body of work, but approaches that utilize the multivariate

t

-distribution have burgeoned into viable and effective alternatives. In this paper, recent work on classification and clustering using mixtures of multivariate

t

-distributions is reviewed and discussed, along with related issues. The paper concludes with a summary and suggestions for future work.

Paul D. McNicholas
Simulation Experiments for Similarity Indexes Between Two Hierarchical Clusterings

In this paper we report results of a series of simulation experiments aimed at comparing the behavior of different similarity indexes proposed in the literature for comparing two hierarchical clusterings on the basis of the whole dendrograms. Simulations are carried out over different experimental conditions.

Isabella Morlini
Performance Measurement of Italian Provinces in the Presence of Environmental Goals

The widespread of sustainable development concept intimates a vision of an ecologically balanced society, where it is necessary to preserve environmental resources and integrate economics and environment in decision-making. Consequently, there has been increasing recognition in developed nations of the importance of good environmental performance, in terms of reducing environmental disamenities, generated as outputs of the production processes, and increasing environmental benefits. In this context, the aim of the present work is to evaluate the environmental efficiency of Italian provinces by using the non-parametric approach to efficiency measurement, represented by Data Envelopment Analysis (DEA) technique. To this purpose, we propose a two-step methodology allowing for improving the discriminatory power of DEA in the presence of heterogeneity of the sample. In the first phase, provinces are classified into groups of similar characteristics. Then, efficiency measures are computed for each cluster.

Eugenia Nissi, Agnese Rapposelli
On the Simultaneous Analysis of Clinical and Omics Data: A Comparison of Globalboosttest and Pre-validation Techniques

In medical research biostatisticians are often confronted with supervised learning problems involving different kinds of predictors including, e.g., classical clinical predictors and high-dimensional “omics” data. The question of the

added

predictive value of high-dimensional omics data given that classical predictors are already available has long been under-considered in the biostatistics and bioinformatics literature. This issue is characterized by a lack of guidelines and a huge amount of conceivable approaches. Two existing methods addressing this important issue are systematically compared in the present paper. The

globalboosttest

procedure (Boulesteix & Hothorn. (2010).

BMC Bioinformatics, 11

, 78.) examines the additional predictive value of high-dimensional molecular data via boosting regression including a clinical offset, while the

pre-validation

method sums up omics data in form of a new cross-validated predictor that is finally assessed in a standard generalized linear model (Tibshirani & Efron. (2002).

Statistical Applications in Genetics and Molecular Biology, 1

, 1). Globalboosttest and pre-validation are introduced and discussed, then assessed based on a simulation study with survival data and finally applied to breast cancer microarray data for illustration. R codes to reproduce our results and figures are available from

http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/020_professuren/boulesteix/gbtpv/index.html

.

Margret-Ruth Oelker, Anne-Laure Boulesteix
External Analysis of Asymmetric Multidimensional Scaling Based on Singular Value Decomposition

An asymmetric similarity matrix among objects, for example, a brand switching matrix of consumers, can be analyzed by asymmetric multidimensional scaling. Suppose that

n

brands exist, and that

m

new brands are introduced. While the brand switching from existing to new brands can be observed, the brand switching from new to existing brands nor that among new brands cannot be observed soon after the introduction of the new brands. The present study analyzed the

n

×

n

similarity matrix by the asymmetric multidimensional scaling based on singular value decomposition. The analysis gives outward and inward tendencies of existing brands. Using the obtained outward tendency of

n

existing brands, the inward tendency of

m

new brands is derived. An application to the brand switching data among margarine brands is presented.

Akinori Okada, Hiroyuki Tsurumi
The Credit Accumulation Process to Assess the Performances of Degree Programs: An Adjusted Indicator Based on the Result of Entrance Tests

In the frame of the performance indicators this paper aims to consider the bias produced by micro-level Potential Confounding Factors—PCF—by comparing the results observed using adjusted and unadjusted measures of outcome. Results at the university entrance tests together with the previous school experiences have been used as proxies of students’ competencies at the beginning of their academic career. The regularity of schooling process has been monitored using as an outcome variable the students’ status (drop out, still enrolled) and the number of credits gathered after one academic year. Adjusted indicators of the regularity of the students’ career are obtained using the results of

zero-augmented

models to investigate the relationships between the outcome measures and the potential PCF which are not directly associated to the learning process under evaluation.

Mariano Porcu, Isabella Sulis
The Combined Median Rank-Based Gini Index for Customer Satisfaction Analysis

The quality assessment represents a relevant topic especially with regard to several real contexts. Currently, firms and services suppliers pay particular attention to customer satisfaction surveys in order to investigate about the “perceived quality” feature. Typically, a useful tool to obtain information about the customer satisfaction degree is represented by the quality questionnaires. The use of quality questionnaires implies that the collected data mostly assume ordinal nature.A contribution in dealing with ordinal data is provided by this paper. Here, we propose a novel Gini measure built on ranks. By combining it with the median index, one can depict the customer satisfaction degree by exploiting information coming from the responses given to the quality questionnaires items.

Emanuela Raffinetti
A Two-Phase Clustering Based Strategy for Outliers Detection in Georeferenced Curves

A two-phase clustering method for the detection of geostatistical functional outliers is proposed. It first, clusters data by a modified version of a Dynamic Clustering algorithm for geostatical functional data and then detects groups of outliers according to a cut-off value defined by a measure of spatial deviation in a minimum spanning tree. The performance of the proposed procedure is analyzed by several simulation studies.

Elvira Romano, Antonio Balzanella
High-Dimensional Bayesian Classifiers Using Non-Local Priors

Common goals in classification problems are (i) obtaining predictions and (ii) identifying subsets of highly predictive variables. Bayesian classifiers quantify the uncertainty in all steps of the prediction. However, common Bayesian procedures can be slow in excluding features with no predictive power (Johnson & Rossell. (2010). In certain high-dimensional setups the posterior probability assigned to the correct set of predictors converges to 0 (Johnson and Rossell 2012). We study the use of non-local priors (NLP), which overcome the above mentioned limitations. We introduce a new family of NLP and derive efficient MCMC schemes.

David Rossell, Donatello Telesca, Valen E. Johnson
A Two Layers Incremental Discretization Based on Order Statistics

Large amounts of data are produced today: network logs, web data, social network data…The data amount and their arrival speed make them impossible to be stored. Such data are called streaming data. The stream specificities are: (i) data are just visible once and (ii) are ordered by arrival time. As these data can not be kept in memory and read afterwards, usual data mining techniques can not apply. Therefore to build a classifier in that context requires to do it incrementally and/or to keep a subset of the information seen and then build the classifier. This paper focuses on the second option and proposed a two layers approach based on order statistics. The first layer uses the Greenwald and Khanna quantiles summary and the second layer a supervised method such as MODL.

Christophe Salperwyck, Vincent Lemaire
Interpreting Error Measurement: A Case Study Based on Rasch Tree Approach

This paper describes the appropriateness of Differential Item Functioning (DIF) analysis performed via mixed-effects Rasch models. Groups of subjects with homogeneous Rasch item parameters are found automatically by a model-based partitioning (Rasch tree model). The unifying framework offers the advantage of including the terminal nodes of Rasch tree in the multilevel formulation of Rasch models. In such a way we are able to handle different measurement issues. The approach is illustrated with a cross-national survey on attitude towards female stereotypes. Evidence of groups DIF was detected and presented as well as the estimates of model parameters.

Annalina Sarra, Lara Fontanella, Tonio Di Battista, Riccardo Di Nisio
Importance Sampling: A Variance Reduction Method for Credit Risk Models

The problem of the asymmetric behaviour and fat tails of portfolios of credit risky corporate assets such as bonds has become very important, not only because of the impact of both defaults end migration from one rating class to another. This paper discusses the use of different copulas for credit risk management. Usual Monte Carlo (MC) techniques are compared with a variable reduction method i.e. Importance Sampling (IS) in order to reduce the variability of the estimators of the tails of the Profit & Loss distribution of a portfolio of bonds. This provides speed up for computing economic capital in the rare event quantile of the loss distribution that must be held in reserve by a lending institution for solvency. An application to a simulated portfolio of bonds ends the paper.

Gabriella Schoier, Federico Marsich
A MCMC Approach for Learning the Structure of Gaussian Acyclic Directed Mixed Graphs

Graphical models are widely used to encode conditional independence constraints and causal assumptions, the directed acyclic graph (DAG) being one of the most common families of models. However, DAGs are not closed under marginalization: that is, if a distribution is Markov with respect to a DAG, several of its marginals might not be representable with another DAG unless one discards some of the structural independencies. Acyclic directed mixed graphs (ADMGs) generalize DAGs so that closure under marginalization is possible. In a previous work, we showed how to perform Bayesian inference to infer the posterior distribution of the parameters of a given Gaussian ADMG model, where the graph is fixed. In this paper, we extend this procedure to allow for priors over graph structures.

Ricardo Silva
Symbolic Cluster Representations for SVM in Credit Client Classification Tasks

Credit client scoring on medium sized data sets can be accomplished by means of Support Vector Machines (SVM), a powerful and robust machine learning method. However, real life credit client data sets are usually huge, containing up to hundred thousands of records, with good credit clients vastly outnumbering the defaulting ones. Such data pose severe computational barriers for SVM and other kernel methods, especially if all pairwise data point similarities are requested. Hence, methods which avoid extensive training on the complete data are in high demand. A possible solution may be a combined cluster and classification approach. Computationally efficient clustering can compress information from the large data set in a robust way, especially in conjunction with a symbolic cluster representation. Credit client data clustered with this procedure will be used in order to estimate classification models.

Ralf Stecking, Klaus B. Schebesch
A Further Proposal to Perform Multiple Imputation on a Bunch of Polytomous Items Based on Latent Class Analysis

This work advances an imputation procedure for categorical scales which relays on the results of Latent Class Analysis and Multiple Imputation Analysis. The procedure allows us to use the information stored in the joint multivariate structure of the data set and to take into account the uncertainty related to the true unobserved values. The accuracy of the results is validated in the Item Response Models framework by assessing the accuracy in estimation of key parameters in a data set in which observations are simulated Missing at Random. The sensitivity of the multiple imputation methods is assessed with respect to the following factors: the number of latent classes set up in the Latent Class Model and the rate of missing observations in each variable. The relative accuracy in estimation is assessed with respect to the Multiple Imputation By Chained Equation missing data handling method for categorical variables.

Isabella Sulis
A New Distance Function for Prototype Based Clustering Algorithms in High Dimensional Spaces

High dimensional data analysis poses some interesting and counter intuitive problems. One of this problems is, that some clustering algorithms do not work or work only very poorly if the dimensionality of the feature space is high. The reason for this is an effect called distance concentration. In this paper, we show that the effect can be countered for prototype based clustering algorithms by using a clever alteration of the distance function. We show the success of this process by applying (but not restricting) it on FCM. A useful side effect is, that our method can also be used to estimate the number of clusters in a data set.

Roland Winkler, Frank Klawonn, Rudolf Kruse
A Simplified Latent Variable Structural Equation Model with Observable Variables Assessed on Ordinal Scales

The communication is related to a wide empirical research promoted by the Università Cattolica del Sacro Cuore of Milan (UCSC) aimed at acquiring an insight into the real work possibilities of its graduates in the last seven years, as well as the appreciation and satisfaction of the firms which offered them a job position. The group of 1,264 firms which have a special connection with UCSC, regarding new job appointments, was considered and they were given a questionnaire, using web for sending and answering. The analysis of the 203 complete answers was conducted by having recourse to a structural equation model with latent variables.

Angelo Zanella, Giuseppe Boari, Andrea Bonanomi, Gabriele Cantaluppi
Optimal Decision Rules for Constrained Record Linkage: An Evolutionary Approach

Record Linkage (RL) aims at identifying pairs of records coming from different sources and representing the same real-world entity. Probabilistic RL methods assume that the pairwise distances computed in the record-comparison process obey a well defined statistical model, and exploit the statistical inference machinery to draw conclusions on the unknown Match/Unmatch status of each pair. Once model parameters have been estimated, classical Decision Theory results (e.g. the MAP rule) can generally be used to obtain a probabilistic clustering of the pairs into Matches and Unmatches. Constrained RL tasks (arising whenever one knows in advance that either or both the data sets to be linked do not contain duplicates) represent a relevant exception. In this paper we propose an Evolutionary Algorithm to find optimal decision rules according to arbitrary objectives (e.g. Maximum complete-Likelihood) while fulfilling 1:1, 1:N and N:1 matching constraints. We also present some experiments on real-world constrained RL instances, showing the accuracy and efficiency of our approach.

Diego Zardetto, Monica Scannapieco
On Matters of Invariance in Latent Variable Models: Reflections on the Concept, and its Relations in Classical and Item Response Theory

An overview is provided of the author’s program of research on measurement invariance. Two questions are addressed. First, when do theoreticians and practitioners talk about invariance, and what is it that we are talking about? Second, is invariance only a property of latent variable models such as IRT and is there invariance in classical test theory? If so, what is it for the: observed score, and latent variable formulations.

Bruno D. Zumbo
Backmatter
Metadaten
Titel
Statistical Models for Data Analysis
herausgegeben von
Paolo Giudici
Salvatore Ingrassia
Maurizio Vichi
Copyright-Jahr
2013
Verlag
Springer International Publishing
Electronic ISBN
978-3-319-00032-9
Print ISBN
978-3-319-00031-2
DOI
https://doi.org/10.1007/978-3-319-00032-9