Skip to main content
Top

2010 | Book

Proceedings of COMPSTAT'2010

19th International Conference on Computational StatisticsParis France, August 22-27, 2010 Keynote, Invited and Contributed Papers

Editors: Yves Lechevallier, Gilbert Saporta

Publisher: Physica-Verlag HD

insite
SEARCH

About this book

Proceedings of the 19th international symposium on computational statistics, held in Paris august 22-27, 2010.Together with 3 keynote talks, there were 14 invited sessions and more than 100 peer-reviewed contributed communications.

Table of Contents

Frontmatter

Keynote

Frontmatter
Complexity Questions in Non-Uniform Random Variate Generation

In this short note, we recall the main developments in non-uniform random variate generation, and list some of the challenges ahead.

Luc Devroye
Computational Statistics Solutions for Molecular Biomedical Research: A Challenge and Chance for Both

Computational statistics, supported by computing power and availability of efficient methodology, techniques and algorithms on the statistical side and by the perception on the need of valid data analysis and data interpretation on the biomedical side, has invaded in a very short time many cutting edge research areas of molecular biomedicine. Two salient cutting edge biomedical research questions demonstrate the increasing role and decisive impact of computational statistics. The role of well designed and well communicated simulation studies is emphasized and computational statistics is put into the framework of the International Association of Statistical Computing (IASC) and special issues on Computational Statistics within Clinical Research launched by the journal Computational Statistics and Data Analysis (CSDA).

Lutz Edler, Christina Wunder, Wiebke Werft, Axel Benner
The Laws of Coincidence

Anomalous events often lie at the roots of discoveries in science and of actions in other domains. Familiar examples are the discovery of pulsars, the identification of the initial signs of an epidemic, and the detection of faults and fraud. In general, they are events which are seen as so unexpected or improbable that one is led to suspect there must be some underlying cause. However, to determine whether such events are genuinely improbable, one needs to evaluate their probability under normal conditions. It is all too easy to underestimate such probabilities. Using the device of a number of ‘laws’, this paper describes how apparent coincidences should be expected to happen by chance alone.

David J. Hand

ABC Methods for Genetic Data

Frontmatter
Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Approximate Bayesian Computation encompasses a family of likelihoodfree algorithms for performing Bayesian inference in models defined in terms of a generating mechanism. The different algorithms rely on simulations of some summary statistics under the generative model and a rejection criterion that determines if a simulation is rejected or not. In this paper, I incorporate Approximate Bayesian Computation into a local Bayesian regression framework. Using an empirical Bayes approach, we provide a simple criterion for 1) choosing the threshold above which a simulation should be rejected, 2) choosing the subset of informative summary statistics, and 3) choosing if a summary statistic should be log-transformed or not.

Michael G.B. Blum
Integrating Approximate Bayesian Computation with Complex Agent-Based Models for Cancer Research

Multi-scale agent-based models such as hybrid cellular automata and cellular Potts models are now being used to study mechanisms involved in cancer formation and progression, including cell proliferation, differentiation, migration, invasion and cell signaling. Due to their complexity, statistical inference for such models is a challenge. Here we show how approximate Bayesian computation can be exploited to provide a useful tool for inferring posterior distributions. We illustrate our approach in the context of a cellular Potts model for a human colon crypt, and show how molecular markers can be used to infer aspects of stem cell dynamics in the crypt.

Andrea Sottoriva, Simon Tavaré

Algorithms for Robust Statistics

Frontmatter
Robust Model Selection with LARS Based on S-estimators

We consider the problem of selecting a parsimonious subset of explanatory variables from a potentially large collection of covariates. We are concerned with the case when data quality may be unreliable (e.g. there might be outliers among the observations). When the number of available covariates is moderately large, fitting all possible subsets is not a feasible option. Sequential methods like forward or backward selection are generally “greedy” and may fail to include important predictors when these are correlated. To avoid this problem Efron et al. (2004) proposed the Least Angle Regression algorithm to produce an ordered list of the available covariates (sequencing) according to their relevance. We introduce outlier robust versions of the LARS algorithm based on S-estimators for regression (Rousseeuw and Yohai (1984)). This algorithm is computationally efficient and suitable even when the number of variables exceeds the sample size. Simulation studies show that it is also robust to the presence of outliers in the data and compares favourably to previous proposals in the literature.

Claudio Agostinelli, Matias Salibian-Barrera
Robust Methods for Compositional Data

Many practical data sets in environmental sciences, official statistics and various other disciplines are in fact compositional data because only the ratios between the variables are informative. Compositional data are represented in the Aitchison geometry on the simplex, and for applying statistical methods designed for the Euclidean geometry they need to be transformed first. The isometric logratio (ilr) transformation has the best geometrical properties, and it avoids the singularity problem introduced by the centered logratio (clr) transformation. Robust multivariate methods which are based on a robust covariance estimation can thus only be used with ilr transformed data. However, usually the results are difficult to interpret because the ilr coordinates are formed by non-linear combinations of the original variables. We show for different multivariate methods how robustness can be managed for compositional data, and provide algorithms for the computation.

Peter Filzmoser, Karel Hron
Detecting Multivariate Outliers Using Projection Pursuit with Particle Swarm Optimization

Detecting outliers in the context of multivariate data is known as an important but difficult task and there already exist several detection methods. Most of the proposed methods are based either on the Mahalanobis distance of the observations to the center of the distribution or on a projection pursuit (PP) approach. In the present paper we focus on the one-dimensional PP approach which may be of particular interest when the data are not elliptically symmetric. We give a survey of the statistical literature on PP for multivariate outliers etection and investigate the pros and cons of the different methods. We also propose the use of a recent heuristic optimization algorithm called Tribes for multivariate outliers detection in the projection pursuit context.

Anne Ruiz-Gazen, Souad Larabi Marie-Sainte, Alain Berro

Brain Imaging

Frontmatter
Imaging Genetics: Bio-Informatics and Bio-Statistics Challenges

The IMAGEN study—a very large European Research Project—seeks to identify and characterize biological and environmental factors that influence teenagers mental health. To this aim, the consortium plans to collect data for more than 2000 subjects at 8 neuroimaging centres. These data comprise neuroimaging data, behavioral tests (for up to 5 hours of testing), and also white blood samples which are collected and processed to obtain 650 k single nucleotide polymorphisms (SNP) per subject. Data for more than 1000 subjects have already been collected. We describe the statistical aspects of these data and the challenges, such as the multiple comparison problem, created by such a large imaging genetics study (i.e., 650 k for the SNP, 50 k data per neuroimage).We also suggest possible strategies, and present some first investigations using uni or multi-variate methods in association with re-sampling techniques. Specifically, because the number of variables is very high, we first reduce the data size and then use multivariate (CCA, PLS) techniques in association with re-sampling techniques.

Jean-Baptiste Poline, Christophe Lalanne, Arthur Tenenhaus, Edouard Duchesnay, Bertrand Thirion, Vincent Frouin
The NPAIRS Computational Statistics Framework for Data Analysis in Neuroimaging

We introduce the role of resampling and prediction (

p

) metrics for flexible discriminant modeling in neuroimaging, and highlight the importance of combining these with measurements of the reproducibility (

r

) of extracted brain activation patterns. Using the NPAIRS resampling framework we illustrate the use of (

p, r

) plots as a function of the size of the principal component subspace (

Q

) for a penalized discriminant analysis (PDA) to: optimize processing pipelines in functional magnetic resonance imaging (fMRI), and measure the global SNR (gSNR) and dimensionality of fMRI data sets. We show that the gSNRs of typical fMRI data sets cause the optimal

Q

for a PDA to often lie in a phase transition region between gSNR ≃ 1 with large optimal

Q

versus SNR ≫ 1 with small optimal

Q

.

Stephen Strother, Anita Oder, Robyn Spring, Cheryl Grady

Computational Econometrics

Frontmatter
Bootstrap Prediction in Unobserved Component Models

One advantage of state space models is that they deliver estimates of the unobserved components and predictions of future values of the observed series and their corresponding Prediction Mean Squared Errors (PMSE). However, these PMSE are obtained by running the Kalman filter with the true parameters substituted by consistent estimates and, consequently, they do not incorporate the uncertainty due to parameter estimation. This paper reviews new bootstrap procedures to estimate the PMSEs of the unobserved states and to construct prediction intervals of future observations that incorporate parameter uncertainty and do not rely on particular assumptions of the error distribution. The new bootstrap PMSEs of the unobserved states have smaller biases than those obtained with alternative procedures. Furthermore, the prediction intervals have better coverage properties. The results are illustrate by obtaining prediction intervals of the quarterly mortgages changes and of the unobserved output gap in USA.

Alejandro F. Rodríguez, Esther Ruiz

Computer-Intensive Actuarial Methods

Frontmatter
A Numerical Approach to Ruin Models with Excess of Loss Reinsurance and Reinstatements *

The present paper studies some computational challenges for the determination of the probability of ruin of an insurer, if excess of loss reinsurance with reinstatements is applied. In the setting of classical risk theory, a contractive integral operator is studied whose fixed point is the ruin probability of the cedent. We develop and implement a recursive algorithm involving high-dimensional integration to obtain a numerical approximation of this quantity. Furthermore we analyze the effect of different starting functions and recursion depths on the performance of the algorithm and compare the results with the alternative of stochastic simulation of the risk process.

Hansjörg Albrecher, Sandra Haas
Computation of the Aggregate Claim Amount Distribution Using R and Actuar

actuar

is a package providing additional Actuarial Science functionality to the R statistical system. This paper presents the features of the package targeted at risk theory calculations. Risk theory refers to a body of techniques to model and measure the risk associated with a portfolio of insurance contracts. The main quantity of interest for the actuary is the distribution of total claims over a fixed period of time, modeled using the classical collective model of risk theory.

actuar

provides functions to discretize continuous distributions and to compute the aggregate claim amount distribution using many techniques, including the recursive method and simulation. The package also provides various plotting and summary methods to ease working with aggregate models.

Vincent Goulet
Applications of Multilevel Structured Additive Regression Models to Insurance Data

Models with structured additive predictor provide a very broad and rich framework for complex regression modeling. They can deal simultaneously with nonlinear covariate effects and time trends, unit- or cluster specific heterogeneity, spatial heterogeneity and complex interactions between covariates of different type. In this paper, we discuss a hierarchical version of regression models with structured additive predictor and its applications to insurance data. That is, the regression coefficients of a particular nonlinear term may obey another regression model with structured additive predictor. The proposed model may be regarded as a an extended version of a multilevel model with nonlinear covariate terms in every level of the hierarchy. We describe several highly efficient MCMC sampling schemes that allow to estimate complex models with several hierarchy levels and a large number of observations typically within a couple of minutes. We demonstrate the usefulness of the approach with applications to insurance data.

Stefan Lang, Nikolaus Umlauf

Data Stream Mining

Frontmatter
Temporally-Adaptive Linear Classification for Handling Population Drift in Credit Scoring

Classification methods have proven effective for predicting the creditworthiness of credit applications. However, the tendency of the underlying populations to change over time, population drift, is a fundamental problem for such classifiers. The problem manifests as decreasing performance as the classifier ages and is typically handled by periodic classifier reconstruction. To maintain performance between rebuilds, we propose an adaptive and incremental linear classification rule that is updated on the arrival of new labeled data. We consider adapting this method to suit credit application classification and demonstrate, with real loan data, that the method outperforms static and periodically rebuilt linear classifiers.

Niall M. Adams, Dimitris K. Tasoulis, Christoforos Anagnostopoulos, David J. Hand
Large-Scale Machine Learning with Stochastic Gradient Descent

During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning methods is limited by the computing time rather than the sample size. A more precise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing performance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.

Léon Bottou

Functional Data Analysis

Frontmatter
Anticipated and Adaptive Prediction in Functional Discriminant Analysis

Linear discriminant analysis with binary response is considered when the predictor is a functional random variable

$$X=\{X_{t},t\in [0,T]\}$$

,

$$T \in\mathbb{R}$$

. Motivated by a food industry problem, we develop a methodology to anticipate the prediction by determining the smallest

$$T^{*}$$

,

$$T^{*} \leq T$$

, such that

$$X^{*} = \{X_{t}, t\in [0,T^{*}]\}$$

and

X

give similar predictions. The adaptive prediction concerns the observation of a new curve

ω

on

$$[0, T^{*}(\omega)]$$

instead of [0,

T

] and answers to the question “How long should we observe

ω

(

$$T^{*}(\omega)=?$$

) for having the same prediction as on [0,

T

] ?”. We answer to this question by defining a conservation measure with respect to the class the new curve is predicted.

Cristian Preda, Gilbert Saporta, Mohamed Hadj Mbarek
Bootstrap Calibration in Functional Linear Regression Models with Applications

Our work focuses on the functional linear model given by

$$Y=\langle\theta,X\rangle+\epsilon,$$

where

Y

and

ε

are real random variables,

X

is a zero-mean random variable valued in a Hilbert space

$$(\mathcal{H},\langle\cdot,\cdot\rangle)$$

, and

$$\theta\in\mathcal{H}$$

is the fixed model parameter. Using an initial sample

$$\{(X_i,Y_i)\}_{i=1}^n$$

, a bootstrap resampling

$$Y_i^{*}=\langle\hat{\theta},X_i\rangle+\hat{\epsilon}_i^{*}$$

,

$$i=1,\ldots,n$$

, is proposed, where

$$\hat{\theta}$$

is a general pilot estimator, and

$$\hat{\epsilon}_i^{*}$$

is a naive or wild bootstrap error. The obtained consistency of bootstrap allows us to calibrate distributions as

$$P_X\{\sqrt{n}(\langle\hat{\theta},x\rangle-\langle\theta,x\rangle)\leq y\}$$

for a fixed

x

, where

P

X

is the probability conditionally on

$$\{X_i\}_{i=1}^n$$

. Different applications illustrate the usefulness of bootstrap for testing different hypotheses related with

θ

, and a brief simulation study is also presented.

Wenceslao González-Manteiga, Adela Martínez-Calvo
Empirical Dynamics and Functional Data Analysis

We review some recent developments on modeling and estimation of dynamic phenomena within the framework of Functional Data Analysis (FDA). The focus is on longitudinal data which correspond to sparsely and irregularly sampled repeated measurements that are contaminated with noise and are available for a sample of subjects. A main modeling assumption is that the data are generated by underlying but unobservable smooth trajectories that are realizations of a Gaussian process. In this setting, with only a few measurements available per subject, classical methods of Functional Data Analysis that are based on presmoothing individual trajectories will not work. We review the estimation of derivatives for sparse data, the PACE package to implement these procedures, and an empirically derived stochastic differential equation that the processes satisfy and that consists of a linear deterministic component and a drift process.

Hans-Georg Müller

Kernel Methods

Frontmatter
Indefinite Kernel Discriminant Analysis

Kernel methods for data analysis are frequently considered to be restricted to positive definite kernels. In practice, however, indefinite kernels arise e.g. from problem-specific kernel construction or optimized similarity measures.We, therefore, present formal extensions of some kernel discriminant analysis methods which can be used with indefinite kernels. In particular these are the multi-class kernel Fisher discriminant and the kernel Mahalanobis distance. The approaches are empirically evaluated in classification scenarios on indefinite multi-class datasets.

Bernard Haasdonk, Elżbieta Pȩkalska
Data Dependent Priors in PAC-Bayes Bounds

One of the central aims of Statistical Learning Theory is the bounding of the test set performance of classifiers trained with i.i.d. data. For Support Vector Machines the tightest technique for assessing this so-called generalisation error is known as the PAC-Bayes theorem. The bound holds independently of the choice of prior, but better priors lead to sharper bounds. The priors leading to the tightest bounds to date are spherical Gaussian distributions whose means are determined from a separate subset of data. This paper gives another turn of the screw by introducing a further data dependence on the shape of the prior: the separate data set determines a direction along which the covariance matrix of the prior is stretched in order to sharpen the bound. In addition, we present a classification algorithm that aims at minimizing the bound as a design criterion and whose generalisation can be easily analysed in terms of the new bound.

The experimental work includes a set of classification tasks preceded by a bound-driven model selection. These experiments illustrate how the new bound acting on the new classifier can be much tighter than the original PAC-Bayes Bound applied to an SVM, and lead to more accurate classifiers.

John Shawe-Taylor, Emilio Parrado-Hernández, Amiran Ambroladze

Monte Carlo Methods in System Safety, Reliability and Risk Analysis

Frontmatter
Some Algorithms to Fit some Reliability Mixture Models under Censoring

Estimating the unknown parameters of a reliability mixture model may be a more or less intricate problem, especially if durations are censored. We present several iterative methods based on Monte Carlo simulation that allow to fit parametric or semiparametric mixture models provided they are identifiable. We show for example that the well-known data augmentation algorithm may be used successfully to fit semiparametric mixture models under right censoring. Our methods are illustrated by a reliability example.

Laurent Bordes, Didier Chauveau
Computational and Monte-Carlo Aspects of Systems for Monitoring Reliability Data

Monitoring plays a key role in today’s business environment, as large volumes of data are collected and processed on a regular basis. Ability to detect onset of new data regimes and patterns quickly is considered an important competitive advantage. Of special importance is the area of monitoring product reliability, where timely detection of unfavorable trends typically offers considerable opportunities of cost avoidance. We will discuss detection systems for reliability issues built by combining Monte-Carlo techniques with modern statistical methods rooted in the theory of Sequential Analysis, Change-point theory and Likelihood Ratio tests. We will illustrate applications of these methods in computer industry.

Emmanuel Yashchin

Optimization Heuristics in Statistical Modelling

Frontmatter
Evolutionary Computation for Modelling and Optimization in Finance

In the last decades, there has been a tendency to move away from mathematically tractable, but simplistic models towards more sophisticated and real-world models in finance. However, the consequence of the improved sophistication is that the model specification and analysis is no longer mathematically tractable. Instead solutions need to be numerically approximated. For this task, evolutionary computation heuristics are the appropriate means, because they do not require any rigid mathematical properties of the model. Evolutionary algorithms are search heuristics, usually inspired by Darwinian evolution and Mendelian inheritance, which aim to determine the optimal solution to a given problem by competition and alteration of candidate solutions of a population. In this work, we focus on credit risk modelling and financial portfolio optimization to point out how evolutionary algorithms can easily provide reliable and accurate solutions to challenging financial problems.

Sandra Paterlini

Spatial Statistics / Spatial Epidemiology

Frontmatter
Examining the Association between Deprivation Profiles and Air Pollution in Greater London using Bayesian Dirichlet Process Mixture Models

Standard regression analyses are often plagued with problems encountered when one tries to make inference going beyond main effects, using datasets that contain dozens of variables that are potentially correlated. This situation arises, for example, in environmental deprivation studies, where a large number of deprivation scores are used as covariates, yielding a potentially unwieldy set of interrelated data from which teasing out the joint effect of multiple deprivation indices is difficult. We propose a method, based on Dirichlet-process mixture models that addresses these problems by using, as its basic unit of inference, a profile formed from a sequence of continuous deprivation measures. These deprivation profiles are clustered into groups and associated via a regression model to an air pollution outcome. The Bayesian clustering aspect of the proposed modeling framework has a number of advantages over traditional clustering approaches in that it allows the number of groups to vary, uncovers clusters and examines their association with an outcome of interest and fits the model as a unit, allowing a region’s outcome potentially to influence cluster membership. The method is demonstrated with an analysis UK Indices of Deprivation and PM10 exposure measures corresponding to super output areas (SOA’s) in greater London.

John Molitor, Léa Fortunato, Nuoo-Ting Molitor, Sylvia Richardson
Assessing the Association between Environmental Exposures and Human Health

In environmental health studies, health effects, environmental exposures, and potential confounders are seldom collected during the study on the same set of units. Some, if not all of the variables, are often obtained from existing programs and databases. Suppose environmental exposure is measured at points, but health effects are recorded on areal units. Further assume that a regression analysis the explores the association between health and environmental exposure is to be conducted at the areal level. Prior to analysis, the information collected on exposure at points is used to predict exposure at the areal level, introducing uncertainty in exposure for the analysis units. Estimation of the regression coefficient associated with exposure and its standard error is considered here. A simulation study is used to provide insight into the effects of predicting exposure. Open issues are discussed.

Linda J. Young, Carol A. Gotway, Kenneth K. Lopiano, Greg Kearney, Chris DuClos

ARS Session (Financial) Time Series

Frontmatter
Semiparametric Seasonal Cointegrating Rank Selection

This paper considers the issue of seasonal cointegrating rank selection by information criteria as the extension of Cheng and Phillips (The Econometrics Journal (2009), Vol. 12, pp. S83–S104). The method does not require the specification of lag length in vector autoregression, is convenient in empirical work, and is in a semiparametric context because it allows for a general short memory error component in the model with only lags related to error correction terms. Some limit properties of usual information criteria are given for the rank selection and small Monte Carlo simulations are conducted to evaluate the performances of the criteria.

Byeongchan Seong, Sung K. Ahn, Sinsup Cho
Estimating Factor Models for Multivariate Volatilities: An Innovation Expansion Method

We introduce an innovation expansion method for estimation of factor models for conditional variance (volatility) of a multivariate time series.We estimate the factor loading space and the number of factors by a stepwise optimization algorithm on expanding the “white noise space”. Simulation and a real data example are given for illustration.

Jiazhu Pan, Wolfgang Polonik, Qiwei Yao
Multivariate Stochastic Volatility Model with Cross Leverage

The Bayesian estimation method using Markov chain Monte Carlo is proposed for a multivariate stochastic volatility model that is a natural extension of the univariate stochastic volatility model with leverage, where we further incorporate cross leverage effects among stock returns.

Tsunehiro Ishihara, Yasuhiro Omori

KDD Session: Topological Learning

Frontmatter
Bag of Pursuits and Neural Gas for Improved Sparse Coding

Sparse coding employs low-dimensional subspaces in order to encode high-dimensional signals. Finding the optimal subspaces is a difficult optimization task. We show that stochastic gradient descent is superior in finding the optimal subspaces compared to MOD and K-SVD, which are both state-of-the art methods. The improvement is most significant in the difficult setting of highly overlapping subspaces. We introduce the so-called “Bag of Pursuits” that is derived from Orthogonal Matching Pursuit. It provides an improved approximation of the optimal sparse coefficients, which, in turn, significantly improves the performance of the gradient descent approach as well as MOD and K-SVD. In addition, the “Bag of Pursuits” allows to employ a generalized version of the Neural Gas algorithm for sparse coding, which finally leads to an even more powerful method.

Kai Labusch, Erhardt Barth, Thomas Martinetz
On the Role and Impact of the Metaparameters in t-distributed Stochastic Neighbor Embedding

Similarity-based embedding is a paradigm that recently gained interest in the field of nonlinear dimensionality reduction. It provides an elegant framework that naturally emphasizes the preservation of the local structure of the data set. An emblematic method in this trend is t-distributed stochastic neighbor embedding (t-SNE), which is acknowledged to be an efficient method in the recent literature. This paper aims at analyzing the reasons of this success, together with the impact of the two metaparameters embedded in the method. Moreover, the paper shows that t-SNE can be interpreted as a distance-preserving method with a specific distance transformation, making the link with existing methods. Experiments on artificial data support the theoretical discussion.

John A. Lee, Michel Verleysen

IFCS Session: New Developments in Two or Highermode Clustering; Model Based Clustering and Reduction for High Dimensional Data

Frontmatter
Multiple Nested Reductions of Single Data Modes as a Tool to Deal with Large Data Sets

The increased accessibility and concerted use of novel measurement technologies give rise to a data tsunami with matrices that comprise both a high number of variables and a high number of objects. As an example, one may think of transcriptomics data pertaining to the expression of a large number of genes in a large number of samples or tissues (as included in various compendia). The analysis of such data typically implies ill-conditioned optimization problems, as well as major challenges on both a computational and an interpretational level.

In the present paper, we develop a generic method to deal with these problems. This method was originally briefly proposed by Van Mechelen and Schepers (2007). It implies that single data modes (i.e., the set of objects or the set of variables under study) are subjected to multiple (discrete and/or dimensional) nested reductions.

We first formally introduce the generic multiple nested reductions method. Next, we show how a few recently proposed modeling approaches fit within the framework of this method. Subsequently, we briefly introduce a novel instantiation of the generic method, which simultaneously includes a two-mode partitioning of the objects and variables under study (Van Mechelen et al. (2004)) and a low-dimensional, principal component-type dimensional reduction of the two-mode cluster centroids. We illustrate this novel instantiation with an application on transcriptomics data for normal and tumourous colon tissues.

In the discussion, we highlight multiple nested mode reductions as a key feature of the novel method. Furthermore, we contrast the novel method with other approaches that imply different reductions for different modes, and approaches that imply a hybrid dimensional/discrete reduction of a single mode. Finally, we show in which way the multiple reductions method allows a researcher to deal with the challenges implied by the analyis of large data sets as outlined above.

Iven Van Mechelen, Katrijn Van Deun
The Generic Subspace Clustering Model

In this paper we present an overview of methods for clustering high dimensional data in which the objects are assigned to mutually exclusive classes in low dimensional spaces. To this end, we will introduce the generic subspace clustering model. This model will be shown to encompass a range of existing clustering techniques as special cases. As such, further insight is obtained into the characteristics of these techniques and into their mutual relationships. This knowledge facilitates selecting the most appropriate model variant in empirical practice.

Marieke E. Timmerman, Eva Ceulemans
Clustering Discrete Choice Data

When clustering discrete choice (e.g. customers by products) data, we may be interested in partitioning individuals in disjoint classes which are homogeneous with respect to product choices and, given the availability of individual- or outcome-specific covariates, in investigating on how these affect the likelihood to be in certain categories (i.e. to choose certain products). Here, a model for joint clustering of statistical units (e.g. consumers) and variables (e.g. products) is proposed in a mixture modeling framework, and the corresponding (modified) EM algorithm is sketched. The proposed model can be easily linked to similar proposals appeared in various contexts, such as in co-clustering gene expression data or in clustering words and documents in webmining data analysis.

Donatella Vicari, Marco Alfò

Selected Contributed Papers

Frontmatter
Application of Local Influence Diagnostics to the Buckley-James Model

This article reports the development of local influence diagnostics of Buckley-James model consisting of variance perturbation, response variable perturbation and independent variables perturbation. The proposed diagnostics improves the previous ones by taking into account both censored and uncensored data to have a possibility to become an influential observation. Note that, in the previous diagnostics of Buckley-James model, influential observations merely come from uncensored observations in the data set. An example based on the Stanford heart transplant data is used for illustration. The data set with three covariates is considered in an attempt to show how the proposed diagnostics is able to handle more than one covariate, which is a concern to us as it is more difficult to identify peculiar observations in a multiple covariates.

Nazrina Aziz, Dong Qian Wang
Multiblock Method for Categorical Variables. Application to the Study of Antibiotic Resistance

We address the problem of describing several categorical variables with a prediction purpose. We focus on methods in the multiblock modelling framework, each block being formed of the indicator matrix associated with each qualitative variable.We propose a method, called categorical multiblock Redundancy Analysis, based on a well-identified global optimization criterion which leads to an eigensolution. In comparison with usual procedures, such as logistic regression, the method is well-adapted to the case of a large number of redundant explanatory variables. Practical uses of the proposed method are illustrated using an empirical example in the field of epidemiology.

Stéphanie Bougeard, El Mostafa Qannari, Claire Chauvin
A Flexible IRT Model for Health Questionnaire: an Application to HRQoL

The aim of this study is to formulate a suitable Item Response Theory (IRT) based model to measure HRQoL (as latent variable) using a mixed responses questionnaire and relaxing the hypothesis of normal distributed latent variable. The new model is a combination of two models, that is a latent trait model for mixed responses and an IRT model for Skew Normal latent variable. It is developed in a Bayesian framework. The proposed model was tested on a questionnaire composed by 5 discrete items and one continuous to measure HRQoL in children. The new model has better performances, in term of Deviance Information Criterion, Monte Carlo Markov chain convergence times and precision of the estimates.

Serena Broccoli, Giulia Cavrini
Multidimensional Exploratory Analysis of a Structural Model Using a Class of Generalized Covariance Criteria

Our aim is to explore a structural model: several variable groups describing the same observations are assumed to be structured around latent dimensions that are linked through a linear model that may have several equations. This type of model is commonly dealt with by methods assuming that the latent dimension in each group is unique. However, conceptual models generally link concepts which are multidimensional. We propose a general class of criteria suitable to measure the quality of a Structural Equation Model (SEM). This class contains the covariance criteria used in PLS Regression and the Multiple Covariance criterion of the SEER method. It also contains quartimax-related criteria. All criteria in the class must be maximized under a unit norm constraint. We give an equivalent unconstrained maximization program, and algorithms to solve it. This maximization is used within a general algorithm named THEME (Thematic Equation Model Exploration), which allows to search the structures of groups for all dimensions useful to the model. THEME extracts locally nested structural component models.

Xavier Bry, Thomas Verron, Patrick Redont
Semiparametric Models with Functional Responses in a Model Assisted Survey Sampling Setting : Model Assisted Estimation of Electricity Consumption Curves

This work adopts a survey sampling point of view to estimate the mean curve of large databases of functional data. When storage capacities are limited, selecting, with survey techniques a small fraction of the observations is an interesting alternative to signal compression techniques. We propose here to take account of real or multivariate auxiliary information available at a low cost for the whole population, with semiparametric model assisted approaches, in order to improve the accuracy of Horvitz-Thompson estimators of the mean curve. We first estimate the functional principal components with a design based point of view in order to reduce the dimension of the signals and then propose semiparametric models to get estimations of the curves that are not observed. This technique is shown to be really effective on a real dataset of 18902 electricity meters measuring every half an hour electricity consumption during two weeks.

Hervé Cardot, Alain Dessertaine, Etienne Josserand
Stochastic Approximation for Multivariate and Functional Median

We propose a very simple algorithm in order to estimate the geometric median, also called spatial median, of multivariate (Small (1990)) or functional data (Gervini (2008)) when the sample size is large. A simple and fast iterative approach based on the Robbins-Monro algorithm (Duflo (1997)) as well as its averaged version (Polyak and Juditsky (1992)) are shown to be effective for large samples of high dimension data. They are very fast and only require

O(Nd)

elementary operations, where

N

is the sample size and d is the dimension of data. The averaged approach is shown to be more effective and less sensitive to the tuning parameter. The ability of this new estimator to estimate accurately and rapidly (about thirty times faster than the classical estimator) the geometric median is illustrated on a large sample of 18902 electricity consumption curves measured every half an hour during one week.

Hervé Cardot, Peggy Cénac, Mohamed Chaouch
A Markov Switching Re-evaluation of Event-Study Methodology

This paper reconsiders event-study methodology in light of evidences showing that Cumulative Abnormal Return (CAR) can result in misleading inferences about financial market efficiency and pre(post)-event behavior. In particular, CAR can be biased downward, due to the increased volatility on the event day and within event windows. We propose the use of Markov Switching Models to capture the effect of an event on security prices. The proposed methodology is applied to a set of 45 historical series on Credit Default Swap (CDS) quotes subject to multiple credit events, such as reviews for downgrading. Since CDSs provide insurance against the default of a particular company or sovereign entity, this study checks if market anticipates reviews for downgrading and evaluates the time period the announcements lag behind the market.

Rosella Castellano, Luisa Scaccia
Evaluation of DNA Mixtures Accounting for Sampling Variability

In the conventional evaluation of DNA mixtures, the allele frequencies are often taken as constants. But they are in fact estimated from a sample taken from a population and thus the variability of the estimates has to be taken into account. Within a Bayesian framework, the evaluation of DNA mixtures accounting for sampling variability in the population database of allele frequencies are discussed in this paper. The concise and general formulae are provided for calculating the likelihood ratio when the people involved are biologically related. The implementation of the formula is demonstrated on the analysis of a real example. The resulting formulae are shown to be more conservative, which is generally more favorable to the defendant.

Yuk-Ka Chung, Yue-Qing Hu, De-Gang Zhu, Wing K. Fung
Monotone Graphical Multivariate Markov Chains

In this paper, we show that a deeper insight into the relations among marginal processes of a multivariate Markov chain can be gained by testing hypotheses of Granger non-causality, contemporaneous independence and monotone dependence coherent with a stochastic ordering. The tested hypotheses associated to a multi edge graph are proven to be equivalent to equality and inequality constraints on interactions of a multivariate logistic model parameterizing the transition probabilities. As the null hypothesis is specified by inequality constraints, the likelihood ratio statistic has chi-bar-square asymptotic distribution whose tail probabilities can be computed by simulation. The introduced hypotheses are tested on real categorical time series.

Roberto Colombi, Sabrina Giordano
Using Observed Functional Data to Simulate a Stochastic Process via a Random Multiplicative Cascade Model

Considering functional data and an associated binary response, a method based on the definition of special Random Multiplicative Cascades to simulate the underlying stochastic process is proposed. It will be considered a class

S

of stochastic processes whose realizations are real continuous piecewise linear functions with a constrain on the increment and the family

R

of all binary responses

Y

associated to a process

X

in

S

. Considering data from a continuous phenomenon evolving in a time interval [0,

T

] which can be simulated by a pair (

X, Y

)

∈ S × R

, a prediction tool which would make it possible to predict

Y

at each point of [0,

T

] is introduced. An application to data from an industrial kneading process is considered.

G. Damiana Costanzo, S. De Bartolo, F. Dell’Accio, G. Trombetta
A Clusterwise Center and Range Regression Model for Interval-Valued Data

This paper aims to adapt clusterwise regression to interval-valued data. The proposed approach combines the dynamic clustering algorithm with the center and range regression method for interval-valued data in order to identify both the partition of the data and the relevant regression models, one for each cluster. Experiments with a car interval-valued data set show the usefulness of combining both approaches.

Francisco de A.T. de Carvalho, Gilbert Saporta, Danilo N. Queiroz
Contributions to Bayesian Structural Equation Modeling

Structural equation models (SEMs) are multivariate latent variable models used to model causality structures in data. A Bayesian estimation and validation of SEMs is proposed and identifiability of parameters is studied. The latter study shows that latent variables should be standardized in the analysis to ensure identifiability. This heuristics is in fact introduced to deal with complex identifiability constraints. To illustrate the point, identifiability constraints are calculated in a marketing application, in which posterior draws of the constraints are derived from the posterior conditional distributions of parameters.

Séverine Demeyer, Nicolas Fischer, Gilbert Saporta
Some Examples of Statistical Computing in France During the 19th Century

Statistical computing emerged as a recognised topic in the seventies. Remember the first COMPSTAT symposium held in Vienna (1974)! But the need for proper computations in statistics arose much earlier. Indeed, the contributions by Laplace (1749-1829) and Legendre (1752-1833) to statistical estimation in linear models are well known. But further works of computational interest originated in the structuring of the concept of regression during the 19th century. While some were fully innovative, some appear now unsuccessful but nevertheless informative. The paper discusses, from a French perspective, the computational aspects of selected examples.

Antoine de Falguerolles
Imputation by Gaussian Copula Model with an Application to Incomplete Customer Satisfaction Data

We propose the idea of imputing missing value based on conditional distributions, which requires the knowledge of the joint distribution of all the data. The Gaussian copula is used to find a joint distribution and to implement the conditional distribution approach.

The focus remains on the examination of the appropriateness of an imputation algorithm based on the Gaussian copula.

In the present paper, we generalize and apply the copula model to incomplete correlated data using the imputation algorithm given by Käärik and Käärik (2009a).

The empirical context in the current paper is an imputation model using incomplete customer satisfaction data. The results indicate that the proposed algorithm performs well.

Meelis Käärik, Ene Käärik
On Multiple-Case Diagnostics in Linear Subspace Method

In this paper, we discuss sensitivity analysis in linear subspace method, especially on multiple-case diagnostics.

Linear subspace method by Watanabe (1973) is a useful discriminant method in the field of pattern recognition. We have proposed its sensitivity analyses, with single-case diagnostics and multiple-case diagnostics with PCA.

We propose a modified multiple-case diagnostics using clustering and discuss its effectiveness with numerical simulations.

Kuniyoshi Hayashi, Hiroyuki Minami, Masahiro Mizuta
Fourier Methods for Sequential Change Point Analysis in Autoregressive Models

We develop a procedure for monitoring changes in the error distribution of autoregressive time series. The proposed procedure, unlike standard procedures which are also referred to, utilizes the empirical characteristic function of properly estimated residuals. The limit behavior of the test statistic is investigated under the null hypothesis, while computational and other relevant issues are addressed.

Marie Hušková, Claudia Kirch, Simos G. Meintanis
Computational Treatment of the Error Distribution in Nonparametric Regression with Right-Censored and Selection-Biased Data

Consider the regression model

Y

=

m

(

X

) + φ(

X

)ε, where

m

(

X

) =

E

[

Y

X

] and φ

2

(

X

) =

V ar

[

Y

X

] are unknown smooth functions and the error ε (with unknown distribution) is independent of

X

. The pair (

X, Y

) is subject to parametric selection bias and the response to right censoring. We construct a new estimator for the cumulative distribution function of the error ε, and develop a bootstrap technique to select the smoothing parameter involved in the procedure. The estimator is studied via extended simulations and applied to real unemployment data.

Géraldine Laurent, Cédric Heuchenne
Mixtures of Weighted Distance-Based Models for Ranking Data

Ranking data has applications in different fields of studies, like marketing, psychology and politics. Over the years, many models for ranking data have been developed. Among them, distance-based ranking models, which originate from the classical rank correlations, postulate that the probability of observing a ranking of items depends on the distance between the observed ranking and a modal ranking. The closer to the modal ranking, the higher the ranking probability is. However, such a model basically assumes a homogeneous population, and the single dispersion parameter may not be able to describe the data very well.

To overcome the limitations, we consider new weighted distance measures which allow different weights for different ranks in formulating more flexible distancebased models. The mixtures of weighted distance-based models are also studied for analyzing heterogeneous data. Simulations results will be included, and we will apply the proposed methodology to analyze a real world ranking dataset.

Paul H. Lee, Philip L. H. Yu
Fourier Analysis and Swarm Intelligence for Stochastic Optimization of Discrete Functions

A new methodology for solving discrete optimization problems by the continuous approach has been developed in this study. A discrete Fourier series method was derived and used for re-formulation of discrete objective functions as continuous functions. Particle Swarm Optimization (PSO) was then applied to locate the global optimal solutions of the continuous functions derived. The continuous functions generated by the proposed discrete Fourier series method correlated almost exactly with their original model functions. The PSO algorithm was observed to be highly successful in achieving global optimization of all such objective functions considered in this study. The results obtained indicated that the discrete Fourier series method coupled to the PSO algorithm is indeed a promising methodology for solving discrete optimization problems via the continuous approach.

Jin Rou New, Eldin Wee Chuan Lim
Global Hypothesis Test to Simultaneously Compare the Predictive Values of Two Binary Diagnostic Tests in Paired Designs: a Simulation Study

The positive and negative predictive values of a binary diagnostic test are measures of the clinical accuracy of the diagnostic that depend on the sensitivity and the specificity of the binary test and on the disease prevalence. Moreover, the positive predictive value and the negative predictive value are not parameters which are independent of each other. In this article, a global hypothesis test is studied to simultaneously compare the positive and negative predictive values of two binary diagnostic tests in paired designs.

J. A. Roldán Nofuentes, J. D. Luna del Castillo, M. A. Montero Alonso
Modeling Operational Risk: Estimation and Effects of Dependencies

Being still in its early stages, operational risk modeling has, so far, mainly been concentrated on the marginal distributions of frequencies and severities within the context of the Loss Distribution Approach (LDA). In this study, drawing on a fairly large real–world data set, we analyze the effects of competing strategies for dependence modeling. In particular, we estimate tail dependence both via copulas as well as nonparametrically, and analyze its effect on aggregate risk–capital estimates.

Stefan Mittnik, Sandra Paterlini, Tina Yener
Learning Hierarchical Bayesian Networks for Genome-Wide Association Studies

We describe a novel probabilistic graphical model customized to represent the statistical dependencies between genetic markers, in the Human genome. Our proposal relies on a forest of hierarchical latent class models. The motivation is to reduce the dimension of the data to be further submitted to statistical association tests with respect to diseased/non diseased status. A generic algorithm, CFHLC, has been designed to tackle the learning of both forest structure and probability distributions. A first implementation has been shown to be tractable on benchmarks describing 10

5

variables for 2000 individuals.

Raphaël Mourad, Christine Sinoquet, Philippe Leray
Exact Posterior Distributions over the Segmentation Space and Model Selection for Multiple Change-Point Detection Problems

In segmentation problems, inference on change-point position and model selection are two difficult issues due to the discrete nature of change-points. In a Bayesian context, we derive exact, non-asymptotic, explicit and tractable formulae for the posterior distribution of variables such as the number of change-points or their positions. We also derive a new selection criterion that accounts for the reliability of the results. All these results are based on an efficient strategy to explore the whole segmentation space, which can be very large. We illustrate our methodology on both simulated data and a comparative genomic hybridisation profile.

G. Rigaill, E. Lebarbier, S. Robin
Parcellation Schemes and Statistical Tests to Detect Active Regions on the Cortical Surface

Activation detection in functional Magnetic Resonance Imaging (fMRI) datasets is usually performed by thresholding activation maps in the brain volume or, better, on the cortical surface. However, basing the analysis on a site-by-site statistical decision may be detrimental both to the interpretation of the results and to the sensitivity of the analysis, because a perfect point-to-point correspondence of brain surfaces from multiple subjects cannot be guaranteed in practice. In this paper, we propose a new approach that first defines anatomical regions such as cortical gyri outlined on the cortical surface, and then segments these regions into functionally homogeneous structures using a parcellation procedure that includes an explicit between-subject variability model, i.e. random effects. We show that random effects inference can be performed in this framework. Our procedure allows an exact control of the specificity using permutation techniques, and we show that the sensitivity of this approach is higher than the sensitivity of voxel- or cluster-level random effects tests performed on the cortical surface.

Bertrand Thirion, Alan Tucholka, Jean-Baptiste Poline
Robust Principal Component Analysis Based on Pairwise Correlation Estimators

Principal component analysis tries to explain and simplify the structure of multivariate data. For standardized variables, these principal components correspond to the eigenvectors of their correlation matrix. To obtain a robust principal components analysis, we estimate this correlation matrix componentwise by using robust pairwise correlation estimates. We show that the approach based on pairwise correlation estimators does not need a majority of outlier-free observations which becomes very useful for high dimensional problems. We further demonstrate that the “bivariate trimming” method especially works well in this setting.

Stefan Van Aelst, Ellen Vandervieren, Gert Willems
Ordinary Least Squares for Histogram Data Based on Wasserstein Distance

Histogram data is a kind of symbolic representation which allows to describe an individual by an empirical frequency distribution. In this paper we introduce a linear regression model for histogram variables. We present a new Ordinary Least Squares approach for a linear model estimation, using the Wasserstein metric between histograms. In this paper we suppose that the regression coefficient are scalar values. After having illustrated the concurrent approaches, we corroborate the proposed estimation method by an application on a real dataset.

Rosanna Verde, Antonio Irpino
DetMCD in a Calibration Framework

The minimum covariance determinant (MCD) method is a robust estimator of multivariate location and scatter (Rousseeuw (1984)). Computing the exact MCD is very hard, so in practice one resorts to approximate algorithms. Most often the FASTMCD algorithm of Rousseeuw and Van Driessen (1999) is used. The FASTMCD algorithm is affine equivariant but not permutation invariant. Recently a deterministic algorithm, denoted as DetMCD, is developed which does not use random subsets and which is much faster (Hubert et al. (2010)). In this paper DetMCD is illustrated in a calibration framework. We focus on robust principal component regression and partial least squares regression, two very popular regression techniques for collinear data. We also apply DetMCD on data with missing elements after plugging it into the M-RPCR technique of Serneels and Verdonck (2009).

Tim Verdonck, Mia Hubert, Peter J. Rousseeuw
Separable Two-Dimensional Linear Discriminant Analysis

Several two-dimensional linear discriminant analysis LDA (2DLDA) methods have received much attention in recent years. Among them, the 2DLDA, introduced by Ye, Janardan and Li (2005), is an important development. However, it is found that their proposed iterative algorithm does not guarantee convergence. In this paper, we assume a separable covariance matrix of 2D data and propose separable 2DLDA which can provide a neatly analytical solution similar to that for classical LDA. Empirical results on face recognition demonstrate the superiority of our proposed separable 2DLDA over 2DLDA in terms of classification accuracy and computational efficiency.

Jianhua Zhao, Philip L.H. Yu, Shulan Li
List of Supplementary Contributed and Invited Papers Only Available on springerlink.com

Clustering of Waveforms-Data Based on FPCA Direction

Giada Adelfio, Marcello Chiodi, Antonino D’Alessandro, Dario Luzio

Yves Lechevallier, Gilbert Saporta
Backmatter
Metadata
Title
Proceedings of COMPSTAT'2010
Editors
Yves Lechevallier
Gilbert Saporta
Copyright Year
2010
Publisher
Physica-Verlag HD
Electronic ISBN
978-3-7908-2604-3
Print ISBN
978-3-7908-2603-6
DOI
https://doi.org/10.1007/978-3-7908-2604-3

Premium Partner