Skip to main content
Top

2011 | Book

Classification and Multivariate Analysis for Complex Data Structures

Editors: Bernard Fichet, Domenico Piccolo, Rosanna Verde, Maurizio Vichi

Publisher: Springer Berlin Heidelberg

Book Series : Studies in Classification, Data Analysis, and Knowledge Organization

insite
SEARCH

About this book

The growing capabilities in generating and collecting data has risen an urgent need of new techniques and tools in order to analyze, classify and summarize statistical information, as well as to discover and characterize trends, and to automatically bag anomalies. This volume provides the latest advances in data analysis methods for multidimensional data which can present a complex structure: The book offers a selection of papers presented at the first Joint Meeting of the Société Francophone de Classification and the Classification and Data Analysis Group of the Italian Statistical Society. Special attention is paid to new methodological contributions from both the theoretical and the applicative point of views, in the fields of Clustering, Classification, Time Series Analysis, Multidimensional Data Analysis, Knowledge Discovery from Large Datasets, Spatial Statistics.

Table of Contents

Frontmatter

Key notes

Principal Component Analysis for Categorical Histogram Data: Some Open Directions of Research

In recent years, the analysis of symbolic data where the units are categories, classes or concepts described by interval, distributions, sets of categories and the like becomes a challenging task since many applicative fields generate massive amount of data that are difficult to store and to analyze with traditional techniques [1]. In this paper we propose a strategy for extending standard PCA to such data in the case where the variables values are “categorical histograms” (i.e. a set of categories called bins with their relative frequency). These variables are a special case of “modal” variables (see for example, Diday and Noirhomme [5]) or of “compositional” variables (Aitchison [1]) where the weights are not necessarily frequencies. First, we introduce “metabins” which mix together bins of the different histograms and enhance interpretability. Standard PCA applied on the bins of such data table loose the histograms constraints and suppose independencies between the bins but copulas takes care of the probabilities and the underlying dependencies. Then, we give several ways for representing the units (called “individuals”), the bins, the variables and the metabins when the number of categories is not the same for each variable. A way for representing the variation of the individuals, for getting histograms in output is given. Finally, some theoretical results allow the representation of the categorical histogram variables inside a hypercube covering the correlation sphere.

Edwin Diday
Factorial Conjoint Analysis Based Methodologies

Aim of this paper is to underline the main contributions in the context of Factorial Conjoint Analysis. The integration of Conjoint Analysis with the exploratory tools of Multidimensional Data Analysis is the basis of different research strategies, proposed by the authors, combining the common estimation method with its geometrical representation. Here we present a systematic and unitary review of some of these methodologies by taking into account their contribution to several open ended problems.

Giuseppe Giordano, Carlo Natale Lauro, Germana Scepi
Ordering and Scaling Objects in Multivariate Data Under Nonlinear Transformations of Variables

An integrated iterative method is presented for the optimal ordering and scaling of objects in multivariate data, where the variables themselves may be transformed in the process of optimizing the objective function. Given an ordering of objects, optimal transformation of variables is guaranteed by the combined use of majorization (a particular (sub)gradient optimization method) and projection methods. The optimal sequencing is a combinatorial task and should not be carried out by applying standard optimization techniques based on gradients, because these are known to result in severe problems of local optima. Instead, a combinatorial data analysis strategy is adopted that amounts to a cyclic application of a number of local operations. A crucial objective for the overall method is the graphical display of the results, which is implemented by spacing the object points optimally over a one-dimensional continuum. An indication is given for how the overall process converges to a (possibly local) optimum. As an illustration, the method is applied to the analysis of a published observational data set.

Jacqueline J. Meulman, Lawrence J. Hubert, Phipps Arabie
Statistical Models to Predict Academic Churn Risk

This paper describes a research conducted on university students careers. The purpose is to study, describe and prevent the phenomenon of abandonment (churn). Results from predictive models may be employed to start activities of personalized tutoring, aimed at preventing the phenomena

Paolo Giudici, Emanuele Dequarti
The Poisson Processes in Cluster Analysis

This paper aims to review some use of the point processes in cluster analysis. The homogeneous Poisson process is, in many ways, the simplest point process, and it plays a role in point process theory in most respects analogous to the normal distribution in the study of random variables. We first propose a statistical model for cluster analysis based on the homogeneous Poisson process. The clustering criterion is extracted from that model thanks to maximum likelihood estimation. It consists in minimizing the sum of the Lebesgue measures of the convex hulls of the clusters. We also present a generalization of that model to the non-stationary Poisson process, as well as some monothetic divisive clustering methods also based on the Poisson processes. On the other hand, it is usually considered that the central problem of cluster validation is the determination of the best number of natural clusters. We present two likelihood ratio tests for the number of clusters based on the Poisson processes. Most of these clustering methods and tests for the number of clusters have been extended to symbolic data.

André Hardy
TWO-CLASS Trees for Non-Parametric Regression Analysis

This paper shows that a regression tree problem can be turned into a classification tree problem reducing the computational cost and providing useful interpretation issues. A TWO-CLASS tree methodoloy for non-parametric regression analysis is introduced. Data are as follows: a numerical response variable and a set of predictors (of categorical and/or numerical type) are measured on a sample of objects, with no probability assumption. Thus a non-parametric approach is proposed. The concepts of prospective and retrospective splits are considered. Main idea is to grow a binary partition of the sample of objects such that, at each node of the tree structure, the numerical response is recoded into a dummy or two-class variable (called theoretical response) on the basis of the optimal partition of the objects into two groups within the set of retrospective splits. A two-stage splitting criterion with a fast algorithm is applied: the best split of the objects is found in the set of candidate (prospective) splits of each predictor modalities by maximizing the predictability of the two-class response. Some applications on real world cases and a simulation study allow to demonstrate that the two-class splitting procedure is computationally less intensive than standard regression tree such as CART. Furthermore, the final partitions obtained by the two-class procedure and the standard one are very similar to each other, in terms of percentage of objects belonging together to the same terminal node. Some aids to the interpretation allow to describe the response variable distribution in the terminal nodes.

Roberta Siciliano, Massimo Aria

Classification and discrimination

Efficient Incorporation of Additional Information to Classification Rules

We propose and discuss improved classification rules when a subset of the predictors is known to be ordered. We compare the performance of the new rules with other standard rules in a restricted normal setting using simulation experiments and real data exposing their good performance.

Miguel Fernández, Cristina Rueda, Bonifacio Salvador
The Choice of the Parameter Values in a Multivariate Model of a Second Order Surface with Heteroscedastic Error

The paper describes an experimental procedure to choose the values for a multivariate vector

x

under these conditions: average of

Y

(

x

) equal to a target value and least variance of

Y

(

x

), linked to

x

by a second order model with a heteroscedastic error. The procedure consists of two steps. In the first step an experimental design is performed in the feasible space

$$\mathcal {X}$$

of the control factors to estimate, by an iterative method, the parameters characterizing the response surface of the mean. Then a second experimental design is performed on a set

$$\mathcal {A}$$

, subset of

$$\mathcal {X}$$

satisfying a condition on the average of

Y

(

x

). This second step determines the choice of

$$\mathbf {x}$$

by using a classification criterion based on the ordering of the sample mean squared errors. The research belongs to the theory of optimal design of experiments [2], that is employed in the Taguchi Methods, used in off-line control [6].

Umberto Magagnoli, Gabriele Cantaluppi
Mixed Mode Data Clustering: An Approach Based on Tetrachoric Correlations

In this paper we face the problem of clustering mixed mode data by assuming that the observed binary variables are generated from latent continuous variables. We perform a principal components analysis on the matrix of tetrachoric correlations and we then estimate the scores of each latent variable and construct a data matrix with continuous variables to be used in fully Guassian mixture models or in the k-means cluster analysis. The calculation of the expected a posteriori (EAP) estimates may proceed by simply considering a limited number of quadrature points. Main results on a simulation study and on a real data set are reported.

Isabella Morlini
Optimal Scaling Trees for Three-Way Data

The framework of this paper is developed on tree-based models for three-way data. Three-way data are measurements of variables on a sample of objects in different occasions (i.e. space, time, factor categories) and they are obtained when prior information play a role in the analysis.

Three way data can be analyzed by exploratory methods, i.e., the factorial approach (TUCKER, PARAFAC, CANDECOMP, etc.) as well as confirmatory methods, i.e., the modelling approach (log-trilinear association models, simultaneous latent budget models, etc.).

Recently, we have introduced a methodology for classification and regression trees in order to deal specifically with three-way data. Main idea is to use a stratifying variable or instrumental variable to distinguish either groups of variables or groups of objects. As a result, prior information plays a role in the analysis providing a new framework of classification and regression trees for three-way data.

In this paper we introduce a tree-based method based on optimal scaling in order to account of the presence of non-linear correlated groups of variables. The results of a real world application on Tourist Satisfaction Analysis in Naples will be also presented.

Valerio A. Tutore

Data mining

A Study on Text Modelling via Dirichlet Compound Multinomial

This contributions deals with a generative approach for the analysis of textual data. Instead of creating heuristic rules for the representation of documents and word counts, we employ a distribution able to model words along text considering different topics. In this regard, following Minka proposal [5], we implement a Dirichlet compound Multinomial distribution that is a mixture of random variables over words and topics. On the basis of this model we evaluate the predictive performance of the distribution by using seven different classifiers and taking into account the count of words in common between text document and reference class.

Concetto Elvio Bonafede, Paola Cerchiello
Automatic Multilevel Thresholding Based on a Fuzzy Entropy Measure

Histogram Thresholding is an image processing technique whose aim is that of separating the objects and the background of the image into non overlapping regions. In gray scale images this task is obtained by properly detecting, on the corresponding gray levels histogram, the valleys that space out the concentration of the pixels around the characteristic gray levels of the different image structures. In this paper, a novel procedure will be discussed exploiting fuzzy set theory and fuzzy entropy to find automatically the optimal number of thresholds and their location in the image histograms.

D. Bruzzese, U. Giani
Some Developments in Forward Search Clustering

The Forward Search (FS) represents a useful tool for clustering data that include outlying observations, because it provides a robust clustering method in conjunction with graphical tools for outlier identification. In this paper, we show that recasting FS clustering in the framework of normal mixture models can introduce some improvements: the problem of choosing a metric for clustering is avoided; membership degree is assessed by posterior probability; a testing procedure for outlier detection can be devised.

Daniela G. Calò
Spectral Graph Theory Tools for Social Network Comparison

The problem faced in this paper is related to the comparison between two undirected networks on

n

actors. Actors are in two different configurations

G

k

(

k

=1,2). Comparison is based on the evaluation of the how the relational node distances evolve in the passage from the first net (

G

1

) to the second net (

G

2

). The procedure consists of two steps: (

i

) define an appropriate relational distance among nodes of the two networks; (

ii

) compare the corresponding distance matrices. The first step is based on the so-called Euclidean Commute-Time Distance among the

n

nodes computed from a random walk on the graph and Laplacian matrix. The second step concerns the comparison between the obtained distance matrices by using Multidimensional Scaling techniques. The procedure has a wide range of application, especially for experimental purposes in social network applications where this issue has not been treated systematically.

Domenico De Stefano
Improving the MHIST-p Algorithm for Multivariate Histograms of Continuous Data

In many different applications (ranging from OLAP databases to query optimization) having an approximate distribution of values in a data set is an important improvement that allows a relevant saving of time or resources during computations. Histograms are a good solution, offering a good balance between computation cost and accuracy. Multidimensional data require more complicated handling in order to keep these two requirements within significant usefulness. In this paper we propose an improvement of the MHIST-p algorithm for the generation of multidimensional histograms and compare it with other approaches from literature.

Mauro Iacono, Antonio Irpino
On Building and Visualizing Proximity Graphs for Large Data Sets with Artificial Ants

We present in this paper a new incremental and bio-inspired algorithm that builds proximity graphs for large amounts of data (i.e. 1 million). It is inspired from the self-assembly behavior of real ants where each ant progressively becomes attached to an existing support and then successively to other attached ants. The ants that we have defined will similarly build a complex hierarchical graph structure. Each artificial ant represents one data. The way ants move and connect depends on the similarity between data. Our hierarchical extension, for huge amounts of data, gives encouraging running times compared to other incremental building methods and is particularly well adapted to the visualization of groups of data (i.e. clusters) thanks to the super-node structure. In addition the visualization using a force-directed algorithm respects the real distances between data.

Julien Lavergne, Hanane Azzag, Christiane Guinot, Gilles Venturini
Including Empirical Prior Information in Test Administration

In this work, the issue of using prior information in test administration is taken into account. The focus is on the development of procedures to include background variables which are strongly related to the latent ability, adopting a Bayesian approach. Because the desirability of prior information for the ability estimation in item response modelling depends on the goals of the test, only some kinds of educational tests might profit of this approach. The procedures will be evaluated in an empirical context and some recommendations about the use of prior information will be given.

Mariagiulia Matteucci, Bernard P. Veldkamp

Robustness and classification

Italian Firms’ Geographical Location in High-tech Industries: A Robust Analysis

Recent debates in economic-statistical research concern the relationship between firms’ performance and their capabilities to develop new technologies and products. Several studies argue that economic performance and geographical proximity strongly affect firms’ level of technology. The aim of the paper is twofold. Firstly, we propose to generalize this approach and to develop a model to identify the relationship between the firm’s technology level and some firm’s characteristics. Secondly, we use an outlier detection method to identify units that affect the analysis results and the estimates stability. This analysis is implemented using a generalized regression model with a diagnostic robust approach based on forward search. The method we use reveals how the fitted regression model depends on individual observations and the results show how the firms’ technology level is influenced by their geographical proximity.

Matilde Bini, Margherita Velucchi
Robust Tests for Pareto Density Estimation

A common practice to determine the extension and heaviness of heavy tails of income, return and size distributions is the sequential estimation and fitting of one or several models, starting from a group of the largest observations and adding one observation at a time [14]. In the early stages this kind of procedure shows high sensitivity of the shape parameter estimates to single observations, the end of the search being fixed when the shape parameter value estimates reach a plateau. In this paper we propose a stepwise fitting of a heavy-tailed model, the Pareto II distribution [1], previously applied to the size distribution of business firms. The procedure, based on the forward search technique [2], is data-driven since observations to be added at each iteration are determined according to the results of the estimation carried out at the preceding step and not, as in sequential fitting, according to their rank.

Aldo Corbellini, Lisa Crosato
Bootstrap and Nonparametric Predictors to Impute Missing Data

A new nonparametric technique to impute missing data is proposed in order to obtain a completed data-matrix, capable of producing a degree of reliability for the imputations. Without taking into account strong assumptions, we introduce multiple imputations using bootstrap and nonparametric predictors. It is shown that, in this manner, we can obtain better imputations than with other known methods producing a more reliable completed data-matrix. Using two simulations, we show that the proposed technique can be generalized to consider non-monotone patterns of missing data with interesting results.

Agostino Di Ciaccio
On the Use of Boosting Procedures to Predict the Risk of Default

Statistical models have been widely applied with the aim of evaluating the risk of default of enterprises. However, a typical problem is that the occurrence of the default event is rare, and this class imbalance strongly affects the performance of traditional classifiers. Boosting is a general class of methods which iteratively enforces the accuracy of any weak learner, but it suffers from some drawbacks in presence of unbalanced classes. Performance of standard boosting procedures to deal with unbalanced classes is discussed and a new algorithm is proposed.

Giovanna Menardi, Federico Tedeschi, Nicola Torelli

Categorical data and latent class approach

Assessing Similarity of Rating Distributions by Kullback-Leibler Divergence

A mixture model for ordinal data modelling (denoted CUB) has been recently proposed in literature. Specifically, ordinal data are represented by means of a discrete random variable which is a mixture of a Uniform and shifted Binomial random variables. This article proposes a testing procedure based on the Kullback-Leibler divergence in order to compare CUB models and detect similarities in the structure of judgements that raters express on set of items.

Marcella Corduas
Sector Classification in Stock Markets: A Latent Class Approach

Stock indices related to specific economic sectors play a major role in portfolio diversification. Notwithstanding its importance, the traditional sector classification shows several flaws and it may not be able to properly discriminate the risk-return profile of financial assets. We propose a latent class approach in order to correctly classify the stock companies into homogenous groups under risk-return profile and to obtain sector indices which are consistent with the standard portfolio theory. Our results allow to introduce a methodological dimension in the stock’s classification and to improve the reliability of sector portfolio diversification.

Michele Costa, Luca De Angelis
Partitioning the Geometric Variability in Multivariate Analysis and Contingency Tables

Most methods of multivariate analysis obtain and interpret an appropriate decomposition of the variability. In canonical variate analysis, multidimensional scaling and correspondence analysis, the variability of the data is measured in terms of distances. Then the geometric variability (inertia) plays an important role. We present a unified approach for describing four methods for representing categorical data in a contingency table. We define the generalized Pearson contingency coefficient and show situations where this measure can be different from the geometric variability.

Carles M. Cuadras, Daniel Cuadras
One-Dimensional Preference Data Imputation Through Transition Rules

Preferences may be elicited with methods based either on pair comparison between items or ordering/sorting one or more items out of the given set. In both cases, the multivariate analysis of preferences requires that preferability is expressed for all pairs of items, so that an irreducible dominance matrix can be defined and mathematically processed. In this paper we present, apply and evaluate a new transition rule for the estimation of empty cells of a dominance matrix. The method was applied to preference data on students’ guidance services. The new methodology showed to be more reliable than other methods in the literature.

Luigi Fabbris
About a Type of Quasi Linear Estimating Equation Approach

In this work, a type of quasi-linear system is presented, which is able to identify the “true” value of parameter-profile in the setup of “generalized linear mixed models”. A type of quasi-linearization of the link function is used, which would preserve basic sampling properties of conditioned moments of the random latent profile. Then, an approach is outlined in estimating. It uses a weighted quasi-linear estimating system which is exactly unbiased. Due to quasi-linearization, it might be solved by using easy-to-implement recursive procedures.

Giulio D’Epifanio
Causal Inference Through Principal Stratification: A Special Type of Latent Class Modelling

Principal stratification is an increasingly adopted framework for drawing counterfactual causal inferences in complex situations. After outlining the framework, with special emphasis on the case of truncation by death, I describe an application of the methodology where the analysis is based on a parametric model with latent classes. Then, I discuss the special features of latent class models derived within the principal strata framework. I argue that the concept of principal stratification gives latent class models a solid theoretical basis and helps to solve some specification and fitting issues.

Leonardo Grilli
Scaling the Latent Variable Cultural Capital via Item Response Models and Latent Class Analysis

One of the main tasks of an educational system is to enrich the

Cultural Capital

of its students. The

Cultural Capital

linked to social origins is considered crucial in determining students’social life and subsequent professional achievement. This work moves from an

ad hoc

survey carried out on a sample of students who enrolled or applied for an entrance test at the university. The

Cultural Capital

is treated as a latent variable which students are supposed to possess at a greater or lesser degree. Latent Class Analysis is adopted in order to provide a non arbitrary scaling of

Cultural Capital

and to sort out mutually exclusive classes of students. Moreover, Item Response Models are implemented to assess the calibration of the questionnaire as an instrument to measure the

Cultural Capital

of the surveyed population.

Isabella Sulis, Mariano Porcu, Marco Pitzalis
Assessment of Latent Class Detection in PLS Path Modeling: a Simulation Study to Evaluate the Group Quality Index performance

Structural Equation Models assume homogeneity across the entire sample. In other words, all the units are supposed to be well represented by a unique model. Not taking into account heterogeneity among units may lead to biased results in terms of model parameters. That is why, nowadays, more attention is focused on techniques able to detect unobserved heterogeneity in Structural Equation Models. However, once unit partition obtained according to the chosen clustering methods, it is important to state if taking into account local models provides better results than using a single model for the whole sample. Here, a new index to assess detected unit partition will be presented: the Group Quality Index. A simulation study involving two different simulation schemes (one simulating the so called null hypothesis of homogeneity among units, and the other taking into account the heterogenous sample case) will be presented.

Laura Trinchera

Latent Variables and related methods

Non-Linear Relationships in SEM with Latent Variables: Some Theoretical Remarks and a Case Study

The object of the work is to take into account non-linear relationships in path analysis models with latent variables. Some theoretical remarks are made to introduce the context where the presence of non-linearity is to be considered with reference to both the inner and the outer model.

Diagnostic tools to test the existence of a non-linear relationship are also presented, mainly with reference to the so-called Kano model. In particular, a procedure based upon the regression of the response variable, with respect to properly defined dummy variables, is considered.

An application to data coming from a survey on the customers of a financial organization is finally presented.

Giuseppe Boari, Gabriele Cantaluppi, Stefano Bertelli
Multidimensional Scaling Versus Multiple Correspondence Analysis When Analyzing Categorization Data

Categorization is a cognitive process in which subjects are asked to group a set of object according to their similarities. This task was used for the first time in psychology and is becoming now more and more popular in sensory analysis. Categorization data are usually analyzed by multidimensional scaling (MDS). In this article we propose an original approach based on multiple correspondence analysis (MCA); this new methodology which provides new insights on the data will be compared to one specified procedure of MDS.

Marine Cadoret, Sébastien Lê, Jérôme Pagès
Multidimensional Scaling as Visualization Tool of Web Sequence Rules

Web Mining can be defined as the application of Data mining processes to Web data. In the field of Web Mining, we distinguish among

Web Content Mining

,

Web Structure Mining

and

Web Usage Mining

. Web Content Mining is the Web Mining process which analyze various aspects related to the contents of a web site such as text, banners, graphics etc. Web Structure Mining is the branch of Web Mining that analyze the structure of the Net (or a sub-part) in terms of connection among the web pages and their linkage design. Finally, Web Usage Mining goal is to understand the usage custom behaviors of web sites users. Within the context of Web Usage Mining,

pattern discovery

and

pattern analysis

allow to

profile

users and their preferences. The sequence rules are association rules ordered in time. Given a data set coming from a web site which is characterized by a sequence of visits, the proposal is to understand the differences among browsing sections through a Multidimensional Scaling solution, and then obtain a graphical tool which allows to visualize in a new way the sequence rules. The resulting application is half way between Web Usage Mining and Web Structure Mining.

Antonio D’Ambrosio, Marcello Pecoraro
Partial Compliance, Effect of Treatment on the Treated and Instrumental Variables

Under the assumption that treatment assignment has no direct effect on the response, a non parametric probabilistic model of the distribution involving the latent confounder under partial compliance leads to a generalized definition of the effect of treatment on the treated and reveals that the instrumental variable estimand equals a suitable average of such causal effects only when certain restrictions hold. An application to a popular data set concerning reduction of cholesterol level is used as an illustration.

Antonio Forcina
Method of Quantification for Qualitative Variables and their Use in the Structural Equations Models

The article is about the problem of the treatment of qualitative variables in the Structural Equation Models with attention to the case of Partial Least Squares Path Modeling. In literature there are some proposals based on the application of known statistical tecniques to quantify the qualitative variables. Starting from these works we propose an external quantification for only qualitative variables by the Alternating Least Squares, obtaining the optimal quantification (vectors of optimal scaling), a future objective to develop an algorithm that computes simultaneously the vectors of optimal scaling and the optimal regression coefficients, between the variables. We will present an application of our method to a real dataset.

C. Lauro, D. Nappo, M.G. Grassia, R. Miele
Monitoring Panel Performance Within and Between Sensory Experiments by Multi-Way Analysis

In sensory analysis a panel of trained assessors evaluates a set of samples according to specific sensory descriptors. The training improves objectivity and reliability of assessments. However, there can be individual differences between assessors left after the training that should be taken into account in the analysis. Monitoring panel performance is then crucial for optimal sensory evaluations. The present work proposes to analyze the panel performance within single sensory evaluations and between consecutive evaluations. The basic idea is to use multi-way models to handle the three-way nature of the sensory data.

Rosaria Romano, Jannie S. Vestergaard, Mohsen Kompany-Zareh, Wender L.P. Bredie
A Proposal for Handling Categorical Predictors in PLS Regression Framework

To regress one or more quantitative response variables on a set of predictor variables of different nature, it is necessary to transform non-quantitative predictors in such a way that they can be analyzed together with the other variables measured on an interval scale. Here, a new proposal to cope with this issue in Partial Least Squares (PLS) regression framework is presented. The approach consists in quantifying each non-quantitative predictor according to Hayashi’s first quantification method, using the dependent variable (or, in the multivariate case, a linear combination of the response variables) as an external criterion. The PLS weight of each variable which is quantified according to the proposed approach is coherent with the statistical relationship between its original non-quantitative variable and the response variable(s) as expressed in terms of Pearson’s correlation ratio. Firstly, the case where one variable depends on a set of both categorical and quantitative variables is discussed; then, a modified PLS algorithm, called PLS-CAP, is proposed to obtain the quantifications of the categorical predictors in the multi-response case. An application on real data is presented in order to enhance the properties of the quantification approach based on the PLS-CAP with respect to the classical approach based on the dummy code of the categorical variables.

Giorgio Russolillo, Carlo Natale Lauro

Symbolic,multivalued and conceptual data analysis

On the Use of Archetypes and Interval Coding in Sensory Analysis

Archetypal analysis is a statistical method aiming at synthesizing a set of multivariate observations through few points not necessarily observed. On the other hand, coding data as interval values allows to include variability and variation in the data itself. This work proposes the use of archetypal analysis for interval-coded sensory data to synthesize profiling data taking into account assessor panel variability.

Maria Rosaria D’Esposito, Francesco Palumbo, Giancarlo Ragozini
From Histogram Data to Model Data Analysis

The aim of this work is to propose a new approach for dealing with histogram data in symbolic data analysis framework. The idea is to approximate histogram data using B-spline functions in order to synthetize the information within data trough some characteristic function parameters. This parameters will be the new data that could be, subsequently, analyzed with methodologies of multidimensional data analysis.

Marina Marino, Simona Signoriello
Use of Genetic Algorithms When Computing Variance of Interval Data

In many areas of science and engineering it is of great interest to compute different statistics under the interval uncertainty. Unfortunately, this task often turns out to be very complex. For example, finding the bounds of the interval that includes all possible values produced by the calculation of quantities like variance or covariance for interval valued dataset is a NP-hard task. In this paper a

genetic algorithm

is proposed to tackle with this problem. An application of the algorithm is presented and compared with the result of an exhaustive search using the same data, which has been performed on a grid computing infrastructure.

Jaromír Antoch, Raffaele Miele
Spatial Visualization of Conceptual Data

Numerous data mining methods have been designed to help extract relevant and significant information from large datasets. Computing concept lattices allows clustering data according to their common features and making all relationships between them explicit. However, the size of such lattices increases exponentially with the volume of data and its number of dimensions. This paper proposes to use spatial (pixel-oriented) and tree-based visualizations of these conceptual structures in order to optimally exploit their expressivity.

Michel Soto, Bénédicte Le Grand, Marie-Aude Aufaure

Spatial,temporal,streaming and functional data analysis

A Test of LBO Firms’ Acquisition Rationale: The French Case

We investigate whether the characteristics of Leveraged Buy-Out (LBO) targets before the deal differ from those of targets that have undergone another type of transfer of shares. Specifically, we examine the size, value, industry, quotation and profitability of French targets involved in transfers of shares between 1996 and 2004. Using two different methods (a classical logit regression and a mixed discriminant analysis), results show that LBO targets are more profitable, that they are more frequently unquoted, and that they more often belong to manufacturing industries in comparison with the targets involved in other types of transfers of shares.

R. Abdesselam, S. Cieply, A.L. Le Nadant
Kernel Intensity for Space-Time Point Processes with Application to Seismological Problems

Dealing with data coming from a space-time inhomogeneous process, there is often the need of semi-parametric estimates of the conditional intensity function; isotropic or anisotropic multivariate kernel estimates can be used, with windows sizes

h

. The properties of the intensities estimated with this choice of

h

are not always good for specific fields of application; we could try to choose

h

in order to have good predictive properties of the estimated intensity function. Since a direct ML approach cannot be followed, we propose an estimation procedure, computationally intensive, based on the subsequent increments of likelihood obtained adding an observation at time. The first results obtained are very encouraging. Some application in statistical seismology is presented.

Giada Adelfio, Marcello Chiodi
Summarizing and Mining Streaming Data via a Functional Data Approach

In recent years, the analysis of data streams has become a challenging task since many applicative fields generate massive amount of data that are difficult to store and to analyze with traditional techniques. In this paper we propose a strategy to summarize pseudo periodic streaming data affected by noise and sampling problems, by means of functional profiles. It is a clustering strategy performed in a divide and conquer manner. In the on-line step, a set of summarization structures, collect statistical information on data. Starting from these, in the off-line step, the final clustering structure and the set of functional profiles are computed.

Antonio Balzanella, Elvira Romano, Rosanna Verde
Clustering Complex Time Series Databases

Time series data account for a large fraction of the data stored in financial, medical and scientific database. As a consequence, in the last decade there has been an explosion of interest in mining time series data and several new algorithms to index, classify, cluster and segment time series have been introduced. In this paper we focus on clustering of time series from a large database provided by a large Italian electric company, and the power consumption of a specific class of power users, that is the business and industrial customers, is measured. The aim of this paper is to propose an effective clustering technique in the frequency domain where the need of computational and memory resources is much reduced in order to make the algorithm efficient for large and complex temporal data bases.

Francesco Giordano, Michele La Rocca, Maria Lucia Parrella
Use of a Flexible Weight Matrix in a Local Spatial Statistic

Most of local indices of spatial autocorrelation utilize a classical adjacency matrix as interconnection system. In this paper we attempt to use generalized matrix of spatial weights for measuring local autocorrelation. The work concludes with a comparison of local autocorrelation indices according to different hypotheses of neighborhood.

Massimo Mucciardi
Constrained Variable Clustering and the Best Basis Problem in Functional Data Analysis

Functional data analysis involves data described by regular functions rather than by a finite number of real valued variables. While some robust data analysis methods can be applied directly to the very high dimensional vectors obtained from a fine grid sampling of functional data, all methods benefit from a prior simplification of the functions that reduces the redundancy induced by the regularity. In this paper we propose to use a clustering approach that targets variables rather than individual to design a piecewise constant representation of a set of functions. The contiguity constraint induced by the functional nature of the variables allows a polynomial complexity algorithm to give the optimal solution.

Fabrice Rossi, Yves Lechevallier

Bio and health science

Plaid Model for Microarray Data: an Enhancement of the Pruning Step

Microarrays have become a standard tool for studying gene functions. For example, we can investigate if a subset of genes shows a coherent expression pattern under different conditions. The plaid model, a model-based biclustering method, can be used to incorporate the addiction structure used for the microarray experiment. In this paper we describe an enhancement for the plaid model algorithm based on the theory of the false discovery rate.

Luigi Augugliaro, Angelo M. Mineo
Classification of the Human Papilloma Viruses

In this study we present a whole-genome phylogenetic classification of the human papilloma viruse (HPV) family. We found that the high risk of carcinogenicity taxa are clustered together. The most likely insertion and deletion (indel) scenarios of HPV nucleotides were computed. We also searched for relationships between the number of indels which occurred during the evolution of the HPV family and the degree of carcinogenicity of considered taxa. Linear and polynomial redundancy analyses (RDA) were carried out to relate the HPV carcinogenicity with the number of insertions, deletions and conservations.

Abdoulaye Baniré Diallo, Dunarel Badescu, Mathieu Blanchette, Vladimir Makarenkov
Toward the Discovery of Itemsets with Significant Variations in Gene Expression Matrices

Gene expression matrices are numerical tables that describe the level of expression of genes in different situations, characterizing their behaviour. Biologists are interested in identifying groups of genes presenting similar quantitative variations of expression. This paper presents new syntactic constraints for itemset mining in particular Boolean gene expression matrices. A two dimensional gene expression profile representation is introduced and adapted to itemset mining allowing one to control gene expression. Syntactic constraints are used to discover itemsets with significant expression variations from a large collection of gene expression profiles.

Mehdi Kaytoue, Sébastien Duplessis, Amedeo Napoli
Metadata
Title
Classification and Multivariate Analysis for Complex Data Structures
Editors
Bernard Fichet
Domenico Piccolo
Rosanna Verde
Maurizio Vichi
Copyright Year
2011
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-13312-1
Print ISBN
978-3-642-13311-4
DOI
https://doi.org/10.1007/978-3-642-13312-1

Premium Partner