Skip to main content
main-content

Über dieses Buch

International Federation of Classification Societies The International Federation of Classification Societies (IFCS) is an agency for the dissemination of technical and scientific information concerning classification and data analysis in the broad sense and in as wide a· range of applications as possible; founded in 1985 in Cambridge (UK) from the following Scientific Societies and Groups: British Classification Society -BCS; Classification Society of North America -CSNA; Gesellschaft fUr Klassifikation -GfKl; Japanese Classification Society -JCS; Classification Group of Italian Statistical Society - COSIS; Societe Francophone de Classification -SFC. Now the IFCS includes the following Societies: Dutch-Belgian Classification Society - VOC; Polish Classification Section - SKAD; Portuguese Classification Association - CLAD; Group-at-Large; Korean Classification Society -KCS. Biannual Meeting of the Classification and Data Analysis Group of SIS The biannual meeting of the Classification and Data Analysis Group of Societa Italiana di Statistica (SIS) was held in Pescara, July 3 -4, 1997. The 69 papers presented were divided in 17 sessions. Each session was organized by a chairperson with two invited speakers and two contributed papers from a call for papers. All the works were referred. Furthermore, during the meeting a discussant was provided for each session. A short version of the papers (4 pages) was.published before the conference.

Inhaltsverzeichnis

Frontmatter

Classification

Frontmatter

Methodologies in Classification

Measuring the Influence of Individual Observations and Variables in Cluster Analysis

In this paper we address some issues in the field of cluster stability. In particular, we study the effect of deleting individual cases and variables on the results of a (nonhierarchical) cluster analysis. We do not restrict to computation of a single influence measure for each data point, or variable, but we analyze how individual influence varies when the number of clusters changes. For this purpose we suggest the use of simple deletion diagnostics computed by cross-validation. The suggested approach is applied to real data and results are displayed by means of a simple tool of modern multivariate-data visualization. Furthermore, the performance of our diagnostics is assessed through Monte Carlo simulations both under the null hypothesis of well-behaved data and the alternative hypothesis of isolated contamination.

Andrea Cerioli

Consensus Classification for A Set of Multiple Time Series

In multiple time series analysis, when there are a very large number of series, a classification into homogeneous clusters might be useful to reduce the problem’s complexity and eliminate possible redundancies (Zani, 1983). Furthermore, when we have different classifications, one for each statistical unit (e. g. spatial units), a consensus classification allows one to obtain a classification which summarizes the given classifications. The present paper focuses on the problem of identifying consensus classifications in a set of multiple time series (panel data), using a consensus method (Vichi, 1993, 1994). First, a distance among time series is defined and a hierarchical classification among time series, for each temporal lag and for each unit, is performed. Then, a consensus classification among different units for the same temporal lag is carried out. Finally, a hierarchical classification among the different consensus classifications, with the same temporal lag, is carried out.

Pierpaolo D’Urso, Maria Grazia Pittau

A Bootstrap Method for Adaptive Cluster Sampling

In adaptive designs defined by selecting an initial simple random sample with or without replacement, the sample mean estimator is unbiased only if the initial sample is used, whereas, it is biased when a sample obtained at the end of the adaptive procedure is considered. In the last situation the estimator has been opportunely modified (Thompson and Seber 1996). However, for several estimators different from mean, such as the variance, the construction, in adaptive design, of a corresponding unbiased estimator has not been solved.In this paper the BACS (Bootstrap for Adaptive Cluster Sampling) procedure based on a resampling is proposed to estimate the bias of an estimator.

Tonio Di Battista, Domenico Di Spalatro

Forecasting a Classification

This paper focuses on the problem of forecasting a classification given a panel data set formed by a (multiple) time series of partitions of a same set of units. As far as we know, in classification and time series analysis this is a completely new problem. A methodology based on a vector autoregressive model is here proposed to directly forecast a partition given a (multiple) time series of partitions with the same fixed number of classes for each time. Two real panel data have been analysed with this new procedure. Open problems are discussed in a final section.

Domenica Fioredistella Iezzi, Maurizio Vichi

Fuzzy Clustering and Fuzzy Methods

Neural Networks as a Fuzzy Semantic Network of Events

The paper deals with the interpretation of a neural network in terms of a semantic network able to describe a cluster of events.Some general remarks on neural networks and semantic networks are proposed and the interpretation of semantic networks as cluster of events is considered as a new way of understanding cluster analysis in a data base,provided we deal with fuzzy events.

Antonio Bellacicco

Hierarchical Fuzzy Clustering: An Example of Spatio-Temporal Analysis

This work describes the hierarchical classification procedure called ‘fuzzy average linkage’ which provides a fuzzy partition of a group of units. The basic principle is that the average similarity of units linked to the same group must be greater than or equal to a certain pre-set similarity level. This method is applied to mortality rates by cause of death for men and women in the 1970s, 1980s and 1990s.

Loredana Cerbara

A New Algorithm for Semi-Fuzzy Clustering

This paper presents a new algorithm for semi-fuzzy clustering that allows objects to belong not necessarily to all the clusters, but also to only one of them. The advantage of this new method is that fuzziness is not introduced for all objects but only for those that cannot be classified as belonging to a single cluster. The performance of the new algorithm compared to the fuzzy c-means algorithm is showed by an application on a data set.

Giampaolo Iacovacci

Fuzzy Classification and Hyperstructures: An Application to Evaluation of Urban Projects

In this paper we have considered the possibility of applying the theories of fuzzy sets and algebraic hyperstructures to feasibility evaluation of urban new qualification projects. We have studied a mathematical model to help detect whether to verify the validity of some choices during each projectual stage, or to single out the best project among a series of possible alternatives by estimating the measure in which the projects attain the economical, psycological, cultural, technological objectives.

Antonio Maturo, Barbara Ferri

Variable Selection In Fuzzy Clustering

The aim of the present paper is to discuss methods for selecting a subset of initially observed variables in the context of fuzzy clustering. The suggested procedure is based on the optimization of an objective function which is differently specified according to the purpose of the selection. Measure of cluster validity, a generalization of Rand index and distance between dissimilarity matrices are then proposed as proper functions to optimize.

Maria Adele Milioli

Other Approaches for Classification

Frontmatter

Discrimination and Classification

Discriminant Analysis Using Markovian Automodels

Spatially distributed observations occur naturally in a number of empirical situations; their analysis represents a significant source of theoretical challenge due to the multidirectional dependence among nearest observations. The presence of a dependence often causes the standard statistical methods, instead based on independence assumptions, to fail badly. This paper concerns the problem of discrimination and classification of spatial binary data. It presents a suitable discrimination function based on Markovian automodels and suggests a solution to the allocation problem through a Gibbs sampler-based procedure.

Marco Alfò, Paolo Postiglione

Discretization of Continuous-Valued Data in Symbolic Classification Learning

Symbolic data analysis aims at extending classical data analysis to data representing classes of individuals instead of single individuals. A major problem in symbolic data analysis is discrimination, that is the generation of data representing classes. Such data can be expressed as classification rules, which are learned from training examples. The paper addresses the problem of learning classification rules from examples described by both numeric and symbolic attributes, so that the discretization of the continuous-valued attributes is performed during the learning process. The proposed technique has been embedded into a classification learning system, named INDUBI/CSL, and tested on several data sets.

F. Esposito, D. Malerba, G. Semeraro, S. Caggese

Logistic Discrimination by Kullback-Leibler type Distance Measures

We consider the problem of parameter estimation in logistic discrimination. Our approach exploits the minimization of an error function based on distance measures between posterior probability distributions of the classes. In this context we analyze statistical properties of the Kullback-Leibler directed divergence and the euclidean distances from both theoretical and applied point of view.

Salvatore Ingrassia

An Empirical Discrimination Algorithm based on Projection Pursuit Density Estimation

In this paper a nonparametric method for discriminant analysis is proposed, based on a group separation oriented version of projection pursuit density estimation. Each population is separated in turn from the remaining ones, considered as a whole, by approximating the boundary between them through the composition of some informative directions, chosen according to an appropriate discrimination criterion. A coherent allocation rule is proposed, too. Simulation studies have shown that this method represents a valid solution for problems when the parametric approaches are not flexible enough and sample sizes are too small to use classical nonparametric methods.

Angela Montanari, Daniela G. Calò

Regression Tree and Neural Networks

Notes on Methods for Improving Unstable Classifiers

Methods for improving the predictive power of unstable classifiers based on combining multiple versions of these have received much attention in the last few years. The aim of this paper is to compare some of the proposed methods with a focus on neural network classifiers. Experimental results are provided to illustrate, in different data sets, the performances of different methods of combining the output of several neural classifiers.

Rossella Miglio, Marilena Pillati

Selection of Cut Points in Generalized Additive Models

This paper offers, in the framework of generalized additive models (GAM), a proposal of a cut point selection for GAM smoothers that stems out of the CART like regression tree procedures. The proposal allows to find a parsimonious bin smoother (regressogram), a new smoother based on the well known loess smoother, and provides, moreover, the user with an additional information inherited from the regression tree methodology. The problem of the choice of span parameter is considered too.

Francesco Mola

Latent Budget Trees for Multiple Classification

This paper provides a methodology to grow classification trees when a multiple qualitative response is considered as a criterion variable. The latent budget model is used recursively to find ever finer partitions of cases into a prior fixed number of groups. The Akaike statistic is considered to select the most predictive model at each node of the tree. A fruitful interpretation of the final decision rule is given through the Bayes rule. The proposed approach is also convenient to deal with multiple questions through the use of compound variables in the latent budget model. An application of the proposed approach on a data set taken from a Survey of the Bank of Italy is finally shown.

Roberta Siciliano

Multivariate and Multidimensional Data Analysis

Frontmatter

Proximity Analysis and Multidimensional Scaling

Methods for Asymmetric Three-way Scaling

A review of methods for asymmetric three-way scaling is presented focusing on their graphical capabilities. A general strategy of analysis is outlined with an example of application to import-export data.

Giuseppe Bove, Roberto Rocci

Comparison of Euclidean Approximations of non-Euclidean Distances

The different techniques used for Euclidean approximation of distances are discussed. In the special case of points in a Euclidean space, whose distances are biased due to measure errors, accepting negative eigenvalues may help in the interpretation of results that are less biased than those obtained through an additive constant solution. Numerical examples are given.

Sergio Camiz

Analysing Dissimilarities through Multigraphs

In this paper a very general way of modelling dissimilarities is proposed based on ideas derived from multigraph theory. The proposed method admits Gower’s dissimilarity index as a special case and gives the possibility to cope with measurement errors and within subject variability when computing dissimilarities. The approach also allows to assign a different importance to the same dissimilarity value on different regions of the variable domain.

Angela Montanari, Gabriele Soffritti

Professional Positioning based on Dominant Eigenvalue Scores (DES), Dimensional Scaling (DS) and Multidimensional Scaling (MDS) Synthesis of Binary Evaluations Matrix of Experts

Wider research aimed to obtain timely evaluations of the Professional Market in the province of Naples through in-depth interviews with experts. This paper shows some results on one set of professions, according to DS and MDS procedures.

Claudio Quintano

Non-Metric Full-Multidimensional Scaling

This paper focuses on some solutions for non-metric full-Multidimensional Scaling (MDS), minimizing the STRESS and S-STRESS loss functions. In particular, the linear transformations of dissimilarities into Euclidean distances minimizing the two loss functions are given. A non trivial result for S-STRESS with a quadratic transformation of dissimilarities, constraining its coefficients, is also obtained.

Maurizio Vichi

Factorial Methods

Dynamic Factor Analysis

This paper represents an extension of Dynamic Factor Analysis (AFD) models proposed in the ‘70s by Coppi and Zannella. AFD models are specific for data-array whose third dimension is time. They consider time as an explicit element which gives rise to part of the observed variability. AFD models integrate two different strategies. The first aims at studying the relationships between variables and units, averaged over time, by factorial analysis of specific covariance-matrices. The second aims at studying time evolution of both variables and units by time regression and autoregressive models.

Isabella Corazziari

A Non Symmetrical Generalised Co-Structure Analysis for Inspecting Quality Control Data

The paper provides a contribution to factorial methods in multidimensional data analysis covering the gap of graphical representations of statistical units on which a multiple set of response variables as well as a common set of explanatory variables are observed. By joining the features of multiple Co-Inertia analysis with those of a geometrical non-symmetrical approach, the proposed technique gains remarkable advantages in identifying a typology of statistical units generated by the mentioned dependence structure.

Vincenzo Esposito, Germana Scepi

Principal Surfaces Constrained Analysis

In this paper we present a non parametric adaptive procedure for the principal components non linear representation (Principal Surfaces, PS, LeBlanc and Tibshirani, 1994) in Constrained Principal Component Analysis (CPCA, D’Ambra & Lauro 1982, 1992 see also Principal Component Analysis with Instrumental Variables, Rao 1964; Redundancy Analysis, van der Wollenberg, 1977).

R. Lombardo, G. Tessitore

Generalised Canonical Analysis on Symbolic Objects

In this paper we propose an extension of the Generalised Canonical Analysis to the study of symbolic objects. The aim is to analyse symbolic objects on a factorial plan. In the reduced sub-space, the symbolic objects are represented by polygons instead of points as in the classical data analysis. This kind of representation seems consistent with their original meaning of complex information. Furthermore we propose a symbolic interpretation of the factorial axes, and an evaluation of the quality of the images of the symbolic objects on the factorial plan.

Rosanna Verde

Analysis of Qualitative Variables in Structural Models with Unique Solutions

A new method based on the Multidimensional Scaling and the Restricted Regression Component Decomposition is proposed in order to obtain solutions for structural models with mixed variables.

Giorgio Vittadini

Spatial Analysis

Exploring Multivariate Spatial Data: Line Transect Data

In this paper we describe an exploratory technique based on the diagonalization of cross-variogram matrices. Our aim is to describe the behavior of a multivariate set of spatial data in a dimensionally reduced space in such a way that the information on the spatial variation is preserved. Furthermore we propose a definition for the range of «variograms» in the multivariate case. Simulation studies and an application to botanical data collected on line transects are reported.

Alessandra Capobianchi, Giovanna Jona-Lasinio

On the Assessment of Geographical Survey Units using Constrained Classification

Surveys of spatially distributed phenomena are often conducted using geographical areas as strata. If CATI (computer assisted telephone interviewing) methodology is used to contact the units, it is possible to choose between two different methods of selecting the units to be interviewed: either from a full list of the population units or a selection based on RDD (random digit dialling) technique. On this last case it seems natural to telephone exchange areas as strata. This could be a very interesting solution from many point of views. The aim of this paper is to find a methodology to asses the opportunity of using such a pre-defined geographical stratification in comparison of the usual clustering methods, based on a set of auxiliary variables correlated with the phenomena under study, to define the strata. The choice will depend on the use of some measure of similarity and/or the evaluation of the homogeneity in the strata for the specific phenomenon to be analysed (in our application the evaluation of the loss of homogeneity is verified with respect to a hypothetical set of variables under study).

Antonio Giusti, Alessandra Petrucci

Kalman Filter Applied to Non-Causal Models for Spatial Data

This paper faces the problem of the application of filtering and smoothing algorithms, in particular Kalman filtering, to spatially dependent data. We take into account the case of first- and second-order homogeneous GaussMarkov Random Fields (GMRF), and we address the question of parameter estimation for this class of spatial processes; then we consider the possibility of expressing these processes as unilateral ones, so that they can be written in state-space form; and, finally, we present a “classical” Kalman filter algorithm, which is particularly suitable for the case of satellite images contaminated by additive Gaussian noise.

Luca Romagnoli

Multiway Data Analysis

A Paradigmatic Path for Statistical Content Analysis Using an Integrated Package of Textual Data Treatment

In this paper different phases of the treatment of text are sketched, in order to link them both with some lexical characteristics of the analised corpus and with multidimensional techniques useful for the statistical content analysis of the latter. Our proposal is directed towards maintaining intact the system of meanings present in the corpus and to bettering the degree of monosemy of words. In this way a corpus vocabulary of mixed units of analysis is realised.

Sergio Bolasco, Adolfo Morrone, Francesco Baiocchi

The Analysis of Auxological Data by Means of Nonlinear Multivariate Growth Curves

In this paper we treat the problem to analyse a data set constituted by multivariate growth curves for different subjects; thus in this context we deal with 3-way data tables. Nevertheless, it is not possible using factorial techniques proposed to deal with 3-way data matrices, because the observations are generally not equally spaced; moreover a multilevel approach founded on polynomial models is not suitable to deal with intrinsic nonlinear models. We propose a non-factorial technique to analyse auxological data sets using an intrinsic nonlinear multivariate growth model with autocorrelated errors. The application to a real data set of growing children gave easily interpretable results.

Marcello Chiodi, Angelo M. Mineo

The Kalman Filter on Three Way Data Matrix for Missing Data: A Case Study on Sea Water Pollution

This paper proposes a method for the reconstruction of missing data in a three-way data array, based on six modified procedures of the optimum Kalman filter in relation to the structural data analysis. The case study regards environmental data on sea water pollution observed in the Adriatic sea.

Mauro Coli, Luigi Ippoliti, Eugenia Nissi

Three-Way Data Arrays with Double Neighbourhood Relations as a Tool to Analyze a Contiguity Structure

In this paper we present two methods to analyze three-way data arrays with double neighbourhood relations. The first procedure use Kronecker product between graph matrices to construct a neighbourhood operator. Some of the most significant eigenvectors of this operator allows modelization of the underlying phenomena. The second methods make Kronecker product between neighbourhood operators of each graph matrices and is equivalent to a particular STATIS. A comparison between these two procedures on ecological data set is then performed.

Pierre-André Cornillon, Pietro Amenta, Robert Sabatier

Firm Performance Analysis with Panel Data

This paper deals with an “ecumenical” approach to the productive processes analysis. In particular we suggest a research strategy based on five known estimation methods for the production frontier function one of each characterised by a proper type of flexibility in results evaluation. In such a way, we obtain through a bootstrap methodology, an interval estimation of the efficiency score of any firm.

Achille Lemmi, Duccio Stefano Gazzei

Multivariate Data Analysis

Detection of Multivariate Outliers by Convex Hulls

This paper deals with the problem of identifying multiple outliers in multivariate data. Detection of anomalous values is achieved by looking at the variations in the convex hull of the data set as block of observations are deleted.

Maria Rosaria D’Esposito, Giancarlo Ragozini

Reducing Dimensionality Effects on Kernel Density Estimation: The Bivariate Gaussian Case

It is well known that the kernel estimation of multidimensional densities is a difficult task due to the so-called “curse of dimensionality”. The greater the data dimension, the greater is the sample size required to obtain efficient estimates. To reduce such dimensionality effects, we introduce further smoothing sources in addition to the usual bandwidth parametrization. In particular, preliminary kernel estimates are interpreted as smoothed samples and form the basis for successive density estimates, whose average (weights are given by empirical likelihoods of the observed sample) define the proposed sequential density estimator.

Marco Di Marzio, Giovanni Lafratta

Shewhart’s Control Chart: Some Observations

Data Analysis in Shewhart’s Control Chart, to use the original m samples n sized intensities, is the main subject of this paper. Given m × n intensities we examine three alternatives to sintetize the variability: a) arithmetic mean of m standard deviations (' S); b) root mean square of m variances (" S); c) global dispersion ("" S). We prefer the global dispersion to estimate parent population σ2.As an alternative we suggest to analyze all the items of an unique random sample dimensioned in such a manner to have an efficient σ2 estimate. A second introducted proposal is to use the Factory’s needs: (P0, P1, α, β, L and U). Some examples are given in the last session of the paper.

Massimiliano Giacalone

Projection Pursuit Regression with Mixed Variables

The aim of this paper is to extend projection pursuit regression to the case of mixed predictors, according to two different approaches. The former consists in converting each categorical regressor into dummy variables. The latter consists in preliminarily transforming the predictors by means of principal coordinate analysis. In presence of strongly non-linear regression functions and interactions between predictors, both procedures improve the results obtained by multiple linear regression, distance-based regression, MORALS and ACE. In particular, projection pursuit regression in conjunction with principal coordinate analysis shows very satisfactory performances.

Annalisa Laghi, Laura Lizzani

Recursive Estimation of System Parameter in Environmental Time Series Models

Dealing with high-frequency time series, such as environmental ones, raises important inferential and computational problems. Environmental monitoring and forecasting, for instance, require statistical procedures giving reliable estimates of unknown parameters and forecasts in real time. In this paper we consider dynamic linear models as a basic tool for the analysis of such kind of data and propose a recursive estimator for system parameter. A comparison of this estimator with some other estimation methods is provided via Monte Carlo simulations. The estimator we propose is computationally efficient and very easy to implement. Moreover, in our simulation study, it exhibits good asymptotic properties.

P. Mantovan, A. Pastore, S. Tonellato

Kernel Methods For Estimating Covariance Functions From Curves

We propose kernel methods for estimating covariance functions, when the data consists of a collection of curves. Every curve is modelled as an independent realization of a stochastic process with unknown mean and covariance structure. We consider a kernel density estimator, which has the positive semi-definiteness property on the “time” points and also in the continuum. We describe a cross-validation procedure, which leaves out an entire curve at a time, to choose the bandwidth (smoothing parameter) automatically from the observed collection of curves.

Andrea Pallini

Detection of Subsamples in Link-Free Regression Analysis

A regression analysis could fail if the sample is actually composed of more subsamples. We show that the regression function plot is a powerful tool to detect such a feature in the data. Its behaviour when more subpopulations are present is investigated in the framework of link-free regression analysis. A dynamic graphics procedure to detect the coexistence of more subsamples in the data is proposed.

Giovanni C. Porzio

Asymptotic Prior to Posterior Analysis for Graphical Gaussian Models

In this paper we derive the asymptotic posterior distribution, in a conjugate analysis, for the marginal and partial correlation coefficients in a graphical Gaussian model. An example of prior to posterior analysis is given and the problem of the specification of the hyper parameters discussed.

Alberto Roverato

Case Studies

Frontmatter

Applied Classification and Data Analysis

Using Qualitative Information and Neural Networks for Forecasting Purposes in Financial Time Series

In the Italian financial market, stock fluctuations are highly dependent on political and economic events. For this reason, any realistic forecast should consider this kind of information. In this paper we show a way to include economic and political events in order to forecast a financial time series. Then we applied neural networks, econometric analysis and some recent non-parametric regression models to empirical data observed over a period of 61 weeks. The respective performances of the different approaches were then compared.

Simone Borra, Agostino Di Ciaccio

A New Approach to the Stock Location Assignment Problem by Multidimensional Scaling and Seriation

The problem of the best stock location assignment in a warehouse has a fundamental role while optimising picking activities. In the present paper, this problem has been faced by considering seven variables to compute similarity between items. In this context, the problem of the choice of the most adequate similarity (or dissimilarity) measure between units while applying Multidimensional Scaling (MDS), has been examined. Besides the right metric, the possibility of applying a Seriation algorithm has been also considered. By using both MDS and seriation not just a single target can be considered, but we are able to manage with a plenty of variables; on the contrary with techniques used in literature, proper to Operational Research, just a single variable is under observation, and therefore just a single goal can be achieved. A wide discussion on the results is presented.

Angelo M. Mineo, Antonella Plaia

Food Coding in Nutritional Surveys

Nutritional studies are aimed at evaluating implications of food behaviour in order to detect possible health problems. Nevertheless, the results can be used to plan educational campaigns, regulatory interventions, and so on. In this context, food classification can vary according to different criteria. Therefore, food coding systems must be flexible enough in order to satisfy the various requirements. This approach has been utilised in the INN-CA study carried out by the Istituto Nazionale della Nutrizione (INN) in 1995, the characteristics and first results are discussed in the present paper.

Aida Turrini

UNAIDED: a PC System for Binary and Ternary Segmentation Analysis

UNAIDED is a software program for segmentation analysis. The program implements several techniques and criteria for segmenting a set of units whatever the measurement scale of the criterion variable. At the present, the techniques available are: binary and ternary segmentation, monotone vs. free analysis, ranking of predictors, “look-ahead” search of the best split. The analytical criterion may be chosen among a large set of implemented criteria.

Claudio Capiluppi, Luigi Fabbris, Michele Scarabello

Backmatter

Weitere Informationen