Skip to main content
main-content

Über dieses Buch

This volume contains a selection of papers presented at the biannual meeting of the Classification and Data Analysis Group of Societa Italiana di Statistica, which was held in Rome, July 5-6, 1999. From the originally submitted papers, a careful review process led to the selection of 45 papers presented in four parts as follows: CLASSIFICATION AND MULTIDIMENSIONAL SCALING Cluster analysis Discriminant analysis Proximity structures analysis and Multidimensional Scaling Genetic algorithms and neural networks MUL TIV ARIA TE DATA ANALYSIS Factorial methods Textual data analysis Regression Models for Data Analysis Nonparametric methods SPATIAL AND TIME SERIES DATA ANALYSIS Time series analysis Spatial data analysis CASE STUDIES INTERNATIONAL FEDERATION OF CLASSIFICATION SOCIETIES The International Federation of Classification Societies (IFCS) is an agency for the dissemination of technical and scientific information concerning classification and data analysis in the broad sense and in as wide a range of applications as possible; founded in 1985 in Cambridge (UK) from the following Scientific Societies and Groups: British Classification Society -BCS; Classification Society of North America - CSNA; Gesellschaft fUr Klassifikation - GfKI; Japanese Classification Society -JCS; Classification Group of Italian Statistical Society - CGSIS; Societe Francophone de Classification -SFC. Now the IFCS includes also the following Societies: Dutch-Belgian Classification Society - VOC; Polish Classification Society -SKAD; Associayao Portuguesa de Classificayao e Analise de Dados -CLAD; Korean Classification Society -KCS; Group-at-Large.

Inhaltsverzeichnis

Frontmatter

Classification and Multidimentional Scaling

Frontmatter

Cluster Analysis

Galois Lattices of Modal Symbolic Objects

In this paper we propose a classification structure of modal symbolic objects based on Galois lattice, and on the concept of credibility and capacity. This classification structure, which is hierarchical and non-partitive, has been developed in the framework of formal concept analysis in order to modelize the knowledge in form of context. The graph representation of a Galois lattices allows us to associate homogeneity measures in the nodes and transition measures on the edges.

D. Bruzzese, A. Irpino

Exploratory Methods for Detecting High Density Regions in Cluster Analysis

In this paper we propose some simple diagnostics which can prove useful for detecting high density regions in ℜP, for p ≥ 2. Our approach does not require full estimation of the multivariate density and exploits the spatial contiguity information which can be attached to objects in ℜP. The suggested method could be routinely applied as a preliminary step in nonhierarchical cluster analysis, where it provides useful guidance both in choosing the appropriate number of clusters and in selecting the values of initial cluster seeds.

Andrea Cerioli, Sergio Zani

A k-means Consensus Classification

The aim of this paper is to detect social and territorial structures from the census data of the administrative units in Italy. Variables describing each administrative unit can be divided in three distinguishable groups. Three approaches to analyze these data are proposed and compared. The first two approaches are essentially two-step ‘tandem analysis’ consisting of a preliminary factorial analysis and then a subsequent cluster analysis on the first few factors. The third approach suggests three different classifications, one for each group of variables, and then establishes a consensus classification via the k-means-clustering criterion.

Giuseppina Damiana Costanzo

MIXISO: a Non-Hierarchical Clustering Method for Mixed-Mode Data

In this paper we consider the approach proposed in the hierarchical clustering method MIXCLAS to obtain a new non-hierarchical method which we named MIXISO. This method can analyze mixed mode data with a large number of units including in the presence of missing data. We demonstrate the application of this method on artificial and real data-set with missing values.

Agostino Di Ciaccio

“Stable Clusters”: a new Approach for Clustering Binary Variables

As pointed out by Openshaw (1980), monothetic divisive algorithms often need for some means of post-classification relocation. We introduce a new criterion for improving the results of a (particular) monothetic divisive algorithm for clustering individuals classified on the basis of Q binary variables. This is done by referring to an agglomerative algorithm having very interesting similarities with the monothetic divisive algorithm taken into account. The new algorithm, developed by combining information arising from the two procedures, is compared with a relocation algorithm proposed by Openshaw.

Raffaella Piccarreta

Double k-means Clustering for Simultaneous Classification of Objects and Variables

In this paper a general model for simultaneous classification of objects and variables of a two mode data matrix is proposed. The model can identify both different classification structures of objects and variables (e.g., partitions, coverings) and different classification types (hard or fuzzy). The identification of the model is obtained by the numerical solution of a least-squares problem, which is typically NP-hard. An alternating least-squares algorithm is introduced to give an efficient solution to the fitting problem. The proposed classification model is named double k-means since its features recall for some extent the well-known clustering k-means methodology. Some real data sets are analyzed with the proposed methodology to show its features.

Vichi Maurizio

Discriminant Analysis

Categorical FDA Under Prospective Sampling Scheme: a Proposal for Variable Selection

Given a population described by p explanatory and one dependent categorical variables, we assume that the dependent variable defines a partition of the population into g groups.Discriminant Analysis studies the relation between the p explanatory variables and the dependent variable finding the subset of variables that has the most predictive power. Generally, in categorical discriminant analysis, the a priori probabilities associated to the g groups are assumed known. In this paper we summarise some suitable approaches under the hypothesis of unknown group a priori probabilities and we propose a new variables selection algorithm.

Francesco Palumbo

The Effect of Telephone Survey Design on Discriminant Analysis

In this paper the effect of telephone survey sample design on discriminant analysis is studied. The population is divided into two groups and a superpopulation model with multivariate normal distribution is assumed for the variables. The learning sample is composed of the randomly generated telephone numbers which correspond to the eligible population units (RDD). The rate of misclassification of the discriminant function and the effect of sample design on this rate of misclassification are discussed.

Alessandra Petrucci, Monica Pratesi

Proximity Structures Analysis and Multidimensional Scaling

A Dissimilarity Measure between Probabilistic Symbolic Objects

This paper presents an approach to calculate the dissimilarity between probabilistic symbolic objects. The proposed dissimilarity measure is based on both a comparison function and an aggregation function. Comparison function is a proximity coefficient based on statistical information given by each probabilistic elementary event. The aggregation function is a proximity index, related to Minkowski measure, which aggregates the results given by comparison functions.

Laura Bocci

Recent Experimentation on Euclidean Approximations of Biased Euclidean Distances

Given a set of 16 points on a grid, a set of randomly biased distances matrices is built and ten methods for their Euclidean approximantion are compared to identify which minimize the stress. The Principal Coordinates Analysis of Torgerson’s (1958) matrix of biased distances, limited to positive eigenvalues proved to be more effective than methods based on monotonous transformations’ aiming at getting the corresponding Torgerson’s (1958) matrix positive semidefinitè prior to PCoA. Its behaviour resulted close to Kruskal Non-Metric Multidimensional Scaling and Bennani Dosse (1998) Optimal Scaling, with the advantage of the identification a posteriori of the suitable dimension.

Sergio Camiz, Georges Le Calvé

Comparing Capital Structure through Similarity Analysis: Evidence about two Industrial Districts

We compare fundamental aspects of Capital Structure of Companies located in two different industrial districts. Starting from the Aggregate Balance Sheets of both districts, on whom ratios of Capital Structure are calculated, the emerging differences and similarities are investigated by means of a statistical analysis of similarity. The problem is worked out through three different approaches: a modified version of a simple Gini’s index, a generalised distance between densities and the overlapping area between them. All these measures are used in non parametric sense: Gini’s does not assume any particular distribution; we compute the other measures by using kernel estimates of the involved densities. Data are two 4-years panels of 843 and 187 Balance Sheets of Industrial textile Companies of Prato and Biella.

Fabrizio Cipollini, Piero Ganugi

The Geometric Approach to the Comparison of Multivariate Time Trajectories

To compare time trajectories different approaches might be envisaged. In this paper, considering the geometric approach, several dissimilarity measures between time trajectories are taken into account. An empirical comparison of the dissimilarity measures is also shown.

Renato Coppi, Pierpaolo D’Urso

Ultramine Spaces in Classification

Statistics is the first field of science where the notion of ultrametricity appeared outside mathematics. In fact, the particular geometric configuration of the ultrametric space finds optimal application in hierarchical cluster analysis methods. Moreover, it is interesting to consider the dual space of ultrametrics, which is induced by the “dual” ultrametric inequality, known as ultramine inequality. In this paper some properties of ultramine functions are analyzed and some algorithms are proposed to derive different ultramine approximation matrices of the dissimilarities between elements.

Donatella Vicari

Genetic Algorithms and Neural Networks

Genetic Algorithms and Clustering: an Application to Fisher’s Iris Data

Fisher’s iris data constitute a hard benchmark for clustering procedures, and attracted much work based on statistical methods and new approaches related to evolutionary algorithms and neural networks. We suggest two genetic algorithms effective for simultaneously determining both the optimal number of groups and the assignment of items to groups. The grouping genetic algorithm proposed by Falkenauer (1998) forms the basis of our method, where the variance ratio criterion and the Marriott’s method provide two fitness functions that both allow for fast computation and include the number of groups explicitly as a parameter. Specialized crossover operators, specific for each of the two fitness functions, are designed to accelerate convergence and minimize the number of iterations. Some simple implementations of our genetic algorithms are presented, that allow to classify correctly as many iris plants as the best alternative procedures proposed for this data set. Therefore genetic algorithms seem to constitute a good alternative choice for handling clustering problems.

Roberto Baragona, Claudio Calzini, Francesco Battaglia

Using Radial Basis Function Networks for Classification Problems

multi-layer perceptron is now widely used in classification problems, whereas radial basis function networks (RBFNs) appear to be rather less well known. Purpose of this work is to briefly recall RBFNs and to allow a synthesis of theirs best features. The relationships between these networks and other well-developed methodological tools for classification, both in neural computing and in statistics, are shown. The application of these networks to the forensic glass data set, which is not new in literature (Ripley, 1994; 1996), try to lay out what is common and what is distinctive in these networks and other competitive methods and to show, through empirical validation, the networks performance.

Isabella Morlini

Unconditional Latent Budget Analysis: a Neural Network Approach

The latent budget model is a reduced rank model for the analysis of compositional data. The model can be also understood as a supervised neural network model with weights interpreted as conditional probabilities. Main advantage of this approach is that a classification rule for budget data can be defined for new observed cases. In this paper, a constrained (weighted) least-squares algorithm — which is alternative to the one already introduced in literature for standard latent budget model — is proposed for the estimation of the parameters. A distinction is made between conditional latent budget analysis (the standard approach) and unconditional latent budget analysis (the neural network approach).

Roberta Siciliano, Ab Mooijaart

Multivariate Data Analysis

Frontmatter

Factorial methods

Generalized Constrained Principal Component Analysis

This paper deals with a non-symmetrical analysis of two multiple data sets in order to study the structure of dependence among sets of variables which play different role in the analysis. This approach represents a generalization of the Constrained Principal Component Analysis (CPCA) (D’Ambra and Lauro, 1982).

Pietro Amenta, Luigi D’Ambra

Interaction Terms in Homogeneity Analysis: Higher Order Non-Linear Multiple Correspondence Analysis

This article aims to generalise Homogeneity Analysis (HA, van Rijckevorsel & de Leeuw 1988) by the introduction and optimal selection of interaction terms as linear manifolds of univariate and multivariately transformed variables (Costanzo & van Rijckevorsel 1994).Optimal interaction terms are a general way to introduce higher order non-linear Multiple Correspondence Analysis.

Rosaria Lombardo, Jan van Rijckevorsel

Perturbation Models for Principal Component Analysis of Rainwater Pollution Data

The study of the correlation matrix between ion concentration and the principal component analysis on the related variance matrix are widely used to explore the presence of contamination patterns in rainwater. The paper shows that the covariance between ion concentrations is a perturbed measure, and that the total conductivity can be interpreted as the perturbation factor. Then, the paper describes some strategies for measuring and removing the perturbation and how, by removing this effect, correct contamination patterns can be identified. A summary of the results of an application on data measured by the monitoring network of the Veneto region is proposed.

Pietro Mantovan, Andrea Pastore

Core Matrix Rotation to Natural Zeros in Three-Mode Factor Analysis

This paper presents a new rotation method to simplify the interpretation of the core matrix in three-mode factor analysis. The rotated solution is compared, theoretically and empirically, with the TUCKALS solution (Kroonenberg, 1994).

Roberto Rocci

Textual Data Analysis

A Factorial Technique for Analysing Textual Data with External Information

The paper aims at proposing a method allowing to take into account information on context, when we are analysing lexical tables by means of factorial techniques, as Correspondence Analysis. Such a kind of information, external to the main data structure, can concern where and how words are used, but, moreover, can concern their (syntactical, grammatical, etc.) role inside the corpus. Here a methodological tool has been proposed: a technique based on projections on subspaces spanned by two sets of variables related to fragments and words. The matrix to be analysed, called inter-reference matrix, measures the importance of the association between the external information on words and fragments. The final outputs are graphical representations that enrich the results of textual data analysis.

Simona Balbi, Giuseppe Giordano

Subjects on Using Open and Closed-Ended Questions

In order to go on with the study of differences between open and closed-ended answers we want to overcome the simple univariate comparison to arrive at the study of relations between open and closed-ended alternatives. The coding procedure of the open answers texts is made by means of the manual post-coding and the textual analysis techniques. Since there is the possibility of observing open and closed-ended information taken from the same respondents, the aim of this work is to understand differences in the frequencies collected with the two data collecting tools and new methods of analysis are proposed.

Arjuna Tuzzi

Regression Models for Data Analysis

Alternative Error Term Specifications in the Log-Tobit Model1

In this paper a logarithmic transformation of the standard Tobit model is proposed. The model represents an interesting tool in order to specify alternative form of data heterosckedasticity. The properties of applied estimators are compared by a set of Monte Carlo experiments.

Rosa Bernardini Papalia, Francesca Di Iorio

A Customer Satisfaction Approach for User-oriented Comparative Evalutations of Services

A goal-driven multi-criteria approach to evaluation is proposed to help a potential new client to choose among competing suppliers of a type of service (or product). The actor of evaluation is an independent agency, perhaps a magazine, which interprets the point of view of a type of potential client. Evaluations are grounded on the surveyed perceived satisfaction of old clients, but on the dimensions of interest for a type of potential client accounted for by the criterion he adopts. An approach which uses structured stochastic models is presented, but we emphasize a new interpretation of the role of modeling in evaluations. In the our setup evaluations are fully conditioned on the accounting criterion - even subjective - which the actor of evaluations would adopt as grounds for its evaluations.

Giulio D’Epifanio

Mixture Models for Maximum Likelihood Estimation from Incomplete Values

In this paper we consider methods for the analysis of the relationship between input and output variables when missing values occur in the input data. In such situations the incomplete cases cannot be suppressed and then the missing values must be estimated on the basis of some suitable statistical model. This problem is here approached by means of mixture distributions in which the parameters are estimated using likelihood-based methods. Applications in neural network training from incomplete data are discussed and the results are compared with those obtained using the mean imputation method. These results lead to some practical criteria for the use of either method in the learning of neural network from incomplete data.

Filippo Domma, Salvatore Ingrassia

Robust Inference in the Logistic Regression Model

Empirical likelihood is extended to a class of robust estimators for the parameter vector of the logistic regression model so to improve on both the known inference procedures based on empirical likelihood, which are not robust, and the usual robust inference procedures based on the normal approximation

Michele La Rocca

A Plot for Submodel Selection in Generalized Linear Models

In applied regression analysis, model selection criteria are usually used to identify a set of submodels for further study. In this paper, we present a method for a graphical comparison of models that helps in selecting among submodels. The method, based on comparisons of fitted functions projected on two-dimensional surfaces, is offered in a generalized linear models framework, and it is explored in the binomial regression case.

Giovanni C. Porzio

On the Use of Multivariate Regression Models in the Context of Multilevel Analysis

The use of Multivariate Regression Models with mixed data to evaluate and decompose relative effectiveness of different social agencies presents numerous problems. The solution proposed is to use the Seemingly Unrelated Equations Models (SURE) in the framework of Multilevel Analysis, following quantification of the response variables by means of simultaneous Multidimensional Scaling methods. An example is provided.

Giorgio Vittadini

Nonparametric Methods

Nonparametric Estimation Methods for Sparse Contingency Tables

The problems related with multinomial sparse data analysis have been widely underlined in statistical literature in recent years. Concerning the estimation of the mass distribution, it has been widely spread the usage of nonparametric methods, particularly in the framework of ordinal variables. The aim of this paper is to evaluate the performance of kernel estimators in the framework of sparse contingency tables with ordinal variables comparing them with alternative methodologies. Moreover, an approach to estimate the mass distribution nominal variables based on a kernel estimator is proposed. At the end a case study in actuarial field is presented.

Riccardo Borgoni, Corrado Provasi

Reduction of Prediction Error by Bagging Projection Pursuit Regression

In this paper we consider the application of Bagging to Projection Pursuit Regression and we study the impact of this technique on the reduction of prediction error. Using artificial and real-data sets, we investigate the predictive performance of this method with respect to the number of aggregated predictors, the number of functions in the single Projection Pursuit model and the signal-to-noise ratio of the sample data.

Simone Borra, Agostino Di Ciaccio

Selecting Regression Tree Models: a Statistical Testing Procedure1

This paper provides a statistical testing approach to the validation of the pruning process in regression trees construction. In particular, the testing procedure, based on the F distribution, is applied to the CART sequence of pruned subtrees providing a single tree prediction rule which is statistically reliable and might not coincide with any tree in the sequence.

Carmela Cappelli, Francesco Mola, Roberta Siciliano

Linear Fuzzy Regression Analysis with Asymmetric Spreads

We discuss a regression model for the study of asymmetrical fuzzy data and provide a method for numerical estimation of the relevant regression parameters. The proposed model is based on a new approach and has the capability to take into account the possible relationships between the size of the spreads and the magnitude of the centers of the fuzzy observations. Two illustrative examples are also presented.

Pierpaolo D’Urso, Tommaso Gastaldi

Spatial and Time Series Data Analysis

Frontmatter

Time Series Analysis

Forecasting Non-Linear Time Series: Empirical Evidences on Financial Data(10)

This paper presents a preliminary comparison of forecasting performance for alternative non-linear methods using daily returns from the Italian Stock Market. In particular, some non-linear models and non-parametric techniques are considered. The accuracy of the forecast is evaluated using the sign prediction criteria, the mean square error and the mean absolute error.

Alessandra Amendola, Francesco Giordano, Cira Perna

Dynamics and Comovements of Regional Exports in Italy1

This paper investigates the dynamic behaviour of regional exports in Italy. After aggregating the Italian regions in six macro-areas we use a recently developed multivariate technique in order to analyse the dynamic and comovements in the relative exports quarterly time-series. Our empirical findings indicate that the north-east exports move separately in the long run from the remaining eastern regions and the western regions are linked by a single long-run relation. Moreover, there is evidence of strong similarities between propagation mechanisms of shocks within central and northern exports respectively.

Gianluca Cubadda, Pierluigi Daddi

Large-sample Properties of Neural Estimators in a Regression Model with ϕ-mixing Errors1

In this paper the large sample properties of neural networks estimators in a regression model with ϕ-mixing errors are investigated. In particular, using the theory of M-estimators, it is proved that the minimum squared error estimators of the connection weights and of the fitted values are consistent and asymptotically Normal.

Francesco Giordano, Cira Perna

Subseries Length in MBB Procedure for α-mixing Processes(1)

In this paper we propose a new procedure to determine the length of the subseries in the MBB bootstrap which takes into account the structure of the model and the strength of dependence in the observed series. It can be easily implemented and easily extended to much more complex structures: multivariate ARMA processes; non linear models; STARMA processes.

Michele La Rocca, Cosimo Vitale

Modelling a Change of Classification by a Structural Time Series Approach

The change of classification problem for economic sectoral time series data is examined by a conversion matrix approach. A state space form for data reconstruction by structural time series models is proposed. The Doran (1992) methodology of constraining the Kalman filter to satisfy time varying restrictions is applied to increase efficiency of the estimates. Results of an application on Italian Quarterly Accounts are discussed.

Filippo Moauro

Spatial Data Analysis

Spatial Discriminant Analysis Using Covariates Information

The analysis of spatially distributed observations implies a number of theoretical problems due to the multidirectional dependence among nearest sites. The presence of such a dependence often causes the standard statistical method, instead based on independence assumptions, to provide inefficient estimates or, even, to fail badly. This paper concerns the problem of discrimination and classification of spatial polytomous data. It extends the approach discussed by Alfò and Postiglione (1999) for binary observations to polytomous data, presents a discrimination function based on markovian automodels and suggests a natural solution to the resulting allocation problem through a Gibbs sampler based procedure.The proposed approach is contrasted with standard logistic discrimination and applied to a real data set consisting of a remote sensed image from Nebrodi mountains (Italy).

Marco Alfò, Paolo Postiglione

Some Aspects of Multivariate Geostatistics

Cokriging allows the use of data on correlated variables in order to enhance the estimation of a primary variable or more generally to enhance the estimation of all variables. However, in order to apply the estimation procedure, some problems must be solved: the variogram matrix must be conditionally negative definite and it is necessary to model the cross-variogram. The aim of this paper is to present a flexible solution in order to solve the mentioned problems. A case study has been presented.

Sandra De Iaco, Donato Posa

Exploring Multivariate Spatial Data: an Application to Election Data

In this paper we report a brief account of an application to spatial data, observed on an irregular grid, of an exploratory technique based on the diagonalization of cross-variogram matrices. Our aim is to describe the behavior of a multivariate set of spatial data in a dimensionally reduced space in such a way that the information on the spatial variation is preserved. We adapt an exploratory technique built for the analysis of quantitative data to frequency data. We give special attention to the choice of a distance measure that well describes the type of “connection” between sites we consider for the analysis of this specific situation. The application aims to the characterization of districts in the city of Rome according to the electoral behavior of their inhabitants, special attention is given to the increasing phenomenon of abstentions.

Giovanna Jona-Lasinio, Fabio Mancuso

Measures of Distances for Spatial Data: an Application to the Banking Deposits in the Italian Provinces

A theoretical and practical approach is presented for extending both the Mahalanobis and the Euclidean distances to spatially correlated data. Departing from the consideration that in some cases the choice of a particular distance is suggested by the nature of the data, we propose new measures of distance for spatial observations based on the distances between the interpretative models of the data. An application to the banking deposits of the Italian provinces in 1995 is given.

Eugenia Nissi

Applications and Case Studies

Frontmatter

42. Statistical Analysis of Papal Encyclicals

The purpose of this work is to demonstrate some possibilities offered by a statistical investigation of theological matters. The object of this paper is a textual analysis of the corpus obtained from the encyclical letters of the last 5 popes. It is possible by statistical analysis of the documents to arrive at conclusions or questions about “theology” in the encyclical letters.

Bruno Bisceglia, Alfredo Rizzi

43. Life Courses as Sequences: an Experiment in Clustering via Monothetic Divisive Algorithms

We consider the problem of clustering demographic life courses, consistently with the so-called sequence representation. We use a monothetic divisive algorithm, which allows for a better readability of results with respect to the increasingly common approach based on optimal matching analysis. The algorithm eases the interpretation of splitting procedures and determinants of group membership when data represent the occurrence or non-occurrence of non-renewable events. We then apply the algorithm to the transition to adulthood in Italy, using retrospective individual-level data from a Fertility and Family Survey.

Francesco C. Billari, Raffaella Piccarreta

44. Significance of the Classification for the Italian Service Sector Activities

In current economic studies on enterprises carried out in ISTAT (the Italian Statistic Institute) the selection of the domain of interest is mainly based on the Classification of Economic Activities (ATECO’91), which is an official classification.In this paper we analyse the coherence of the ATECO’91 in representing microdata which concern economic variables. Then we determine the statistical information, expressed by the coefficient CI, corresponding to different levels of details of the above classification and to different characters.For this purpose we consider a set of data collected by the survey on Economic Account of the Enterprises, including both structural and economic variables.

Anna Rita Giorgi, Roberto Moro

45. A Neural Net Model to Predict High Tides in Venice

In this research we design and apply a neural network model to predict the tidal levels in the Venetian lagoon. We use an evolutionary computational approach to select the net topology within the class of multilayered feedforward networks. We build a genetic algorithm, which evolves both the number of predictors and the best set of predictors for the model. The results of this approach are compared to the results we achieve with a linear model based on the same set of candidate predictor variables, whose specification is also obtained with a genetic algorithm. The predictions resulting from the genetically evolved neural net model are more accurate for both tidal levels and extreme values (“the high waters”).

Tommaso Minerva, Irene Poli

Backmatter

Weitere Informationen