Skip to main content

Über dieses Buch

This volume contains a selection of papers presented at the Seven~h Confer­ ence of the International Federation of Classification Societies (IFCS-2000), which was held in Namur, Belgium, July 11-14,2000. From the originally sub­ mitted papers, a careful review process involving two reviewers per paper, led to the selection of 65 papers that were considered suitable for publication in this book. The present book contains original research contributions, innovative ap­ plications and overview papers in various fields within data analysis, classifi­ cation, and related methods. Given the fast publication process, the research results are still up-to-date and coincide with their actual presentation at the IFCS-2000 conference. The topics captured are: • Cluster analysis • Comparison of clusterings • Fuzzy clustering • Discriminant analysis • Mixture models • Analysis of relationships data • Symbolic data analysis • Regression trees • Data mining and neural networks • Pattern recognition • Multivariate data analysis • Robust data analysis • Data science and sampling The IFCS (International Federation of Classification Societies) The IFCS promotes the dissemination of technical and scientific information data analysis, classification, related methods, and their applica­ concerning tions.



Cluster Analysis


Cluster Analysis and Mixture Models

Classifier Probabilities

In statistical clustering, we usually devise probability models that begin by specifying joint distributions of data and possible classifications and end in reporting classifications that are probable given the data. Yet the art and practice of classification is more fundamental and prior to probabilistic analysis, and so it is worthwhile to ask how one might derive probabilities from classifications, rather than derive classifications from probabilities. In this scheme, a classifier is assumed able to express any knowledge as a classification consisting of a number of statements of the form x ∈ y, in words, x is a member of y. We specify an inductive probability distribution over all such classifications. Probabilities for future outcomes are determined by the probabilities of the classifications formed by the classifier corresponding to those outcomes. Particular examples studied are coin tossing, recognition, the globular cluster Messier 5, and the next president of the United States.

J. A. Hartigan

Cluster Analysis Based on Data Depth

A data depth depth(y, χ) measures how deep a point y lies in a set χ. The corresponding α-trimmed regions Dα(χ) = y : depth(y,χ) ≤ α are monotonely decreasing with α, that is a α > β implies Dα ⊂ Dβ. We introduce clustering procedures based on weighted averages of volumes of α-trimmed regions.The hypervolume method turns out to be a special case of these procedures.We investigate the performance in a simulation study.

Richard Hoberg

An Autonomous Clustering Technique

The basic idea of this paper is that it will be possible to construct clusters by moving each pair of objects closer or farther according to their relative similarity to all of the objects. For this purpose, regarding a set of objects as a set of autonomous agents, each agent decides its action to the other agents by taking account of the similarity between its self and others. And consequently, we get the clusters autonomously.

Yoshiharu Sato

Unsupervised Non-hierarchical Entropy-based Clustering

We present an unsupervised non-hierarchical clustering which realizes a partition of unlabelled objects in K non-overlapping clusters. The interest of this method rests on the convexity of the entropv-based clustering criterion which is demonstrated here. This criterion permits to reach an optimal partition independently of the initial conditions, with a step by step iterative Monte-Carlo process. Several data sets serve to illustrate the main properties of this clustering.

M. Jardino

Improving the Additive Tree Representation of a Dissimilarity Matrix Using Reticulations

This paper addresses the problem of approximating a dissimilarity matrix by means of a reticulogram. A reticulogram represents an evolutionary structure in which the objects may be related in a non-unique way to a common ancestor. Dendrograms and additive (phylogenetic) trees are particular cases of reticulograms. The reticulogram is obtained by adding edges (reticulations) to an additive tree, gradually improving the approximation of the dissimilarity matrix. We constructed a reticulogram representing the evolution of 12 primates. The reticulogram not only improved the data approximation provided by the phylogenetic tree, but also depicted the homoplasy contained in the data, which cannot be expressed bv a tree topology. The algorithm for reconstructing reticulograms is part of the T-Rex software package, available at URL <>.

Vladimir Makarenkov, Pierre Legendre

Double Versus Optimal Grade Clusterings

Two clustering methods based on grade correspondence analysis will be compared on a real data example. Special attention will be paid to the interpretation aspects versus the formal inference based on clustering quality measures. The discussed example shows that formally similar solutions may differ significantly from the interpretation point of view.

Alicja Ciok

The Effects of Initial Values and the Covariance Structure on the Recovery of some Clustering Methods

Some clustering methods are compared in a simulation study. The data used in the analysis are generated in a mixture modeling framework. The methods included are some hierarchical methods, A:-means as implemented in the FASTCLUS procedure of SAS and cluster analysis by means of normal mixtures with the NORMIX program. We demonstrate that the poor recovery found in some studies for normal mixture type of clustering is partly due to the use of bad initial values, and partly due to the specification of covariance structure within the cluster. We further find that an important factor in the relative success of FASTCLUS lies in the initial seed selection.

Istvan Hajnal, Geert Loosveldt

What Clusters Are Generated by Normal Mixtures?

Model based cluster analysis is often carried out by estimation of the parameters of a normal mixture. But mixture components do not necessarily reflect the idea of a “cluster”. I discuss how to formalize the concept of “clusters” w.r.t. probability distributions on the real line by means of fixed point clusters, i.e., sets that do not contain any outlier and with respect to which the rest of the real line consists of outliers. The concept is applied to some normal mixtures

Christian Hennig

A Bootstrap Procedure for Mixture Models

A bootstrap procedure useful in latent class models has been developed to determine the sufficient number of latent classes required to account for systematic group differences in the data. The procedure is illustrated in the context of a multidimensional scaling latent class model, CLASCAL. Real and artficial data are presented. The bootstrap procedure for selecting a sufficient number of classes seems to correctly select the correct number of latent classes at both low and high error levels. At higher error levels it outperforms Hope’s (1968) procedure.

Suzanne Winsberg, Geert deSoete

Fuzzy Clustering

A New Criterion for Obtaining a Fuzzy Partition from a Hybrid Fuzzy/Hierarchical Clustering Method

Classical fuzzy clustering methods are not able to compute a partition into a set of points, when classes have non-convex shape. Furthermore, we know that in this case, the usual criteria of class validity, such as fuzzy hyper volume or compactness - separability, do not allow one to find the optimal partition.The purpose of our paper is to provide a criterion allowing one to find the optimal fuzzy partition in a set of points including classes of any shape. To that effect we shall use the Fuzzy C Means algorithm to divide the set of points into an overspecified number of subclasses. A fuzzy relation is established between them in order to extract the structure of the set of points. The subclasses are merged according to this relation and the criterion that we propose allows one to find the optimal regrouping.

Arnaud Devillez, Patrice Billaudel, Gérard Villermain Lecolier

Application of Fuzzy Mathematical Morphology for Unsupervised Color Pixels Classification

In this paper, we present a new color image segmentation algorithm which is based on fuzzy mathematical morphology. After a color pixel projection into an attribute space, segmentation consists of detecting the different modes associated with homogeneous regions. In order to detect these modes, we show how a color image can be viewed as a fuzzy set with its associated membership function corresponding to a mode which is defined by a color cooccurrence matrix and by mode concavity properties. A new developed fuzzy morphological transformation is then applied to this membership function in order to identify the modes. The performance of our proposed fuzzy morphological approach is then presented using a test color image, and is then compared to the competitive learning algorithm.

A. Gillet, C. Botte-Lecocq, L. Macaire, J.-G. Postaire

A Hyperbolic Fuzzt k-Means Clustering and Algorithm for Neural Networks

A new fuzzy k-means clustering algorithm is proposed by introducing crisp regions of clusters. Boundaries of the regions are determined by hyperbolas and membership values are given by one or zero in each region. The area between crisp regions is a fuzzy region, where membership values are proportional to distances to crisp regions. Though the traditional hard k-means is a limit of the usual fuzzy k-means, results of the latter are fuzzy and then are not the same as results of the former. On the other hand a new method can produce the same results as those by the hard k-means. An algorithm for neural networks is given and a numerical example is illustrated.

Norio Watenable, Tadashi Imaizumi, Toshiko Kikuchi

Special Purpose Classification Procedures and Applications

A Toolkit for Development of the Domain-Oriented Dictionaries for Structuring Document Flows

An approach to thematic document classification, clusterization and investigation of document flows and collections based on domain-oriented dictionaries (DODs) is considered. It is simple enough to be used by, say, a secretary that frequently needs to classify and search large amounts of documents. However, for good results, such an approach requires a solid technology for construction and maintenance of the DODs; this task is to be performed by experts or advanced users. A DOD represents a specific subject topic and is constructed on the basis of the analysis of a collection of documents representing this topic, selected by a group of experts. The toolkit facilitates the development of a hierarchical system of DODs by the application of a set of heuristic criteria for the selection of the keywords from such a document collection representing one subject domain. In the paper, the application of the DODs developed with the toolkit for information retrieval is illustrated with examples.

Pavel P. Makagonov, Mikhail A. Alexandrov, Konstantin Sboychakov

Classification of Single Malt Whiskies

Tasting notes in 10 recently published books on malt whisky were coded and analysed for 84 single malt whiskies. Over 400 aromatic and taste descriptors were identified and grouped into 12 sensory features, from which a synonymy of the whisky literature was developed. The 84 malt whiskies were then clustered into 10 groups using the FocalPoint clustering method in ClustanGraphics. An industry survey to validate the classification is described, and applications in product design, brand management and marketing are discussed. A tutored tasting of selected single malt whiskies follows the technical presentation.

David Wishart

Robust Approach in Hierarchical Clustering: Application to the Sectorisation of an Oil Field

Production data of oil fields are provided as decline curves (oil and water production vs time), that the user wants to gather in a limited number of clusters. Preprocessing of data is required to remove noise, and provides a complete data set, involving for each statistical unit (wells) extraction of attributes from smoothed or modelized curves. Hierarchical clustering is performed in two steps to avoid smaller or outlier cluster ; firstly the centroid clustering method is used to recognize and then discard clusters having a lower frequency, this is followed by application of the Ward-method. Finally, using the central part of these previous (Ward) clusters, discriminant analysis is performed, including all the discarded units. This sequence avoids the disturbing influence of outlying units, and also gives the probability for each unit to be classified in the clusters.

Jean-Paul Valois

A Minimax Solution for Sequential Classification Problems

The purpose of this paper is to derive optimal rules for sequential classification problems. In a sequential classification test, for instance, in an educational context, the decision is to classify a student as a master, a partial master, a nonmaster, or continue testing and administering another random item. The framework of minimax sequential decision theory is used by minimizing the maximum expected losses associated with all possible decision rules at each stage of testing. The main advantage of this approach is that costs of testing can be explicitly taken into account.

Hans J. Vos

Verification and Comparison of Clusterings

Comparison of Ultrametrics Obtained With Real Data, Using the P L and VAL Aw Coefficients

We compare 20 ultrametric matrices generated by the classifications obtained from 20 similarity indices for binary variables on the same group of data, that were studied by Hubálek (1982). To measure the similarity between the ultrametric matrices we use the P L coefficient proposed by Le Calvé (1977) and the Validity of Affinity Coefficient WW, VAL Aw proposed by Bacelar-Nicolau (1988). By means of hierarchical cluster analysis and principal component analysis on the similarity matrices obtained with those two coefficients, we draw conclusions about the 20 similarity indices and compare results for P L and VAL Aw coefficients. The results obtained with these two coefficients are very similar and are also similar to the results obtained by Hubálek. Finally we introduce in this ultrametrics/coefficients comparative study the simple matching coefficient, Sokal and Michener (1958), and observe, using P L or VAL Aw coefficients, its particular behaviour in relation to the other indices.

Isabel Pinto Doria, Georges Le Calvé, Helena Bacelar-Nicolau

Numerical Comparisons of two Spectral Decompositions for Vertex Clustering

We study multi-way partitioning algorithms of a hypergraph which are based on its prior transformation into a geometric object by constructing a one-to-one mapping between the vertex set and a point set in a Euclidean space. The coordinates of the points are generated by a spectral decomposition of a positive semi-definite matrix. Here, we compare the decomposition of the discrete Laplacian of a graph associated with the hypergraph to that of the Torgerson matrix associated with a dissimilarity coefficient. Numerical results are presented on standard test cases of large sizes from the integrated circuit design literature.

P. Kuntz, F. Henaux

Measures to Evaluate Rankings of Classification Algorithms

Due to the wide variety of algorithms for supervised classification originating from several research areas, selecting one of them to apply on a given problem is not a trivial task. Recently several methods have been developed to create rankings of classification algorithms based on their previous performance. Therefore, it is necessary to develop techniques to evaluate and compare those methods. We present three measures to evaluate rankings of classification algorithms, give examples of their use and discuss their characteristics.

Carlos Soares, Pavel Brazdil, Joaquim Costa

A General Approach to Test the Pertinence of a Consensus Classification

Many techniques have been proposed to combine classifications defined on the same set of objects. All the methods that have been developed are designed to return a solution, but validation of the solution is seldom performed. In this paper we propose a general approach to test the pertinence of a consensus classification and discuss the choices that one has to make at each step of the method.

Guy Cucumel, François-Joseph Lapointe

Dissimilarity Measures

On a Class of Aggregation-invariant Dissimilarities Obeying the Weak Huygens’ Principle

We propose a complete characterization of a certain class of aggregation-invariant dissimilarities between row (or column) profiles. This class (for which row and column dispersions coincide) contains the chi-square, ratio, Kullback-Leibler, Hellinger, Cressie-Read dissimilarities, as well as a presumably new “type s” class of dissimilarities. Distinguishing between two forms of Huygens’ principle from Classical Mechanics, we show “type s” dissimilarities to satisfy the weak Huygens’ principle; the strong Huygens’ principle however holds for a single member of the class, namely the chi-square dissimilarity. Extending the concept of dissimilarity to “type s” divergences restores the strong principle.

F. Bavaud

A Short Optimal Way for Constructing Quasi-ultrametrics From Some Particular Dissimilarities

Recently, Diatta has established a lower maximal quasi-ultrametric approximation of a dissimilarity fulfilling the inclusion condition. The approach is purely algorithmical, but incidentally the solution is characterised by a formula. From this formula, we give here two straightforward and short proofs of the result. One is based on the properties of the dissimilarities under consideration, and the second one derives from the bijection between quasi-ultrametrics and indexed quasi-hierarchies.

B. Fichet

Missing Data in Cluster Analysis

Estimating Missing Values in a Tree Distance

In phylogeny, one tries to approximate a given dissimilarity by a tree distance. In some cases, especially when comparing biological sequences, some dissimilarity values cannot be evaluated and a partial dissimilarity with undefined values is only available. In that case one can develop a sequential method to reconstruct a weighted tree or to evaluate the missing values using a tree model. In this paper we study the latter approach and measure the quality of the estimated values using simulated noisy tree distances.

A. Guénoche, S. Grandcolas

Estimating Trees From Incomplete Distance Matrices: A Comparison of Two Methods

In the present paper, we compare two methods (TRIANGLE and MW) for estimating trees from incomplete distances matrices through simulations. Our results illustrate that MW performs better for recovering path-length distances whereas TRIANGLE is superior in terms of topological recovery. Recommendations are provided as to which method should be used with real experimental data

Claudine Levasseur, Pierre-Alexandre Landry, François-Joseph Lapointe

Zero Replacement in Compositional Data Sets

The sample space of compositional data is the open simplex. Therefore, zeros in a compositional data set are identified either with below detection limit values, or lead to a division of the data set into different subpopulations with the corresponding lower dimensional sample space. Most multivariate data analysis techniques require complete data matrices, thus calling for a strategy of imputation of zeros in the first case. Existing replacement methods of rounded zeros are reviewed, and a new method is proposed, who’s properties are analyzed and illustrated. The method is applied in a hierarchical cluster analysis of compositional data.

J. A. Martín-Fernández, C. Barceló-Vidal, V. Pawlowsky-Glahn

EM Algorithm for Partially Known Labels

Mixture models are widely used for clustering or discrimination problems. Estimating the parameters of such models can be viewed as an incomplete data problem and has thus often been handled by the Expectation-Maximization (EM) algorithm. It has been shown that this method can integrate additional information such as the label of some observations. In this paper we propose a generalization of this approach which can take into account partial information about the observation labels. An example illustrates the relevance of the proposed method for mixture density estimation.

C. Ambroise, G. Govaert

Discrimination, Regression Trees, and Data Mining


Discriminant Analysis

Detection of Company Failure and Global Risk Forecasting

Work on detecting and monitoring company risk has greatly increased over the last ten years both in credit reference agencies or associations of credit managers and in banks or public bodies which monitor the national and international economic situation. The Banque de France’s work on risk is part of its credit supervision and banking system refinancing responsibilities.Since 1982, the Banque de France’s research into operational credit scoring has intensified with the use of increasingly vast and reliable databases. The aim is to describe risk in statistical terms and create medium-term forecasts for firms in France. The tools created provide both an individual risk diagnosis for each firm and an overview of company risk as a whole. This paper gives a progress report on the work carried out and some indications of the prospects for the future.

Mireille Bardos

Discriminant Analysis by Hierarchical Coupling in EDDA Context

Friedman(1996) proposed a strategy for the classification multigroup problem. He build independently a classifier for each pair of classes and then combined all the pairwise decisions to form the final decision. We suggest an alternate approach in the context of EDDA models. Our technique, the hierarchical coupling, is also based on pairwise decisions but we abandon the independence and work on nested pairs of classes. We evaluate the performance of hierarchical coupling on simulated and real datasets.

Isabel Brito, Gilles Celeux

Discrete Discriminant Analysis: The Performance of Combining Models by a Hierarchical Coupling Approach

We are concerned with combining models in discrete discriminant analysis in the multiclass (K > 2) case. Our approach consists of decomposing the multiclass problem in several biclass problems embedded in a binary tree. The affinity coefficient (Matusita (1955); Bacelar-Nicolau (1981,1985)) is proposed for the choice of the hierarchical couples, at each level of the tree, among all possible forms of merging. For the combination of models we consider a single coefficient: a measure of the relative performance of models - the integrated likelihood coefficient (Ferreira et al., 1999)) and we evaluate its performance.

Ana Sousa Ferreira, Gilles Celeux, Helena Bacelar-Nicolau

Discrimination Based on the Atypicity Index versus Density Function Ratio

We propose a method of discrimination, based on the atypicity index and the density function. After a short survey of the atypicity index, we show that the presence of “critical regions”, when we apply the bayesian quadratic discrimination, under some hypotheses, leads to misclassifications. The performance of the proposed method versus quadratic and linear discrimination is assessed via simulation. It is generally shown that the discrimination based on the ratio (atypicity index/density function) consistently yields noticeably higher percentage of well classified individuals relative to the traditional methods. The method is illustrated with a numerical example and is compared to quadratic discrimination

H. Chamlal, S. Slaoui Chah

Decision and Regression Trees

A Third Stage in Regression Tree Growing: Searching for Statistical Reliability

This paper suggests the introduction of a third stage in regresssion tree growing approach. To this aim, a statistical testing procedure based on the F statistics, is proposed. In particular, the testing procedure is applied to the CART sequence of pruned subtrees, resulting in a single final tree structured prediction rule, which is statistically reliable and might not coincide with any tree in the sequence itself.

Carmela Cappelli, Francesco Mola, Roberta Siciliano

A New Sampling Strategy for Building Decision Trees from Large Databases

We propose a fast and efficient sampling strategy to build decision trees from a very large database, even when there are many numerical attributes which must be discretized at each step. Successive samples are used, one on each tree node. Applying the method to a simulated database (virtually infinite size) confirms that when the database is large and contains many numerical attributes, our strategy of fast sampling on each node (with sample size about n = 300 or 500) speeds up the mining process while maintaining the accuracy of the classifier.

J. H. Chauchat, R. Rakotomalala

Generalized Additive Multi-Model for Classification and Prediction

In this paper we introduce a methodology based on a combination of classification/prediction procedures derived from different approaches. In particular, starting from a general definition of a classification/prediction model named Generalized Additive Multi-Model (GAM-M) we will demonstrate how it is possible to obtain different types of statistical models based on parametric, semiparametric and nonparametric methods. In our methodology the estimation procedure is based on a variant of the backfitting algorithm used for Generalized Additive Models (GAM). The benchmarking of our methodology will be shown and the results will be compared with those derived from the applications of GAM and Tree procedures.

Claudio Conversano, Roberta Siciliano, Francesco Mola

Radial Basis Function Networks and Decision Trees in the Determination of a Classifier

In this paper a nonparametric classifier which combines radial basis function networks and binary classification trees is proposed. The joint use of the two methods may be preferable not only with respect to radial basis function networks, but also to recursive partitioning techniques, as it may help to integrate the knowledge acquired by the single classifiers. A simulation study, based on a two-class problem, shows that this method represents a valid solution, particularly in the presence of noise variables.

Rossella Miglio, Marilena Pillati

Clustered Multiple Regression

This paper describes a new method for dealing with multiple regression problems. This method integrates a clustering technique with regression trees, leading to what we have named as clustered regression trees. We use the clustering method to form sub-samples of the given data that are similar in terms of the predictor variables. By proceeding this way we aim at facilitating the subsequent regression modeling process based on the assumption of a certain smoothness of the regression surface. For each of the found clusters we obtain a different regression tree. These clustered regression trees can be used to predict the response value for a query case by an averaging process based on the cluster membership probabilities of the case. We have carried out a series of experimental comparisons of our proposal that have shown a significant predictive accuracy advantage over the use of a single regression tree.

Luis Torgo, J. Pinto da Costa

Neural Networks and Data Mining

Constructing Artificial Neural Networks for Censored Survival Data from Statistical Models

A general approach to the design and training of ANNs for censored survival data is presented, with statistical models used as building blocks. This provides efficient initialization and an aid to interpretation.

Antonio Ciampi, Yves Lechevallier

Visualisation and Classification with Artificial Life

Systems that possess the ability of emergence through self-organization are a particular promising approach to Data Mining. In this paper, we describe a novel approach to emerging self organizing systems: artificial life forms, called DataBots, simulated in a computer show collective behavioural patterns that correspond to structural features in a high dimensional input space. Movement strategies for DataBots have been found and tested on a real world data set. Important structural properties could be found and visualized by the collective organisation of the artificial life forms.

Alfred Ultsch

Pattern Recognition and Geometrical Statistics

Exploring the Periphery of Data Scatters: Are There Outliers?

Outliers are observations that are particularly discordant with respect to others, lying hence on the periphery of the data region. In the literature, many tools have been proposed with the aim of detecting multiple outliers. Most of the recent and attractive methods are based on some measure of the distance of each data point from a center. However, they are really effective only if the shape of the data scatter is symmetrical with respect to such a center. Otherwise, asymmetry will make these measures misleading. For this reason, we propose a method that allows direct exploration of the periphery of the data scatter, without considering any center. The methodology we propose is based on a two-step procedure that exploits the sample convex hull and radial projections. It explores gaps in the data scatter and proximities to its boundary, highlighting how the data structure is sparse at its periphery. A complementary graphical display is finally offered as a useful tool to visualize boundary features.

Giovanni C. Porzio, Giancarlo Ragozini

Discriminant Analysis Tools for Non Convex Pattern Recognition

Estimation of non convex domains when inside and outside observations are available is often needed in current research applications. The key idea of this paper is to propose a solution based on convex and discriminant analysis tools, even when non convex domains are considered. Simulations are done and comparisons are made with a natural candidate, based on the Voronoï tessellation, for estimation of non convex bodies. However, this solution has irregularity problems.The question of how to get smooth estimate of the unknown non convex domain is the core of this research. Our solution gives a smooth estimate of the domain and a gain of around 40 percent with respect to the symmetric difference criterion.

Marcel Rémon

A Markovian Approach to Unsupervised Multidimensional Pattern Classification

This paper proposes a new method for core cluster detection prior to unsupervised automatic classification. Based upon a Markov random field model, this approach transforms the set of multidimensional observations into a normalised discret binary set, which represents the observable field. The field of classes is then represented by connex components corresponding to the cores, or prototypes, inside the samples. Classification results of artificially generated data are compared with results obtained by a classical clustering method.

A. Sbihi, A. Moussa, B. Benmiloud, J.-G. Postaire

Multivariate and Multidimensional Data Analysis


Multivariate Data Analysis

An Algorithm with Projection Pursuit for Sliced Inverse Regression Model

In the paper, we investigate a conditional density function of sliced response variables and propose an algorithm for the sliced inverse regression (SIR) model with projection pursuit.The SIR model is a general model for dimension reduction of explanatory variables on regression analysis. Some algorithms for SIR model are proposed; SIR, SIR2, Bivariate SIR. We apply the algorithms to some typical data sets. They can not find suitable reductions for all of the data sets. The proposed algorithm can get reasonable results for all of them.

Masahiro Mizuta, Hiroyuki Minami

Testing Constraints and Misspecification in VAR-ARCH Models

Vector autoregressive models with conditional heteroskedastic errors (abbreviated as VAR-ARCH models) have become increasingly important for applications in financial econometrics. In this paper, we propose likelihood ratio and Wald tests for constraints and the White (1982) misspecification test for VAR-ARCH models which are estimated by the maximum likelihood (ML) method. The tests are discussed for a general class of multivariate conditional heteroskedastic time series models including the VAR-ARCH models. We derive the exact analytic expression for the gradient vector and the conditional information matrix from the log-likelihood function under the normality assumption.

Wolfgang Polasek, Shuangzhe Liu

Goodness of Fit Measure Based on Sample Isotone Regression of Mokken Double Monotonicity Model

Based on concepts of Mokken Double Monotonicity model (1971, 1997) and Sample Isotone Regression (Barlow, Bartholomew, Bremmer & Brunk, 1972), a model goodness of fit measure is defined. It permits interpretation of the global deviation from Double Monotonicity in a set of dichotomous response items.To this end, based on the order induced by the difficulty of the items, the disparity function associated with the proportion of positive and negative responses to pairs of items — given in the matrices P11 and P00 — is defined. In each matrix, the global deviation from Double Monotonicity is obtained as the sum of discrepancies between the proportions of responses observed on pairs of items and the disparities associated with these proportions.

Teresa Rivas Moya

Multiway Data Analysis

Fuzzy Time Arrays and Dissimilarity Measures For Fuzzy Time Trajectories

In this paper we define a fuzzy extension of a time array. The algebraic and geometric characteristics of the fuzzy time array are analyzed. Furthermore, considering the objects space ℜJ+1, where J is the number of variables and the remaining dimension is related to time, we suggest different dissimilarity measures for fuzzy time trajectories.

Renato Coppi, Pierpaolo D’Urso

Three-Way Partial Correlation Measures

Analysis of linear relations between variables, given a third one, can be investigated for three-way three-mode data, by defining new measures of linear dependence between occasions. In this paper, two partial correlation coefficients between matrices are proposed. Their properties are analyzed, in particular with respect to the absence of conditional linear dependence.

Donatella Vicari

Analysis of Network and Relationship Data and Multidimensional Scaling

Statistical Models for Social Networks

Recent developments in statistical models for social networks reflect an increasing theoretical focus in the social and behavioral sciences on the interdependence of social actors in dynamic, network-based social settings (e.g., Abbott, 1997; White, 1992, 1995). As a result, a growing importance has been accorded the problem of modeling the dynamic and complex interdependencies among network ties and the actions of the individuals whom they link. Included in this problem is the identification of cohesive subgroups, or classifications of the individuals. The early focus of statistical network modeling on the mathematical and statistical properties of Bernoulli and dyad-independent random graph distributions has now been replaced by efforts to construct theoretically and empirically plausible parametric models for structural network phenomena and their changes over time.

Stanley Wasserman, Philippa Pattison

Application of Simulated Annealing in some Multidimensional Scaling Problems

We apply simulated annealing as a combinatorial optimization heuristic in some multidimensional scaling (MDS) contexts for the minimization of Stress: metric MDS, MDS with restrictions in the configuration and INDSCAL parameter estimation. The application of this technique is based on a discretization of the representation space by a grid. Results obtained are compared to those of usual well-known algorithms and are shown to be better in most of the cases.

Javier Trejos, William Castillo, Jorge González, Mario Villalobos

Data Analysis Based on Minimal Closed Subsets

The aim of this paper is to provide a framework which enables us to treat structural analysis problems. This framework is based on pretopological theory. We apply the concepts of pseudoclosure and minimal closed subsets to bring out the structural information. In order to illustrate our method, an application to co-authorships of publications between French geographical areas is displayed.

S. Bonnevay, C. Largeron-Leteno

Robust Multivariate Methods

A Robust Method for Multivariate Regression

We introduce a new method for multivariate regression based on robust estimation of the location and scatter matrix of the joint response and explanatory variables. The resulting method has good equivariance properties and the same breakdown value as the initial estimator for location and scatter. We also derive a general expression for the influence function at elliptical distributions. We compute asymptotic variances and compare them to finite-sample efficiencies obtained by simulation.

Stefan Van Aelst, Katrien Van Driessen, Peter J. Rousseeuw

Robust Methods for Complex Data Structures

It is well known that the results of classical statistical methods may be affected by model deviations as for example the occurrence of outliers. The more complex a data structure or statistical procedure, the more complicated might be the mechanism of how outliers influence the analysis. The impact of spurious observations becomes less transparent with growing complexity of models and methods. Detailed sensitivity analyses and the development of suitable robustness concepts are therefore necessary.

Ursula Gather, Claudia Becker, Sonja Kuhnt

Robust Methods for Canonical Correlation Analysis

Canonical correlation analysis studies associations between two sets of random variables. Its standard computation is based on sample covariance matrices, which are however very sensitive to outlying observations. In this note we introduce, discuss and compare four different ways for performing a robust canonical correlation analysis. One method uses robust estimators of the involved covariance matrices, another one uses the signs of the observations, a third approach is based on projection pursuit, and finally an alternating regression algorithm for canonical analysis is proposed.

Catherine Dehon, Peter Filzmoser, Christophe Croux

Data Science


Data Science and Data Collection

From Data Analysis to Data Science

This paper discusses the significance of the term “data science” to the Japanese Classification Society (JCS) and the international relevance of JCS’s research. In 1992, the author argued the urgency of the need to grasp the concept “data science”. Despite the emergence of concepts such as data mining, this issue has not been addressed. Discussion will emphasize the history of methods of data analysis proposed by J. Tukey. The interaction between Japan and, particularly, France in the development of data analysis will be emphasized.

Noboru Ohsumi

Evaluation of Data Quality and Data Analysis

The practical evaluation of data quality is discussed. Such evaluations are essential if we intend to carry out useful data analyses. Here we treat this problem in the context of cross-societal (comparative social) surveys.

Chikio Hayashi

Collapsibility and Collapsing Multidimensional Contingency Tables—Perspectives and Implications

Collapsing multidimensional contingency tables is a necessary procedure in all kinds of research. Since collapsibility is subject to severe conditions, collapsing is often not admissible without incurring severe interpretative errors. After having discussed the main contributions to the statistical specification of the concept, we shall point out the logical conditions for collapsing multidimensional contingency tables.

Stefano De Cantis, Antonino M. Oliveri

Sampling and Internet Surveys

Data Collected on the Web

Despite its relatively short existence Web-assisted data collection has already been widely applied and, increasingly, data analysts have to work with data collected on the Web. However, we are faced with relatively contradictory views on this data collection mode. In particular, with respect to Web surveys, opinions vary from a belief that the Web will revolutionise the survey industry to the opinion that this does not represent a valid mode of data collection. This paper provides an overview of the methodology of professional Web surveys. Three essential components (self-administration, HTML basis, and automatic transmission) of the Web survey mode are defined and separated from related solicitation and selection procedures. As an illustration, the segmentation/clustering of Internet users with respect to the survey mode (telephone and Web) is presented. In addition, new possibilities for data collection on the Web are discussed, particularly as concerns the data used in network and clustering analysis.

Vasja Vehovar, Katja Lozar Manfreda, Zenel Batagelj

Some Experimental Surveys on the WWW Environments in Japan

To assess and analyze the characteristics of surveys on the World Wide Web as objectively as possible, we simultaneously conducted some experimental surveys on three Web sites and two ordinary surveys. A comparison of survey results revealed some interesting characteristics of surveys conducted on the Web. There were stable, uniform and systematically biased responses among the three Web sites surveyed, in spite of the low response rates. In addition, respondents to the Web surveys had a general tendency to participate in surveys conducted through the WWW. The findings imply that in Web surveys, it may be feasible and beneficial to conduct longitudinal surveys.

Osamu Yoshimura, Noboru Ohsumi

Bootstrap Goodness-of-fit Tests for Complex Survey Samples

A method to implement goodness of fit tests in survey sampling, where the common independence assumption fails, is proposed. A bootstrap test following an approach similar to those of Bickel and Freedman (1984), Rao and Wu (1988), Sitter (1992) is defined and applied to stratified and two stage sampling. Extensive MonteCarlo simulations show the good behavior of the test and the conditions to achieve satisfactory power levels against reasonable alternatives.

Andrea Scagni

Symbolic Data Analysis


Classification and Analysis of Symbolic Data

Regression Analysis for Interval-Valued Data

When observations in large data sets are aggregated into smaller more manageable data sizes, the resulting classifications of observations invariably involve symbolic data. In this paper, covariance and correlation functions are introduced for interval-valued symbolic data. These and their associated terms are then used to fit linear regression models to such data. The methods are illustrated with an example from cardiology.

L. Billard, E. Diday

Symbolic Approach to Classify Large Data Sets

The aim of this work is to present an approach to classify large data sets based on a Boolean symbolic classifier as described in Yaguchi et al (1996) and Ichino et al (1996). Compared with this last classifier, our system keeps the concept of mutual neighbours between examples (Ichino et al (1996)) but introduces some modifications in both learning step (generalisation tools) and allocation step (matching functions). As an example of large data set processing, a particular kind of simulated images will be classified according to this approach.

Francisco de A.T. de Carvalho, Cezar A. de F. Anselmo, Renata M. C. R. de Souza

Factorial Methods with Cohesion Constraints on Symbolic Objects

In this paper we generalize some results of Factorial Analysis to the complex data structure defined in Symbolic Data Analysis. The proposed treatments are based on a multi-steps symbolic-numerical-symbolic procedure and on the geometric results interpretation. The paper generalizes the constrained Factorial Approach (Lauro and Palumbo, 2000), that permits to take into account the Symbolic Data Structure in the analysis of the coded data.

N. C. Lauro, R. Verde, F. Palumbo

A Dynamical Clustering Algorithm for Multi-nominal Data

In this paper we present a dynamical clustering algorithm in order to partition a set of multi-nominal data in k classes. This kind of data can be considered as a particular description of symbolic objects. In this algorithm, the representation of the classes are given by prototypes that generalize the characteristics of the elements belonging to each class. A suitable allocation function (context dependent) is considered in this context to assign an object to a class. The final classes are described by the distributions associated to the multi-nominal variables of the elements belonging to each class. That representation corresponds to the usual description of the so called modal symbolic objects.

Rosanna Verde, Francisco de A. T. de Carvalho, Yves Lechevallier


DB2SO: A Software for Building Symbolic Objects from Databases

The SODAS project, funded by EC, has developed a software for extending statistical data analysis methods to more complex objects. Objects processed by these methods are complex in the sense that they represent groups of individuals, featuring variation among each group of individuals. Within the context of the SODAS project, the complex objects are called symbolic objects. In this paper, we present a part of the SODAS software, which enables the user to acquire datasets of symbolic objects, by extracting information from relational databases.

Georges Hébrail, Yves Lechevallier

Symbolic Data Analysis and the SODAS Software in Official Statistics

The need to extract new knowledge from complex data contained in relational databases is increasing. Therefore, it becomes a task of first importance to summarise huge data sets by their underlying concepts in order to extract useful knowledge. These concepts can only be described by more complex data type called “symbolic data”. We define “Symbolic Data Analysis” (SDA) as the extension of standard Data Analysis to symbolic data tables. The “Symbolic Data Analysis” theory is now enhanced by a new software tool called “SODAS” which results from the effort of 17 European teams (sponsored by EUROSTAT). This is shown by several applications in Official Statistics.

Raymond Bisdorff, Edwin Diday

Strata Decision Tree SDA Software

The SDT and SDTEDITOR software are presented. The SDT (Strata Decision Tree) implements a generalised recursive tree-building algorithm for populations partitioned into strata and described by symbolic data, that is, more complex data structures than classical data. Symbolic objects describe decisional nodes and strata. The SDTEDITOR is a graph editor for strata decision trees. The SDT and SDTEDITOR are modules integrated into the SODAS Software (Symbolic Official Data Analysis System), partially supported by ESPRIT-20821 SODAS.

M. Carmen Bravo

Marking and Generalization by Symbolic Objects in the Symbolic Official Data Analysis Software

In this paper we propose an automatic method of generating Symbolic Objects in the following framework: description of a partition by symbolic objects that takes into account two aspects, that may be called homogeneity and discrimination criteria. This method belongs to a family of algorithms named MGS (Marking and Generalization by Symbolic Objects), which may be applied either to Factorial Analysis interpretation, to interpretation of partitions or for summarizing huge databases.

Mireille Gettler Summa


Weitere Informationen