nach oben

2006 | Buch

Kapitel lesen Erstes Kapitel lesen

Data Analysis, Classification and the Forward Search

Proceedings of the Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, University of Parma, June 6–8, 2005

herausgegeben von: Prof. Sergio Zani, Prof. Andrea Cerioli, Prof. Marco Riani, Prof. Maurizio Vichi

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Classification, Data Analysis, and Knowledge Organization

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This volume contains revised versions of selected papers presented at the biennial meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society, which was held in Parma, June 6-8, 2005. Sergio Zani chaired the Scientific Programme Committee and Andrea Cerioli chaired the Local Organizing Committee. The scientific programme of the conference included 127 papers, 42 in spe cialized sessions, 68 in contributed paper sessions and 17 in poster sessions. Moreover, it was possible to recruit five notable and internationally renowned invited speakers (including the 2004-2005 President of the International Fed eration of Classification Societies) for plenary talks on their current research work. Among the specialized sessions, two were organized by Wolfgang Gaul with five talks by members of the GfKl (German Classification Society), and one by Jacqueline J. Meulman (Dutch/Flemish Classification Society). Thus, the conference provided a large number of scientists and experts from home and abroad with an attractive forum for discussion and mutual exchange of knowledge. The topics of all plenary and specialized sessions were chosen to fit, in the broadest possible sense, the mission of CLADAG, the aim of which is "to further methodological, computational and applied research within the fields of Classification, Data Analysis and Multivariate Statistics". A peer-review refereeing process led to the selection of 46 extended papers, which are contained in this book.

Inhaltsverzeichnis

Frontmatter

Clustering and Discrimination

Frontmatter

Genetic Algorithms-based Approaches for Clustering Time Series

Cluster analysis is to be included among the favorite data mining techniques. Cluster analysis of time series has received great attention only recently mainly because of the several difficult issues involved. Among several available methods, genetic algorithms proved to be able to handle efficiently this topic. Several partitions are considered and iteratively selected according to some adequacy criterion. In this artificial “struggle for survival” partitions are allowed to interact and mutate to improve and produce a “high quality” solution. Given a set of time series two genetic algorithms are considered for clustering (the number of clusters is assumed unknown). Both algorithms require a model to be fitted to each time series to obtain model parameters and residuals. These methods are applied to a real data set concerned with the visitors flow recorded, in state owned museums with paid admission, in the Lazio region of Italy.

Roberto Baragona, Salvatore Vitrano

On the Choice of the Kernel Function in Kernel Discriminant Analysis Using Information Complexity

In this short paper we shall consider the Kernel Fisher Discriminant Analysis (KFDA) and extend the idea of Linear Discriminant, Analysis (LDA) to nonlinear feature space. We shall present a new method of choosing the optimal kernel function and its effect on the KDA classifier using information-theoretic complexity measure.

Hamparsum Bozdogan, Furio Camillo, Caterina Liberati

Growing Clustering Algorithms in Market Segmentation: Defining Target Groups and Related Marketing Communication

This paper outlines innovative techniques for the segmentation of consumer markets. It compares a new self-controlled growing neural network with a recent growing

-means algorithm. A critical issue is the identification of the “right” number of clusters, which is externally validated by the

JUMP

-criterion. The empirical application counters several objections recently raised against the use of cluster analysis for market segmentation.

Reinhold Decker, Sören W. Scholz, Ralf Wagner

Graphical Representation of Functional Clusters and MDS Configurations

We deal with graphical representations of results of functional clustering and functional multidimensional scaling (MDS). Ramsay and Silverman(1997, 2005) proposed functional data analysis. Functional data analysis enlarges the range of statistical data analysis. But, it is not easy to represent results of functional data analysis techniques. We focus on two methods of functional data analysis: functional clustering and functional MDS. We show graphical representations for functional hierarchical clustering and functional

-means method in the first part of this paper. Then, in the second part, graphical representation of results of functional MDS. functional configuration is presented.

Masahiro Mizuta

Estimation of the Structural Mean of a Sample of Curves by Dynamic Time Warping

Following our previous works where an improved dynamic time warping (DTW) algorithm has been proposed and motivated, especially in the multivariate case, for computing the dissimilarity between curves, in this paper we modify the classical DTW in order to obtain discrete warping functions and to estimate the structural mean of a sample of curves. With the suggested methodology we analyze series of daily measurements of some air pollutants in Emilia-Romagna (a region in Northern Italy). We compare results with those obtained with other flexible and non parametric approaches used in functional data analysis.

Isabella Morlini, Sergio Zani

Sequential Decisional Discriminant Analysis

We are describing here a sequential discriminant analysis method which aim is essentially to classify evolutionary data. This method of decision-making is based on the research of principal axes of a configuration of points in the individual-space with a relational inner product. We are in presence of a discriminant analysis problem, in which the decision must be taken as the partial knowledge evolutionary information of the observations of the statistical unit, which we want to classify. We show here how the knowledge from the observation of the global testimony sample carried out during the entire period, can be of particular benefit to the classifying decision on supplementary statistical units, of which we only have partial information about. An analysis using real data is here described using this method.

Rafik Abdesselam

Regularized Sliced Inverse Regression with Applications in Classification

Consider the problem of classifying a number of objects into one of several groups or classes based on a set of characteristics. This problem has been extensively studied under the general subject of discriminant analysis in the statistical literature, or supervised pattern recognition in the machine learning field. Recently, dimension reduction methods, such as SIR. and SAVE, have been used for classification purposes. In this paper we propose a regularized version of the SIR. method which is able to gain information from both the structure of class means and class variances. Furthermore, the introduction of a shrinkage parameter allows the method to be applied in under-resolution problems, such as those found in gene expression microarray data. The REGSIR method is illustrated on two different classification problems using real data sets.

Luca Scrucca

Multidimensional Data Analysis and Multivariate Statistics

Frontmatter

Approaches to Asymmetric Multidimensional Scaling with External Information

In this paper some possible approaches to asymmetric multidimensional sealing with external information are presented to analyze graphically asymmetric proximity matrices. In particular, a proposal to incorporate external information in biplot method is provided. The methods considered allow joint or separate analyzes of symmetry and skew-symmetry. A final application to Morse code data is performed to emphasize advantages and shortcomings of the different methods proposed.

Giuseppe Bove

Variable Architecture Bayesian Neural Networks: Model Selection Based on EMC

This work addresses the problem of Selecting appropriate architectures for Bayesian Neural Networks (BNN). Specifically, it proposes a variable architecture model where the number of hidden units are selected by using a variant of the real-coded Evolutionary Monte Carlo algorithm developed by Liang and Wong (2001) for inference and prediction in fixed architecture Bayesian Neural Networks.

Silvia Bozza, Pietro Mantovan

Missing Data in Optimal Scaling

We propose a procedure to assess a measure for a latent phenomenon, starting from the observation of a wide set of ordinal variables affected by missing data. The proposal is based on Nonlinear PCA technique to be jointly used with an

ad hoc

imputation method for the treatment of missing data. The procedure is particularly suitable when dealing with ordinal, or mixed, variables, which are strongly interrelated and in the presence of Specific patterns of missing observations.

Pier Alda Ferrari, Paola Annoni

Simple Component Analysis Based on RV Coefficient

Michele Gallo, Pietro Amenta, Lnigi D’Ambra

Baum-Eagon Inequality in Probabilistic Labeling Problems

This work illustrates an approach to the study of labeling, aka “object classification”, This kind of parallel computing problem well suites to AI applications (pattern recognition, edge detection, etc.) Our target consists in simplifying an overly computationally costly algorithm proposed by Faugeras and Berthod: using Baum-Eagon theorem, we obtained a reduced algorithm which produces results comparable with other more complex approaches.

Crescenzio Gallo, Giancarlo de Stasio

Monotone Constrained EM Algorithms for Multinormal Mixture Models

We investigate the spectral decomposition of the covariance matrices of a multivariate normal mixture distribution in order to construct constrained EM algorithms which guarantee the monotonicity property. Furthermore we propose different set of constraints which can be simply implemented. These procedures have been tested on the ground of many numerical experiments.

Salvatore Ingrassia, Roberto Rocci

Visualizing Dependence of Bootstrap Confidence Intervals for Methods Yielding Spatial Configurations

Several techniques (like MDS and PCA) exist for summarizing data by means of a graphical configuration of points in a low-dimensional space. Usually, such analyses are applied to data for a sample drawn from a population. To assess how accurate the sample based plot is as a representation for the population, confidence intervals or ellipsoids can be constructed around each plotted point, using the bootstrap procedure. However, such a procedure ignores the dependence of variation of different points across bootstrap samples. To display how the variations of different points depend on each other, we propose to visualize bootstrap configurations in a bootstrap movie.

Henk A. L. Kiers, Patrick J. F. Groenen

Automatic Discount Selection for Exponential Family State-Space Models

In a previous paper (Pastore, 2004), a method for selecting the discount parameter in a gaussian state-space model was introduced. The method is based on a sequential optimization of a Bayes factor and is intended for on-line modelling purposes. In this paper, these results are extended to state-space models where the distribution of the observable variable belongs to the exponential family.

Andrea Pastore

A Generalization of the Polychoric Correlation Coefficient

The polychoric correlation coefficient is a measure of association between two ordinal variables. It is based on the assumption that two latent bivariate normally distributed random variables generate couples of ordinal scores. Categories of the two ordinal variables correspond to intervals of the corresponding continuous variables. Thus, measuring the association between ordinal variables means estimating the product moment correlation between the underlying normal variables (Olsonn. 1979). When the hypothesis of latent bivariate normality is empirically or theoretically implausible, other distributional assumptions can be made. In this paper a new and more flexible polychoric correlation coefficient is proposed assuming that the underlying variables are skewnormally distributed (Roscino. 2005). The skew normal (Azzalini and Dalla Valle. 1996) is a family of distributions which includes the normal distribution as a special case, but with an extra parameter to regulate the skewness. As for the original polychoric correlation coefficient, the new coefficient was estimated by the maximization of the log-likelihood function with respect to the thresholds of the continuous variables, the skewness and the correlation parameters. The new coefficient was then tested on samples from simulated populations differing in the number of ordinal categories and the distribution of the underlying variables. The results were compared with those of the original polychoric correlation coefficient.

Annarita Roscino, Alessio Pollice

The Effects of MEP Distributed Random Effects on Variance Component Estimation in Multilevel Models

An in-depth investigation on maximum likelihood estimators for variance components is proposed, where the reference is a multilevel model with misspecifications on random effect distribution. The multivariate distributions here introduced for the random effects belong to the family of the Multivariate Exponential Power (MEP) distributions. Our primary interest is devoted to the variability of such estimators, since the MEPs have a noteworthy influence upon it.

Nadia Solaro, Pier Alda Ferrari

Calibration Confidence Regions Using Empirical Likelihood

The literature on multivariate calibration shows an increasing interest in non-parametric or semiparametric methods. Using Empirical Likelihood (EL). we present a semiparametric approach to find multivariate calibration confidence regions and we show how a unique optimum calibration point may be found weighting the EL profile function. In addition, a freeware VBA for Excel© program has been implemented to solve the many relevant computational problems. An example taken from a process of a semiconductor industry is presented.

Diego Zappa

Robust Methods and the Forward Search

Frontmatter

Random Start Forward Searches with Envelopes for Detecting Clusters in Multivariate Data

During a forward search the plot of minimum Mahalanobis distances of observations not in the subset provides a test for outliers. However, if clusters are present in the data, their simple identification requires that there arc searches that initially include a preponderance of observations from each of the unknown clusters. We use random starts to provide such searches, combined with simulation envelopes for precise inference about clustering.

Anthony Atkinson, Marco Riani, Andrea Cerioli

Robust Transformation of Proportions Using the Forward Search

The aim of this work is to detect the best transformation parameters to normality when data are proportions. To this purpose we extend the forward search algorithm introduced by Atkinson and Riani (2000), and Atkinson

et al.

(2004) to the transformation proposed by Aranda-Ordaz (1981). The procedure, implemented by authors with R package, is applied to the analysis of a particular characteristic of Tuscany industries. The data used derive from the Italian industrial census conducted in the year 2001 by the Italian National Statistical Institute (ISTAT).

Matilde Bini, Bruno Bertaccini

The Forward Search Method Applied to Geodetic Transformations

In geodesy, one of the most frequent problems to solve is the coordinate transformation. This means that it is necessary to estimate the coefficients of the equations that transform the planimetric coordinates, defined in a reference system, to the corresponding ones in a second reference system. This operation, performed in a 2D-space is called planar transformation. The main problem is that if outliers are included in the data, adopting non-robust methods of adjustment to calculate the coefficients cause an arbitrary large change in the estimate. Traditional methods require a closer analysis, by the operator, of the computation progress and of the statistic indicator provided in order to identify possible outliers.

In this paper the application of the Forward Search in geodesy is discussed and the results are compared with those computed with traditional adjustment methods.

Alessandro Carosio, Marco Piras, Dante Salvini

An R Package for the Forward Analysis of Multivariate Data

We describe the R package Rfwdmv (R package for the forward multivariate analysis) which implements the forward search for the analysis of multivariate data. The package provides functions useful for detecting atypical observations and/or subsets in the data and for testing in a robust way whether the data should be transformed. Additionally, the package contains functions for performing robust principal component analyses and robust discriminant analyses as well as a range of graphical tools for interactively assessing fitted forward searches on multivariate data.

Aldo Corbellini, Kjell Konis

A Forward Search Method for Robust Generalised Procrustes Analysis

One drawback of Procrustes Analysis is the lack of robustness. To overcome this limitation a procedure that applies the Generalised Procrustes methods, by way of a progressive sequence inspired to the “forward search”, was developed. Starting from an initial centroid, defined by the partial point configuration satisfying the LMS principle, this is extended by joining, at every step, a restricted subset of the remaining points. At every insertion, the updated centroid, redetermined by the new considered points, is compared with the previous by way of the common elements. If significant variations of the similarity transformation parameters occur, they reveal the presence of outliers or non stationary points among the new elements just inserted.

Fabio Crosilla, Alberto Beinat

A Projection Method for Robust Estimation and Clustering in Large Data Sets

A projection method for robust estimation of shape and location in multivariate data and cluster analysis is presented. The key idea of the procedure is to search for heterogeneity in univariate projections on directions that are obtained both randomly, using a modification of the Stahel-Donoho procedure, and by maximizing and minimizing the kurtosis coefficient of the projected data, as proposed by Peña and Prieto (2005). We show in a Monte Carlo study that the resulting procedure works well for robust estimation. Also, it preserves the good theoretical properties of the Stahel-Donoho method.

Daniel Peña, Francisco J. Prieto

Robust Multivariate Calibration

Multivariate calibration uses an estimated relationship between a multivariate response

and an explanatory vector

to predict unknown

in future from further observed responses. Up to now very little has been written about robust calibration. An approach can be based on the outliers deletion methods. An alternative is to employ robust procedures. The purpose of this paper is to present multivariate calibration methods which are able to detect and investigate those observations which differ from the bulk of the data or to identify subgroups of observations. Particular attention will be paid to the

forward search

approach.

Silvia Salini

Data Mining Methods and Software

Frontmatter

Procrustes Techniques for Text Mining

This paper aims at exploring the capability of the so called Latent Semantic Analysis applied to a multilingual context. In particular we are interested in weighing how it could be useful in solving linguistic problems, moving from a statistical point of view. Here we focus on the possibility of evaluating the goodness of a translation by comparing the latent structures of the original text and its version in another natural language. Procrustes rotations are introduced in a statistical framework as a tool for reaching this goal. An application on one year of

Le Monde Diplomatique

and the corresponding Italian edition will show the effectiveness of our proposal.

Simona Balbi, Michelangelo Misuraca

Building Recommendations from Random Walks on Library OPAC Usage Data

In this contribution we describe a new way of building a recommender service based on OPAC web-usage histories. The service is based on a clustering approach with restricted random walks. This algorithm has some properties of single linkage clustering and suffers from the same deficiency, namely bridging. By introducing the idea of a walk context (see Franke and Thede (2005) and Franke and Geyer-Schulz (2004)) the bridging effect can be considerably reduced and small clusters suitable as recommendations are produced. The resulting clustering algorithm scales well for the large data sets in library networks. It complements behavior-based recommender services by supporting the exploration of the revealed semantic net of a library network’s documents and it offers the user the choice of the trade-off between precision and recall. The architecture of the behavior-based system is described in Geyer-Schulz et al. (2003).

Markus Franke, Andreas Geyer-Schulz, Andreas Neumann

A Software Tool via Web for the Statistical Data Analysis: R-php

The spread of Internet and the growing demand of services from the web users have changed and are still changing the way to organize the work or the study. Nowadays, the main part of information and many services are on the web and the software is going toward the same direction: in fact, the use of software implemented via web is ever-increasing, with a client-server logic that enables the “centralized” use of software installed on a server. In this paper we describe the Structure and the running of

R-php

, an environment for statistical analysis, freely accessible and attainable through the World Wide Web, based on the statistical environment

R. R-php

is based on two modules: a base module and a point-and-click module. By using the point-and-click module, the so far implemented statistical analyses include also ANOVA, linear regression and some data analysis methods such as cluster analysis and PCA (Principal Component Analysis).

Angelo M. Mineo, Alfredo Pontillo

Evolutionary Algorithms for Classification and Regression Trees

Optimization Problems represent a topic whose importance is getting higher and higher for many statistical methodologies. This is particularly true for Data Mining. It is a fact that, for a particular class of problems, it is not feasible to exhaustively examine all possible solutions. This has led researchers’ attention towards a particular class of algorithms called Heuristics. Some of these Heuristics (in particular Genetic Algorithms and Ant Colony Optimization Algorithms), which are inspired to natural phenomena, have captured the attention of the scientific community in many fields. In this paper Evolutionary Algorithms are presented, in order to face two well-known problems that affect Classification and Regression Trees.

Francesco Mola, Raffaele Miele

Variable Selection Using Random Forests

One of the main topic in the development of predictive models is the identification of variables which are predictors of a given outcome. Automated model selection methods, such as backward or forward stepwise regression, are classical solutions to this problem, but are generally based on strong assumptions about the functional form of the model or the distribution of residuals. In this pa-per an alternative selection method, based on the technique of Random Forests, is proposed in the context of classification, with an application to a real dataset.

Marco Sandri, Paola Zuccolotto

Boosted Incremental Tree-based Imputation of Missing Data

Tree-based procedures have been recently considered as non parametric tools for missing data imputation when dealing with large data structures and no probability assumption. A previous work used an incremental algorithm based on cross-validated decision trees and a lexicographic ordering of the single data to be imputed. This paper considers an ensemble method where tree-based model is used as learner. Furthermore, the incremental imputation concerns missing data of each variable at turn. As a result, the proposed method allows more accurate imputa-tions through a more efficient algorithm. A simulation case study shows the overall good performance of the proposed method against some competitors. A MatLab implementation enriches Tree Harvest Software for non-standard classification and regression trees.

Roberta Siciliano, Massimo Aria, Antonio D’Ambrosio

Sensitivity of Attributes on the Performance of Attribute-Aware Collaborative Filtering

Collaborative Filtering (CF). the most commonly-used technique for recommender systems, does not make use of object attributes. Several hybrid recommender systems have been proposed, that aim at improving the recommendation quality by incorporating attributes in a CF model.

In this paper, we conduct an empirical study of the sensitivity of attributes for Several existing hybrid techniques using a movie dataset with an augmented movie attribute set. In addition, we propose two attribute selection measures to select informative attributes for attribute-aware CF filtering algorithms.

Karen H. L. Tso, Lars Schmidt-Thierne

Multivariate Methods for Customer Satisfaction and Service Evaluation

Frontmatter

Customer Satisfaction Evaluation: An Approach Based on Simultaneous Diagonalization

Several methods have been proposed in literature for the service quality evaluation. These models measure the gap between customer’s expectations for excellence and their perceptions of actual service offered. In this paper we propose an extension of a techniques which allows to analyze jointly the expectations and perceptions data.

Pietro Amenta, Biagio Simonetti

Analyzing Evaluation Data: Modelling and Testing for Homogeneity

In the evaluation process of a given service, different issues are worth of analysis. In first instance, it is interesting to assess how the evaluation responses changes over the time and whether there is an effect of the raters’ features. Secondly, when the service is made up by different items, it is important to verify if the satisfaction feelings of the users/consumers are the same with respect to all the dimensions. At this scope, the paper proposes a modelling approach for analyzing and testing ordinal/rating data. Some evidence from University services evaluation shows the usefulness of this procedure in a real case-study.

Angela D’Elia, Domenico Piccolo

Archetypal Analysis for Data Driven Benchmarking

In this work, adopting an exploratory and graphical approach, we suggest to consider archetypal analysis as a basis for a data driven benchmarking procedure. The procedure is aimed at defining some reference performers, at understanding their features, and at comparing observed performances with them. Being archetypes some extreme points, we propose to consider them as reference performers. Then, we offer a set of graphical tools in order to describe these archetypal benchmarks, and to evaluate the observed performances with respect to them.

Giovanni C. Porzio, Giancarlo Ragozini, Domenico Vistocco

Determinants of Secondary School Dropping Out: a Structural Equation Model

In this work we present the main results of a research program on dropping out in secondary school, carried out for the Labor Bureau of Campania Region in Italy. We exploited structural equation modeling to identify determinants of the phenomenon under study. We adopt a social system perspective, considering data coming from official statistics related to the 103 Italian Provinces. We provide some details for the model specification and the estimated parameters. Some relevant issues related to the estimation process due to the small sample size and the non-normality of the variables are also discussed.

Giancarlo Ragozini, Maria Prosperina Vitale

Testing Procedures for Multilevel Models with Administrative Data

Recent Relative Effectiveness studies of the Health Sector have strongly criticized hierarchical ranking in hospitals. As an alternative, they propose a multi-faceted approach which evaluates the quality and characteristics of Hospital services. In this direction, the use of administrative data has proven highly useful. This data is less precise than clinical data but performs more effectively in describing general situations. The numerosity of the population renders all the parameters Significant in linear model tests. We must therefore utilize resampling schemes in order to verify the hypotheses concerning the significance of the parameters in opportunely drawn subsamples.

Giorgio Vittadini, Maurizio Sanarico, Paolo Berta

Multidimensional Versus Unidimensional Models for Ability Testing

Over last few years the need for an objective way of evaluating student performance has rapidly increased due to the growing call for the evaluation of tests administered at the end of a teaching process and during the guidance phases. Performance evaluation can be achieved busing the Item Response Theory (IRT) approach. In this work we compare the performance of an IRT model defined first on a multidimensional ability space and then on a unidimensional one. The aim of the comparison is to assess the results obtained in the two situations through a simulation study in terms of student classification based on ability estimates. The comparison is made using the two-parameter model defined within the more general framework of the Generalized Linear Latent Variable Models (GLLVM) since it allows the inclusion of more than one ability (latent variables). The simulation highlights that the importance of the dimensionality of the ability space increases when the number of items referring to more than one ability increases.

Stefania Mignani, Paola Monari, Silvia Cagnone, Roberto Ricci

Multivariate Methods in Applied Science

Frontmatter

Economics

A Spatial Mixed Model for Sectorial Labour Market Data

A vast literature has been recently concerned with the analysis of variation in overdispersed counts across geographical areas. In this paper, we extend the univariate semiparametric models introduced by Biggeri et al. (2003) to the analysis of multiple spatial counts. The proposed approach is applied to modeling the geographical distribution of employees by economic sectors of the manufacturing industry in Teramo province (Abruzzo) during 2001.

Marco Alfó, Paolo Postiglione

The Impact of the New Labour Force Survey on the Employed Classification

Regulation n. 577/1998 of the European Council gives the rules to be used by the Community countries to design and conduct the Labour Force Survey (

LFS

). In order to apply this regulation the Italian

LFS

has been completely revised regarding several aspects of the survey such as frequency, definitions, questionnaire, survey design, interviewers network. All these changes caused a break in the time series of the main labour force estimates. The aim of this work is to describe and evaluate these differences and their impact on the employed classification.

Claudio Ceccarelli, Antonio R. Discenza, Silvia Loriga

Using CATPCA to Evaluate Market Regulation

One of the most interesting research area in economics concerns the measurement of relative competiveness of different economic systems. Among the several proposed indicators, a particularly relevant one is the Product Market Regulation (PMR) proposed by the OECD, calculated on the basis of a rich database. This paper uses the same database to compute alternative indicators. The main difference with the OECD indicator is that we propose a less invasive statistical methodology (CATPCA). suitable for the treatment of qualitative data. In addition we remove several arbitrary manipulations of basic data. The calculation delivered a new ranking of the 21 countries analyzed and some new interesting evidence.

Giuseppe Coco, Massimo Alfonso Russo

Credit Risk Management Through Robust Generalized Linear Models

In this work, a robust methodology is developed for the classification of a sample of small and medium firms on the basis of their default probability. The importance of this classification procedure is emphasized by the New Basel Capital Accord (Basel II) for the capital adequacy of internationally active banks. The Basel accord introduces the possibility to adopt models of internal rating for the estimation of the default probability of customers’ banks. The reference framework of this paper is the class of generalized linear models which allows to classify units avoiding strict assumptions such those required by the linear discriminant analysis. Another advantage of generalized linear models is the possibility to explore different links between the expected value of the dependent variable and the linear predictor. Parameters are estimated using balance ratios and data coming from Centrale dei Rischi for a set of firms which are customers of a medium sized bank of Northern Italy. Finally, we perform a robust analysis of the model estimates through the forward search in order to monitor the influence of outliers on the final classification.

Luigi Grossi, Tiziano Bellini

Classification of Financial Returns According to Thresholds Exceedances

Properties of a panel of financial time series arc explored, aiming at classifying market shares according to their extremal returns behaviour. Existing methods for optimal portfolio selection involve estimation of correlation coefficient, whose properties for measuring dependence in financial time series are questionable. Alternatively, for stationary processes of financial returns, the mean size of cluster of thresholds exceedances leads to define a measure of extremal dependence more accurate than correlation. Further functionals that might help to optimal portfolio selection, are, for instance, the total loss occurred to a stock during an extreme event or the time-length duration of a loss in a stress period. Combining functionals of financial returns it is possible to clustering shares properly and setting up a tool for portfolio selection. The performance of this method is assessed, through an application to real financial time series, by means of standard Markowitz theory of optimal selection of shares.

Fabrizio Laurini

Environmental and Medical Sciences

Nonparametric Clustering of Seismic Events

In this paper we propose a clustering technique, based on the maximization of the likelihood function defined from the generalization of a model for seismic activity (ETAS model, (Ogata (1988))), iteratively changing the partitioning of the events. In this context it is useful to apply models requiring the distinction between independent events (i.e. the background seismicity) and strongly correlated ones. This technique develops nonparametric estimation methods of the point process intensity function. To evaluate the goodness of fit of the model, from which the clustering method is implemented, residuals process analysis is used.

Giada Adelfio, Marcello Chiodi, Luciana De Luca, Dario Luzio

A Non-Homogeneous Poisson Based Model for Daily Rainfall Data

In this paper we report some results of the application of a new stochastic model applied to rainfall daily data. The Poisson models, characterized only by the expected rate of events (impulse occurrences, that is the mean number of impulses per unit time) and the assigned probability distribution of the phenomenon magnitude, do not take into consideration the datum regarding the duration of the occurrences, that is fundamental from a hydrological point of view. In order to describe the phenomenon in a way more adherent to its physical nature, we propose a new model simple and manageable. This model takes into account another random variable, representing the duration of the rainfall due to the same occurrence. Estimated parameters of both models and related confidence regions are obtained.

Alberto Lombardo, Antonina Pirrotta

A Comparison of Data Mining Methods and Logistic Regression to Determine Factors Associated with Death Following Injury

A comparison of techniques for analysing trauma injury data collected over ten years at a hospital trauma unit in the U.K. is reported. The analysis includes a comparison of four data mining techniques to determine factors associated with death following injury. The techniques include a classification and regression tree algorithm, a classification algorithm, a neural network and logistic regression. As well as techniques within the data mining framework, conventional logistic regression modelling is also included for comparison. Results are compared in terms of sensitivity, specificity, positive predictive value and negative predictive value.

Kay Penny, Thomas Chesney

Backmatter

Titel: Data Analysis, Classification and the Forward Search
herausgegeben von: Prof. Sergio Zani
Prof. Andrea Cerioli
Prof. Marco Riani
Prof. Maurizio Vichi
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-35978-4
Print ISBN: 978-3-540-35977-7
DOI: https://doi.org/10.1007/3-540-35978-8