scroll identifier for mobile
main-content

## Über dieses Buch

This volume presents theoretical developments, applications and computational methods for the analysis and modeling in behavioral and social sciences where data are usually complex to explore and investigate. The challenging proposals provide a connection between statistical methodology and the social domain with particular attention to computational issues in order to effectively address complicated data analysis problems.
The papers in this volume stem from contributions initially presented at the joint international meeting JCS-CLADAG held in Anacapri (Italy) where the Japanese Classification Society and the Classification and Data Analysis Group of the Italian Statistical Society had a stimulating scientific discussion and exchange.

## Inhaltsverzeichnis

### Time-Frequency Filtering for Seismic Waves Clustering

This paper introduces a new technique for clustering seismic events based on processing, in time-frequency domain, the waveforms recorded by seismographs. The detection of clusters of waveforms is performed by a

k

-means like algorithm which analyzes, at each iteration, the time-frequency content of the signals in order to optimally remove the non discriminant components which should compromise the grouping of waveforms. This step is followed by the allocation and by the computation of the cluster centroids on the basis of the filtered signals. The effectiveness of the method is shown on a real dataset of seismic waveforms.

### Modeling Longitudinal Data by Latent Markov Models with Application to Educational and Psychological Measurement

I review a class of models for longitudinal data, showing how it may be applied in a meaningful way for the analysis of data collected by the administration of a series of items finalized to educational or psychological measurement. In this class of models, the unobserved individual characteristics of interest are represented by a sequence of discrete latent variables, which follows a Markov chain. Inferential problems involved in the application of these models are discussed considering, in particular, maximum likelihood estimation based on the Expectation-Maximization algorithm, model selection, and hypothesis testing. Most of these problems are common to hidden Markov models for time-series data. The approach is illustrated by different applications in education and psychology.

Francesco Bartolucci

### Clustering of Stratified Aggregated Data Using the Aggregate Association Index: Analysis of New Zealand Voter Turnout (1893–1919)

Recently, the Aggregate Association Index (AAI) was proposed to identify the likely association structure between two dichotomous variables of a 2×2 contingency table when only aggregate, or equivalently the marginal, data are available. In this paper we shall explore the utility of the AAI and its features for analysing gendered New Zealand voting data in 11 national elections held between 1893 and 1919. We shall demonstrate that, by using these features, one can identify clusters of homogeneous electorates that share similar voting behaviour between the male and female voters. We shall also use these features to compare the association between gender and voting behaviour across each of the 11 elections.

Eric J. Beh, Duy Tran, Irene L. Hudson, Linda Moore

### Estimating a Rasch Model via Fuzzy Empirical Probability Functions

The joint maximum likelihood estimation of the parameters of the Rasch model is hampered by several drawbacks, the most relevant of which are that: (1) the estimates are not available for item or person with perfect scores; (2) the item parameter estimates are severely biased, especially for short tests. To overcome both these problems, in this paper a new method is proposed, based on a fuzzy extension of the empirical probability function and the minimum Kullback–Leibler divergence estimation approach. The new method warrants the existence of finite estimates for both person and item parameters and results very effective in reducing the bias of joint maximum likelihood estimates.

Lucio Bertoli-Barsotti, Tommaso Lando, Antonio Punzo

### Scale Reliability Evaluation for A-Priori Clustered Data

According to the classical measurement theory, the reliability of a set of indicators related to a latent variable describing a true measure can be assessed through the Cronbach’s

$$\alpha$$

index. The Cronbach’s

α

index can be used for

τ

-equivalent measures and for parallel measures and represents a lower bound for the reliability value in presence of congeneric measures, for which the assessment can properly be made only ex post, once the loading coefficients have been estimated, e.g. by means of a structural equation model with latent variables.Once assumed the existence of an a-priori segmentation based upon a categorical variable

Z

, we test whether the construct is reliable all over the groups. In this case the measurement model is the same across groups, which means that loadings are equal within each group as well as they do not vary across groups. A formulation of the Cronbach’s

α

coefficient is considered according to the decomposition of pairwise covariances in a clustered framework, and a test procedure assessing the possible presence of congeneric measures in a multigroup framework is proposed.

Giuseppe Boari, Gabriele Cantaluppi, Marta Nai Ruscone

### An Evaluation of Performance of Territorial Services Center (TSC) by a Nonparametric Combination Ranking Method

The IQuEL Italian Project

The work presents some results about a national project IQuEL-2010, aimed to solve some problems associated to the digital divide by Territorial Services Center (TSC). A specific survey was carried out by sample of local operator in the three Italian provinces (Padova, Parma Pesaro-Urbino). We applied a nonparametric combination (NPC) ranking method on a set of nine dimensions related the public services supplied. The results show important differences among the provinces, at least for six out of nine TSC abilities or performances and producing a Global satisfaction ranking.

Mario Bolzan, Livio Corain, Valeria De Giuli, Luigi Salmaso

### A New Index for the Comparison of Different Measurement Scales

In psychometric sciences, a common problem is the choice of a good response scale. Every scale has, by its nature, a propensity to lead a respondent to mainly positive- or negative- ratings. This paper investigates possible causes of the discordance between two ordinal scales evaluating the same goods or services. In psychometric literature, Cohen’s Kappa is one of the most important index to evaluate the strength of agreement, or disagreement, between two nominal variables, in particular in its weighted version. In this paper, a new index is proposed. A proper procedure to determine the lower and upper triangle in a non-square table is also implemented, as to generalize the index in order to compare two scales with a different number of categories. A test is set up with the aim to verify the tendency of a scale to have a different rating compared to a different one. A study with real data is conducted.

Andrea Bonanomi

### Asymmetries in Organizational Structures

Relationships in organizational structures are frequently asymmetric (e.g., the number of e-mail messages that an employee sends to a colleague is usually different from the number of e-mail messages he received from that colleague). So organizational data are usually represented by asymmetric square matrices that cannot be analyzed by standard symmetric approaches. For this reason methods based on Singular Value Decomposition and Asymmetric Multidimensional Scaling were proposed to analyze these types of matrices. In many situations information concerning hierarchies or aggregations in the organizational structure is available and can be used in the analysis of the data (e.g., professional levels or departments belonging). In this paper three-way unfolding is proposed to take into consideration this additional information and applied to Krackhardt (Social Networks 9:109–134, 1987) data on advice-giving and getting in an organization.

Giuseppe Bove

### A Generalized Additive Model for Binary Rare Events Data: An Application to Credit Defaults

We aim at proposing a new model for binary rare events, i.e. binary dependent variable with a very small number of ones. We extend the Generalized Extreme Value (GEV) regression model proposed by Calabrese and Osmetti (Journal of Applied Statistics 40(6):1172–1188, 2013) to a Generalized Additive Model (GAM). We suggest to consider the quantile function of the GEV distribution as a link function in a GAM, so we propose the Generalized Extreme Value Additive (GEVA) model. In order to estimate the GEVA model, a modified version of the local scoring algorithm of GAM is proposed. Finally, to model default probability, we apply our proposal to empirical data on Italian Small and Medium Enterprises (SMEs). The results show that the GEVA model has a higher predictive accuracy to identify the rare event than the logistic additive model.

Raffaella Calabrese, Silvia Angela Osmetti

### The Meaning of forma in Thomas Aquinas: Hierarchical Clustering from the Index Thomisticus Treebank

We apply word hierarchical clustering techniques to collect the occurrences of the lemma

forma

that show a similar contextual behaviour in the works of Thomas Aquinas into the same or closely related groups. Our results will support the lexicographers of a data-driven new lexicon of Thomas Aquinas in their task of writing the lexical entry of

forma

. We use two datasets: the

Index Thomisticus

(IT), a corpus containing the opera omnia of Thomas Aquinas, and the

Index Thomisticus

Treebank, a syntactically annotated subset of the IT.

Results are evaluated against a manually labeled subset of the occurrences of

forma

.

Gabriele Cantaluppi, Marco Passarotti

### The Estimation of the Parameters in Multi-Criteria Classification Problem: The Case of the Electre Tri Method

In this work we will address the estimation of the parameters of the well-known Electre Tri method, used to model the ordinal sorting problem. This is a multi-criteria classification problem, in which classes are in the strict preference relation. The parameters are profiles, thresholds, weights, and cutting level; they are linked each other either directly or indirectly with mathematical relations. We propose a new procedure composed of two phases, taking into account that the core of the analysis is the profiles estimation made by linear programming problem.

Renato De Leone, Valentina Minnetti

### Dynamic Clustering of Financial Assets

In this work we propose a procedure for time-varying clustering of financial time series. We use a dissimilarity measure based on the lower tail dependence coefficient, so that the resulting groups are homogeneous in the sense that the joint bivariate distributions of two series belonging to the same group are highly associated in the lower tail. In order to obtain a dynamic clustering, tail dependence coefficients are estimated by means of copula functions with a time-varying parameter. The basic assumption for the dynamic pattern of the copula parameter is the existence of an association between tail dependence and the volatility of the market. A case study with real data is examined.

Giovanni De Luca, Paola Zuccolotto

### A Comparison of χ 2 Metrics for the Assessment of Relational Similarities in Affiliation Networks

Factorial techniques are widely used in Social Network Analysis to analyze and visualize networks. When the purpose is to represent the relational similarities, simple correspondence analysis is the most frequent used technique. However, in the case of affiliation networks, its use can be criticized because the involved

χ

2

distance does not adequately reflect the actual relational patterns. In this paper we perform a simulation study to compare the metric involved in Correspondence Analysis with respect to the one in Multiple Correspondence Analysis. Analytical results and simulation outcomes show that Multiple Correspondence Analysis allows a proper graphical appraisal of the underlying two-mode relational structure.

Maria Rosaria D’Esposito, Domenico De Stefano, Giancarlo Ragozini

### Influence Diagnostics for Meta-Analysis of Individual Patient Data Using Generalized Linear Mixed Models

In meta-analysis, generalized linear mixed models (GLMMs) are usually used when heterogeneity is present and individual patient data (IPD) are available, while accepting binary, discrete as well as continuous response variables. In the present paper some measures of influence diagnostics based on log-likelihood are suggested and discussed. A known measure is approximated to get a simpler form, for which the information matrix is no more necessary. The performance of the proposed measure is assessed through a diagnostic analysis on simulated data reproducing a possible meta-analytical context of IPD with influential outliers. The proposed measure is showed to work well and to have a form similar to the gradient statistic, recently introduced.

Marco Enea, Antonella Plaia

### Social Networks as Symbolic Data

Starting from the main idea of Symbolic Data Analysis to extend Statistics and Data Mining methods from first-order to second-order objects, we focus on network data—as defined in the framework of Social Network Analysis—to define a graph structure and the underlying network in the context of complex data objects. A Network Symbolic description is defined according to the statistical characterization of the network topological properties. We use suitable network measures, which are represented by means of symbolic variables. Their study through multidimensional data analysis, allows for the synthetic representation of a network as a point onto a metric space. The proposed approach is discussed on the basis of a simulation study considering three classical network growth processes.

Giuseppe Giordano, Paula Brito

### Statistical Assessment for Risk Prediction of Endoleak Formation After TEVAR Based on Linear Discriminant Analysis

Over the past decade, therapy for thoracic aneurysms involving the use of a stent-graft has gained popularity as an alternate therapy for surgical treatment. This therapy is considered to be safe and efficient, and realizes satisfactory short-to-midterm results. However, a clinical side effect called endoleak has often been observed after alternate therapy. Based on the empirical findings of doctors, if a stent-graft is inserted into the part of the large curvature on the aortic angiography of a patient, it is believed that there is an increased risk of endoleak formation. To understand the relationship between the risk and the aortic curvature, we set a two-class discriminant problem involving no-endoleak and endoleak groups, and apply linear discriminant analysis to the two-class discriminant problem with a quantitative dataset that is associated with the curvature of aortic angiography and the insertion position of a stent-graft. Next, we propose a procedure for the diagnostics based on the sign of the sample influence function for the average discriminant score in each class. In addition, we apply our proposed diagnostic procedure to the prediction result of the two-class linear discriminant analysis, and detect large influential individuals for the improvement of the prediction accuracy for endoleak groups. With our approach, we determine the relation between the curvature of the aorta and the risk of endoleak formation.

Kuniyoshi Hayashi, Fumio Ishioka, Bhargav Raman, Daniel Y. Sze, Hiroshi Suito, Takuya Ueda, Koji Kurihara

### Fuzzy c-Means for Web Mining: The Italian Tourist Forum Case

e-tourism is in stable growth and becoming one of the leading sectors in the

e

-commerce world. Social media and mobile technologies are holding an increasingly important role in the procurement processes of tourism, by both providing access to real-time information and promoting the exchange of experiences. Web mining allows the collection of new unstructured data and the building of users’ profiles based on electronic web mouth. We apply a soft approach to solve lexical ambiguity and build a vocabulary for the tourism sector. Indeed, we propose a new version of the fuzzy

c

-means algorithm to detect the best centroid clusters, and we choose the final partition according to the validation of three indices (the partition coefficient, the classification entropy, and the Xie-Beni index). We use this method to classify 525 posts published by the Italian tourism forum from January 2010 to April 2012.

Domenica Fioredistella Iezzi, Mario Mastrangelo

### On Joint Dimension Reduction and Clustering of Categorical Data

There exist several methods for clustering high-dimensional data. One popular approach is to use a two-step procedure. In the first step, a dimension reduction technique is used to reduce the dimensionality of the data. In the second step, cluster analysis is applied to the data in the reduced space. This method may be referred to as the tandem approach. An important drawback of this method is that the dimension reduction may distort or hide the cluster structure. As an alternative, various authors have proposed joint dimension reduction and clustering approaches. In this paper we review some of these existing joint dimension reduction and clustering methods for categorical data in a unified framework that facilitates comparison.

Alfonso Iodice D’Enza, Michel Van de Velden, Francesco Palumbo

### A SVM Applied Text Categorization of Academia-Industry Collaborative Research and Development Documents on the Web

A method of automatically extracting Japanese documents describing University-Industry (U-I) relations from the Web is proposed. The proposed method consists of Japanese text processing and support vector machine (SVM) classification. The SVM feature selections were customized for U-I relations documents. The strongest experimental result was 79.95 of accuracy and 81.17 of f-measure.

Kei Kurakawa, Yuan Sun, Nagayoshi Yamashita, Yasumasa Baba

### Dynamic Customer Satisfaction and Measure of Trajectories: A Banking Case

The most important company asset seems to be Customer Satisfaction (CS), which banks, in the recent years, have frequently analyzed. For reaching such target, a dynamic Factor Analysis offers an effective way of merging information about clients and their preferences evolution. In our work we performed a dynamic Customer Satisfaction study, by means of a three-way factorial analysis, and we also introduced a new index of shift and shape (SSI), to synthesize information about every customer, cluster or typology. We considered a national bank case, with spread network, evaluating results provided by a questionnaire framed according to the SERVQUAL model. The information employed was obtained via a Customer Satisfaction survey repeated three times (waves). We performed the dynamic factorial model and we illustrated the usage of SSI as a new measure of trajectories’ dissimilarity. Finally, we showed our results which highlight promising performances of our index.

Caterina Liberati, Paolo Mariani

### The Analysis of Partnership Networks in Social Planning Processes

This article focuses on using social network analysis (SNA) to evaluate social planning. A case study on a regional policy, the Territorial Youth Plans (PTGs), is presented. PTGs were introduced by the Campania Region (Southern Italy) in 2009 as an effort to reform the regional youth policies to foster increased participation in decision-making processes. In our case study, we use a combination of SNA tools and multivariate data analysis techniques to analyze

if

and

how

the structure of interactions between actors at the local level shapes the quality of planning in terms of coherence and innovation.

The relational data have been gathered through the analysis of official documents. Attention has been directed to the following aspects: (1) describing the networks characteristics, (2) detecting networks with a typical and homogenous structure configuration, and (3) determining the relationship between several network structure configurations and different forms of social planning, assuming that relational structures of networks shape the coherence of social planning activity and local innovation capacity.

Rosaria Lumino, Concetta Scolorato

### Evaluating the Effect of New Brand by Asymmetric Multidimensional Scaling

Brand switching data among potato snacks are analyzed to evaluate a new brand. The brand switching matrix among existing brands is analyzed by asymmetric multidimensional scaling. The analysis shows the outward and the inward tendencies of existing brands, which tell the strength of switching from the corresponding brand to the other brands and the strength of switching to the corresponding brand from the other brands respectively. The inward tendency of a new brand is estimated by analyzing the brand switching data from the existing brands to the new brand obtained soon after its introduction based on the outward tendency of existing brands. The estimated inward tendency of the new brand is compared with those derived by analyzing the brand switching matrix including the new brand obtained 2 months after the introduction of the new brand. The comparison shows the estimated inward tendency is similar to the actually derived one.

### Statistical Characterization of the Virtual Water Trade Network

The water that is used in the production process of a product (a supply, commodity or service) is called the “virtual water” contained in the product. If one country (or region, company, individual, etc.) exports a water intensive product to another country, it exports water in virtual form. Virtual water trade as both a policy instrument and practical means to balance the local, national and global water budget has received much attention in recent years. Several studies have been conducted by researchers from various disciplines including engineers, economists and demographers. The aim of this paper is to improve the statistical characterization of the virtual water flow networks by suggesting a statistical modeling approach for examining their stochastic properties.

Alessandra Petrucci, Emilia Rocco

### A Pre-specified Blockmodeling to Analyze Structural Dynamics in Innovation Networks

In recent decades economic theory has highlighted the benefits produced by networks of organizations in fostering innovation. A number of public policies were put in place to favor these innovation networks throughout Europe. The top-down institution of a number of specialized technological districts in Italy has been one of the main outcomes of this new wave of policies, in mid-2000. The aim of this paper is to explore what impact the institution of technological districts had on collaborative patterns over time. Using a pre-specified blockmodeling, observed network configurations obtained by the co-participation to R&D projects undertaken by organizations involved in a technological district are compared with a theoretical core-periphery structure in a 8-years time interval. The analyses of networks over time show that collaborative patterns have evolved from a core-periphery structure towards a complete network in which each research group is connected with the others.

Laura Prota, Maria Prosperina Vitale

### The RCI as a Measure of Monotonic Dependence

In this paper a statistical interpretation of a recent measure, called “

Rank-based Concordance Index

” (

RCI

), in terms of monotonic dependence relationship between a non-negative dependent variable and a quantitative independent one is provided. Due to its rank-based construction, the measure presents properties and features that make it suitable also in an ordinal context of analysis. In applied research many data sets contain observations from ordinal variables rather than continuous ones. In such situations, the study of dependence relationship among variables represents an interesting issue, since ordinal variables are not specified according to a metric scale. The proposal discussed here can thus contribute to solve this problem.

Emanuela Raffinetti, Pier Alda Ferrari

### A Value Added Approach in Upper Secondary Schools of Lombardy by OECD-PISA 2009 Data

In the last decade a great deal of interest at the national and international level has been shown in measuring the school impact on student achievement. Standardized tests and the Value Added Methodology have emerged as the appropriate instruments for this purpose. The aim of this paper is to find a value added measure for upper secondary schools of the Lombardy region from the OECD-PISA 2009 data. The initial cognitive level of the student, which is necessary for the analysis, has been obtained by summarizing different teachers’ evaluations from a Rasch analysis. A multilevel model has been fitted to control the student and school factors effecting the reading results. In particular, even the reading enjoyment variable has been considered, since it explains a high variability of student performance. The ranking of the upper secondary schools based on the value added measures is compared with the one obtained using raw data, showing significantly different results.

Isabella Romeo, Brunella Fiore

### Algorithmic-Type Imputation Techniques with Different Data Structures: Alternative Approaches in Comparison

In recent years, with the spread availability of large datasets from multiple sources, increasing attention has been devoted to the treatment of missing information. Recent approaches have paved the way to the development of new powerful algorithmic techniques, in which imputation is performed through computer-intensive procedures. Although most of these approaches are attractive for many reasons, less attention has been paid to the problem of which method should be preferred according to the data structure at hand. This work addresses the problem by comparing the two methods

missForest

and

IPCA

with a new method we developed within the forward imputation approach. We carried out comparisons by considering different data patterns with varying skewness and correlation of variables, in order to ascertain in which situations a given method produces more satisfying results.

Nadia Solaro, Alessandro Barbiero, Giancarlo Manzi, Pier Alda Ferrari

### Changes in Japanese EFL Learners’ Proficiency: An Application of Latent Rank Theory

In the present study the authors compared achievements of Japanese learners of English as a foreign language (EFL) at the end of 6 years of formal instruction based on their test performance in the English section of the National Center Tests for University Admissions administered in 1990, 1997, and 2004. Direct comparisons were made possible by equating the scales of these three tests using the common subject design. In addition to 121 Japanese EFL learners who took the tests prepared by the researchers for the equating purpose, 10,000 cases were randomly sampled from each year’s actual test-takers. Their test performance was rendered into analysis based on the Latent Rank Theory (Shojima, Neural test theory: A latent rank theory for analyzing test data (DNC Research Note, 08-01). Retrieved from

http://www.rd.dnc.ac.jp/~shojima/ntt/Shojima2008RN08-01.pdf

, 2008; Shojima, Neural test theory. In: K. Shigemasu, A. Okada, T. Imaizumi, & T. Hoshino (Eds.),

New trends in psychometrics

, pp. 417–426, 2009. Tokyo: University Academic Press; Shojima, Neural test theory. In: M. Ueno & K. Shojima (Eds.),

Gakushu Hyoka no Shin-choryu [New trends in evaluation of learning]

, pp. 83–111, 2010. Tokyo: Asakura Shoten). The results indicate that the test-takers in 1990 are unique both in their membership to the latent ranks and in the knowledge that characterizes the high-achievers. Implication of the present study will be discussed in the last section.

Naoki Sugino, Kojiro Shojima, Hiromasa Ohba, Kenichi Yamakawa, Yuko Shimizu, Michiko Nakano

### Robustness and Stability Analysis of Factor PD-Clustering on Large Social Data Sets

Factor clustering methods were proposed to cluster large data sets. Among them factor probabilistic distance clustering (FPDC) shows interesting performance. The method is based on two main steps: a Tucker3 decomposition of the distance array and probabilistic distance (PD) clustering on the resulting factors. The aim of this paper is to apply FPDC on behavioral and social data sets of large dimensions, to obtain homogeneous and well-separated clusters of individuals. The scope is to evaluate the stability and the robustness of the method dealing with these data sets. Stability of results is referred to the invariance of results in each iteration of the method. Robustness is referred to the sensitivity of the method to errors in data. These characteristics of the method are evaluated using bootstrap resampling.

Cristina Tortora, Marina Marino

### A Box-Plot and Outliers Detection Proposal for Histogram Data: New Tools for Data Stream Analysis

In this paper, we propose a method for monitoring the evolution of data described by histograms of values. Our proposal consists to define new order statistics on the quantile functions associated with the empirical distributions, represented by the histogram-data. We introduce the Median, the First and the Third Quartile quantile functions, as well as a generalized representation of the box and whiskers plot. For example, the proposed representations and indices are useful for identifying and classifying outliers, arriving along the time in a data stream environment.

Rosanna Verde, Antonio Irpino, Lidia Rivoli

### Assessing Cooperation in Open Systems: An Empirical Test in Healthcare

This paper aims to detect the social mechanisms underlying cooperation in organizational communities. To this purpose, it proposes to apply a longitudinal Social Network Analysis approach based on Stochastic Actor-Oriented Models for network dynamics to Web 2.0 data on interpersonal interaction. The paper claims and demonstrates that such an approach allows alleviating some limitations of current studies. It overcomes the issue of relational missing data. Also, it models directly the network structure as the outcome of actors’ counterparts selection in their neighbourhood. Application is on a virtual community of Italian oncologists who collaborate in resolving diagnoses. Using repository and field data, we reconstruct a network, with clinicians as nodes and emails exchanged as ties. Then, we model cooperation longitudinally. Evidence is provided that emergent behaviors are effectively captured and advantages of this approach are discussed.

Paola Zappa
Weitere Informationen