Skip to main content
Top

2012 | Book

Advanced Statistical Methods for the Analysis of Large Data-Sets

Editors: Agostino Di Ciaccio, Mauro Coli, Jose Miguel Angulo Ibanez

Publisher: Springer Berlin Heidelberg

Book Series : Studies in Theoretical and Applied Statistics

insite
SEARCH

About this book

The theme of the meeting was “Statistical Methods for the Analysis of Large Data-Sets”. In recent years there has been increasing interest in this subject; in fact a huge quantity of information is often available but standard statistical techniques are usually not well suited to managing this kind of data. The conference serves as an important meeting point for European researchers working on this topic and a number of European statistical societies participated in the organization of the event.

The book includes 45 papers from a selection of the 156 papers accepted for presentation and discussed at the conference on “Advanced Statistical Methods for the Analysis of Large Data-sets.”

Table of Contents

Frontmatter

Clustering Large Data-Sets

Frontmatter
Clustering Large Data Set: An Applied Comparative Study

The aim of this paper is to analyze different strategies to cluster large data sets derived from social context. For the purpose of clustering, trials on effective and efficient methods for large databases have only been carried out in recent years due to the emergence of the field of data mining. In this paper a sequential approach based on multiobjective genetic algorithm as clustering technique is proposed. The proposed strategy is applied to a real-life data set consisting of approximately 1.5 million workers and the results are compared with those obtained by other methods to find out an unambiguous partitioning of data.

Laura Bocci, Isabella Mingo
Clustering in Feature Space for Interesting Pattern Identification of Categorical Data

Standard clustering methods fail when data are characterized by non-linear associations. A suitable solution consists in mapping data in a higher dimensional feature space where clusters are separable. The aim of the present contribution is to propose a new technique in this context to identify interesting patterns in large datasets.

Marina Marino, Francesco Palumbo, Cristina Tortora
Clustering Geostatistical Functional Data

In this paper, we among functional data. A first strategy aims to classify curves spatially dependent and to obtain a spatio-functional model prototype for each cluster. It is based on a Dynamic Clustering Algorithm with on an optimization problem that minimizes the spatial variability among the curves in each cluster. A second one looks simultaneously for an optimal partition of spatial functional data set and a set of bivariate functional regression models associated to each cluster. These models take into account both the interactions among different functional variables and the spatial relations among the observations.

Elvira Romano, Rosanna Verde
Joint Clustering and Alignment of Functional Data: An Application to Vascular Geometries

We show an application of the

k-mean alignment

method presented in Sangalli et al. (Comput. Stat. Data Anal. 54:1219–1233). This is a method for the joint clustering and alignment of functional data, that sets in a unique framework two widely used methods of functional data analysis: Procrustes continuous alignment and functional

k

-mean clustering. These two methods turn out to be two special cases of the new method. In detail we use this algorithm to analyze 65 Internal Carotid Arteries in relation to the presence and rupture of cerebral aneurysms. Some interesting issues pointed out by the analysis and amenable of a biological interpretation are briefly discussed.

Laura M. Sangalli, Piercesare Secchi, Simone Vantini, Valeria Vitelli

Statistics in Medicine

Frontmatter
Bayesian Methods for Time Course Microarray Analysis: From Genes’ Detection to Clustering

Time-course microarray experiments are an increasingly popular approach for understanding the dynamical behavior of a wide range of biological systems. In this paper we discuss some recently developed functional Bayesian methods specifically designed for time-course microarray data. The methods allow one to identify differentially expressed genes, to rank them, to estimate their expression profiles and to cluster the genes associated with the treatment according to their behavior across time. The methods successfully deal with various technical difficulties that arise in this type of experiments such as a large number of genes, a small number of observations, non-uniform sampling intervals, missing or multiple data and temporal dependence between observations for each gene. The procedures are illustrated using both simulated and real data.

Claudia Angelini, Daniela De Canditiis, Marianna Pensky
Longitudinal Analysis of Gene Expression Profiles Using Functional Mixed-Effects Models

In many longitudinal microarray studies, the gene expression levels in a random sample are observed repeatedly over time under two or more conditions. The resulting time courses are generally very short, high-dimensional, and may have missing values. Moreover, for every gene, a certain amount of variability in the temporal profiles, among biological replicates, is generally observed. We propose a functional mixed-effects model for estimating the temporal pattern of each gene, which is assumed to be a smooth function. A statistical test based on the distance between the fitted curves is then carried out to detect differential expression. A simulation procedure for assessing the statistical power of our model is also suggested. We evaluate the model performance using both simulations and a real data set investigating the human host response to BCG exposure.

Maurice Berk, Cheryl Hemingway, Michael Levin, Giovanni Montana
A Permutation Solution to Compare Two Hepatocellular Carcinoma Markers

In medical literature Alpha-fetoprotein (AFP) is the most commonly used marker for hepatocellular carcinoma (HCC) diagnosis. Some researches showed that there is over-expression of insulin-like growth factor (IGF)-II in HCC tissue, especially in small HCC. In this background our study investigates the diagnostic utility of IGF-II in HCC. Serum levels of IGF-II and AFP were determined on 96 HCC patients, 102 cirrhotic patients and 30 healthy controls. The application of NPC test, stratified for small and large tumours, allowed us to notice that IGF-II and AFP levels in HCC were significantly higher than cirrhotic patients and controls, the IGF-II levels in cirrhotic patients were significantly lower than controls. The optimal cut-off values for diagnosing HCC were determined with ROC curve. The sensitivity, specificity and diagnostic accuracy values for AFP and IGF-II have been estimated for diagnosis of HCC and, subsequently, for small or large HCC. Determination of jointly used markers significantly increases the diagnostic accuracy and sensitivity, with a high specificity. So IGF-II serum can be considered a useful tumour marker to be jointly used with AFP, especially for diagnosis of small HCC.

Agata Zirilli, Angela Alibrandi

Integrating Administrative Data

Frontmatter
Statistical Perspective on Blocking Methods When Linking Large Data-sets

The combined use of data from different sources is largely widespread. Record linkage is a complex process aiming at recognizing the same real world entity, differently represented in data sources. Many problems arise when dealing with large data-sets, connected with both computational and statistical aspects. The well-know blocking methods can reduce the number of record comparisons to a suitable number. In this context, the research and the debate are very animated among the information technology scientists. On the contrary, the statistical implications of different blocking methods are often neglected. This work is focused on highlighting the advantages and disadvantages of the main blocking methods in carrying out successfully a probabilistic record linkage process on large data-sets, stressing the statistical point of view.

Nicoletta Cibella, Tiziana Tuoto
Integrating Households Income Microdata in the Estimate of the Italian GDP

National accounts statistics are the result of the integration of several data sources. At present, in Italy, sample surveys data on households income are not included in the estimation process of national accounts aggregates. In this paper we investigate the possibility of using such data within an independent estimate of GDP, based on the income approach. The aim of this paper is to assess whether (and to what extent) sample survey microdata on household income may contribute to the estimate of GDP. To this end, surveys variables are recoded and harmonized according to the national accounting concepts and definitions in order to point out discrepancies or similarities with respect to national accounts estimates. The analysis focuses particularly on compensation of employees. Applications are based on the EU statistics on income and living conditions and on the Bank of Italy survey on income and wealth.

Alessandra Coli, Francesca Tartamella
The Employment Consequences of Globalization: Linking Data on Employers and Employees in the Netherlands

Globalization – or the increased interconnectedness of nations, peoples and economies – is often illustrated by the strong growth of international trade, foreign direct investment (FDI) and multinational enterprises (MNEs). At the moment, more firms, in more industries and countries than ever before, are expanding abroad through direct investment and trade. The advent of globalization has been paired with intense debates among policy makers and academics about its consequences for a range of social issues related to employment, labor conditions, in-come equality and overall human wellbeing. On the one hand, the growing inter-nationalization of production may lead to economic growth, increased employment and higher wages. In setting up affiliates and hiring workers, MNEs directly and indirectly affect employment, wages and labor conditions in host countries (see e.g. Driffield 1999; G rg 2000; and Radosevic et al. 2003). On the other hand, fears are often expressed that economic growth may be decoupled from job creation, partly due to increased competition from low-wage countries, or through outsourcing and offshoring activities of enterprises (Klein 2000; Korten 1995). These concerns about the employment consequences of globalization are not entirely unwarranted, as studies by Kletzer (2005) and Barnet and Cavenagh (1994) have shown.

Fabienne Fortanier, Marjolein Korvorst, Martin Luppes
Applications of Bayesian Networks in Official Statistics

In this paper recent results about the application of Bayesian networks to official statistics are presented. Bayesian networks are multivariate statistical models able to represent and manage complex dependence structures. Here they are proposed as a useful and unique framework by which it is possible to deal with many problems typical of survey data analysis. In particular here we focus on categorical variables and show how to derive classes of contingency table estimators in case of stratified sampling designs. Having this technology poststratification, integration and missing data imputation become possible. Furthermore we briefly discuss how to use Bayesian networks for decision as a support system to monitor and manage the data production process.

Paola Vicard, Mauro Scanu

Outliers and Missing Data

Frontmatter
A Correlated Random Effects Model for Longitudinal Data with Non-ignorable Drop-Out: An Application to University Student Performance

Empirical study of university student performance is often complicated by missing data, due to student drop-out of the university. If drop-out is non-ignorable, i.e. it depends on either unobserved values or an underlying response process, it may be a pervasive problem. In this paper, we tackle the relation between the primary response (student performance) and the missing data mechanism (drop-out) with a suitable random effects model, jointly modeling the two processes. We then use data from the individual records of the faculty of Statistics at Sapienza University of Rome in order to perform the empirical analysis.

Filippo Belloc, Antonello Maruotti, Lea Petrella
Risk Analysis Approaches to Rank Outliers in Trade Data

The paper discusses ranking methods for outliers in trade data based on statistical information with the objective to prioritize anti-fraud investigation activities. The paper presents a ranking method based on risk analysis framework and discusses a comprehensive trade fraud indicator that aggregates a number of individual numerical criteria.

Vytis Kopustinskas, Spyros Arsenis
Problems and Challenges in the Analysis of Complex Data: Static and Dynamic Approaches

This paper summarizes results in the use of the Forward Search in the analysis of corrupted datasets, and those with mixtures of populations. We discuss new challenges that arise in the analysis of large, complex datasets. Methods developed for regression and clustering are described.

Marco Riani, Anthony Atkinson, Andrea Cerioli
Ensemble Support Vector Regression:A New Non-parametric Approach for Multiple Imputation

The complex case in which several variables contain missing values needs to be analyzed by means of an iterative procedure. The imputation methods most commonly employed, however, rely on parametric assumptions. In this paper we propose a new non-parametric method for multiple imputation based on Ensemble Support Vector Regression. This procedure works under quite general assumptions and has been tested with different simulation schemes. We show that the results obtained in this way are better than the ones obtained with other methods usually employed to get a complete data set.

Daria Scacciatelli

Time Series Analysis

Frontmatter
On the Use of PLS Regression for Forecasting Large Sets of Cointegrated Time Series

This paper proposes a methodology to forecast cointegrated time series using many predictors. In particular, we show that Partial Least Squares can be used to estimate single-equation models that take into account of possible long-run relations among the predicted variable and the predictors. Based on Helland (Scand. J. Stat. 17:97–114, 1990), and Helland and Almoy (J. Am. Stat. Assoc. 89:583–591, 1994), we discuss the conditions under which Partial Least Squares regression provides a consistent estimate of the conditional expected value of the predicted variable. Finally, we apply the proposed methodology to a well-known dataset of US macroeconomic time series (Stock and Watson, Am. Stat. Assoc. 97:1167–1179, 2005). The empirical findings suggest that the new method improves over existing approaches to data-rich forecasting, particularly when the forecasting horizon becomes larger.

Gianluca Cubadda, Barbara Guardabascio
Large-Scale Portfolio Optimisation with Heuristics

Heuristic optimisation techniques allow to optimise financial portfolios with respect to different objective functions and constraints, essentially without any restrictions on their functional form. Still, these methods are not widely applied in practice. One reason for this slow acceptance is the fact that heuristics do not provide the “optimal” solution, but only a stochastic approximation of the optimum. For a given problem, the quality of this approximation depends on the chosen method, but also on the amount of computational resources spent (e.g., the number of iterations): more iterations lead (on average) to a better solution. In this paper, we investigate this convergence behaviour for three different heuristics: Differential Evolution, Particle Swarm Optimisation, and Threshold Accepting. Particular emphasis is put on the dependence of the solutions’ quality on the problem size, thus we test these heuristics in large-scale settings with hundreds or thousands of assets, and thousands of scenarios.

Manfred Gilli, Enrico Schumann
Detecting Short-Term Cycles in Complex Time Series Databases

Time series characterize a large part of the data stored in financial, medical and scientific databases. The automatic statistical modelling of such data may be a very hard problem when the time series show “complex” features, such as nonlinearity, local nonstationarity, high frequency, long memory and periodic components. In such a context, the aim of this paper is to analyze the problem of detecting automatically the different periodic components in the data, with particular attention to the short term components (weakly, daily and intra-daily cycles). We focus on the analysis of real time series from a large database provided by an Italian electric company. This database shows complex features, either for the high dimension or the structure of the underlying process. A new classification procedure we proposed recently, based on a spectral analysis of the time series, was applied on the data. Here we perform a sensitivity analysis for the main tuning parameters of the procedure. A method for the selection of the optimal partition is then proposed.

F. Giordano, M. L. Parrella, M. Restaino
Assessing the Beneficial Effects of Economic Growth: The Harmonic Growth Index

In this paper we introduce the multidimensional notion of

harmonic growth

as a situation of diffused well-being associated to an increase of per capita GDP. We say that a country experienced a

harmonic growth

if during the observed period all the key indicators, proxies of the endogenous and exogenous forces driving population well-being, show a significantly common pattern with the income dynamics. The notion is operationalized via an index of time series harmony which follows the functional data analysis approach. This

Harmonic Growth Index

(HGI) is based on comparisons between the coefficients from cubic B-splines interpolation. Such indices are then synthesized in order to provide the global degree of harmony in growth inside a country. With an accurate selection of the key indicators, the index can be used also to rank countries thus offering a useful complementary information to the Human Development Indexes from UNDP. An exemplification is given for the Indian economy.

Daria Mendola, Raffaele Scuderi
Time Series Convergence within I(2) Models: the Case of Weekly Long Term Bond Yields in the Four Largest Euro Area Countries

The purpose of the paper is to suggest a modelling strategy that can be used to study the process of pairwise convergence within time series analysis. Moving from the works of Bernard (1992) and Bernard and Durlauf (1995), we specify an I(1) cointegrated model characterized by broken linear trends, and we identify the driving force leading to convergence as a common stochastic trend, but the results are unsatisfactory. Then we deal the same question of time series convergence within I(2) cointegration analysis, allowing for broken linear trends and an I(2) common stochastic trend as the driving force. The results obtained with this second specification are encouraging and satisfactory. The suggested modelling strategy is applied to the convergence of long-term bond markets in the Economic and Monetary Union (EMU), that we observe during the years covering the second stage, that is the period from 1993 to the end of 1998, before the introduction of euro. During the third stage, started in 1999 and continuing, the markets show a tendency to move together and to behave similarly.

Giuliana Passamani

Environmental Statistics

Frontmatter
Anthropogenic CO2 Emissions and Global Warming: Evidence from Granger Causality Analysis

This note reports an updated analysis of global climate change and its relationship with Carbon Dioxide (CO

2

) emissions: advanced methods rooted in econometrics are applied to bivariate climatic time series. We found a strong evidence for the absence of Granger causality from CO

2

emissions to global surface temperature: we can conclude that our findings point out that the hypothesis of anthropogenically-induced climate change still need a conclusive confirmation using the most appropriate methods for data analysis.

Massimo Bilancia, Domenico Vitale
Temporal and Spatial Statistical Methods to Remove External Effects on Groundwater Levels

This paper illustrates a project on monitoring groundwater levels elaborated jointly with officers from Regione Piemonte. Groundwater levels are strongly affected by external predictors, such as rain precipitation, neighboring waterways or local irrigation ditches. We discuss a kriging and transfer function approach applied to monthly and daily series of piezometric levels to model these neighboring effects. The aims of the study are to reconstruct a groundwater virgin level as an indicator of the state of health of the groundwater itself and to provide important regulatory tools to the local government.

Daniele Imparato, Andrea Carena, Mauro Gasparini
Reduced Rank Covariances for the Analysis of Environmental Data

In this work we propose a Monte Carlo estimator for non stationary covariances of large incomplete lattice or irregularly distributed data. In particular, we propose a method called “reduced rank covariance” (RRC), based on the multiresolution approach for reducing the dimensionality of the spatial covariances. The basic idea is to estimate the covariance on a lower resolution grid starting from a stationary model (such as the Mathérn covariance) and use the multiresolution property of wavelet basis for evaluating the covariance on the full grid. Since this method doesn’t need to compute the wavelet coefficients, it is very fast in estimating covariances in large data sets. The spatial forecasting performances of the method has been described through a simulation study. Finally, the method has been applied to two environmental data sets: the aerosol optical thickness (AOT) satellite data observed in Northern Italy and the ozone concentrations in the eastern United States.

Orietta Nicolis, Doug Nychka
Radon Level in Dwellings and Uranium Content in Soil in the Abruzzo Region: A Preliminary Investigation by Geographically Weighted Regression

Radon is a noble gas coming from the natural decay of uranium. It can migrate from the underlying soil into buildings, where sometimes very high concentration can be found, particularly in the basement or at ground floor. It contributes up to about the 50% of the ionizing radiation dose received by the population, constituting a real health hazard. In this study, we use the geographically weighted regression (GWR) technique to detect spatial non-stationarity of the relationship between indoor radon concentration and the radioactivity content of soil in the Provincia of L’Aquila, in the Abruzzo region (Central Italy). Radon measurements have been taken in a sample of 481 dwellings. Local estimates are obtained and discussed. The significance of the spatial variability in the local parameter estimates is examined by performing a Monte Carlo test.

Eugenia Nissi, Annalina Sarra, Sergio Palermi

Probability and Density Estimation

Frontmatter
Applications of Large Deviations to Hidden Markov Chains Estimation

Consider a Hidden Markov model where observations are generated by an underlying Markov chain plus a perturbation. The perturbation and the Markov process can be dependent from each other. We apply large deviations result to get an approximate confidence interval for the stationary distribution of the underlying Markov chain.

Fabiola M. Greco
Multivariate Tail Dependence Coefficients for Archimedean Copulae

We analyze the multivariate upper and lower tail dependence coefficients, obtained extending the existing definitions in the bivariate case. We provide their expressions for a popular class of copula functions, the Archimedean one. Finally, we apply the formulae to some well known copula functions used in many financial analyses.

Giovanni De Luca, Giorgia Rivieccio
A Note on Density Estimation for Circular Data

We discuss kernel density estimation for data lying on the

d

-dimensional torus (

d

≥ 1). We consider a specific class of product kernels, and formulate exact and asymptotic

L

2

properties for the estimators equipped with these kernels. We also obtain the optimal smoothing for the case when the kernel is defined by the product of von Mises densities. A brief simulation study illustrates the main findings.

Marco Di Marzio, Agnese Panzera, Charles C. Taylor
Markov Bases for Sudoku Grids

In this paper we show how to describe sudoku games under the language of design of experiments, and to translate sudoku grids into contingency tables. Then, we present the application of some techniques from Algebraic Statistics to describe the structure of the sudoku grids, at least for the 4 ×4 grids. We also show that this approach has interesting applications to both complete grids and partially filled grids.

Roberto Fontana, Fabio Rapallo, Maria Piera Rogantin

Application in Economics

Frontmatter
Estimating the Probability of Moonlighting in Italian Building Industry

It’s well known that black market economy and especially undeclared work undermines the financing of national social security programs and hinders efforts to boost economic growth. This paper goes in the direction of shedding light on the phenomenon by using statistical models to detect which companies are more likely to hire off the books workers. We used database from different administrative sources and link them together in order to have an informative system able to capture all aspects of firms activity. Afterward we used both parametric and non parametric models to estimate the probability of a firm to use moonlighters. We have chosen to study building industry both because of its importance in the economy of a country and because its a wide spread problem in that sector

Maria Felice Arezzo, Giorgio Alleva
Use of Interactive Plots and Tables for Robust Analysis of International Trade Data

This contribution is about the analysis of international trade data through a robust approach for the identification of outliers and regression mixtures called Forward Search. The focus is on interactive tools that we have developed to dynamically connect the information which comes from different robust plots and from the trade flows in the input datasets. The work originated from the need to provide the statistician with new robust exploratory data analysis tools and the end-user with an instrument to simplify the production and interpretation of the results. We argue that with the proposed interactive graphical tools the end-user can combine effectively subject matter knowledge with information provided by the statistical method and draw conclusions of relevant operational value.

Domenico Perrotta, Francesca Torti
Generational Determinants on the Employment Choice in Italy

Aim of the paper is to explore some crucial factors playing a significant role in employment decision-making in Italy. In particular, we aim at investigating the influence of family background on the choice to be self-employed rather than salaried; for this end, a series of regression models for categorical data is tested both on sampled workers, taken as a whole, and by gender separation. In this light, to test if the employment choice is context-dependent, environmental attributes are also modeled. In addition to a diversity of determinants, our results shed light on some differences between self-employed workers by first and second generation.

Claudio Quintano, Rosalia Castellano, Gennaro Punzo
Route-Based Performance Evaluation Using Data Envelopment Analysis Combined with Principal Component Analysis

Frontier analysis methods, such as Data Envelopment Analysis (DEA), seek to investigate the technical efficiency of productive systems which employ input factors to deliver outcomes. In economic literature one can find extreme opinions about the role of input/output systems in assessing performance. For instance, it has been argued that if all inputs and outputs are included in assessing the efficiency of units under analysis, then they will all be fully efficient. Discrimination can be increased, therefore, by being parsimonious in the number of factors. To deal with this drawback, we suggest to employ Principal Component Analysis (PCA) in order to aggregate input and output data. In this context, the aim of the present paper is to evaluate the performance of an Italian airline for 2004 by applying a model based upon PCA and DEA techniques.

Agnese Rapposelli

WEB and Text Mining

Frontmatter
Web Surveys: Methodological Problems and Research Perspectives

This paper presents a framework of current problems and related literature on Internet/web surveys. It proposes a new classification of research topics on challenging issues. Thereby it takes into account application the role that these surveys play in different research contexts (official statistics, academic research, market research). In addition critical research, open questions and trends are identified. Furthermore a specific section is devoted to bias estimation, which is a critical point in this type of survey. In particular, original bias and variance definitions are proposed.

Silvia Biffignandi, Jelke Bethlehem
Semantic Based DCM Models for Text Classification

This contribution deals with the problem of documents classification. The proposed approach is probabilistic and it is based on a mixture of a Dirichlet and Multinomial distribution. Our aim is to build a classifier able, not only to take into account the words frequency, but also the latent topics contained within the available corpora. This new model, called

sbDCM

, allows us to insert directly the number of topics (known or unknown) that compound the document, without losing the “burstiness” phenomenon and the classification performance. The distribution is implemented and tested according to two different contexts: on one hand, the number of latent topics is defined by experts in advance, on the other hand, such number is unknown.

Paola Cerchiello
Probabilistic Relational Models for Operational Risk: A New Application Area and an Implementation Using Domain Ontologies

The application of probabilistic relational models (PRM) to the statistical analysis of operational risk is presented. We explain the basic components of PRM, domain theories and dependency models. We discuss two real application scenarios from the IT services domain. Finally, we provide details on an implementation of the PRM approach using semantic web technologies.

Marcus Spies

Advances on Surveys

Frontmatter
Efficient Statistical Sample Designs in a GIS for Monitoring the Landscape Changes

The process of land planning, addressed to operate the synthesis between development aims and appropriate policies of preservation and management of territorial resources, requires a detailed analysis of the territory carried out by making use of data stored in Geographic Information Systems (GISs). A detailed analysis of changes in the landscape is time consuming, thus it can be carried out only on a sample of the whole territory and an efficient procedure is needed for selecting a sample of area units. In this paper we apply two recently proposed sample selection procedures to a study area for comparing them in terms of efficiency as well as of operational advantages, in order to set up a methodology which enables an efficient estimate of the change in the main landscape features on wide areas.

Elisabetta Carfagna, Patrizia Tassinari, Maroussa Zagoraiou, Stefano Benni, Daniele Torreggiani
Studying Foreigners’ Migration Flows Through a Network Analysis Approach

The aim of the paper is to enlighten the advantages of measuring and representing migration flows through the network analysis approach. The data we use, in two different kinds of analysis, regard the changes of residence of foreign population; they are individual data collected by Istat through the Municipal Registers. In the first step, we consider inflows from abroad (average 2005–2006). The countries of origin are identified as “sending nodes” and the local labour market areas are identified as “receiving nodes”. In the second step, we examine the internal flows of immigrants between local labour market areas. The analysis is focused on specific citizenships.

Cinzia Conti, Domenico Gabrielli, Antonella Guarneri, Enrico Tucci
Estimation of Income Quantiles at the Small Area Level in Tuscany

Available data to measure poverty and living conditions in Italy come mainly from sample surveys, such as the Survey on Income and Living Conditions (EU-SILC). However, these data can be used to produce accurate estimates only at the national or regional level. To obtain estimates referring to smaller unplanned domains small area methodologies can be used. The aim of this paper is to provide a general framework in which the joint use of large sources of data, namely the EU-SILC and the Population Census data, can fulfill poverty and living conditions estimates for Italian Provinces and Municipalities such as the Head Count Ratio and the quantiles of the household equivalised income.

Caterina Giusti, Stefano Marchetti, Monica Pratesi
The Effects of Socioeconomic Background and Test-taking Motivation on Italian Students’ Achievement

The aim of this work is to analyze the educational outcomes of Italian students and to explain the differences across Italian macro regions. In addition to the “classic” determinants of student achievement (e.g. family socioeconomic background) we investigated the extent to which the test-taking motivation may contribute to influence the results from assessment test and to explain, partially, the Italian territorial disparities. Therefore, a two stage approach is provided. Firstly, the data envelopment analysis (DEA) is applied to obtain a synthetic measure of the test-taking motivation. Secondly, a multilevel regression model is employed to investigate the effect of this measure of test-taking motivation on student performance after controlling for school and student factors.

Claudio Quintano, Rosalia Castellano, Sergio Longobardi

Multivariate Analysis

Frontmatter
Firm Size Dynamics in an Industrial District: The Mover-Stayer Model in Action

In the last decade, the District of Prato (an important industrial area in the neighborhood of Florence, Italy) suffered a deep shrinkage of exports and added value of the textile industry, the core of its economy. In this paper we aim to investigate if such a crisis entailed a firm downsizing (evaluated as number of employees) of the same industry and, possibly, of the overall economy of the District. For this purpose we use the Mover Stayer Model. Data are represented by two panels from ASIA-ISTAT data. The main results of the analysis are that: (1) the textile industry is affected by a relevant downsizing of the firm size; (2) such a process takes place through a slightly changed level of concentration; (3) the mentioned changes does not seem to spread to the overall economy.

F. Cipollini, C. Ferretti, P. Ganugi
Multiple Correspondence Analysis for the Quantification and Visualization of Large Categorical Data Sets

The applicability of a dimension-reduction technique on very large categorical data sets or on categorical data streams is limited due to the required singular value decomposition (SVD) of properly transformed data. The application of SVD to large and high-dimensional data is unfeasible because of the very large computational time and because it requires the whole data to be stored in memory (no data flows can be analysed). The aim of the present paper is to integrate an incremental SVD procedure in a multiple correspondence analysis (MCA)-like procedure in order to obtain a dimensionality reduction technique feasible for the application on very large categorical data or even on categorical data streams.

Alfonso Iodice D’Enza, Michael Greenacre
Multivariate Ranks-Based Concordance Indexes

The theoretical contributions to a “good” taxation have put the attention on the relations between the efficiency and the vertical equity without considering the “horizontal equity” notion: only recently, measures connected to equity (iniquity) of a taxation have been introduced in literature. The taxation problem is limited to the study of two quantitative characters: however the concordance problem can be extended in a more general context as we present in the following sections. In particular, the aim of this contribution consists in defining concordance indexes, as dependence measures, in a multivariate context. For this reason a

k

-variate (

k

> 2) concordance index is provided recurring to statistical tools such as ranks-based approach and multiple linear regression function. All the theoretical topics involved are shown through a practical example.

Emanuela Raffinetti, Paolo Giudici
Methods for Reconciling the Micro and the Macro in Family Demography Research: A Systematisation

In the second half of the twentieth century, the scientific study of population changed its paradigm from the macro to the micro, so that attention focused mainly on individuals as the agents of demographic action. However, for accurate handling of all the complexities of human behaviours, the interactions between individuals and the context they belong to cannot be ignored. Therefore, in order to explain (or, at least, to understand) contemporary fertility and family dynamics, the gap between the micro and the macro should be bridged. In this contribution, we highlight two possible directions for bridging the gap: (1) integrating life-course analyses with the study of contextual characteristics, which is made possible by the emergence of the theory and tools of multi-level modelling; and (2) bringing the micro-level findings back to macro outcomes via meta-analytic techniques and agent-based computational models.

Anna Matysiak, Daniele Vignoli
Metadata
Title
Advanced Statistical Methods for the Analysis of Large Data-Sets
Editors
Agostino Di Ciaccio
Mauro Coli
Jose Miguel Angulo Ibanez
Copyright Year
2012
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-21037-2
Print ISBN
978-3-642-21036-5
DOI
https://doi.org/10.1007/978-3-642-21037-2

Premium Partner