main-content

## Über dieses Buch

This volume collects the extended versions of papers presented at the SIS Conference “Statistics and Data Science: new challenges, new generations”, held in Florence, Italy on June 28-30, 2017. Highlighting the central role of statistics and data analysis methods in the era of Data Science, the contributions offer an essential overview of the latest developments in various areas of statistics research. The 35 contributions have been divided into six parts, each of which focuses on a core area contributing to “Data Science”. The book covers topics including strong statistical methodologies, Bayesian approaches, applications in population and social studies, studies in economics and finance, techniques of sample design and mathematical statistics. Though the book is mainly intended for researchers interested in the latest frontiers of Statistics and Data Analysis, it also offers valuable supplementary material for students of the disciplines dealt with here. Lastly, it will help Statisticians and Data Scientists recognize their counterparts’ fundamental role.

## Inhaltsverzeichnis

### Monitoring the Spatial Correlation Among Functional Data Streams Through Moran’s Index

This paper focuses on measuring the spatial correlation among functional data streams recorded by sensor networks. In many real world applications, spatially located sensors are used for performing at a very high frequency, repeated measurements of some variable. Due to the spatial correlation, sensed data are more likely to be similar when measured at nearby locations rather than in distant places. In order to monitor such correlation over time and to deal with huge amount of data, we propose a strategy based on computing the well known Moran’s index and Geary’s index on summaries of the data.

A. Balzanella, E. Romano, Rosanna Verde, F. Fortuna, F. Maturo, S. A. Gattone, T. Di Battista

### User Profile Construction Method for Personalized Access to Data Sources Using Multivariate Conjoint Analysis and Collaborating Filtering

Current information systems provide access to multiple, distributed, autonomous and potentially redundant data sources. Their users may not know the sources they questioned, nor their description and content. Consequently, their queries reflect no more a need that must be satisfied but an intention that must be refined. The purpose of personalization is to facilitate the expression of users’ needs. It allows them to obtain relevant information by maximizing the exploitation of their preferences grouped in their respective profile. In this work, we present a collaborative filtering method based on a Multivariate Conjoint Analysis approach to get these profiles. The proposed strategy provides a representation of the users and of the items, according to their characteristics, on factorial plans; whereas, the collaborative approach predicts the missing preferences.

Oumayma Banouar, Said Raghay

### Clustering Communities Using Interval K-Means

With regard to large networks there is a specific need to consider particular patterns relatable to structured groups of nodes which could be also defined as communities. In this work we will propose an approach to cluster the different communities using interval data. This approach is relevant in the context of the analysis of large networks and, in particular, in order to discover the different functionalities of the communities inside a network. The approach is shown in this paper by considering different examples of networks by means of synthetic data. The application is specifically related to a large network, that of the co-authorship network in Astrophysics.

Carlo Drago

### Text Mining and Big Textual Data: Relevant Statistical Models

A general overview is provided through examples and case studies, retrieved from research experiences, to foster description and debate on effectiveness in Big Data environments. At issue are early stage case studies relating to: research publishing and research impact; literature, narrative and foundational emotional tracking; and social media, here Twitter, with a social science orientation. Central relevance and importance will be associated with the following aspects of analytical methodology: context, leading to availing of semantics; focus, motivating homology between fields of analytical orientation; resolution scale, which can incorporate a concept hierarchy and aggregation in general; and acknowledging all that is implied by this expression: correlation is not causation. Application areas are: quantitative and also qualitative assessment, narrative analysis and assessing impact, and baselining and contextualizing, statistically and in related aspects such as visualization.

Fionn Murtagh

### A Three-Way Data Analysis Approach for Analyzing Multiplex Networks

In the present contribution, the use of factorial methods for three-way data is proposed to visually explore the structure of multiplex networks, that is, in presence of more relationships measured for a common set of nodes. Specifically, the DISTATIS technique, an extension of multidimensional scaling to three-way data, is used to analyze multiplex one-mode networks. In this procedure different types of relationships are represented in separate spaces and in a compromise space. A well-known dataset in the related literature is considered to illustrate how this procedure works in practice.

Giancarlo Ragozini, Maria Prosperina Vitale, Giuseppe Giordano

### Comparing FPCA Based on Conditional Quantile Functions and FPCA Based on Conditional Mean Function

In this work functional principal component analysis (FPCA) based on quantile functions is proposed as an alternative to the classical approach, based on the functional mean. Quantile regression characterizes the conditional distribution of a response variable and, in particular, some features like the tails behavior; smoothing splines have also been usefully applied to quantile regression to allow for a more flexible modelling. This framework finds application in contexts involving multiple high frequency time series, for which the functional data analysis (FDA) approach is a natural choice. Quantile regression is then extended to the estimation of functional quantiles and our proposal explores the performance of the three-mode FPCA as a tool for summarizing information when functional quantiles of different order are simultaneously considered. The methodology is illustrated and compared with the functional mean based FPCA through an application to air pollution data.

M. Ruggieri, F. Di Salvo, A. Plaia

### Statistical Archetypal Analysis for Cognitive Categorization

Human knowledge develops through complex relationships between categories. In the era of Big Data, the concept of categorization implies data summarization in a limited number of well-separated groups that must be maximally and internally homogeneous at the same time. This proposal exploits archetypal analysis capabilities by finding a set of extreme points that can summarize entire data sets in homogeneous groups. The archetypes are then used to identify the best prototypes according to Rosch’s definition. Finally, in the geometric approach to cognitive science, the Voronoi tessellation based on the prototypes is used to define categorization. An example using a well-known wine dataset by Forina et al. illustrates the procedure.

Francesco Santelli, Francesco Palumbo, Giancarlo Ragozini

### Inferring Rater Agreement with Ordinal Classification

In several contexts ranging from medical to social sciences, rater reliability is assessed in terms of intra (-inter) rater agreement. The extent of rater agreement is commonly characterized by comparing the value of the adopted agreement coefficient against a benchmark scale. This deterministic approach has been widely criticized since it neglects the influence of experimental conditions on the estimated agreement coefficient. In order to overcome this criticism, in this paper an inferential procedure for benchmarking is presented. The proposed procedure is based on non-parametric bootstrap confidence intervals. The statistical properties of the proposed procedure have been studied for two bootstrap confidence intervals via a Monte Carlo simulation. The simulated scenarios differ for sample sizes (i.e. n = 10, 30, 50, 100 items) and rating scale dimensions (i.e. k = 2, 3, 5, 7 categories).

Amalia Vanacore, Maria Sole Pellegrino

### Bayesian Analysis of ERG Models for Multilevel, Multiplex, and Multilayered Networks with Sampled or Missing Data

Social network analysis has typically concerned analysis of one type of tie connecting nodes of the same type. It has however been recognised that people are connected through multiple types of ties and that people in addition are affiliated with multiple types of non-people nodes. Exponential random graph models (ERGM) is a family of statistical models for social networks that at this point allows for a number of different types of network data, including one-mode networks, bipartite networks, multiplex data, as well as multilevel network data. Multilevel networks have been proposed as a joint representation of associations between multiple types of entities or nodes, such as people and organization, where two types of nodes gives rise to three distinct types of ties. The typical roster data collection method may be impractical or infeasible when the node sets are hard to detect or define or because of the cognitive demands on respondents. Multilevel multilayered networks allow us to consider a multitude of different sources of data and to sample on different types of nodes and relations. We consider modelling multilevel multilayered networks using exponential random graph models and extend a recently developed Bayesian data-augmentation scheme to allow for partially missing data. We illustrate the proposed inference procedures for the case of multilevel snowball sampling and sampling with error based on the Noordin Top network.

Johan Koskinen, Chiara Broccatelli, Peng Wang, Garry Robins

### Bayesian Kantorovich Deconvolution in Finite Mixture Models

This chapter addresses the problem of recovering the mixing distribution in finite kernel mixture models, when the number of components is unknown, yet bounded above by a fixed number. Taking a step back to the historical development of the analysis of this problem within the Bayesian paradigm and making use of the current methodology for the study of the posterior concentration phenomenon, we show that, for general prior laws supported over the space of mixing distributions with at most a fixed number of components, under replicated observations from the mixed density, the mixing distribution is estimable in the Kantorovich or $$L^1$$ L 1 -Wasserstein metric at the optimal pointwise rate $$n^{-1/4}$$ n - 1 / 4 (up to a logarithmic factor), n being the sample size.

Catia Scricciolo

### Discovering and Locating High-Energy Extra-galactic Sources by Bayesian Mixture Modelling

Discovering and locating gamma-ray sources in the whole sky map is a declared target of the Fermi Gamma-ray Space Telescope collaboration. In this paper, we carry out an unsupervised analysis of the collection of high-energy photons accumulated by the Large Area Telescope, the principal instrument on board the Fermi spacecraft, over a period of around 7.5 years using a Bayesian mixture model. A fixed, though unknown, number of parametric components identify the extra-galactic emitting sources we are searching for, while a further component represents parametrically the diffuse gamma-ray background due to both, extra-galactic and galactic high-energy photon emission. We determine the number of sources, their coordinates on the map and their intensities. The model parameters are estimated using a reversible jump MCMC algorithm which implements four different types of moves. These allow us to explore the dimension of the parameter space. The possible transitions remove from or add a source to the model, while leaving the background component unchanged. We furthermore present an heuristic procedure, based on the posterior distribution of the mixture weights, to qualify the nature of each detected source.

Andrea Sottosanti, Denise Costantin, Denis Bastieri, Alessandra R. Brazzale

### Bayesian Estimation of Causal Effects in Carcinogenicity Tests Based upon CTA

Despite more than 30,000 chemical substances are currently produced or imported in the European Union in volumes of 1 ton or more per year, they remain widely yet to be tested for carcinogenicity. Cell Transformation Assays (CTAs) are cheap and fast in vitro methods developed to screen chemical substances without resorting to animal-based testing. Here we propose two models for potential outcomes to estimate causal effects of different concentrations of a candidate carcinogen on counts of Type III foci growing within Petri dishes. A comparison of our proposals with simpler alternatives suggested in the literature for the BALB/c 3T3 CTA protocol is performed using the LOO information criterion. Here we overcome data manipulations recently proposed in the literature by introducing a flexible class of models based on experts’ belief that do not necessitate of: (i) adding fake observations to actual data; (ii) making cumbersome transformations to original counts; (iii) constraining distributions at low concentrations to have a variance larger than the mean. Open issues are discussed in relation to the current practice adopted to perform multi-laboratory experiments on the same substance.

Federico M. Stefanini, Giulia Callegaro

### Performance Comparison of Heterogeneity Measures for Count Data Models in Bayesian Perspective

Random effects model is one of the widely used statistical techniques in combining information from multiple independent studies and examine the heterogeneity. The present study has focussed on count data model which is comparatively uncommon in such research studies. Also the interest is to exploit the advantage of Bayesian modelling by incorporating plausible prior distributions on the parameter of interest. The study is illustrated with a data on rental bikes obtained from UC Irvine Machine Learning Repository. Results have indicated the impact of prior distributions and usage of heterogeneity estimators in count data models.

M. Subbiah, R. Renuka Devi, M. Gallo, M. R. Srinivasan

### Sampling and Modelling Issues Using Big Data in Now-Casting

The use of Big Data and, more specifically, Google Trends data in now- and forecasting, has become common practice nowadays, even by Institutes and Organizations producing official statistics worldwide. However, the use of Big Data has many neglected implications in terms of model estimation, testing and forecasting, with a significant impact on final results and their interpretation. Using a MIDAS model with Google Trends covariates, we analyse sampling error issues and time-domain effects triggered by these digital economy new data sources.

M. Simona Andreano, Roberto Benedetti, Federica Piersimoni, Paolo Postiglione, Giovanni Savio

### Sample Design for the Integration of Population Census and Social Surveys Il Disegno Campionario per L’integrazione Del Censimento Della Popolazione e delle Indagini Sociali

Starting from 2018, the Italian National Statistical Institute launched a new census system, named Permanent Census, which, integrating administrative data and data coming surveys, will be carried out every year. This put an end to the era of traditional decennial censuses. The census survey sample is aimed at updating the data contained in the integrated system of registers. Furthermore, the new census will be integrated with the main social surveys. The aim of this work is to compare two sampling strategies for the census survey sample. The first comprises pooling together the samples of the main social surveys, while the second consists of an ad hoc sampling design. Different estimation procedures are taken into account in order to compare the two sampling strategies.

D’Alò Michele, Falorsi Stefano, Fasulo Andrea, Solari Fabrizio

### Sampling Schemes Using Scanner Data for the Consumer Price Index

The Italian National Institute of Statistics (ISTAT) is carrying out a redesign of Consumer Price Survey (CPS). The availability of Scanner Data (SD) from retail modern distribution, provided to ISTAT by Nielsen for a large number of stores selling food and grocery, is the starting point of this innovation. Indeed, SD represent a big opportunity for improving the computation of Consumer Price Index (CPI). This work aims to study the properties of alternative aggregation formulas of the elementary price index in different sampling schemes implemented on SD. Bias and efficiency of the estimated indices are evaluated through a Monte Carlo simulation.

Claudia De Vitiis, Alessio Guandalini, Francesca Inglese, Marco Dionisio Terribili

### An Investigation of Hierarchical and Empirical Bayesian Small Area Predictors Under Measurement Error

In this paper we focus on small area models with measurement error in covariates. Based on data from the Measuring Morality study, a nationally representative survey of United States residents, that contains a validated behavioural measure of generosity (the dictator game) along with the household income of respondents, we define a measurement error model suitable to obtain area-level estimates of generosity at the district level. We investigate the effect of introducing the measurement error in this model, focusing on fully Bayesian as well as Empirical Bayesian (EB) estimation proposed in the recent literature. We discuss the characteristics of each of the two approaches and analyze the impact of the measurement error on the resulting estimates based on real data and a simulation study.

Silvia Polettini, Serena Arima

### Indicators for Monitoring the Survey Data Quality When Non-response or a Convenience Sample Occurs

Non-response bias has long been a concern for surveys, even more so over the past decades with the increasing decline of the response rates. A similar problem concerns the surveys based on non-representative samples, the convenience and cost-effectiveness of which has increased with the recent technological innovations that allow for collecting large numbers of highly non-representative samples via online surveys. In both cases it must be assumed that the bias is the result of a self-selection process and, for both, quality indicators are needed to measure the impact of this process. The goal of this research is to show the opportunity in each survey of monitoring the risk of self-selection bias at two different level: at the level of the whole survey and at the level of each statistic of interest. The combined use of two indicators is suggested and empirically evaluated under various scenarios.

Emilia Rocco

### The Propensity to Leave the Country of Origin of Young Europeans

Using data from the “Youth Project”, a survey carried out by the Toniolo Institute for Advanced Studies, we provide evidence of the determinants of the propensity to leave the native country by young Europeans and show how this phenomenon depends on the economic opportunities offered by the countries of origin. In addition, we underline the effect of individuals’ trust in the economic development of the country of residence as a main predictor of the intention of moving away.

Paolo Balduzzi, Alessandro Rosina, Emiliano Sironi

### New Insights on Student Evaluation of Teaching in Italy

This work focuses on the relationship between student evaluation of teaching and student, teacher and course specific characteristics, exploiting the richness of information collected by a new survey carried out among professors of the University of Padua. Data collected in this survey are able to highlight teacher needs, beliefs and practices of teaching and learning. This allows to introduce in the study some subjective traits of the teachers. The role of these new variables in explaining student evaluations is deeply investigated.

Francesca Bassi, Leonardo Grilli, Omar Paccagnella, Carla Rampichini, Roberta Varriale

### Eurostat Methodological Network: Skills Mapping for a Collaborative Statistical Office

Collaboration, interaction and exchange of knowledge among staff are important components for development and enriching of the scientific intelligence within a statistical office. Eurostat methodological network has been built as a skills mapping tool aiming identify in-house competencies for innovation and affordability of diffusion of knowledge, promotion of collaboration on methodological issues, and processes within statistical office. In this exercise we mainly focus on staff knowledge and working and academic experience on statistics and econometrics. Quantitative network analysis metrics are used to measure the strengths of methodological competencies within Eurostat, to identify groups of people for collaboration in providing results on specific tasks, or characterise areas that are not fully integrated into methodological network. By combining network visualisation and quantitative analysis, we able easily assess competency level for each dimension of interest. Network analysis helps us in making decisions related to improvement of staff communication and collaboration, by building mechanisms for information flows, filling competency gaps. Data represented as mathematical graph makes readily visible general view, absorbs its structure, permits us to focus on persons, competencies and relations between them. Modernisation of ways of working leads to a more cost effective use of resources.

Agne Bikauskaite, Dario Buono

### The Evaluation of the Inequality Between Population Subgroups

This paper illustrates the advantages to evaluate inequality between population subgroups with respect to a maximum compatible with the observed data, thus going beyond the traditional approach to the analysis of inequality between, where the maximum corresponds to total inequality. The new proposal improves both the measurement and the interpretation of the contribution of inequality between to total inequality.

Michele Costa

### Basketball Analytics Using Spatial Tracking Data

Spatial tracking data are used in sport analytics to study the players’ position during the game in order to evaluate game strategies, players’ roles, performance, also in prospect. From the broad fields of statistics, mathematics, information science and computer science it is possible to draw theories and methods useful to produce innovative results based on speed, distance, players’ separation trajectories. In basketball, spatial tracking data can be combined with play-by-play data, joining results on spatial movements to team performance. In this paper, using tracking data from basketball, we study the spatial pattern of players on the court in order to contribute to the literature of data mining methods for tracking data analysis in sports, with the final objective of suggesting new game strategies to improve team performance.

Marica Manisera, Rodolfo Metulini, Paola Zuccolotto

### New Fuzzy Composite Indicators for Dyslexia

Composite indicators should ideally identify multidimensional concepts that cannot be captured by a single variable. In this paper, we suggest a method based on fuzzy set theory for the construction of fuzzy synthetic indexes of dyslexia, using the set of manifest variables measured by means of reading tests. A few criteria for assigning values to the membership function are discussed, as well as criteria for defining the weights of the variables. An application regarding the diagnosis of dyslexia in primary and middle school in Italy is presented. In this application, the fuzzy approach is compared with the crisp approach actually used in Italy for detecting dyslexic children in compulsory school.

Isabella Morlini, Maristella Scorza

### Who Tweets in Italian? Demographic Characteristics of Twitter Users

In this paper we try for the first time to shed light on the use of Twitter by the Italian speaking users quantifying the total audience and some relevant characteristics: in particular, gender and location. The attempt is based on publicly available APIs data referring both to profile documents and tweets. Through real-time calculation is possible to infer the gender mainly using the name field of the users’ profile, while the geo-location is deduced using the location field and the geotagged tweets.

Righi Alessandra, Mauro M. Gentile, Domenico M. Bianco

### An Approach to Developing a Scoring System for Peer-to-Peer (p2p) Lending Platform

The paper reviews the possibilities of using survival analysis tools to configure scoring systems for p2p lending platform. Along with the Cox model, the models of log-logistic regression, accelerated failure time (AFT) model and Weibull regression were considered in this study. To test the stability of the factor influence the models were built when discretizing the observation period (12 months, 24 months and 36 months). The sample consisted of 887,379 observations for the period of 2007–2016. The study examined loans issued for the period of 36 months. Proportional hazard models were also analyzed taking into account the grouping feature of borrowers reditworthiness. The best model describing the state duration before the default was chosen. As a result of the analysis the factors affecting the probability of the borrower default during the considered period of time were revealed. It was determined that the greatest influence on the default risk was exerted by the purpose of loan and the interest rate regardless of the considered dynamics. The borrower’s income also had a significant impact on the default risk.

Alexander Agapitov, Irina Lakman, Zoya Maksimenko, Natalia Efimenko

### What Do Employers Look for When Hiring New Graduates? Answers from the Electus Survey

This paper presents the main results obtained from Electus survey targeting 471 Lombardy companies with at least 15 employees. The project wants to acquire the knowledge about criteria for entrepreneurs in the choice for graduates demanding a job vacancy. This study, also, aims to evaluate the features of a graduate’s profile employers for potential candidates in five job positions (Administration clerk; Human Resource assistant; ICT professional; Marketing assistant; CRM assistant). In order to estimate the entrepreneurs’ preferences about skills and competencies for the new hirings, Conjoint Analysis is adopted. Finally, using a new definition of the relative importance of attributes, the analysis finds out the monetary value for skills owned by the candidates.

Paolo Mariani, Andrea Marletta, Mariangela Zenga

### Modeling Household Income with Contaminated Unimodal Distributions

In many countries, income inequality has reached its highest level over the past half century. In the labor market, the technological progress has widened the earnings gap between high- and low-skilled workers. Changes in the structure of households, with a growing percentage of single-headed households, and in family formation, with an increased earnings correlation among partners in couples, is contributing in increasing inequality. A key step in measuring income inequality is the estimation of the income distribution, due to the sensitivity of usual inequality measures to extreme values. To deal with this issue, we propose the use of contaminated lognormal and gamma models and we derive the formulations for computing the Gini index based on the model parameters. An application to 101 empirical income distributions that include countries at different development stages is presented.

Angelo Mazza, Antonio Punzo

### Endowments and Rewards in the Labour Market: Their Role in Changing Wage Inequality in Europe

This paper proposes a comparative analysis on how the recent structural changes in the workforce composition affect wage inequality in a set of European countries. By performing RIF regression on the EU-SILC data, we assess how much of the overall Gini gap between 2005 and 2013 is due to employees’ characteristics rather than the capability of each country’s labour market to capitalise skills. The outright deterioration of all jobs, irrespective of skill levels required, and the lack of a well-defined structure of labour market may jeopardise wage distribution, and the wage structure plays a leading role in this process.

Gennaro Punzo, Mariateresa Ciommi, Gaetano Musella, Rosalia Castellano

### An Analysis of Wage Distribution Equality Dynamics in Poland Based on Linear Dependencies

This work investigates the gross wage distribution dynamics in Poland in different time periods. The study includes several stages and components. We first estimate the linear relationships between wages in adjacent time periods, along with the content analysis of the obtained constant dependency coefficients. We observe differences in the dynamics of wage growth across classes of wage-earners. We also calculate the value of Gini coefficients, as well as the characteristics of wage equality distribution. We then analyze the obtained linear dependencies with the use of dispersion elements analysis. Our findings show that the dynamics of wages distribution in Poland are in line with the government’s goals with regard to a fairer wage distribution consistent with the current stage of the country’s socio-economic development. We then analyze these findings in light of the dynamics of wage distribution. In particular, we focus on differences in wage growth across classes of wage earners. A logarithmic function of the cost effect is used for quantitative analysis. Overall, this study contributes to the literature on the patterns of wage distribution dynamics, which have important policy implications for Poland and other countries at similar stages of socio-economic development.

Viktoriya Voytsekhovska, Olivier Karl Butzbach

### Unions of Orthogonal Arrays and Their Aberrations via Hilbert Bases

We generate all the Orthogonal Arrays (OAs) of a given size n and strength t as the union of a collection of OAs which belong to an inclusion-minimal set of OAs. We derive a formula for computing the (Generalized) Word Length Pattern of a union of OAs that makes use of their polynomial counting functions. The best OAs according to the Generalized Minimum Aberration criterion can thereby be found simply by exploring a relatively small set of counting functions. The classes of OAs with 5 binary factors, strength 2, and sizes 16 and 20 are fully described.

Roberto Fontana, Fabio Rapallo

### A Copula-Based Hidden Markov Model for Toroidal Time Series

Toroidal time series are temporal sequences of bivariate angular observations that often arise in environmental and ecological studies. A hidden Markov model is proposed for segmenting these data according to a finite number of latent classes, associated with copula-based toroidal densities. The model conveniently integrates circular correlation, multimodality and temporal auto-correlation. A computationally efficient EM algorithm is proposed for parameter estimation. The proposal is illustrated on a time series of wind and sea wave directions.

Francesco Lagona

### A Biased Kaczmarz Algorithm for Clustered Equations

The Kaczmarz method is an iterative algorithm for solving overdetermined linear systems by consecutive projections onto the hyperplanes defined by the system equations. The method has a wide range of applications in signal processing, notably for biomedical imaging in X-ray tomography. It has been shown that selecting the hyperplane randomly at each iteration guarantees exponential convergence to the solution. We propose here a new implementation of the Kaczmarz method for clustered equations. When the hyperplanes are grouped into directional clusters, we draw the projection promoting sparse high-variance clusters. This leads to an improvement in performance, as we show in several numerical experiments. Some applications to image reconstruction are presented.

Alessandro Lanteri, Mauro Maggioni, Stefano Vigogna

### Nearly Unbiased Probability Plots for Extreme Value Distributions

Probability plots allow for a straightforward analysis of the data and interpretation of results also by non-statisticians and still play a central role in today’s software. In this chapter, probability plots for extreme value (EV) distributions are developed based on the generalized least-squares distribution fitting method and on convenient approximations of the first two moments of order statistics from the standard EV distributions. The proposed probability plots lead to graphical estimators of parameters that are shown to be nearly unbiased through the use of pivotal indices that avoid the massive numerical investigations usually presented for similar purposes in the recent literature. Although more efficient biased solutions can be theoretically found, the obtained parameter estimators achieve also adequate performances in terms of mean square deviation with respect to those derived through probability plots that have been presented separately in the literature as the most effective for EV distributions. Lastly, a real-case study is presented concerning wind speed data collected at a candidate wind farm site in Southern Italy. The results demonstrate how the proposed probability plot can effectively support EV analysis and assist practitioners in the selection of the turbine class to be installed.

Antonio Lepore

### Estimating High-Dimensional Regression Models with Bootstrap Group Penalties

Currently many research problems are addressed by analysing datasets characterized by a huge number of variables, with a relatively limited number of observations, especially when data are generated by experimentation. Most of the classical statistical procedures for regression analysis are often inadequate to deal with such datasets as they have been developed assuming that the number of observations is larger than the number of the variables. In this work, we propose a new penalization procedure for variable selection in regression models based on Bootstrap group Penalties (BgP). This new family of penalization methods extends the bootstrap version of the LASSO approach by taking into account the grouping structure that may be present or introduced in the model. We develop a simulation study to compare the performance of this new approach with respect several existing group penalization methods in terms of both prediction accuracy and variable selection quality. The results achieved in this study show that the new procedure outperforms the other penalties procedures considered.

Valentina Mameli, Debora Slanzi, Irene Poli
Weitere Informationen