main-content

## Über dieses Buch

Highlighting the latest advances in nonparametric and semiparametric statistics, this book gathers selected peer-reviewed contributions presented at the 4th Conference of the International Society for Nonparametric Statistics (ISNPS), held in Salerno, Italy, on June 11-15, 2018. It covers theory, methodology, applications and computational aspects, addressing topics such as nonparametric curve estimation, regression smoothing, models for time series and more generally dependent data, varying coefficient models, symmetry testing, robust estimation, and rank-based methods for factorial design. It also discusses nonparametric and permutation solutions for several different types of data, including ordinal data, spatial data, survival data and the joint modeling of both longitudinal and time-to-event data, permutation and resampling techniques, and practical applications of nonparametric statistics.

The International Society for Nonparametric Statistics is a unique global organization, and its international conferences are intended to foster the exchange of ideas and the latest advances and trends among researchers from around the world and to develop and disseminate nonparametric statistics knowledge. The ISNPS 2018 conference in Salerno was organized with the support of the American Statistical Association, the Institute of Mathematical Statistics, the Bernoulli Society for Mathematical Statistics and Probability, the Journal of Nonparametric Statistics and the University of Salerno.

## Inhaltsverzeichnis

### Portfolio Optimisation via Graphical Least Squares Estimation

In this paper, an unbiased estimation method called GLSE (proposed by Aldahmani and Dai [1]) for solving the linear regression problem in high-dimensional data ( $$n<p$$ ) is applied to portfolio optimisation under the linear regression framework and compared to the ridge method. The unbiasedness of method helps in improving the portfolio performance by increasing its expected return and decreasing the associated risk when $$n<p$$ , thus leading to a maximisation of the Sharpe ratio. The verification of this method is achieved through conducting simulation and data analysis studies and comparing the results with those of ridge regression. It is found that GLSE outperforms ridge in portfolio optimisation when $$n<p$$ .

Saeed Aldahmani, Hongsheng Dai, Qiao-Zhen Zhang, Marialuisa Restaino

### Change of Measure Applications in Nonparametric Statistics

Neyman [7] was the first to propose a change in measure in the context of goodness of fit problems. This provided an alternative density to the one for the null hypothesis. Hoeffding introduced a change of measure formula for the ranks of the observed data which led to obtaining locally most powerful rank tests. In this paper, we review these methods and propose a new approach which leads on the one hand to new derivations of existing statistics. On the other hand, we exploit these methods to obtain Bayesian applications for ranking data.

Mayer Alvo

### Choosing Between Weekly and Monthly Volatility Drivers Within a Double Asymmetric GARCH-MIDAS Model

Volatility in financial markets has both low- and high-frequency components which determine its dynamic evolution. Previous modelling efforts in the GARCH context (e.g. the Spline-GARCH) were aimed at estimating the low-frequency component as a smooth function of time around which short-term dynamics evolves. Alternatively, recent literature has introduced the possibility of considering data sampled at different frequencies to estimate the influence of macro-variables on volatility. In this paper, we extend a recently developed model, here labelled Double Asymmetric GARCH-MIDAS model, where a market volatility variable (in our context, VIX) is inserted as a daily lagged variable, and monthly variations represent an additional channel through which market volatility can influence individual stocks. We want to convey the idea that such variations (separately) affect the short- and long-run components, possibly having a separate impact according to their sign.

Alessandra Amendola, Vincenzo Candila, Giampiero M. Gallo

### Goodness-of-fit Test for the Baseline Hazard Rate

We provide a nonparametric test procedure for the baseline hazard function in the generalized Cox model in presence of fixed given covariates. The test statistic is given by an optimal estimator of the quadratic functional of the same function. Our test procedure attains the rate $$n^{-4\alpha /(4\alpha +1)}$$ over Besov classes of functions $$B_{\alpha }^{2,\infty }(L)$$ , $$\alpha ,L>0$$ , which is known to be minimax optimal in the context of testing the intensity function of a Poisson processes.

A. Anfriani, C. Butucea, E. Gerardin, T. Jeantheau, U. Lecleire

### Permutation Tests for Multivariate Stratified Data: Synchronized or Unsynchronized Permutations?

In the present work, we adopt a method based on permutation tests aimed at facing stratified experiments. The method consists in computing permutation tests separately for each strata and then combining the results. We know that by performing simultaneously permutation tests (synchronized) in different strata, we maintain the underlying dependence structure and we can properly adopt the nonparametric combination of dependent tests procedure. But when strata have different sample sizes, performing the same permutations is not allowed. On the other hand, if units in different strata can be assumed independent we can think to perform permutation tests independently (unsynchronized) for each strata, and then combining the resulting p-values. In this work, we show that when strata are independent we can adopt equivalently both synchronized and unsynchronized permutations.

Rosa Arboretti, Eleonora Carrozzo, Luigi Salmaso

### An Extension of the DgLARS Method to High-Dimensional Relative Risk Regression Models

In recent years, clinical studies, where patients are routinely screened for many genomic features, are becoming more common. The general aim of such studies is to find genomic signatures useful for treatment decisions and the development of new treatments. However, genomic data are typically noisy and high dimensional, not rarely outstripping the number of patients included in the study. For this reason, sparse estimators are usually used in the study of high-dimensional survival data. In this paper, we propose an extension of the differential geometric least angle regression method to high-dimensional relative risk regression models.

Luigi Augugliaro, Ernst C. Wit, Angelo M. Mineo

### A Kernel Goodness-of-fit Test for Maximum Likelihood Density Estimates of Normal Mixtures

This article contributes a methodological advance so as to help practitioners decide in selecting between parametric and nonparametric estimates for mixtures of normal distributions. In order to facilitate the decision, a goodness-of-fit test based on the integrated square error difference between the classical kernel density and the maximum likelihood estimates is introduced. Its asymptotic distribution under the null is quantified analytically and a hypothesis test is then developed so as to help practitioners choose between the two estimation options. The article concludes with an example which exhibits the operational characteristics of the procedure.

Dimitrios Bagkavos, Prakash N. Patil

### Robust Estimation of Sparse Signal with Unknown Sparsity Cluster Value

In the signal+noise model, we assume that the signal has a more general sparsity structure in the sense that the majority of signal coordinates are equal to some value which is assumed to be unknown, contrary to the classical sparsity context where one knows the sparsity cluster value (typically, zero by default). We apply an empirical Bayes approach (linked to the penalization method) for inference on the signal, possibly sparse in this more general sense. The resulting method is robust in that we do not need to know the sparsity cluster value; in fact, the method extracts as much generalized sparsity as there is in the underlying signal. However, as compared to the case of known sparsity cluster value, the proposed robust method cannot be reduced to thresholding procedure anymore. We propose two new procedures: the empirical Bayes model averaging (EBMA) and empirical Bayes model selection (EBMS) procedures, respectively. The former is procedure realized by an MCMC algorithm based on the partial (mixed) normal–normal conjugacy build in our modeling stage, and the latter is based on a new optimization algorithm of $$O(n^2)$$ -complexity. We perform simulations to demonstrate how the proposed procedures work and accommodate possible systematic error in the sparsity cluster value.

Eduard Belitser, Nurzhan Nurushev, Paulo Serra

### Test for Sign Effect in Intertemporal Choice Experiments: A Nonparametric Solution

In order to prove the hypothesis of sign effect in intertemporal choice experiments, the empirical studies described in the specialized literature apply univariate tests (in most cases parametric t or F tests) even when multivariate inferential procedures are more suitable according to the experimental data, the study design and the goal of the analysis. Furthermore, the used tests do not take into account the possible presence of confounding effects, very common in such kind of experimental studies. In this paper, a multivariate nonparametric method to test for sign effect in intertemporal choice is proposed. This method overcomes the mentioned limits of the tests usually applied in previous studies. A case study related to a survey performed at the University of Almeria (Spain) is presented. The methodological solution based on the nonparametric test is described and the results of its application to the data collected in the sample survey performed in Almeria are shown.

Stefano Bonnini, Isabel Maria Parra Oller

### Nonparametric First-Order Analysis of Spatial and Spatio-Temporal Point Processes

First-order characteristics are essential functions in point processes representing the distribution of events in the corresponding domain. For decades, the inconsistency of the first-order kernel intensity estimator has been an obstacle to perform inference in the point process context. In this work, we develop different procedures to obtain consistent estimators of the first-order intensity function, and we also propose bootstrap procedures to define effective bandwidth selectors. Moreover, these innovations are used in three testing problems: the goodness-of-fit of an appealing model in the literature of point processes with covariates, the nonparametric comparison of first-order intensity functions and a separability test for spatio-temporal point process. We illustrate the above-mentioned procedures with two wildfire data sets in Galicia (NW Spain) and in Canada.

M. I. Borrajo, I. Fuentes-Santos, W. González-Manteiga

### Bayesian Nonparametric Prediction with Multi-sample Data

In the present paper, we address the problem of prediction within the setting of species sampling models. We consider d populations composed of different species with unknown proportions. Our goal is to predict specific features of additional and unobserved samples from the d populations by adopting a Bayesian nonparametric model. We focus on a broad class of hierarchical priors. These were introduced and investigated in [1], where also an algorithm for drawing predictions is devised, however, without any specific numerical illustration. The aim of this paper is twofold: on the one hand, we provide an illustration with an actual implementation of the algorithm of [1] and, on the other hand, we discuss its relevance with respect to complex prediction problems with species sampling data.

Federico Camerlenghi, Antonio Lijoi, Igor Prünster

### Algorithm for Automatic Description of Historical Series of Forecast Error in Electrical Power Grid

The EU-FP7 iTesla project developed a toolbox that assesses dynamic security of large electric power systems in the presence of forecast uncertainties. In particular, one module extracts plausible realizations of the stochastic variables (power injections of RES Renewable Energy Sources, load power absorptions). It is built upon historical data series of hourly forecasts and realizations of the stochastic variables at HV (High-Voltage) nodes in the French transmission grid. Data reveal a large diversity of forecast error distributions: characterizing them allows to adapt the module to the data, improving the results. The algorithm here presented is aimed to automatically classify all the forecast error variables and to cluster them into smoother variables. The main steps of the algorithm are filtering of the variables with too many missing data or too low variance, outliers detection by two methods (Chebyshev inequality, quantile method), separation of unimodal variables from multimodal ones by exploiting a peak counting algorithm, Gaussian mixtures, comparison with asymmetrical distributions, multimodality index, clustering of the multimodal variables whose sum is unimodal, comparing two alternative algorithms (the former based on hierarchical clusterization, accounting for correlation and geographical closeness, and the latter on the identification of the same initial characters in the identification codes).

Gaia Ceresa, Andrea Pitto, Diego Cirio, Nicolas Omont

### Linear Wavelet Estimation in Regression with Additive and Multiplicative Noise

In this paper, we deal with the estimation of an unknown function from a nonparametric regression model with both additive and multiplicative noises. The case of the uniform multiplicative noise is considered. We develop a projection estimator based on wavelets for this problem. We prove that it attains a fast rate of convergence under the mean integrated square error over Besov spaces. A practical extension to automatically select the truncation parameter of this estimator is discussed. A numerical study illustrates the usefulness of this extension.

Christophe Chesneau, Junke Kou, Fabien Navarro

### Speeding up Algebraic-Based Sampling via Permutations

Algebraic sampling methods are a powerful tool to perform hypothesis tests on conditional spaces. We analyse the link of the sampling method introduced in [6] with permutation tests and we exploit this link to build a two-step sampling procedure to perform two-sample comparisons for non-negative discrete exponential families. We thus establish a link between standard permutation and algebraic-statistics-based sampling. The proposed method reduces the dimension of the space on which the MCMC sampling is performed by introducing a second step in which a standard Monte Carlo sampling is performed. The advantages of this dimension reduction are verified through a simulation study, showing that the proposed approach grants convergence in the least time and has the lowest mean squared error.

Francesca Romana Crucinio, Roberto Fontana

### Obstacle Problems for Nonlocal Operators: A Brief Overview

In this note, we give a brief overview of obstacle problems for nonlocal operators, focusing on the applications to financial mathematics. The class of nonlocal operators that we consider can be viewed as infinitesimal generators of non-Gaussian asset price models, such as Variance Gamma Processes and Regular Lévy Processes of Exponential type. In this context, we analyze the existence, uniqueness, and regularity of viscosity solutions to obstacle problems which correspond to prices of perpetual and finite expiry American options.

Donatella Danielli, Arshak Petrosyan, Camelia A. Pop

### Low and High Resonance Components Restoration in Multichannel Data

A technique for the restoration of low resonance component and high resonance component of K independently measured signals is presented. The definition of low and high resonance component is given by the Rational Dilatation Wavelet Transform (RADWT), a particular kind of finite frame that provides sparse representation of functions with different oscillations persistence. It is assumed that the signals are measured simultaneously on several independent channels and in each channel the underlying signal is the sum of two components: the low resonance component and the high resonance component, both sharing some common characteristic between the channels. Components restoration is performed by means of the lasso-type penalty and backfitting algorithm. Numerical experiments show the performance of the proposed method in different synthetic scenarios highlighting the advantage of estimating the two components separately rather than together.

Daniela De Canditiis, Italia De Feis

### Kernel Circular Deconvolution Density Estimation

We consider the problem of nonparametrically estimating a circular density from data contaminated by angular measurement errors. Specifically, we obtain a kernel-type estimator with weight functions that are reminiscent of deconvolution kernels. Here, differently from the Euclidean setting, discrete Fourier coefficients are involved rather than characteristic functions. We provide some simulation results along with a real data application.

Marco Di Marzio, Stefania Fensore, Agnese Panzera, Charles C. Taylor

### Asymptotic for Relative Frequency When Population Is Driven by Arbitrary Unknown Evolution

Strongly consistent estimates are shown, via relative frequency, for the probability of white balls inside a dichotomous urn when such a probability is an arbitrary unknown continuous time-dependent function over a bounded time interval. The asymptotic behaviour of relative frequency is studied in a nonstationary context using a Riemann-Dini type theorem for strong law of large numbers of random variables with arbitrarily different expectations; furthermore, the theoretical results concerning the strong law of large numbers can be applied for estimating the mean function of an unknown form of a general nonstationary process.

Silvano Fiorin

### Semantic Keywords Clustering to Optimize Text Ads Campaigns

In this paper, we describe how to use some well-known machine learning tools to make groups of textual queries of similar semantic meaning. Such a clusterization can be used to improve the performances of bidding algorithms for online advertising, by mutualizing the signal gathered by text ads displayed on result pages of search queries which share a similar meaning. Indeed, search engines organize auctions wherein participants bid on selected search terms on which they wish to display an ad. Generalist e-commerce companies such as Cdiscount bid simultaneously on millions of terms that reflect the diversity of their catalog of products, according to the expected profits associated with the ads. Methods to estimate these expected returns suffer from a sparsity of data, since most of the keywords have little or no historical signal. Grouping them and exploiting information on the most frequent keywords (short tail) to infer information on the less frequent ones (long tail), allow to anticipate the user behavior by semantics and improve the bidding strategy. The plan is the following: pre-process the keywords by stemming, choose an e-commerce training corpus for the Word2Vec model, train it, and perform an embedding into a euclidean space where we can cluster keywords thanks to a K-means algorithm. We validate our approach on a sub-sample of the keywords for which they have a non-semantic distance available. Finally, all the keywords in the same cluster share the same bid, which is computed aggregating the cluster historical signal.

Pietro Fodra, Emmanuel Pasquet, Bruno Goutorbe, Guillaume Mohr, Matthieu Cornec

### A Note on Robust Estimation of the Extremal Index

Many examples in the most diverse fields of application show the need for statistical methods of analysis of extremes of dependent data. A crucial issue that appears when there is dependency is the reliable estimation of the extremal index (EI), a parameter related to the clustering of large events. The most popular EI-estimators, like the blocks’ EI-estimators, are very sensitive to anomalous cluster sizes and exhibit a high bias. The need for robust versions of such EI-estimators is the main topic under discussion in this paper.

M. Ivette Gomes, Miranda Cristina, Manuela Souto de Miranda

### Multivariate Permutation Tests for Ordered Categorical Data

The main goal of this article is to compare whether different groups with ordinal responses on the same measurement scale satisfy stochastic dominance and monotonic stochastic ordering. In the literature, the majority of inferential approaches to settle the univariate case are proposed within the likelihood framework. These solutions have very nice characterizations under their stringent assumptions. However, when the set of alternatives lie in a positive orthant with more than four dimensions, it is quite difficult to achieve proper inferences. Further, it is known that testing for stochastic dominance in multivariate cases by likelihood approach is much more difficult than the univariate case. This paper intends to discuss the problem within the conditionality principle of inference through the permutation testing approach and the nonparametric combination (NPC) of dependent permutation tests. The NPC approach based on permutation theory is generally appropriate to suitably find exact good solutions to this kind of problems. Moreover, some solutions for a typical medical example are provided.

Huiting Huang, Fortunato Pesarin, Rosa Arboretti, Riccardo Ceccato

### Smooth Nonparametric Survival Analysis

This research proposes the local polynomial smoothing of the Kaplan–Meier estimate under the fixed design setting. This allows the development of estimates of the distribution function (equivalently the survival function) and its derivatives under the random right censoring model. The asymptotic properties of the estimate, including its asymptotic normality are all established herein.

Dimitrios Ioannides, Dimitrios Bagkavos

### Density Estimation Using Multiscale Local Polynomial Transforms

The estimation of a density function with an unknown number of singularities or discontinuities is a typical example of a multiscale problem, with data observed at nonequispaced locations. The data are analyzed through a multiscale local polynomial transform (MLPT), which can be seen as a slightly overcomplete, non-dyadic alternative for a wavelet transform, equipped with the benefits from a local polynomial smoothing procedure. In particular, the multiscale transform adopts a sequence of kernel bandwidths in the local polynomial smoothing as resolution level-dependent, user-controlled scales. The MLPT analysis leads to a reformulation of the problem as a variable selection in a sparse, high-dimensional regression model with exponentially distributed responses. The variable selection is realized by the optimization of the l1-regularized maximum likelihood, where the regularization parameter acts as a threshold. Fine-tuning of the threshold requires the optimization of an information criterion such as AIC. This paper develops discussions on results in [9].

Maarten Jansen

### On Sensitivity of Metalearning: An Illustrative Study for Robust Regression

Metalearning is becoming an increasingly important methodology for extracting knowledge from a database of available training datasets to a new (independent) dataset. While the concept of metalearning is becoming popular in statistical learning and readily available also for the analysis of economic datasets, not much attention has been paid to its limitations and disadvantages. To the best of our knowledge, the current paper represents the first illustration of metalearning sensitivity to data contamination by noise or outliers. For this purpose, we use various linear regression estimators (including highly robust ones) over a set of 24 datasets with economic background and perform a metalearning study over them as well as over the same datasets after an artificial contamination. The results reveal the whole process to remain rather sensitive to data contamination and some of the standard classifiers turn out to yield unreliable results. Nevertheless, using a robust classification method does not bring a desirable improvement. Thus, we conclude that the task of robustification of the whole metalearning methodology is more complex and deserves a systematic future research.

Jan Kalina

### Function-Parametric Empirical Processes, Projections and Unitary Operators

We describe another approach to the theory of distribution free testing. The approach uses geometric similarity within various forms of empirical processes: whenever there is an empirical object (like the empirical distribution function) and theoretical parametric model (like a parametric model for distribution function) and a normalised difference of the two, then substitution of estimated values of the parameters leads to projection of this difference. Then one can bring some system in the multitude of these projections. We use unitary operators to describe classes of statistical problems, where one can “rotate” one projection into another, thus creating classes of equivalent problems. As a result, behaviour of various test statistics could be investigated in only one “typical” problem from each class. Thus, the approach promises economy in analytic and numerical work. We also hope to show that the unitary operators involved in “rotations” are of simple and easily implementable form.

### Rank-Based Analysis of Multivariate Data in Factorial Designs and Its Implementation in R

Recently, a completely nonparametric rank-based approach for inference regarding multivariate data from factorial designs has been introduced, with theoretical results for two different asymptotic settings. Namely, for the situation of few factor levels with large sample sizes at each level, and for the situation of a large number of factor levels with small sample sizes in each group. In this article, we examine in detail how this theory can be translated into practical application. A challenge in this regard has been feasibly implementing consistent covariance matrix estimation in the setting of small sample sizes. The finite sampling distributions are approximated using moment estimators. In order to make the results widely available, we introduce the R package nparMD which performs nonparametric analysis of multivariate data in a two-way layout. Multivariate data in a one-way layout have already been addressed by the npmv package. Similar to the latter, within the nparMD package, there are no assumptions met about the underlying distribution of the multivariate data. The components of the response vector do not necessarily have to be measured on the same scale, but they have to be at least binary or ordinal. Due to the factorial design, hypotheses to be tested include the main effects of both factors, as well as their interaction. The new R package is equipped with two versions of the testing procedure, corresponding to the two asymptotic situations mentioned above.

Maximilian Kiefel, Arne C. Bathke

### Tests for Independence Involving Spherical Data

We propose consistent procedures for testing the independence of circular variables based on the empirical characteristic function. The new methods are first specified for observations lying on a torus, i.e., for bivariate circular data, but it is shown that these methods can readily be extended to arbitrary dimension. The large-sample behavior of the test statistic is investigated under fixed alternatives. Finite-sample results are also presented.

Pierre Lafaye De Micheaux, Simos Meintanis, Thomas Verdebout

### Interval-Wise Testing of Functional Data Defined on Two-dimensional Domains

Functional Data Analysis is the statistical analysis of data sets composed of functions of a continuous variable on a given domain. Previous work in this area focuses on one-dimensional domains. In this work, we extend a method developed for the one-dimensional case, the interval-wise testing procedure (IWT), to the case of a two-dimensional domain. We first briefly explain the theory of the IWT for the one-dimensional case, followed by a proposed extension to the two-dimensional case. We also discuss challenges that appear in the two-dimensional case but do not exist in the one-dimensional case. Finally, we provide results of a simulation study to explore the properties of the new procedure in more detail.

Patrick B. Langthaler, Alessia Pini, Arne C. Bathke

### Assessing Data Support for the Simplifying Assumption in Bivariate Conditional Copulas

The paper considers the problem of establishing data support for the simplifying assumption (SA) in a bivariate conditional copula model. It is known that SA greatly simplifies the inference for a conditional copula model, but standard tools and methods for testing SA in a Bayesian setting tend to not provide reliable results. After splitting the observed data into training and test sets, the method proposed will use a flexible Bayesian model fit to the training data to define tests based on randomization and standard asymptotic theory. Its performance is studied using simulated data. The paper’s supplementary material also discusses theoretical justification for the method and implementations in alternative models of interest, e.g. Gaussian, Logistic and Quantile regressions.

### Semiparametric Weighting Estimations of a Zero-Inflated Poisson Regression with Missing in Covariates

We scrutinize the problem of missing covariates in the zero-inflated Poisson regression model. Under the assumption that some covariates for modeling the probability of the zero and the nonzero states are missing at random, the complete-case estimator is known to be biased and inefficient. Although the inverse probability weighting estimator is unbiased, it remains inefficient. We propose four types of semiparametric weighting estimations where the conditional probabilities and the conditional expected score functions are estimated either by using the generalized additive models (GAMs) and the Nadaraya kernel smoother method. In addition, we allow the conditional probabilities and the conditional expectations to be either of the same types or of different types. Moreover, a Monte Carlo experiment is used to investigate the merit of the proposed method.

M. T. Lukusa, F. K. H. Phoa

### The Discrepancy Method for Extremal Index Estimation

We consider the nonparametric estimation of the extremal index of stochastic processes. The discrepancy method that was proposed by the author as a data-driven smoothing tool for probability density function estimation is extended to find a threshold parameter u for an extremal index estimator in case of heavy-tailed distributions. To this end, the discrepancy statistics are based on the von Mises–Smirnov statistic and the k largest order statistics instead of an entire sample. The asymptotic chi-squared distribution of the discrepancy measure is derived. Its quantiles may be used as discrepancy values. An algorithm to select u for an estimator of the extremal index is proposed. The accuracy of the discrepancy method is checked by a simulation study.

Natalia Markovich

### Correction for Optimisation Bias in Structured Sparse High-Dimensional Variable Selection

In sparse high-dimensional data, the selection of a model can lead to an overestimation of the number of nonzero variables. Indeed, the use of an $$\ell _1$$ norm constraint while minimising the sum of squared residuals tempers the effects of false positives, thus they are more likely to be included in the model. On the other hand, an $$\ell _0$$ regularisation is a non-convex problem and finding its solution is a combinatorial challenge which becomes unfeasible for more than 50 variables. To overcome this situation, one can perform selection via an $$\ell _1$$ penalisation but estimate the selected components without shrinkage. This leads to an additional bias in the optimisation of an information criterion over the model size. Used as a stopping rule, this IC must be modified to take into account the deviation of the estimation with and without shrinkage. By looking into the difference between the prediction error and the expected Mallows’s Cp, previous work has analysed a correction for the optimisation bias and an expression can be found for a signal-plus-noise model given some assumptions. A focus on structured models, in particular, grouped variables, shows similar results, though the bias is noticeably reduced.

Bastien Marquis, Maarten Jansen

### United Statistical Algorithms and Data Science: An Introduction to the Principles

Developing algorithmic solutions to tackle the rapidly increasing variety of data types, by now, is recognized as an outstanding open problem of modern statistics and data science. But why does this issue remain difficult to solve programmatically? Is it merely a passing trend, or does it have the potential to radically change the way we build learning algorithms? Discussing these questions without falling victim to the big data hype is not an easy task. Nonetheless, an attempt will be made to better understand the core statistical issues, in a manner to which every data scientist can relate.

### The Halfspace Depth Characterization Problem

The halfspace depth characterization conjecture states that for any two distinct (probability) measures P and Q in the d-dimensional Euclidean space, there exists a point at which the halfspace depths of P and Q differ. Until recently, it was widely believed that this conjecture holds true for all integers $$d \ge 1$$ . In several research papers dealing with this problem, partial positive results towards the complete characterization of measures by their depths can be found. We provide a comprehensive review of this literature, point out to certain difficulties with some of these earlier results and construct examples of distinct (probability or finite) measures whose halfspace depths coincide at all points of the sample space, for all integers $$d > 1$$ .

Stanislav Nagy

### A Component Multiplicative Error Model for Realized Volatility Measures

We propose a component Multiplicative Error Model (MEM) for modelling and forecasting realized volatility measures. In contrast to conventional MEMs, the proposed specification resorts to the use of a multiplicative component structure in order to parsimoniously parameterize the complex dependence structure of realized volatility measures. The long-run component is defined as a linear combination of MIDAS filters moving at different frequencies, while the short-run component is constrained to follow a unit mean GARCH recursion. This particular specification of the long-run component allows to reproduce very persistent oscillations of the conditional mean of the volatility process, in the spirit of Corsi’s Heterogeneous Autoregressive Model (HAR). The empirical performances of the proposed model are assessed by means of an application to the realized volatility of the S&P 500 index.

Antonio Naimoli, Giuseppe Storti

### Asymptotically Distribution-Free Goodness-of-Fit Tests for Testing Independence in Contingency Tables of Large Dimensions

We discuss a possibility of using asymptotically distribution-free goodness-of-fit tests for testing independence of two discrete or categorical random variables in contingency tables. The tables considered are particularly of large dimension, in which the conventional chi-square test becomes less reliable when the table is relatively sparse. The main idea of the method is to apply the new Khmaladze transformation to transform the vector of the chi-square statistic components into another vector whose limit distribution is free of the parameters. The transformation is one-to-one and hence we can build up any statistic based on the transformed vector as an asymptotically distribution-free test statistic for the problem of interest where we recommend the analogue of the Kolmogorov-Smirnov test. Simulations are used to show that the new test not only converges relatively quickly but is also more powerful than the chi-square test in certain cases.

Thuong T. M. Nguyen

### Incorporating Model Uncertainty in the Construction of Bootstrap Prediction Intervals for Functional Time Series

A sieve bootstrap method that incorporates model uncertainty for constructing pointwise or simultaneous prediction intervals of stationary functional time series is proposed. The bootstrap method exploits a general backward vector autoregressive representation of the time series of Fourier coefficients appearing in the well-established Karhunen-Loève expansion of the functional process. The bootstrap method generates, by running backward in time, functional bootstrap samples which adequately mimic the dependence structure of the underlying process and which all have the same conditionally fixed curves at the end of every functional bootstrap sample. The bootstrap prediction error distribution is then calculated as the difference between the model-free bootstrap generated future functional pseudo-observations and the functional forecasts obtained from a model used for prediction. In this way, the estimated prediction error distribution takes into account not only the innovation and estimation error associated with prediction, but also the possible error due to model uncertainty or misspecification. Through a simulation study, we demonstrate an excellent finite-sample performance of the proposed sieve bootstrap method.

Efstathios Paparoditis, Han Lin Shang

### Measuring and Estimating Overlap of Distributions: A Comparison of Approaches from Various Disciplines

In this work, we will compare three approaches on measuring the overlap of datasets. Different research areas lead to differing definitions and interpretations of overlap. We will discuss the differences, advantages and disadvantages of three methods which were all introduced in different research fields. Coming from a medical, a cryptographical and a statistical background, all three methods show interesting aspects of overlap. Even though quite differently defined, all three show reasonably interpretable results in simulations and data example.

Judith H. Parkinson, Arne C. Bathke

### Bootstrap Confidence Intervals for Sequences of Missing Values in Multivariate Time Series

This paper is aimed at deriving some specific-oriented bootstrap confidence intervals for missing sequences of observations in multivariate time series. The procedure is based on a spatial-dynamic model and imputes the missing values using a linear combination of the neighbor contemporary observations and their lagged values. The resampling procedure implements a residual bootstrap approach which is then used to approximate the sampling distribution of the estimators of the missing values. The normal based and the percentile bootstrap confidence intervals have been computed. A Monte Carlo simulation study shows the good empirical coverage performance of the proposal, even in the case of long sequences of missing values.

Maria Lucia Parrella, Giuseppina Albano, Michele La Rocca, Cira Perna

### On Parametric Estimation of Distribution Tails

The aim of this work is to propose a method for estimating the parameter of the continuous distribution tail based on the largest order statistics of a sample. We prove the consistency and asymptotic normality of the proposed estimator. Note especially that we do not assume the fulfillment of the conditions of the extreme value theorem.

Igor Rodionov

### An Empirical Comparison of Global and Local Functional Depths

A functional data depth provides a center-outward ordering criterion that allows the definition of measures such as median, trimmed means, central regions, or ranks in a functional framework. A functional data depth can be global or local. With global depths, the degree of centrality of a curve x depends equally on the rest of the sample observations, while with local depths the contribution of each observation in defining the degree of centrality of x decreases as the distance from x increases. We empirically compare the global and the local approaches to the functional depth problem focusing on three global and two local functional depths. First, we consider two real data sets and show that global and local depths may provide different data insights. Second, we use simulated data to show when we should expect differences between a global and a local approach to the functional depth problem.

Carlo Sguera, Rosa E. Lillo

### AutoSpec: Detecting Exiguous Frequency Changes in Time Series

Most established techniques that search for structural breaks in time series may not be able to identify slight changes in the process, especially when looking for frequency changes. The problem is that many of the techniques assume very smooth local spectra and tend to produce overly smooth estimates. The problem of over-smoothing tends to produce spectral estimates that miss slight frequency changes because frequencies that are close together will be lumped into one frequency. The goal of this work is to develop techniques that concentrate on detecting slight frequency changes by requiring a high degree of resolution in the frequency domain.

David S. Stoffer

### Bayesian Quantile Regression in Differential Equation Models

In many situations, nonlinear regression models are specified implicitly by a set of ordinary differential equations. Often, mean regression may not adequately represent the relationship between the predictors and the response variable. Quantile regression can give a more complete picture of the relationship, can avoid distributional assumptions and can naturally handle heteroscedasticity. However, quantile regression driven by differential equations has not been addressed in the literature. In this article, we consider the problem and adopt a Bayesian approach. To construct a likelihood without distributional assumptions, we consider all quantile levels simultaneously. Because of the lack of an explicit form of the regression function and the indeterminate nature of the conditional distribution, evaluating the likelihood and sampling from the posterior distribution are very challenging. We avoid the computational bottleneck by adopting a “projection posterior” method. In this approach, the implicit parametric family of regression function of interest is embedded in the space of smooth functions, where it is modeled nonparametrically using a B-spline basis expansion. The posterior is computed in the larger space based on a prior without constraint, and a “projection” on the parametric family using a suitable distance induces a posterior distribution on the parameter. We illustrate the method using both simulated and real datasets.

Qianwen Tan, Subhashis Ghosal

### Predicting Plant Threat Based on Herbarium Data: Application to French Data

Evaluating formal threat criteria for every organism on earth is a tremendously resource-consuming task which will need many more years to accomplish at the actual rate. We propose here a method allowing for a faster and reproducible threat prediction for the 360,000+ known species of plants. Threat probabilities are estimated for each known plant species through the analysis of the data from the complete digitization of the largest herbarium in the world using machine learning algorithms, allowing for a major breakthrough in biodiversity conservation assessments worldwide. First, the full scientific names from the Paris herbarium database were matched against all the names from the international plant list using a text mining open-source search engine called Terrier. A series of statistics related to the accepted names of each plant were computed and served as predictors in a statistical learning algorithm with binary output. The training data was built based on the International Union for Conservation of Nature (IUCN) global Redlisting plants assessments. For each accepted name, the probability to be of least concern (LC, not threatened) was estimated with a confidence interval and a global misclassification rate of 20%. Results are presented on the world map and according to different plant traits.

Jessica Tressou, Thomas Haevermans, Liliane Bel

### Monte Carlo Permutation Tests for Assessing Spatial Dependence at Different Scales

Spatially dependent residuals arise as a result of missing or misspecified spatial variables in a model. Such dependence is observed in different areas, including environmental, epidemiological, social and economic studies. It is crucial to take the dependence into modelling consideration to avoid spurious associations between variables of interest or to avoid wrong inferential conclusions due to underestimated uncertainties. An insight about the scales at which spatial dependence exist can help to comprehend the underlying physical process and to select suitable spatial interpolation methods. In this paper, we propose two Monte Carlo permutation tests to (1) assess the existence of overall spatial dependence and (2) assess spatial dependence at small scales, respectively. A p-value combination method is used to improve statistical power of the tests. We conduct a simulation study to reveal the advantages of our proposed methods in terms of type I error rate and statistical power. The tests are implemented in an open-source R package variosig.

Craig Wang, Reinhard Furrer

### Introduction to Independent Counterfactuals

The aim of this contribution is to introduce the idea of independent counterfactuals. The technique allows to construct a counterfactual random variable which is independent from a set of given covariates, but it follows the same distribution as the original outcome. The framework is fully nonparametric, and under error exogeneity condition the counterfactuals have causal interpretation. On an example of a stylized linear process, I demonstrate the main mechanisms behind the method. The finite-sample properties are further tested in a simulation experiment.

Marcin Wolski

### The Potential for Nonparametric Joint Latent Class Modeling of Longitudinal and Time-to-Event Data

Joint latent class modeling (JLCM) of longitudinal and time-to-event data is a parametric approach of particular interest in clinical studies. JLCM has the flexibility to uncover complex data-dependent latent classes, but it suffers high computational cost, and it does not use time-varying covariates in modeling time-to-event and latent class membership. In this work, we explore in more detail both the strengths and weaknesses of JLCM. We then discuss the sort of nonparametric joint modeling approach that could address some of JLCM’s weaknesses. In particular, a tree-based approach is fast to fit, and can use any type of covariates in modeling both the time-to-event and the latent class membership, thus serving as an alternative method for JLCM with great potential.

Ningshan Zhang, Jeffrey S. Simonoff

### To Rank or to Permute When Comparing an Ordinal Outcome Between Two Groups While Adjusting for a Covariate?

The classical parametric analysis of covariance (ANCOVA) is frequently used when comparing an ordinal outcome variable between two groups, while adjusting for a continuous covariate. However, the normality assumption might be crucial and assuming an underlying additive model might be questionable. Therefore, in the present manuscript, we consider the outcome as truly ordinal and dichotomize the covariate by a median split, in order to transform the testing problem to a nonparametric factorial setting. We propose using either a permutation-based Anderson–Darling type approach in conjunction with the nonparametric combination method or the pseudo-rank version of a nonparametric ANOVA-type test. The results of our extensive simulation study show that both methods maintain the type I error level well, but that the ANOVA-type approach is superior in terms of power for location-shift alternatives. We also discuss some further aspects, which should be taken into account when deciding for the one or the other method. The application of both approaches is illustrated by the analysis of real-life data from a randomized clinical trial with stroke patients.

Georg Zimmermann
Weitere Informationen