Skip to main content
Top

2025 | Book

New Frontiers in Statistics and Data Science

SPE2023, Guimarães, Portugal, October 11-14

Editors: Lígia Henriques-Rodrigues, Raquel Menezes, Luís Meira Machado, Susana Faria, Miguel de Carvalho

Publisher: Springer Nature Switzerland

Book Series : Springer Proceedings in Mathematics & Statistics

insite
SEARCH

About this book

This volume showcases a collection of thirty-two peer-reviewed articles presented at the XXVI Congress of the Portuguese Statistical Society (2023). It covers a wide range of cutting-edge topics in both theoretical and applied statistics. Each contribution highlights the latest advancements and research in the field, offering valuable insights and innovative methodologies for researchers and practitioners alike. Readers with a background in mathematics and statistics will find it particularly beneficial, while researchers from various scientific disciplines can explore numerous robust applications.

Table of Contents

Frontmatter
A Note on a Parzen–Rosenblatt Type Density Estimator for Circular Data

Using the close connection between the Parzen–Rosenblatt estimator for linear data and the recently proposed Parzen–Rosenblatt type estimator for circular data, we establish some asymptotic properties of this last estimator, such as asymptotic unbiasedness, weak and strong pointwise consistency, and weak and strong uniform consistency.

Carlos Tenreiro
Population Growth and Geometrically Thinned Extreme Value Theory

Starting from the simple Beta(2,2) model, connected to the Verhulst logistic parabola, several extensions are discussed, and connections to extremal models are revealed. Aside from the classical general extreme value model, extreme value models in randomly stopped extremes schemes are also discussed. Logistic and Gompertz growth equations are the usual choice to model sustainable growth. Therefore, observing that the logistic distribution is (geo)max-stable and the Gompertz function is proportional to the Gumbel max-stable distribution, other growth models, related to classical and to geometrically thinned extreme value theory are investigated.

M. Fátima Brilhante, M. Ivette Gomes, Sandra Mendonça, Dinis Pestana, Pedro Pestana
An Additive Shared Frailty Model for Recurrent Gap Time Data in the Presence of Zero-Recurrence Subjects

A new shared frailty model for recurrent gap time data is introduced, assuming that frailty acts additively on a Weibull rate function derived from a non-homogeneous Poisson process. The frailty is included with two purposes: to handle within-subject correlation and to accommodate zero-recurrence subjects. With this intention, we assume that the frailty has a non-central chi-squared distribution with zero degrees of freedom. The proposed model has two special cases without frailty, namely the Weibull rate model and the classical homogeneous Poisson process. Furthermore, since the model is fully specified, the maximum likelihood method is applied for parameters estimation based on a marginal likelihood function. Particular attention is devoted to the likelihood construction for bivariate recurrent event data. An application to a well-known data set is provided to elucidate the practical contribution of the new survival model.

Ivo Sousa-Ferreira, Ana Maria Abreu, Cristina Rocha
Clustering and Risk Analysis for Evaluating the Water Quality of a Hydrological Basin

Water is a limited, irreplaceable and indispensable natural resource. Statistical methods are important tools for controlling and forecasting changes in the management process of water quality. The main objective of this study is to classify the water quality monitoring sites into homogeneous groups and to measure the risk for water pollution influenced by the monthly dissolved oxygen concentration, which has been selected and considered relevant to characterize the water quality. The methodologies are illustrated using a data set of the Douro River basin (in Portugal), measured monthly in 18 water quality sampling stations and recorded in the period from January 2002 to December 2013. A clustering analysis is performed to group similar sampling stations considering the environmental variable, and by taking into account the seasonal variations. Several risk measures, such as loss probability, entropy or value at risk, are determined for the considered variable in order to assess the risk of water pollution taking into account the monthly nature of the data. The risk measures are used to classify the months in order to analyse and to confirm the distinction of the dissolved oxygen concentration according to a wet and a dry period.

Ana Pedra, A. Manuela Gonçalves, Irene Brito
Green Exchange-Traded Fund Performance Evaluation Using the EU–EV Risk Model

This work evaluates the performance of green exchange-traded funds (ETFs) using the expected utility, entropy and variance (EU–EV) risk model. Data from 14 green ETFs analysed in earlier literature in the in-sample period from January 2008 to December 2010 are used.The green ETFs are ranked according to their risk, considering the returns’ expected utility, entropy and variance, and the best-ranked ETFs are selected to construct equally weighted portfolios. Then, the performance of the green ETFs portfolios is evaluated and compared with those of the S&P500 Index. Cumulative returns in in-sample and out-of-sample periods and performance metrics, such as Maximum drawdown, Sharpe ratio, Sortino ratio, Beta and Alpha, are analysed. The results show that, in general, the equally weighted portfolios formed with half the number of best-ranked ETFs outperform the benchmark index in the in-sample period and for specific time ranges in the out-of-sample periods.

Irene Brito, José Manuel Azevedo, Ana Isabel Azevedo
Risk Assessment of Vulnerabilities Exploitation

Using the Kolmogorov–Smirnov, Cramér–von Mises and Anderson–Darling tests, and the not so commonly applied Vuong’s test, it is shown that a two components hyperlog-logistic distribution, i.e., a mixture of two geo-max-stable log-logistic distributions, provides a good fit for the time from disclosure to update of vulnerabilities sampled from the CVEdetails.com database. It is also shown that the hyperlog-logistic distribution provides a better fit than a heavy-tailed distribution of maxima, or a log-logistic distribution, or even a heavy-tailed two components hyperexponential distribution. Moreover, ways of incorporating uncertainty and of modeling vulnerabilities lifecycle into the Common Vulnerabilities Scoring System (CVSS), the most widely used score to assess severity of vulnerabilities, are discussed, in order to obtain an improved CVSS calculator and the evolution of a score over time.

M. Fátima Brilhante, Pedro Pestana, M. Luísa Rocha, Fernando Sequeira
Bayesian Modelling of Time Series of Counts with Missing Data

The presence of missing data poses a common challenge for time series analysis in general since the most usual requirement is that the data is equally spaced in time and therefore imputation methods are required. For time series of counts, the usual imputation methods which usually produce real valued observations, are not adequate. This work employs Bayesian principles for handling missing data within time series of counts, based on first-order integer-valued autoregressive (INAR) models, namely Approximate Bayesian Computation (ABC) and Gibbs sampler with Data Augmentation (GDA) algorithms. The methodologies are illustrated with synthetic and real data and the results indicate that the estimates are consistent and present less bias when the percentage of missing observations decreases, as expected.

Isabel Silva, Maria Eduarda Silva, Isabel Pereira
Sexual Classification Based on Orthopantomographs

Craniomandibular bone structures, as they are more resistant to taphonomy processes, are relevant in the sexual diagnosis of adult skeletons. This step is essential in the reconstruction of an unidentified corpse. Within this context, this study evaluates the performance of sexual classification methodologies based on 206 orthopantomographs. Hence, convolutional neural networks (CNN) were applied directly on the orthopantomography, and several classification methodologies were applied to linear measurements taken on the orthopantomographs, such as logistic regression, discriminant analysis, k-nearest neighbours, naïve Bayes, support vector machines, decision trees, and random forests. The performance of each method was evaluated based on accuracy, sensitivity, specificity, predictive values, and the area under the ROC curve.The pre-trained VGG16 CNN achieved better results, revealing that it can be reliably applied in sexual classification in a Portuguese adult population within the scope of forensic science. Nonetheless, a final sexual classification model to be applied to the Portuguese population must be established in a larger sample.

João Alves, Cristiana Palmela Pereira, Rui Santos
Survapp: A Shiny Application for Survival Data Analysis

There is a substantial demand for user-friendly graphical interfaces that empower professionals with limited programming knowledge to perform statistical analysis. Although R software is widely used for statistical analysis, it lacks an adequately intuitive graphical interface for individuals without statistical and programming skills. This paper aims to address this gap by introducing an application called Survapp, enabling users, regardless of their computational knowledge, to conduct survival analysis. The development leveraged R software, RStudio, and the Shiny package to create an interactive web app.Survapp incorporates diverse methodologies for analyzing survival data, including Kaplan-Meier, log-rank tests, Cox regression models, parametric accelerated failure time models, decision trees, random forests, and competitive risk analysis (a specific case of multi-state models). Survapp enables users to analyze survival data, offering example databases for various methodologies within the application. However, the primary objective is to allow users to import their own data and conduct their respective analyses in a user-friendly environment. A distinguishing aspect of Survapp is its interface, bridging the gap between complex statistical methods and users with limited statistical and programming expertise.Overall, Survapp proves to be a highly valuable tool for survival data analysis, catering to users needs and providing a user-friendly interface with a wide range of survival analysis methods. The Shiny app is available at the Shiny Apps repository: https://emanuel-vieira.shinyapps.io/survapp .

Emanuel Vieira Monteiro da Silva, Luís Filipe Meira Machado, Gustavo Domingos da Costa Coelho Soutinho
An Application of Multivariate Random Fields and Systems of Stochastic Partial Differential Equations to Wind Velocity Data

The wind is a meteorological phenomena, resulting from air movements due to differences in pressure, or differences between earth and air temperatures. The wind flow shows a wide range of different behaviours, and it has great relevance in weather conditions, deeply influences the landscape, and even plays a role in the spread of infectious diseases, so it is of great importance to study and model its behaviour.The wind velocity is a vector field, so we consider in this work a multivariate spatial model that can be addressed through systems of Stochastic Partial Differential Equations (SPDEs). The main goal is to estimate the wind velocity, considering a system of SPDEs, and applying Bayesian inference, based on integrated nested Laplace approximation (INLA) methods, which are theoretically explored here for the particular multivariate case.The results are encouraging, and open new lines of investigation, such as applying statistical methods to study the velocity field of certain fluid flows, instead of solving strongly non-linear partial differential equations.

Sílvia Guerra, Fernanda Cipriano, Isabel Natário
A Direct Approach in Extremal Index Estimation

The limit distribution of the normalized maxima of stationary sequences exists under specific conditions, even in the presence of some dependence structures. Dealing with sequences of maxima, the degree of dependence between observations can be studied in the limit distribution, when it exists, through a parameter of the Extreme Value distribution, named the extremal index, EI. The EI is theoretically known for some particular models and might be interpreted in different contexts, namely, as the limit of the reciprocal of clusters mean size of exceedances, or related to the multiplicity of a compound Poisson point process. Generally, EI estimation methods are focused on the limit mean size of clusters. In this study we investigate the direct estimation of the parameter itself as a proportion. The procedure takes into account the distribution of the inter-exceedances times and considers the proportion of strictly positive inter-exceedances times as an EI estimator. The results of a simulation study show that the method is more robust to different cluster dependence structures than the usual alternatives.

Manuela Souto de Miranda, M. Cristina Miranda, M. Ivette Gomes
When PACE-Gate Meets Sample Size Calculations

PACE-Gate is the name by which the case surrounding the high-profile PACE clinical trial became publicly known. This case ended up in controversy due to a change in the pre-trial primary endpoint and a re-definition of the treatments’ efficacy. The present paper concerns a different angle of the case: the statistical argumentation of the reported sample size calculations. The analysis of the per-trial research protocol and the post-trial statistical analysis plan revealed inconsistencies between the theoretical assumptions underpinning the sample size calculations and the prior beliefs of the trial’s research team. The reported sample sizes also seemed inaccurate for not accounting for multiple pairwise comparisons contemplated in the trial’s objectives. In conclusion, the statistical argumentation of the trial sample size is suboptimal. The question is whether PACE is either an exception or the norm with respect to incongruous sample size determination.

Nuno Sepúlveda
An Approach for Predicting Spatially Indexed Carcass Persistence Probability to Estimate Bird Mortality at Power Lines

One of the objectives of the environmental monitoring programs of transmission power lines is to quantify bird mortality. To account for carcass removal, these programs typically include field experiments which allow to obtain data on the persistence time of the carcass in the field until removal. In this study, we aim to estimate the removal bias correction factor, considering the carcass size, the season, and the location of power line projects, eliminating the need for field trials in every new project. To achieve this goal, we used the Integrated Nested Laplace Approximation (INLA) method combined with the Stochastic Partial Differential Equations (SPDE) approach to model the probability of persistence considering both fixed (carcass size and season) and random (geographic location and project) effects. The results allowed to analyze the variation in space of bird carcass persistence and to create a tool for common users to estimate the removal correction factor for a specific location as a function of the covariates considered, in mainland Portugal. However, further improvement is required as model predictions are still unreliable in large portions of the national territory. We discuss the model limitations and offer directions for future work.

Ema Biscaia, Joana Bernardino, Regina Bispo
Extremal Behavior of Some Bivariate Integer Models

We study the extremal behaviour of some integer-valued bivariate time series. Assuming that the distributional behaviour of the innovations is the one introduced in Hüsler et al. (Methodol Comput Appl Probab 24:2373–2402, 2022), we establish asymptotics for the distribution of the bivariate normalized maxima of two max-BINAR models and of a BINMM model. Since the marginal distribution functions of these processes belong to Anderson’s class, we consider two different approaches. First, maintaining the Anderson’s setup, we establish limiting lower and upper bounds for the normalized double maximum. In a second step, considering the double maximum of the first k n $$k_n$$ observations, where { k n } $$\{k_n\}$$ is a non decreasing sequence of positive integers with an asymptotic geometric pattern, we obtain a well defined limit in distribution for the double maximum, which is a bivariate max-semistable distribution function (Pancheva, Theory Probab Appl, 679–705, 1992). In both cases the asymptotic independence of maxima is established.

Sandra Dias, Maria da Graça Temido
Solar Radiation Forecasting: A Study Case in the Colombian Caribbean Region

This paper presents the forecast of monthly and daily solar radiation in the Colombian Caribbean region using time series analysis. Three models are implemented, the Seasonal Auto-Regressive Integrated Moving Average (SARIMA) for the monthly forecast and two machine learning algorithms, support vector machine (SVM) the regression tree (RT) for the daily forecast. Performance in forecasting solar radiation is compared, with and without climatic variables. Data, including historical solar radiation and climatic variables, were collected by the Institute of Hydrology, Meteorology, and Environmental Studies in Colombia (IDEAM). Results indicate that while the SARIMA model provides acceptable forecasting, and machine learning models demonstrate better performance, which can improve enhancing decision-making in local energy planning.

Gloria Carrascal, Jhonathan Barrios, Flora Ferreira, Jairo Plaza
Sources of Bias When Assessing Seasonal Influenza Vaccine Performance: A Narrative Review

The best way to prevent influenza infection is through vaccination. Evaluating vaccine performance is essentially done through two types of study: clinical trials and the test negative design, an observational study derived from the case control. While in clinical trials the sources of bias are perfectly identified and there are specific tools for assessing them, in test negative design we find very varied sources of bias and a lot of scattered information without there being a validated tool for assessing the risk of bias. The aim of this narrative review is to identify the most important sources of bias in both types of study and to contribute to the development of a risk assessment tool for test negative design studies, given their major importance in evaluating the performance of the seasonal influenza vaccine.

André Miguel Martins, Luis Félix Valero Juan, Marlene Santos, João P. Martins
Peaks Over Random Thresholds (PORT) Estimation of the Weibull Tail Coefficient

The Weibull tail-coefficient (WTC) is the index of regular variation in a regularly varying cumulative hazard function. Due to the specificity of the WTC, and its deep and explicit link to a positive extreme value index (EVI), any estimator of a positive EVI, like all generalizations of the classical Hill estimator, can be used for the estimation of the WTC. These estimators are scale invariant but not location invariant, contrarily to the location/scale invariance of the EVI and of the WTC parameters. With PORT standing for peaks over random thresholds, new classes of consistent PORT WTC-estimators, dependent on an extra tuning parameter s, 0 ≤ s < 1 $$0 \leq s < 1$$ , are introduced. These WTC-estimators are highly flexible and are further studied for finite samples, through a Monte-Carlo simulation study. Possible choices of the tuning parameters under play are put forward, and some concluding remarks are provided.

M. Ivette Gomes, Frederico Caeiro, Lígia Henriques-Rodrigues
Exploring the Mutual Information Rate Decomposition in Situations of Pathological Stress

A Plateau wave (PW) represents a distinctive pattern of Intracranial Pressure (ICP) change observed in patients with severe traumatic brain injuries (TBI), marked by a sudden sustained increase in ICP. These pathological stress events are frequently linked with significant alterations in Heart Rate Variability (HRV), indicative of Autonomic Nervous System (ANS) dysfunction. This study aims to investigate the coupling between ICP and HRV by employing the Mutual Information Rate (MIR). MIR serves as an extension of Mutual Information (MI), enabling the analysis of the dynamic exchange of information across various time intervals. Furthermore, the MIR between two random processes can be decomposed into distinct entropy rate components associated with the concept of complexity, as well as conditional mutual information terms related to information transfer. The MIR and its constituent information terms are estimated through a model-free approach based on the nearest neighbors search (KNN). This framework is first validated on simulations of linear and non-linear bivariate systems, then it is applied to data consisting of RR intervals and ICP amplitude (AMP) time series measured in TBI patients with PW occurrence. The obtained results evidence that MIR decompositions are able to highlight the interdependence of HRV and ICP in PW episodes and the association of these critical phenomena with autonomic stress.

Helder Pinto, Celeste Dias, Chiara Barà, Yuri Antonacci, Luca Faes, Ana Paula Rocha
A Simulation Comparison of Spatial Models for Preferential Sampling

In some situations, the sampling locations are intentionally oversampled due to the higher/lower expected values, providing more information about specific features or characteristics of interest. This sampling strategy, named preferential sampling, is particularly relevant in ecological and environmental studies where researchers may focus on areas with expected high biodiversity, specific habitat characteristics, or other relevant factors. However, this strategy of sampling poses challenges that can lead to inaccurate inferences. Therefore, this study delves into the nuanced exploration of preferential sampling by comparing two prominent models proposed by Diggle et al. (J. R. Stat. Soc. Series C Appl. Stat. 59(2):191–232, 2010. https://doi.org/10.1111/j.1467-9876.2009.00701.x ) and Pati et al. (Biometrika 98(1):35–48, 2011. https://doi.org/10.1093/biomet/asq067 ), hereinafter referred to as Diggle model and Pati model, respectively. The comparative analysis unfolds in two crucial steps: assessing parameter inference efficacy through empirical simulations and establishing theoretical and empirical connections between the models. In the exploration of preferentiality degree inference, the Pati model excelled under strong preferential sampling, while the Diggle model outperformed in scenarios with moderate preferential sampling. An inverse relation between preferentiality degrees emerged, with the Pati model tending to overestimate and the Diggle model to underestimate. In estimating marginal variance, the Pati model outshone the Diggle model under strong preferential sampling, while the Diggle model excelled at moderate preferential sampling. Both models exhibited comparable performance under negative preferentiality. The analysis revealed improved model performance under positive preferentiality, aligning theoretical predictions with empirical observations. The study highlighted the interconnectedness of model parameters, emphasizing precision in estimation.

Daniela Silva, Raquel Menezes
A Partially Reduced Bias Hill Estimator of the Extreme Value Index

The estimation of the extreme value index (EVI) plays a crucial role in modelling and predicting extreme events, such as floods, earthquakes, heatwaves or a financial crisis. The Hill estimator, defined as the average of the log-excesses of a high threshold, is a popular choice for estimating the EVI, primarily due to its simplicity. However, the Hill estimator is known to suffer from bias, particularly when the estimation is based on a large fraction of the sample size. In this paper, we propose a partially bias corrected Hill estimator that addresses this issue and provides more accurate estimates. The performance of the new estimator is illustrated with simulated and real data.

Frederico Caeiro, M. Ivette Gomes, Lígia Henriques-Rodrigues
The Importance of Experimental Design Principles in Agricultural Field Trials: A Note for Grapevine Field Trials

Agriculture has a long tradition of developing experimental designs to establish rigorous field trials, particularly in plant breeding research. The overall design of the field trial is a key point to ensure the success of the experimental process. Although the importance of the principles of experimental design is widely recognised, they are not well understood and not well implemented in practice by some researchers. For example, randomisation is a fundamental principle in research, but it is not always fully respected. Using real yield data from grapevine field trials designed to quantify within-variety variability and fitting linear mixed models, this paper illustrates that incorrect conclusions about the research being conducted may be drawn if inappropriate randomisation is considered.

Elsa Gonçalves
A New Class of Conditional Tail Expectation Estimators

Extreme value theory is a crucial tool in finance and risk management for evaluating the tail risk of a distribution. We shall focus on the conditional tail expectation (CTE) among various risk measures, as it is regarded as more informative than the value-at-risk at a level q, the upper ( 1 − q $$1-q$$ )-quantile of the loss function. We consider a Pareto tail for the right-tail function and work with heavy tailed models, i.e. models with a positive extreme value index (EVI), quite common in finance. The link between the estimation of both the EVI and the CTE allows for the utilization of the class of EVI estimators based on the power mean of the log-excesses in CTE estimation. To assess the behaviour of this class in finite samples, Monte Carlo simulation experiments will be conducted.

Lígia Henriques-Rodrigues, M. Ivette Gomes, Fernanda Figueiredo, Frederico Caeiro
Tail (In)dependence: A Comparative Analysis of Estimation Methods

Extreme value theory is focused on developing methods for tail inference where data are scarce and central measures may not be suitable. Such is the case of correlation to assess linear association between two variables. For example, the bivariate Gaussian model always shows an asymptotic tail independence, no matter how strong the correlation is. In this work we address the tail independence coefficient η $$\eta $$ of Ledford and Tawn, a measure to assess the presence of an extremal residual dependence. It can be estimated as a regular variation index, for which several estimators already exist. A major problem is the selection of the optimal sample fraction to be considered in the estimation. Based on a simulation study, we will make a comparative analysis of different methodologies adapted to the estimation of η $$\eta $$ . We will finish with an illustration in real data.

Sandra Dias, Marta Ferreira
Robust Estimation for the Random Effects Panel Data Models

Panel data have been increasingly used over the past decades. They arise in various fields of study like economics, biology, marketing, finance, the environment, and others. Particularly in domains of economics and finance, panel (or longitudinal) data are frequently used. Usually, research is based on empirical studies, where the estimation of the parameters is usually obtained with classical methodologies. Real data frequently exhibit the presence of outliers. These values may have a serious effect on the classic estimates produced. This paper aims to provide robust methods of estimation for random effects in panel data, resulting in better estimates for the parameters when the data violate the assumed conditions of the classic estimation models. The properties of the proposed estimation methods are measured with Monte Carlo simulations. A real data set is used to illustrate the new suggested methodology performance.

Anabela Rocha, M. Cristina Miranda
Air Quality Data Analysis with Symbolic Principal Components

Air pollution is a global challenge with deep implications in public health and environment. We examine air quality data from a monitoring station in Entrecampos, Lisbon, Portugal, using Symbolic Data Analysis. The dataset consists of hourly concentrations of nine pollutants during three years, which are logarithmically transformed and aggregated in intervals, taking the daily minimum and maximum values. The symbolic mean and variance are estimated for each variable through the method of moments, and the pairwise dependencies are captured using a bivariate copula. Symbolic principal component scores are obtained from the estimated covariance matrix and used to fit generalized extreme value distributions. Outlier maps, based on these distributions’ quantiles, are used to identify outlying observations. A comparative analysis with daily average-based outlier detection methods is conducted. The results show the relevance of Symbolic Data Analysis in revealing new insights into air quality.

Catarina P. Loureiro, M. Rosário Oliveira, Paula Brito, Lina Oliveira
Geostatistical Models for Identifying Juvenile Fish Hotspots in Marine Conservation

Species distribution models play a pivotal role in the management and conservation of commercially significant marine species. This work focuses on investigating geostatistical models that connect species occurrence and biomass observations with environmental covariates at a limited number of locations.The main objectives are to identify hotspots of juvenile richness, and map recruitment areas and seasons. Our analysis centers on the landing per unit of effort of small sardine (Sardina pilchardus, length 11–15 cms) along the northern Portuguese coast during a period with fewer administrative fishing restrictions (2007–2011). Using a Bayesian-INLA framework, we address the complexity associated with hierarchical geostatistical models capable of handling temporally collected data.The results of this study enhance our understanding of juvenile sardine distributions and allow us to identify hotspots, contributing to the sustainability of marine ecosystems and the preservation of commercially significant species.

Raquel Menezes, Francisco Gonçalves, Daniela Silva, Inês Dias, Alexandra A. Silva
Count Models and Randomness Patterns

The observation of randomness patterns serves as guidance for the craft of probabilistic modelling. The most used count models—Binomial, Poisson, Negative Binomial—are the discrete Morris’ natural exponential families whose variance is at most quadratic on the mean, and members as well of the Katz-Panjer, Power Series and Generalized Hypergeometric families, which accounts for their many advantageous properties. Some other basic count models are also described, as well as models with less obvious but useful randomness patterns in connection with maximum entropy characterisations, such as Zipf and Good models. Simple tools, as truncation, thinning, or parameters randomisation, are straightforward ways of constructing other count models. Some of them are useful for understanding biological phenomena, such as modelling the number of extra-pair nestlings in broods.

Sandra Mendonça, Dinis Pestana
Neurological Disease Classification Based on Gait Analysis Through Transformation-Based Multiple Linear Regression Normalization

Gait analysis plays a vital role in clinical assessments by providing clear, objective insights into how diseases progress, how impairments in walking are manifested, and the effectiveness of various treatments. Our study tackled the challenge of comparing individuals by using multiple linear regression models to account for personal physical differences. We also looked at improving these models by transforming variables. Our focus was on individuals with Parkinson’s disease, normal pressure hydrocephalus, and a control group without these conditions. We used statistical tests to select relevant features and reduced the number of variables using principal component analysis. Techniques like Random Forest, Support Vector Machine, and Multiple Linear Perceptron were used to understand the normalized data patterns. Additionally, SHapley’s Additive exPlanations method helped us identify which variables were most influential before and after data normalization. This work opens new possibilities for using these techniques in both clinical settings and further research.

Jhonathan Barrios, Bárbara Araújo, Miguel Gago, Wolfram Erlhagen, Estela Bicho, Flora Ferreira
Model and Threshold Selection in the Peaks-Over-Threshold (POT) Methodology: Application to Extreme Precipitation Values in Madeira and Porto Santo Islands

The study of extremes in the Peaks-Over-Threshold (POT) method requires the analysis of observations that exceed a specified threshold. The choice of this threshold, which involves a balance between the size of the bias of the estimators and their variances, as well as the distribution to be used in the modeling of the resulting extreme values, are topics of great practical importance. In this paper, nine threshold selection methods and four tests for the choice of statistical distribution models were considered. To demonstrate these methodologies and tests, an application of them to daily precipitation values on the islands of Madeira and Porto Santo from 1999 to 2023 is carried out in this study.

Délia Gouveia-Reis, Luiz Guerreiro Lopes, Sandra Mendonça
Revisiting Estimation Methods for Some Parameters of Rare Events

The primary objective of Extreme Value Theory is to estimate the probability of events occurring beyond the range of available data. Several parameters are of particular interest, including the extreme value index, ξ $$\xi $$ , which is associated with the tail weight of the distribution. It is the basis for estimating other parameters of extreme events, such as high quantiles. In dependent situations, which are very common in practice, another parameter emerges and can influence the estimation of high quantiles. This parameter is the extremal index, θ $$\theta $$ , which is roughly defined as the reciprocal of the mean duration of values above a high level. A concise overview of several estimators for θ $$\theta $$ is presented and the impact of its estimation for the estimation of high quantiles is shown. Given the challenges associated with semiparametric estimators, resampling methods will also be taken into account in a brief simulation and a real case study.

Dora Prata Gomes, Manuela Neves
Clustering and Classification of Compositional Data Using Distributions Defined on the Hypersphere

We propose an approach to cluster and classify compositional data. We transform the compositional data into directional data using the square root transformation. To cluster the compositional data, we apply the identification of a mixture of Watson distributions on the hypersphere and to classify the compositional data into predefined groups, we apply Bayes rules based on the Watson distribution to the directional data. We then compare our clustering results with those obtained in hierarchical clustering and in the K-means clustering using the log-ratio transformations of the data and compare our classification results with those obtained in linear discriminant analysis using log-ratio transformations of the data.

Adelaide Figueiredo
Joint Models of Longitudinal Binary Responses: A Bayesian Nonparametric Approach

In longitudinal data analysis, the response is usually modeled conditionally on a random effect that often has a Gaussian distribution. Bayesian nonparametric (BNP) statistics do not generally impose probability distribution for either response or random effect. The Dirichlet process (DP), which is a common BNP model, is formerly presented for key properties and a constructive definition from which several DP variants can be obtained. (e.g., the Dependent DP). BNP models are here revised to select the best joint model for two binary responses associated with a longitudinal four-arm randomized parallel trial conducted in Bengo Province exploring the effects of four interventions on wasting and stunting for 121 Angolan children with intestinal parasitic infections. Finally, we present a simulation study for evaluation the performance of the proposed BNP joint model.

André Nunes, Giovani L. Silva, Luzia Gonçalves
Backmatter
Metadata
Title
New Frontiers in Statistics and Data Science
Editors
Lígia Henriques-Rodrigues
Raquel Menezes
Luís Meira Machado
Susana Faria
Miguel de Carvalho
Copyright Year
2025
Electronic ISBN
978-3-031-68949-9
Print ISBN
978-3-031-68948-2
DOI
https://doi.org/10.1007/978-3-031-68949-9

Premium Partner