Skip to main content

2022 | Buch

Advances and Innovations in Statistics and Data Science

insite
SUCHEN

Über dieses Buch

This book highlights selected papers from the 4th ICSA-Canada Chapter Symposium, as well as invited articles from established researchers in the areas of statistics and data science. It covers a variety of topics, including methodology development in data science, such as methodology in the analysis of high dimensional data, feature screening in ultra-high dimensional data and natural language ranking; statistical analysis challenges in sampling, multivariate survival models and contaminated data, as well as applications of statistical methods. With this book, readers can make use of frontier research methods to tackle their problems in research, education, training and consultation.

Inhaltsverzeichnis

Frontmatter
Correction to: Identifiability and Estimation of Autoregressive ARCH Models with Measurement Error
Mustafa Salamh, Liqun Wang

Methodology Development in Data Science

Frontmatter
MiRNA–Gene Activity Interaction Networks (miGAIN): Integrated Joint Models of miRNA–Gene Targeting and Disturbance in Signaling Pathways
Abstract
Omics data are now inexpensive to collect in vast quantities, across a wide variety of not only multiple data platform, but also distinct functional units. These bioinformatic datasets can enable scientific analysis of system-level cellular processes, including complex diseases such as cancers. Recent experimental research has found significant interactions between non-coding microRNAs (miRNAs) and genes. We propose an integrated, graphical regression model to endogenize the directed miRNA–gene target interactions and control for their effects in signaling pathway disturbance. We identify prominent miRNA–gene interactions and propose a graphical representation of the targeting. We merge this network with signaling pathway networks to obtain a cross-functional graph representation of regulatory relationships between genes and miRNAs. We integrate gene expression and miRNA expression, in tandem with graphical integration of epigenetic and transcriptomic data types, and estimate a statistical model. We find that our integration approach improves the statistical power, using a simulation study. We demonstrate our integrated model with an application to disturbance of the BRAF signaling pathway across 9 cancers. We find that integration of miRNA–gene targets clarifies the differential activity between healthy and tumor tissues, which in turn reflects different roles for the pathway across the different cancers.
Henry Linder, Yuping Zhang
Robust Feature Screening for Ultrahigh-Dimensional Censored Data Subject to Measurement Error
Abstract
Feature screening is commonly used to handle ultrahigh-dimensional data prior to conducting a formal data analysis. While various feature screening methods have been developed in the literature, research gaps still exist. The existing methods usually make an implicit assumption that data are accurately measured. This requirement, however, is frequently violated in applications. In this chapter, we consider error-prone ultrahigh-dimensional survival data and propose a robust feature screening method. We develop an iteration algorithm to improve the performance of retaining all informative covariates. Theoretical results are established for the proposed method. Simulation studies are reported to assess the performance of the proposed method, together with an application of the proposed method to handle a mantle cell lymphoma microarray dataset.
Li-Pang Chen, Grace Y. Yi
Simultaneous Control of False Discovery Rate and Sensitivity Using Least Angle Regressions in High-Dimensional Data Analysis
Abstract
Controlling the false discovery rate (FDR) and maintaining the high sensitivity are key desiderata in post-selection inference in high-dimensional data analysis. Least Angle Regression (LARS) is an efficient variable selection method, and it provides a solution path along which the entered predictors always have the same absolute correlation with the current residual. In this chapter, we propose a new method to control the FDR and sensitivity simultaneously for high-dimensional post-selection inference using least angle regression, termed Cosine PoSI. Cosine PoSI focuses on the geometric aspect of least angle regression: in each step of the LARS algorithm, the proposed Cosine PoSI method makes use of the angle between the entering variable and the current residual and treats this angle as a random variable that follows a cosine distribution. Given the collection of the possible angles, the variable selection path is stopped using hypothesis testing based on the limiting distribution of the maximum angle that can be obtained through the order statistics of the cosine distribution. We show that both the sensitivity and the FDR can be controlled by using the stopping criteria. Simulation studies and a real-data analysis are conducted to assess the effectiveness of the proposed method.
Bangxin Zhao, Wenqing He
Minimum Wasserstein Distance Estimator Under Finite Location-Scale Mixtures
Abstract
When a population exhibits heterogeneity, we often model it via a finite mixture: decompose it into several different but homogeneous subpopulations. Contemporary practice favors learning the mixtures by maximizing the likelihood for statistical efficiency and the convenient EM algorithm for numerical computation. Yet the maximum likelihood estimate (MLE) is not well defined for finite location-scale mixture in general. We hence investigate feasible alternatives to MLE such as minimum distance estimators. Recently, the Wasserstein distance has drawn increased attention in the machine learning community. It has intuitive geometric interpretation and is successfully employed in many new applications. Do we gain anything by learning finite location-scale mixtures via a minimum Wasserstein distance estimator (MWDE)? This chapter investigates this possibility in several respects. We find that the MWDE is consistent and derive a numerical solution under finite location-scale mixtures. We study its robustness against outliers and mild model mis-specifications. Our moderate scaled simulation study shows the MWDE suffers some efficiency loss against a penalized version of MLE in general without noticeable gain in robustness. We reaffirm the general superiority of the likelihood-based learning strategies even for the non-regular finite location-scale mixtures.
Qiong Zhang, Jiahua Chen
An Entropy-Based Comment Ranking Method with Word Embedding Clustering
Abstract
Automatically ranking comments by their relevance plays an important role in text mining. In this chapter, we introduce a new text digitization method: the bag-of-word clusters model, i.e., grouping semantic-related words as clusters using pre-trained word2vec word embeddings and representing each comment as a distribution of word clusters. This method extracts both semantic and statistical information from texts. Next, we propose an unsupervised ranking algorithm that identifies relevant comments by their distance to the “ideal” comment. This “ideal” comment is the maximum general entropy comment with respect to the global word cluster distribution. The intuition is that the “ideal” comment highlights aspects of a product that many other comments frequently mention. Therefore, it is regarded as a standard to judge a comment’s relevance to this product. At last, we analyze our algorithm’s performance on a real Amazon product.
Yuyang Zhang, Hao Yu
A Robust Approach to Statistical Quality Control for High-Dimensional Non-Normal Data
Abstract
A recently proposed modification to the limit of the Hotelling’s T 2-statistic for statistical control under high-dimensional settings is evaluated for its robustness to the normality assumption. The limit, evaluated for high-dimensional asymptotics, is shown to be robust under a few mild assumptions and a general multivariate model covering normality as a special case. Further, the limit holds without any dimension reduction or preprocessing. The validity of the limit is demonstrated through simulations.
M. Rauf Ahmad, S. Ejaz Ahmed

Challenges in Statistical Analysis

Frontmatter
Functional Linear Regression for Partially Observed Functional Data
Abstract
In functional linear regression model, many methods have been proposed and studied to estimate the slope function while the functional predictor was observed in the entire domain. However, works on functional linear regression model with partially observed trajectories have received less attention. In this paper, to fill the literature gap we consider the scenario where individual functional predictor maybe observed only on part of the domain. Depending on whether measurement error is presented in functional predictors, two methods are developed, one is based on linear functionals of the observed part of the trajectory and the other one uses conditional principal component scores. We establish the asymptotic properties of the two proposed methods. Finite sample simulations are conducted to verify their performance. Diffusion tensor imaging (DTI) data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) study is analyzed.
Yafei Wang, Tingyu Lai, Bei Jiang, Linglong Kong, Zhongzhan Zhang
Profile Estimation of Generalized Semiparametric Varying-Coefficient Additive Models for Longitudinal Data with Within-Subject Correlations
Abstract
In this paper, we study several profile estimation methods for the generalized semiparametric varying-coefficient additive model for longitudinal data by utilizing the within-subject correlations. The model is flexible in allowing time-varying effects for some covariates and constant effects for others, and in having the option to choose different link functions which can used to analyze both discrete and continuous longitudinal responses. We investigated the profile generalized estimating equation (GEE) approaches and the profile quadratic inference function (QIF) approach. The profile estimations are assisted with the local linear smoothing technique to estimate the time-varying effects. Several approaches that incorporate the within-subject correlations are investigated including the quasi-likelihood (QL), the minimum generalized variance (MGV), the quadratic inference function, and the weighted least squares (WLS). The proposed estimation procedures can accommodate flexible sampling schemes. These methods provide a unified approach that works well for discrete longitudinal responses as well as for continuous longitudinal responses. Finite sample performances of these methods are examined through Monto Carlo simulations under various correlation structures for both discrete and continuous longitudinal responses. The simulation results show efficiency improvement over the working independence approach by utilizing the within-subject correlations as well as comparative performances of different approaches.
Yanqing Sun, Fang Fang
Sieve Estimation of Semiparametric Linear Transformation Model with Left-Truncated and Current Status Data
Abstract
In this paper, we analyze the semiparametric linear transformation model with left-truncated and current status data. Sieve maximum likelihood estimation method based on techniques of constrained Bernstein polynomials is exploited to obtain estimators for both the regression coefficients and the baseline survival function. Under some regularity conditions, we have proved that the proposed parameter estimators are semiparametrically efficient and asymptotically normal base on the conditional likelihood given the truncation time, and the estimator for the nonparametric function achieves the optimal rate of convergence. Simulation studies are conducted to assure the theoretical results, and a real data set is analyzed using the proposed method.
Riyadh Rustam Al-Mosawi, Xuewen Lu
A Review of Flexible Transformations for Modeling Compositional Data
Abstract
Vectors of non-negative components carrying only relative information, and often normalized to sum to one, are referred to as compositional data and their sample space is the simplex. Compositional data arise in many applications across a variety of disciplines such as ecology, geology, demography, and economics to name a few. For some time, log-ratio methods have been a popular approach for analyzing compositional data and have motivated much of the recent research in the area. In this paper, we consider two recently proposed transformations for data defined on the simplex. The first, referred to as the α-transformation, transforms the data from the simplex to a subset of Euclidean space while a more complex transformation, involving folding, results in data with Euclidean sample space. In both cases, the transformed data are assumed to follow a multivariate normal distribution and the parameter α provides flexibility compared to the traditional log-ratio transformations. Through an empirical study using several real-life data sets we illustrate that the α-transformation may be sufficient and preferred in practice compared to the α-folded model, and further that it is often needed over the log-ratio transformation.
Michail Tsagris, Connie Stewart
Identifiability and Estimation of Autoregressive ARCH Models with Measurement Error
Abstract
The autoregressive conditional heteroscedasticity (ARCH) model and its various generalizations have been widely used to analyze economic and financial data. Although many variables like GDP, inflation, and commodity prices are imprecisely measured, research focusing on the mismeasured response processes in GARCH models is sparse. We study a dynamic model with ARCH error where the underlying process is latent and subject to additive measurement error. We show that, in contrast to the case of covariate measurement error, this model is identifiable by using the observations of the proxy process only and no extra information is needed. We construct GMM estimators for the unknown parameters which are consistent and asymptotically normally distributed under general conditions. We also propose a procedure to test the presence of measurement error, which avoids the usual boundary problem of testing variance parameters. We carry out Monte Carlo simulations to study the impact of measurement error on the naive maximum likelihood estimators and have found interesting patterns of their biases. Moreover, the proposed estimators have fairly good finite sample properties.
Mustafa Salamh, Liqun Wang
Modal Regression for Skewed, Truncated, or Contaminated Data with Outliers
Abstract
Built on the ideas of mean and quantile, mean regression and quantile regression are extensively investigated and popularly used to model the relationship between a dependent variable Y  and covariates x. However, the research about the regression model built on the mode is rather limited. In this article, we introduce a new regression tool, named modal regression, that aims to find the most probable conditional value (mode) of a dependent variable Y  given covariates x rather than the mean that is used by the traditional mean regression. The modal regression can reveal new interesting data structure that is possibly missed by the conditional mean or quantiles. In addition, modal regression is resistant to outliers and heavy-tailed data and can provide shorter prediction intervals when the data are skewed. Furthermore, unlike traditional mean regression, the modal regression can be directly applied to the truncated data. Modal regression could be a potentially very useful regression tool that can complement the traditional mean and quantile regressions.
Sijia Xiang, Weixin Yao
Spatial Multilevel Modelling in the Galveston Bay Recovery Study Survey
Abstract
The Galveston Bay Recovery Study conducted a longitudinal survey of residents of two counties in Texas in the aftermath of Hurricane Ike, which made landfall on September 13, 2008 and caused widespread damage. An important objective was to chart the extent of symptoms of Post-Traumatic Stress Disorder (PTSD) in the resident population over the following months. Wave 1 of the survey was conducted between November 17, 2008 and March 24, 2009. Waves 2 and 3 consisted of two month and one year follow-ups, respectively. With the use of a stratified, 3-stage sampling design, data were collected from 658 residents. The first stage of sampling within strata was the selection of clusters, or area segments. Our objective is to model the course of the repeated PTSD measures as a function of individual characteristics and area segment, and to examine the analytical and visual evidence for spatial correlation of the area segment effect. To incorporate design information, our multilevel analysis uses the composite likelihood approach of Rao et al. (Survey Methodology, 39, 263–282, 2013) and Yi et al. (Statistica Sinica, 26, 569–587, 2016). We compare this with a Bayesian multilevel analysis and discuss the estimability of the model when the cluster-level variation has spatial dependence.
Mary E. Thompson, Gang Meng, Joseph Sedransk, Qixuan Chen, Rebecca Anthopolos
Efficient Experimental Design for Lasso Regression
Abstract
Lasso regression has attracted great attention in statistical learning and data science. However, there is sporadic work on constructing efficient data collection for regularized regression. In this work, we propose an experimental design approach, using nearly orthogonal Latin hypercube designs, to enhance the variable selection accuracy of Lasso regression. Systematic methods for constructing such designs are presented. The effectiveness of the proposed method is illustrated with several examples.
Peter Chien, Xinwei Deng, Chunfang Devon Lin
A Selective Overview of Statistical Methods for Identification of the Treatment-Sensitive Subsets of Patients
Abstract
Identification of a subset of patients who may benefit from or be sensitive to a specific type of treatment has become a very important research topic in clinical trials and other types of clinical research. Statistical methods are essential in helping clinical researchers to identify the subset. In this article, we provide a selective overview of statistical methods developed in recent years in this research areas. Specifically, we consider first the cases where the outcome of the clinical studies is time-to-event or survival time and the subset is defined by one continuous covariate, such as the expression level of a gene, or by multiple covariates which can be continuous or categorical, such as mutation statuses of multiple genes. The cases where the outcomes of the clinical studies are longitudinal or repeated measurements, such as patient reported quality of life scores before, during, and after a treatment, are considered next. Gaps between the needs in clinical research and the methods available in statistical literature are identified and future research topics to bridge these gaps are discussed based on this overview.
Xinyi Ge, Yingwei Peng, Dongsheng Tu
Backmatter
Metadaten
Titel
Advances and Innovations in Statistics and Data Science
herausgegeben von
Wenqing He
Liqun Wang
Jiahua Chen
Chunfang Devon Lin
Copyright-Jahr
2022
Electronic ISBN
978-3-031-08329-7
Print ISBN
978-3-031-08328-0
DOI
https://doi.org/10.1007/978-3-031-08329-7

Premium Partner