Skip to main content

2017 | Buch

Big and Complex Data Analysis

Methodologies and Applications

insite
SUCHEN

Über dieses Buch

This volume conveys some of the surprises, puzzles and success stories in high-dimensional and complex data analysis and related fields. Its peer-reviewed contributions showcase recent advances in variable selection, estimation and prediction strategies for a host of useful models, as well as essential new developments in the field.

The continued and rapid advancement of modern technology now allows scientists to collect data of increasingly unprecedented size and complexity. Examples include epigenomic data, genomic data, proteomic data, high-resolution image data, high-frequency financial data, functional and longitudinal data, and network data. Simultaneous variable selection and estimation is one of the key statistical problems involved in analyzing such big and complex data.

The purpose of this book is to stimulate research and foster interaction between researchers in the area of high-dimensional data analysis. More concretely, its goals are to: 1) highlight and expand the breadth of existing methods in big data and high-dimensional data analysis and their potential for the advancement of both the mathematical and statistical sciences; 2) identify important directions for future research in the theory of regularization methods, in algorithmic development, and in methodologies for different application areas; and 3) facilitate collaboration between theoretical and subject-specific researchers.

Inhaltsverzeichnis

Frontmatter

General High-Dimensional Theory and Methods

Frontmatter
Regularization After Marginal Learning for Ultra-High Dimensional Regression Models
Abstract
Regularization is a popular variable selection technique for high dimensional regression models. However, under the ultra-high dimensional setting, a direct application of the regularization methods tends to fail in terms of model selection consistency due to the possible spurious correlations among predictors. Motivated by the ideas of screening (Fan and Lv, J R Stat Soc Ser B Stat Methodol 70:849–911, 2008) and retention (Weng et al, Manuscript, 2013), we propose a new two-step framework for variable selection, where in the first step, marginal learning techniques are utilized to partition variables into different categories, and the regularization methods can be applied afterwards. The technical conditions of model selection consistency for this broad framework relax those for the one-step regularization methods. Extensive simulations show the competitive performance of the new method.
Yang Feng, Mengjia Yu
Empirical Likelihood Test for High Dimensional Generalized Linear Models
Abstract
Technological advances allow scientists to collect high dimensional data sets in which the number of variables is much larger than the sample size. A representative example is genomics. Consequently, due to their loss of accuracy or power, many classic statistical methods are being challenged when analyzing such data. In this chapter, we propose an empirical likelihood (EL) method to test regression coefficients in high dimensional generalized linear models. The EL test has an asymptotic chi-squared distribution with two degrees of freedom under the null hypothesis, and this result is independent of the number of covariates. Moreover, we extend the proposed method to test a part of the regression coefficients in the presence of nuisance parameters. Simulation studies show that the EL tests have a good control of the type-I error rate under moderate sample sizes and are more powerful than the direct competitor under the alternative hypothesis under most scenarios. The proposed tests are employed to analyze the association between rheumatoid arthritis (RA) and single nucleotide polymorphisms (SNPs) on chromosome 6. The resulted p-value is 0.019, indicating that chromosome 6 has an influence on RA. With the partial test and logistic modeling, we also find that the SNPs eliminated by the sure independence screening and Lasso methods have no significant influence on RA.
Yangguang Zang, Qingzhao Zhang, Sanguo Zhang, Qizhai Li, Shuangge Ma
Random Projections for Large-Scale Regression
Abstract
Fitting linear regression models can be computationally very expensive in large-scale data analysis tasks if the sample size and the number of variables are very large. Random projections are extensively used as a dimension reduction tool in machine learning and statistics. We discuss the applications of random projections in linear regression problems, developed to decrease computational costs, and give an overview of the theoretical guarantees of the generalization error. It can be shown that the combination of random projections with least squares regression leads to similar recovery as ridge regression and principal component regression. We also discuss possible improvements when averaging over multiple random projections, an approach that lends itself easily to parallel implementation.
Gian-Andrea Thanei, Christina Heinze, Nicolai Meinshausen
Testing in the Presence of Nuisance Parameters: Some Comments on Tests Post-Model-Selection and Random Critical Values
Abstract
We point out that the ideas underlying some test procedures recently proposed for testing post-model-selection (and for some other test problems) in the econometrics literature have been around for quite some time in the statistics literature. We also sharpen some of these results in the statistics literature. Furthermore, we show that some intuitively appealing testing procedures, that have found their way into the econometrics literature, lead to tests that do not have desirable size properties, not even asymptotically.
Hannes Leeb, Benedikt M. Pötscher
Analysis of Correlated Data with Error-Prone Response Under Generalized Linear Mixed Models
Abstract
Measurements of variables are often subject to error due to various reasons. Measurement error in covariates has been discussed extensively in the literature, while error in response has received much less attention. In this paper, we consider generalized linear mixed models for clustered data where measurement error is present in response variables. We investigate asymptotic bias induced by nonlinear error in response variables if such error is ignored, and evaluate the performance of an intuitively appealing approach for correction of response error effects. We develop likelihood methods to correct for effects induced from response error. Simulation studies are conducted to evaluate the performance of the proposed methods, and a real data set is analyzed with the proposed methods.
Grace Y. Yi, Zhijian Chen, Changbao Wu
Bias-Reduced Moment Estimators of Population Spectral Distribution and Their Applications
Abstract
In this paper, we propose a series of bias-reduced moment estimators for the Population Spectral Distribution (PSD) of large covariance matrices, which are fundamentally important for modern high-dimensional statistics. In addition, we derive the limiting distributions of these moment estimators, which are then adopted to test the order of PSDs. The simulation study demonstrates the desirable performance of the order test in conjunction with the proposed moment estimators for the PSD of large covariance matrices.
Yingli Qin, Weiming Li

Network Analysis and Big Data

Frontmatter
Statistical Process Control Charts as a Tool for Analyzing Big Data
Abstract
Big data often take the form of data streams with observations of certain processes collected sequentially over time. Among many different purposes, one common task to collect and analyze big data is to monitor the longitudinal performance/status of the related processes. To this end, statistical process control (SPC) charts could be a useful tool, although conventional SPC charts need to be modified properly in some cases. In this paper, we introduce some basic SPC charts and some of their modifications, and describe how these charts can be used for monitoring different types of processes. Among many potential applications, dynamic disease screening and profile/image monitoring will be discussed in some detail.
Peihua Qiu
Fast Community Detection in Complex Networks with a K-Depths Classifier
Abstract
We introduce a notion of data depth for recovery of community structures in large complex networks. We propose a new data-driven algorithm, K-depths, for community detection using the L 1-depth in an unsupervised setting. We evaluate finite sample properties of the K-depths method using synthetic networks and illustrate its performance for tracking communities in online social media platform Flickr. The new method significantly outperforms the classical K-means and yields comparable results to the regularized K-means. Being robust to low-degree vertices, the new K-depths method is computationally efficient, requiring up to 400 times less CPU time than the currently adopted regularization procedures based on optimizing the Davis–Kahan bound.
Yahui Tian, Yulia R. Gel
How Different Are Estimated Genetic Networks of Cancer Subtypes?
Abstract
Genetic networks provide compact representations of interactions between genes, and offer a systems perspective into biological processes and cellular functions. Many algorithms have been developed to estimate such networks based on steady-state gene expression profiles. However, the estimated networks using different methods are often very different from each other. On the other hand, it is not clear whether differences observed between estimated networks in two different biological conditions are truly meaningful, or due to variability in estimation procedures. In this paper, we aim to answer these questions by conducting a comprehensive empirical study to compare networks obtained from different estimation methods and for different subtypes of cancer. We evaluate various network descriptors to assess complex properties of estimated networks, beyond their local structures, and propose a simple permutation test for comparing estimated networks. The results provide new insight into properties of estimated networks using different reconstruction methods, as well as differences in estimated networks in different biological conditions.
Ali Shojaie, Nafiseh Sedaghat
A Computationally Efficient Approach for Modeling Complex and Big Survival Data
Abstract
Modern data collection techniques have resulted in an increasing number of big clustered time-to-event data sets, wherein patients are often observed from a large number of healthcare providers. Semiparametric frailty models are a flexible and powerful tool for modeling clustered time-to-event data. In this manuscript, we first provide a computationally efficient approach based on a minimization–maximization algorithm to fit semiparametric frailty models in large-scale settings. We then extend the proposed method to incorporate complex data structures such as time-varying effects, for which many existing methods fail because of lack of computational power. The finite-sample properties and the utility of the proposed method are examined through an extensive simulation study and an analysis of the national kidney transplant data.
Kevin He, Yanming Li, Qingyi Wei, Yi Li
Tests of Concentration for Low-Dimensional and High-Dimensional Directional Data
Abstract
We consider asymptotic inference for the concentration of directional data. More precisely, we propose tests for concentration (1) in the low-dimensional case where the sample size n goes to infinity and the dimension p remains fixed, and (2) in the high-dimensional case where both n and p become arbitrarily large. To the best of our knowledge, the tests we provide are the first procedures for concentration that are valid in the (n, p)-asymptotic framework. Throughout, we consider parametric FvML tests, that are guaranteed to meet asymptotically the nominal level constraint under FvML distributions only, as well as “pseudo-FvML” versions of such tests, that meet asymptotically the nominal level constraint within the whole class of rotationally symmetric distributions. We conduct a Monte-Carlo study to check our asymptotic results and to investigate the finite-sample behavior of the proposed tests.
Christine Cutting, Davy Paindaveine, Thomas Verdebout
Nonparametric Testing for Heterogeneous Correlation
Abstract
In the presence of weak overall correlation, it may be useful to investigate if the correlation is significantly and substantially more pronounced over a subpopulation. Two different testing procedures are compared. Both are based on the rankings of the values of two variables from a data set with a large number n of observations. The first maintains its level against Gaussian copulas; the second adapts to general alternatives in the sense that the number of parameters used in the test grows with n. An analysis of wine quality illustrates how the methods detect heterogeneity of association between chemical properties of the wine, which are attributable to a mix of different cultivars.
Stephen Bamattre, Rex Hu, Joseph S. Verducci

Statistics Learning and Applications

Frontmatter
Optimal Shrinkage Estimation in Heteroscedastic Hierarchical Linear Models
Abstract
Shrinkage estimators have profound impacts in statistics and in scientific and engineering applications. In this article, we consider shrinkage estimation in the presence of linear predictors. We formulate two heteroscedastic hierarchical regression models and study optimal shrinkage estimators in each model. A class of shrinkage estimators, both parametric and semiparametric, based on unbiased risk estimate (URE) is proposed and is shown to be (asymptotically) optimal under mean squared error loss in each model. Simulation study is conducted to compare the performance of the proposed methods with existing shrinkage estimators. We also apply the method to real data and obtain encouraging and interesting results.
S. C. Kou, Justin J. Yang
High Dimensional Data Analysis: Integrating Submodels
Abstract
We consider an efficient prediction in sparse high dimensional data. In high dimensional data settings where d ≫ n, many penalized regularization strategies are suggested for simultaneous variable selection and estimation. However, different strategies yield a different submodel with d i  < n, where d i represents the number of predictors included in ith submodel. Some procedures may select a submodel with a larger number of predictors than others. Due to the trade-off between model complexity and model prediction accuracy, the statistical inference of model selection becomes extremely important and challenging in high dimensional data analysis. For this reason we suggest shrinkage and pretest strategies to improve the prediction performance of two selected submodels. Such a pretest and shrinkage strategy is constructed by shrinking an overfitted model estimator in the direction of an underfitted model estimator. The numerical studies indicate that our post-selection pretest and shrinkage strategy improved the prediction performance of selected submodels.
Syed Ejaz Ahmed, Bahadır Yüzbaşı
High-Dimensional Classification for Brain Decoding
Abstract
Brain decoding involves the determination of a subject’s cognitive state or an associated stimulus from functional neuroimaging data measuring brain activity. In this setting the cognitive state is typically characterized by an element of a finite set, and the neuroimaging data comprise voluminous amounts of spatiotemporal data measuring some aspect of the neural signal. The associated statistical problem is one of the classifications from high-dimensional data. We explore the use of functional principal component analysis, mutual information networks, and persistent homology for examining the data through exploratory analysis and for constructing features characterizing the neural signal for brain decoding. We review each approach from this perspective, and we incorporate the features into a classifier based on symmetric multinomial logistic regression with elastic net regularization. The approaches are illustrated in an application where the task is to infer, from brain activity measured with magnetoencephalography (MEG), the type of video stimulus shown to a subject.
Nicole Croteau, Farouk S. Nathoo, Jiguo Cao, Ryan Budney
Unsupervised Bump Hunting Using Principal Components
Abstract
Principal Components Analysis is a widely used technique for dimension reduction and characterization of variability in multivariate populations. Our interest lies in studying when and why the rotation to principal components can be used effectively within a response-predictor set relationship in the context of mode hunting. Specifically focusing on the Patient Rule Induction Method (PRIM), we first develop a fast version of this algorithm (fastPRIM) under normality which facilitates the theoretical studies to follow. Using basic geometrical arguments, we then demonstrate how the Principal Components rotation of the predictor space alone can in fact generate improved mode estimators. Simulation results are used to illustrate our findings.
Daniel A. Díaz-Pachón, Jean-Eudes Dazard, J. Sunil Rao
Identifying Gene–Environment Interactions Associated with Prognosis Using Penalized Quantile Regression
Abstract
In the omics era, it has been well recognized that for complex traits and outcomes, the interactions between genetic and environmental factors (i.e., the G×E interactions) have important implications beyond the main effects. Most of the existing interaction analyses have been focused on continuous and categorical traits. Prognosis is of essential importance for complex diseases. However with significantly more complexity, prognosis outcomes have been less studied. In the existing interaction analysis on prognosis outcomes, the most common practice is to fit marginal (semi)parametric models (for example, Cox) using likelihood-based estimation and then identify important interactions based on significance level. Such an approach has limitations. First data contamination is not uncommon. With likelihood-based estimation, even a single contaminated observation can result in severely biased estimation and misleading conclusions. Second, when sample size is not large, the significance-based approach may not be reliable. To overcome these limitations, in this study, we adopt the quantile-based estimation which is robust to data contamination. Two techniques are adopted to accommodate right censoring. For identifying important interactions, we adopt penalization as an alternative to significance level. An efficient computational algorithm is developed. Simulation shows that the proposed method can significantly outperform the alternative. We analyze a lung cancer prognosis study with gene expression measurements.
Guohua Wang, Yinjun Zhao, Qingzhao Zhang, Yangguang Zang, Sanguo Zang, Shuangge Ma
A Mixture of Variance-Gamma Factor Analyzers
Abstract
The mixture of factor analyzers model is extended to variance-gamma mixtures to facilitate flexible clustering of high-dimensional data. The formation of the variance-gamma distribution utilized is a special and limiting case of the generalized hyperbolic distribution. Parameter estimation for these mixtures is carried out via an alternating expectation-conditional maximization algorithm, and relies on convenient expressions for expected values for the generalized inverse Gaussian distribution. The Bayesian information criterion is used to select the number of latent factors. The mixture of variance-gamma factor analyzers model is illustrated on a well-known breast cancer data set. Finally, the place of variance-gamma mixtures within the growing body of literature on non-Gaussian mixtures is considered.
Sharon M. McNicholas, Paul D. McNicholas, Ryan P. Browne
Metadaten
Titel
Big and Complex Data Analysis
herausgegeben von
S. Ejaz Ahmed
Copyright-Jahr
2017
Electronic ISBN
978-3-319-41573-4
Print ISBN
978-3-319-41572-7
DOI
https://doi.org/10.1007/978-3-319-41573-4