Skip to main content

2018 | Buch

Handbook of Big Data Analytics

herausgegeben von: Prof. Dr. Wolfgang Karl Härdle, Prof. Henry Horng-Shing Lu, Prof. Xiaotong Shen

Verlag: Springer International Publishing

Buchreihe : Springer Handbooks of Computational Statistics

insite
SUCHEN

Über dieses Buch

Addressing a broad range of big data analytics in cross-disciplinary applications, this essential handbook focuses on the statistical prospects offered by recent developments in this field. To do so, it covers statistical methods for high-dimensional problems, algorithmic designs, computation tools, analysis flows and the software-hardware co-designs that are needed to support insightful discoveries from big data. The book is primarily intended for statisticians, computer experts, engineers and application developers interested in using big data analytics with statistics. Readers should have a solid background in statistics and computer science.

Inhaltsverzeichnis

Frontmatter

Overview

Frontmatter
Chapter 1. Statistics, Statisticians, and the Internet of Things
Abstract
Within the overall rubric of big data, one emerging subset holds particular promise, peril, and attraction. Machine-generated traffic from sensors, data logs, and the like, transmitted using Internet practices and principles, is being referred to as the “Internet of Things” (IoT). Understanding, handing, and analyzing this type of data will stretch existing tools and techniques, thus providing a proving ground for other disciplines to adopt and adapt new methods and concepts. In particular, new tools will be needed to analyze data in motion rather than data at rest, and there are consequences of having constant or near-constant readings from the ground-truth phenomenon as opposed to numbers at a remove from their origin. Both machine learning and traditional statistical approaches will coevolve rapidly given the economic forces, national security implications, and wide public benefit of this new area of investigation. At the same time, data practitioners will be exposed to the possibility of privacy breaches, accidents causing bodily harm, and other concrete consequences of getting things wrong in theory and/or practice. We contend that the physical instantiation of data practice in the IoT means that statisticians and other practitioners may well be seeing the origins of a post-big data era insofar as the traditional abstractions of numbers from ground truth are attenuated and in some cases erased entirely.
John M. Jordan, Dennis K. J. Lin
Chapter 2. Cognitive Data Analysis for Big Data
Abstract
Cognitive data analysis (CDA) automates and adds cognitive processes to data analysis so that the business user or data analyst can gain insights from advanced analytics. CDA is especially important in the age of big data, where the data is so complex, and includes both structured and unstructured data, that it is impossible to manually examine all possible combinations. As a cognitive computing system, CDA does not simply take over the entire process. Instead, CDA interacts with the user and learns from the interactions. This chapter reviews IBM Corporation’s (IBM SPSS Modeler CRISP-DM guide, 2011) Cross Industry Standard Process for Data Mining (CRISP-DM) as a precursor of CDA. Then, continuing to develop the ideas set forth in Shyr and Spisic’s (“Automated data analysis for Big Data.” WIREs Comp Stats 6: 359–366, 2014), this chapter defines a new three-stage CDA process. Each stage (Data Preparation, Automated Modeling, and Application of Results) is discussed in detail. The Data Preparation stage alleviates or eliminates the data preparation burden from the user by including smart technologies such as natural language query and metadata discovery. This stage prepares the data for specific and appropriate analyses in the Automated Modeling stage, which performs descriptive as well as predictive analytics and presents the user with starting points and recommendations for exploration. Finally, the Application of Results stage considers the user’s purpose, which may be to directly gain insights for smarter decisions and better business outcomes or to deploy the predictive models in an operational system.
Jing Shyr, Jane Chu, Mike Woods

Methodology

Frontmatter
Chapter 3. Statistical Leveraging Methods in Big Data
Abstract
With the advance in science and technologies in the past decade, big data becomes ubiquitous in all fields. The exponential growth of big data significantly outpaces the increase of storage and computational capacity of high performance computers. The challenge in analyzing big data calls for innovative analytical and computational methods that make better use of currently available computing power. An emerging powerful family of methods for effectively analyzing big data is called statistical leveraging. In these methods, one first takes a random subsample from the original full sample, then uses the subsample as a surrogate for any computation and estimation of interest. The key to success of statistical leveraging methods is to construct a data-adaptive sampling probability distribution, which gives preference to those data points that are influential to model fitting and statistical inference. In this chapter, we review the recent development of statistical leveraging methods. In particular, we focus on various algorithms for constructing subsampling probability distribution, and a coherent theoretical framework for investigating their estimation property and computing complexity. Simulation studies and real data examples are presented to demonstrate applications of the methodology.
Xinlian Zhang, Rui Xie, Ping Ma
Chapter 4. Scattered Data and Aggregated Inference
Abstract
Scattered Data and Aggregated Inference (SDAI) represents a class of problems where data cannot be at a centralized location, while modeling and inference is pursued. Distributed statistical inference is a technique to tackle a type of the above problem, and has recently attracted enormous attention. Many existing work focus on the averaging estimator, e.g., Zhang et al. (2013) and many others. In this chapter, we propose a one-step approach to enhance a simple-averaging based distributed estimator. We derive the corresponding asymptotic properties of the newly proposed estimator. We find that the proposed one-step estimator enjoys the same asymptotic properties as the centralized estimator. The proposed one-step approach merely requires one additional round of communication in relative to the averaging estimator; so the extra communication burden is insignificant. In finite-sample cases, numerical examples show that the proposed estimator outperforms the simple averaging estimator with a large margin in terms of the mean squared errors. A potential application of the one-step approach is that one can use multiple machines to speed up large-scale statistical inference with little compromise in the quality of estimators. The proposed method becomes more valuable when data can only be available at distributed machines with limited communication bandwidth. We discuss other types of SDAI problems at the end.
Xiaoming Huo, Cheng Huang, Xuelei Sherry Ni
Chapter 5. Nonparametric Methods for Big Data Analytics
Abstract
Nonparametric methods provide more flexible tools than parametric methods for modeling complex systems and discovering nonlinear patterns hidden in data. Traditional nonparametric methods are challenged by modern high dimensional data due to the curse of dimensionality. Over the past two decades, there have been rapid advances in nonparametrics to accommodate analysis of large-scale and high dimensional data. A variety of cutting-edge nonparametric methodologies, scalable algorithms, and the state-of-the-art computational tools have been designed for model estimation, variable selection, statistical inferences for high dimensional regression, and classification problems. This chapter provides an overview of recent advances on nonparametrics in big data analytics.
Hao Helen Zhang
Chapter 6. Finding Patterns in Time Series
Abstract
Large datasets are often time series data, and such datasets present challenging problems that arise from the passage of time reflected in the datasets. A problem of current interest is clustering and classification of multiple time series. When various time series are fitted to models, the different time series can be grouped into clusters based on the fitted models. If there are different identifiable classes of time series, the fitted models can be used to classify new time series.
For massive time series datasets, any assumption of stationarity is not likely to be met. Any useful time series model that extends over a lengthy time period must either be very weak, that is, a model in which the signal-to-noise ratio is relatively small, or else must be very complex with many parameters. Hence, a common approach to model building in time series is to break the series into separate regimes and to identify an adequate local model within each regime. In this case, the problem of clustering or classification can be addressed by use of sequential patterns of the models for the separate regimes.
In this chapter, we discuss methods for identifying changepoints in a univariate time series. We will emphasize a technique called alternate trends smoothing.
After identification of changepoints, we briefly discuss the problem of defining patterns. The objectives of defining and identifying patterns are twofold: to cluster and/or to classify sets of time series, and to predict future values or trends in a time series.
James E. Gentle, Seunghye J. Wilson
Chapter 7. Variational Bayes for Hierarchical Mixture Models
Abstract
In recent years, sparse classification problems have emerged in many fields of study. Finite mixture models have been developed to facilitate Bayesian inference where parameter sparsity is substantial. Classification with finite mixture models is based on the posterior expectation of latent indicator variables. These quantities are typically estimated using the expectation-maximization (EM) algorithm in an empirical Bayes approach or Markov chain Monte Carlo (MCMC) in a fully Bayesian approach. MCMC is limited in applicability where high-dimensional data are involved because its sampling-based nature leads to slow computations and hard-to-monitor convergence. In this chapter, we investigate the feasibility and performance of variational Bayes (VB) approximation in a fully Bayesian framework. We apply the VB approach to fully Bayesian versions of several finite mixture models that have been proposed in bioinformatics, and find that it achieves desirable speed and accuracy in sparse classification with finite mixture models for high-dimensional data.
Muting Wan, James G. Booth, Martin T. Wells
Chapter 8. Hypothesis Testing for High-Dimensional Data
Abstract
We present a systematic theory for tests for means of high-dimensional data. Our testing procedure is based on an invariance principle which provides distributional approximations of functionals of non-Gaussian vectors by those of Gaussian ones. Differently from the widely used Bonferroni approach, our procedure is dependence-adjusted and has an asymptotically correct size and power. To obtain cutoff values of our test, we propose a half-sampling method which avoids estimating the underlying covariance matrix of the random vectors. The latter method is shown via extensive simulations to have an excellent performance.
Wei Biao Wu, Zhipeng Lou, Yuefeng Han
Chapter 9. High-Dimensional Classification
Abstract
There are three fundamental goals in constructing a good high-dimensional classifier: high accuracy, interpretable feature selection, and efficient computation. In the past 15 years, several popular high-dimensional classifiers have been developed and studied in the literature. These classifiers can be roughly divided into two categories: sparse penalized margin-based classifiers and sparse discriminant analysis. In this chapter we give a comprehensive review of these popular high-dimensional classifiers.
Hui Zou
Chapter 10. Analysis of High-Dimensional Regression Models Using Orthogonal Greedy Algorithms
Abstract
We begin by reviewing recent results of Ing and Lai (Stat Sin 21:1473–1513, 2011) on the statistical properties of the orthogonal greedy algorithm (OGA) in high-dimensional sparse regression models with independent observations. In particular, when the regression coefficients are absolutely summable, the conditional mean squared prediction error and the empirical norm of OGA derived by Ing and Lai (Stat Sin 21:1473–1513, 2011) are introduced. We then explore the performance of OGA under more general sparsity conditions. Finally, we obtain the convergence rate of OGA in high-dimensional time series models, and illustrate the advantage of our results compared to those established for Lasso by Basu and Michailidis (Ann Stat 43:1535–1567, 2015) and Wu and Wu (Electron J Stat 10:352–379, 2016).
Hsiang-Ling Hsu, Ching-Kang Ing, Tze Leung Lai
Chapter 11. Semi-supervised Smoothing for Large Data Problems
Abstract
This book chapter is a description of some recent developments in non-parametric semi-supervised regression and is intended for someone with a background in statistics, computer science, or data sciences who is familiar with local kernel smoothing (Hastie et al., The elements of statistical learning (data mining, inference and prediction), chapter 6. Springer, Berlin, 2009). In many applications, response data often require substantially more effort to obtain than feature data. Semi-supervised learning approaches are designed to explicitly train a classifier or regressor using all the available responses and the full feature data. This presentation is focused on local kernel regression methods in semi-supervised learning and provides a good starting point for understanding semi-supervised methods in general.
Mark Vere Culp, Kenneth Joseph Ryan, George Michailidis
Chapter 12. Inverse Modeling: A Strategy to Cope with Non-linearity
Abstract
In the big data era, discovering and modeling potentially non-linear relationships between predictors and responses might be one of the toughest challenges in modern data analysis. Most forward regression modeling procedures are seriously compromised due to the curse of dimension. In this chapter, we show that the inverse modeling idea, originated from the Sliced Inverse Regression (SIR), can help us detect nonlinear relations effectively, and survey a few recent advances, both algorithmically and theoretically, in which the inverse modeling idea leads to unforeseeable benefits in nonlinear variable selection and nonparametric screening.
Qian Lin, Yang Li, Jun S. Liu
Chapter 13. Sufficient Dimension Reduction for Tensor Data
Abstract
With the rapid development of science and technology, a large volume of array data has been collected in areas such as genomics, finance, image processing, and Internet search. How to extract useful information from massive data becomes the key issue nowadays. In spite of the urgent need for statistical tools to deal with such data, there are limited methods that can fully address the high-dimensional problem. In this chapter, we review the general setting of sufficient dimension reduction framework and its generalization to tensor data. Tensor is a multi-way array, and its usage is becoming more and more important with the advancement of social and behavioral science, chemistry, and imaging technology. The vector-based statistical methods can be applied to tensor data by vectorizing a tensor into a vector. However, vectorized tensor usually has a large dimension which may largely exceed the number of samples. To preserve the tensor structure and reduce the dimensionality simultaneously, we revisit the tensor sufficient dimension reduction model and apply it to colorimetric sensor arrays. Tensor sufficient dimension reduction method is simple but powerful and exhibits a competent empirical performance in real data analysis.
Yiwen Liu, Xin Xing, Wenxuan Zhong
Chapter 14. Compressive Sensing and Sparse Coding
Abstract
Compressive sensing is a technique to acquire signals at rates proportional to the amount of information in the signal, and it does so by exploiting the sparsity of signals. This section discusses the fundamentals of compressive sensing, and how it is related to sparse coding.
Kevin Chen, H. T. Kung
Chapter 15. Bridging Density Functional Theory and Big Data Analytics with Applications
Abstract
The framework of the density functional theory (DFT) reveals both strong suitability and compatibility for investigating large-scale systems in the Big Data regime. By technically mapping the data space into physically meaningful bases, the chapter provides a simple procedure to formulate global Lagrangian and Hamiltonian density functionals to circumvent the emerging challenges on large-scale data analyses. Then, the informative features of mixed datasets and the corresponding clustering morphologies can be visually elucidated by means of the evaluations of global density functionals. Simulation results of data clustering illustrated that the proposed methodology provides an alternative route for analyzing the data characteristics with abundant physical insights. For the comprehensive demonstration in a high dimensional problem without prior ground truth, the developed density functionals were also applied on the post-process of magnetic resonance imaging (MRI) and better tumor recognitions can be achieved on the T1 post-contrast and T2 modes. It is appealing that the post-processing MRI using the proposed DFT-based algorithm would benefit the scientists in the judgment of clinical pathology. Eventually, successful high dimensional data analyses revealed that the proposed DFT-based algorithm has the potential to be used as a framework for investigations of large-scale complex systems and applications of high dimensional biomedical image processing.
Chien-Chang Chen, Hung-Hui Juan, Meng-Yuan Tsai, Henry Horng-Shing Lu

Software

Frontmatter
Chapter 16. Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing
Abstract
QuantNet is an integrated web-based environment consisting of different types of statistics-related documents and program codes. Its goal is creating reproducibility and offering a platform for sharing validated knowledge native to the social web. To increase the information retrieval (IR) efficiency there is a need for incorporating semantic information. Three text mining models will be examined: vector space model (VSM), generalized VSM (GVSM), and latent semantic analysis (LSA). The LSA has been successfully used for IR purposes as a technique for capturing semantic relations between terms and inserting them into the similarity measure between documents. Our results show that different model configurations allow adapted similarity-based document clustering and knowledge discovery. In particular, different LSA configurations together with hierarchical clustering reveal good results under M 3 evaluation. QuantNet and the corresponding Data-Driven Documents (D3) based visualization can be found and applied under http://​quantlet.​de. The driving technology behind it is Q3-D3-LSA, which is the combination of “GitHub API based QuantNet Mining infrastructure in R”, LSA and D3 implementation.
Lukas Borke, Wolfgang K. Härdle
Chapter 17. A Tutorial on : R Package for the Linearized Bregman Algorithm in High-Dimensional Statistics
Abstract
The R package, https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-18284-1_17/331027_1_En_17_IEq2_HTML.gif , stands for the LInearized BRegman Algorithm in high-dimensional statistics. The Linearized Bregman Algorithm is a simple iterative procedure which generates sparse regularization paths of model estimation. This algorithm was firstly proposed in applied mathematics for image restoration, and is particularly suitable for parallel implementation in large-scale problems. The limit of such an algorithm is a sparsity-restricted gradient descent flow, called the Inverse Scale Space, evolving along a parsimonious path of sparse models from the null model to overfitting ones. In sparse linear regression, the dynamics with early stopping regularization can provably meet the unbiased oracle estimator under nearly the same condition as LASSO, while the latter is biased. Despite its successful applications, proving the consistency of such dynamical algorithms remains largely open except for some recent progress on linear regression. In this tutorial, algorithmic implementations in the package are discussed for several widely used sparse models in statistics, including linear regression, logistic regression, and several graphical models (Gaussian, Ising, and Potts). Besides the simulation examples, various applications are demonstrated, with real-world datasets such as diabetes, publications of COPSS award winners, as well as social networks of two Chinese classic novels, Journey to the West and Dream of the Red Chamber.
Jiechao Xiong, Feng Ruan, Yuan Yao

Application

Frontmatter
Chapter 18. Functional Data Analysis for Big Data: A Case Study on California Temperature Trends
Abstract
In recent years, detailed historical records, remote sensing, genomics and medical imaging applications as well as the rise of the Internet-of-Things present novel data streams. Many of these data are instances where functions are more suitable data atoms than traditional multivariate vectors. Applied functional data analysis (FDA) presents a potentially fruitful but largely unexplored alternative analytics framework that can be incorporated directly into a general Big Data analytics suite. As an example, we present a modeling approach for the dynamics of a functional data set of climatic data. By decomposing functions via a functional principal component analysis and functional variance process analysis, a robust and informative characterization of the data can be derived; this provides insights into the relationship between the different modes of variation, their inherent variance process as well as their dependencies over time. The model is applied to historical data from the Global Historical Climatology Network in California, USA. The analysis reveals that climatic time-dependent information is jointly carried by the original processes as well as their noise/variance decomposition.
Pantelis Zenon Hadjipantelis, Hans-Georg Müller
Chapter 19. Bayesian Spatiotemporal Modeling for Detecting Neuronal Activation via Functional Magnetic Resonance Imaging
Abstract
We consider recent developments in Bayesian spatiotemporal models for detecting neuronal activation in fMRI experiment. A Bayesian approach typically results in complicated posterior distributions that can be of enormous dimension for a whole-brain analysis, thus posing a formidable computational challenge. Recently developed Bayesian approaches to detecting local activation have proved computationally efficient while requiring few modeling compromises. We review two such methods and implement them on a data set from the Human Connectome Project in order to show that, contrary to popular opinion, careful implementation of Markov chain Monte Carlo methods can be used to obtain reliable results in a matter of minutes.
Martin Bezener, Lynn E. Eberly, John Hughes, Galin Jones, Donald R. Musgrove
Chapter 20. Construction of Tight Frames on Graphs and Application to Denoising
Abstract
Given a neighborhood graph representation of a finite set of points \(x_i\in \mathbb {R}^d,i=1,\ldots ,n,\) we construct a frame (redundant dictionary) for the space of real-valued functions defined on the graph. This frame is adapted to the underlying geometrical structure of the x i, has finitely many elements, and these elements are localized in frequency as well as in space. This construction follows the ideas of Hammond et al. (Appl Comput Harmon Anal 30:129–150, 2011), with the key point that we construct a tight (or Parseval) frame. This means we have a very simple, explicit reconstruction formula for every function f defined on the graph from the coefficients given by its scalar product with the frame elements. We use this representation in the setting of denoising where we are given noisy observations of a function f defined on the graph. By applying a thresholding method to the coefficients in the reconstruction formula, we define an estimate of f whose risk satisfies a tight oracle inequality.
Franziska Göbel, Gilles Blanchard, Ulrike von Luxburg
Chapter 21. Beta-Boosted Ensemble for Big Credit Scoring Data
Abstract
In this work we present the novel ensemble model for credit scoring problem. The main idea of the approach is to incorporate separate beta binomial distributions for each of the classes to generate balanced datasets that are further used to construct base learners that constitute the final ensemble model. The sampling procedure is performed on two separate ranking lists, each for one class, where the ranking is based on probability of observing positive class. The two strategies are considered in the studies: one assumes mining easy examples and the second one force good classification of hard cases. The proposed solutions are tested on two big datasets from credit scoring domain.
Maciej Zieba, Wolfgang Karl Härdle
Metadaten
Titel
Handbook of Big Data Analytics
herausgegeben von
Prof. Dr. Wolfgang Karl Härdle
Prof. Henry Horng-Shing Lu
Prof. Xiaotong Shen
Copyright-Jahr
2018
Electronic ISBN
978-3-319-18284-1
Print ISBN
978-3-319-18283-4
DOI
https://doi.org/10.1007/978-3-319-18284-1

Premium Partner