Skip to main content

Über dieses Buch

This book provides an overview of the resources and research projects that are bringing Big Data and High Performance Computing (HPC) on converging tracks. It demystifies Big Data and HPC for the reader by covering the primary resources, middleware, applications, and tools that enable the usage of HPC platforms for Big Data management and processing.Through interesting use-cases from traditional and non-traditional HPC domains, the book highlights the most critical challenges related to Big Data processing and management, and shows ways to mitigate them using HPC resources. Unlike most books on Big Data, it covers a variety of alternatives to Hadoop, and explains the differences between HPC platforms and Hadoop.Written by professionals and researchers in a range of departments and fields, this book is designed for anyone studying Big Data and its future directions. Those studying HPC will also find the content valuable.



Chapter 1. An Introduction to Big Data, High Performance Computing, High-Throughput Computing, and Hadoop

Recent advancements in the field of instrumentation, adoption of some of the latest Internet technologies and applications, and the declining cost of storing large volumes of data, have enabled researchers and organizations to gather increasingly large datasets. Such vast datasets are precious due to the potential of discovering new knowledge and developing insights from them, and they are also referred to as “Big Data”. While in a large number of domains, Big Data is a newly found treasure that brings in new challenges, there are various other domains that have been handling such treasures for many years now using state-of-the-art resources, techniques and technologies. The goal of this chapter is to provide an introduction to such resources, techniques, and technologies, namely, High Performance Computing (HPC), High-Throughput Computing (HTC), and Hadoop. First, each of these topics is defined and discussed individually. These topics are then discussed further in the light of enabling short time to discoveries and, hence, with respect to their importance in conquering Big Data.
Ritu Arora

Chapter 2. Using High Performance Computing for Conquering Big Data

The journey of Big Data begins at its collection stage, continues to analyses, culminates in valuable insights, and could finally end in dark archives. The management and analyses of Big Data through these various stages of its life cycle presents challenges that can be addressed using High Performance Computing (HPC) resources and techniques. In this chapter, we present an overview of the various HPC resources available at the open-science data centers that can be used for developing end-to-end solutions for the management and analysis of Big Data. We also present techniques from the HPC domain that can be used to solve Big Data problems in a scalable and performance-oriented manner. Using a case-study, we demonstrate the impact of using HPC systems on the management and analyses of Big Data throughout its life cycle.
Antonio Gómez-Iglesias, Ritu Arora

Chapter 3. Data Movement in Data-Intensive High Performance Computing

The cost of executing a floating point operation has been decreasing for decades at a much higher rate than that of moving data. Bandwidth and latency, two key metrics that determine the cost of moving data, have degraded significantly relative to processor cycle time and execution rate. Despite the limitation of sub-micron processor technology and the end of Dennard scaling, this trend will continue in the short-term making data movement a performance-limiting factor and an energy/power efficiency concern. Even more so in the context of large-scale and data-intensive systems and workloads. This chapter gives an overview of the aspects of moving data across a system, from the storage system to the computing system down to the node and processor level, with case study and contributions from researchers at the San Diego Supercomputer Center, the Oak Ridge National Laboratory, the Pacific Northwest National Laboratory, and the University of Delaware.
Pietro Cicotti, Sarp Oral, Gokcen Kestor, Roberto Gioiosa, Shawn Strande, Michela Taufer, James H. Rogers, Hasan Abbasi, Jason Hill, Laura Carrington

Chapter 4. Using Managed High Performance Computing Systems for High-Throughput Computing

This chapter will explore the issue of executing High-Throughput Computing (HTC) workflows on managed High Performance Computing (HPC) systems that have been tailored for the execution of “traditional” HPC applications. We will first define data-oriented workflows and HTC, and then highlight some of the common hurdles that exist to executing these workflows on shared HPC resources. Then we will look at Launcher, which is a tool for making large HTC workflows appear—from the HPC system’s perspective—to be a “traditional” simulation workflow. Launcher’s various features are described, including scheduling modes and extensions for use with Intel®;Xeon PhiTM coprocessor cards.
Lucas A. Wilson

Chapter 5. Accelerating Big Data Processing on Modern HPC Clusters

Modern HPC systems and the associated middleware (such as MPI and parallel file systems) have been exploiting the advances in HPC technologies (multi-/many-core architecture, RDMA-enabled networking, and SSD) for many years. However, Big Data processing and management middleware have not fully taken advantage of such technologies. These disparities are taking HPC and Big Data processing into divergent trajectories. This chapter provides an overview of popular Big Data processing middleware, high-performance interconnects and storage architectures, and discusses the challenges in accelerating Big Data processing middleware by leveraging emerging technologies on modern HPC clusters. This chapter presents case studies of advanced designs based on RDMA and heterogeneous storage architecture, that were proposed to address these challenges for multiple components of Hadoop (HDFS and MapReduce) and Spark. The advanced designs presented in the case studies are publicly available as a part of the High-Performance Big Data (HiBD) project. An overview of the HiBD project is also provided in this chapter. All these works aim to bring HPC and Big Data processing into a convergent trajectory.
Xiaoyi Lu, Md. Wasi-ur-Rahman, Nusrat Islam, Dipti Shankar, Dhabaleswar K. (DK) Panda

Chapter 6. dispel4py: Agility and Scalability for Data-Intensive Methods Using HPC

Today’s data bonanza and increasing computational power provide many new opportunities for combining observations with sophisticated simulation results to improve complex models and make forecasts by analyzing their relationships. This should lead to well-presented actionable information that can support decisions and contribute trustworthy knowledge. Practitioners in all disciplines: computational scientists, data scientists and decision makers need improved tools to realize such potential. The library dispel4py is such a tool. dispel4py is a Python library for describing abstract workflows for distributed data-intensive applications. It delivers a simple abstract model in familiar development environments with a fluent path to production use that automatically addresses scale without its users having to reformulate their methods. This depends on optimal mappings to many current HPC and data-intensive platforms.
Rosa Filgueira, Malcolm P. Atkinson, Amrey Krause

Chapter 7. Performance Analysis Tool for HPC and Big Data Applications on Scientific Clusters

Big data is prevalent in HPC computing. Many HPC projects rely on complex workflows to analyze terabytes or petabytes of data. These workflows often require running over thousands of CPU cores and performing simultaneous data accesses, data movements, and computation. It is challenging to analyze the performance involving terabytes or petabytes of workflow data or measurement data of the executions, from complex workflows over a large number of nodes and multiple parallel task executions. To help identify performance bottlenecks or debug the performance issues in large-scale scientific applications and scientific clusters, we have developed a performance analysis framework, using state-of-the-art open-source big data processing tools. Our tool can ingest system logs and application performance measurements to extract key performance features, and apply the most sophisticated statistical tools and data mining methods on the performance data. It utilizes an efficient data processing engine to allow users to interactively analyze a large amount of different types of logs and measurements. To illustrate the functionality of the big data analysis framework, we conduct case studies on the workflows from an astronomy project known as the Palomar Transient Factory (PTF) and the job logs from the genome analysis scientific cluster. Our study processed many terabytes of system logs and application performance measurements collected on the HPC systems at NERSC. The implementation of our tool is generic enough to be used for analyzing the performance of other HPC systems and Big Data workflows.
Wucherl Yoo, Michelle Koo, Yi Cao, Alex Sim, Peter Nugent, Kesheng Wu

Chapter 8. Big Data Behind Big Data

There is data related to the collection and management of big data that is as relevant as the primary datasets being collected, and can itself be very large. In this chapter, we will examine two aspects of High Performance Computing (HPC) data that fall under the category of big data. The first is the collection of HPC environmental data and its analysis. The second is the collection of information on how large datasets are produced by scientific research on HPC systems so that the datasets can be processed efficiently. A team within the computational facility at NERSC created an infrastructure solution to manage and analyze the data related to monitoring of HPC systems. This solution provides a single location for storing the data, which is backed by a scalable and parallel, time-series database. This database is flexible enough such that maintenance on the system does not disrupt the data collection activity.
Elizabeth Bautista, Cary Whitney, Thomas Davis

Chapter 9. Empowering R with High Performance Computing Resources for Big Data Analytics

The software package R is a free, powerful, open source software package with extensive statistical computing and graphics capabilities. Due to its high-level expressiveness and multitude of domain specific packages, R has become the lingua franca for many areas of data analysis, drawing power from its high-level expressiveness and its multitude of domain specific, community-developed packages. While R is clearly a “high productivity” language, it has not necessarily been a “high performance” language. Challenges still remain in developing methods to effectively scale R to the power of supercomputers, and in deploying support and enabling access for the end users. In this chapter, we focus on approaches that are available in R that can adopt high performance computing resources for providing solutions to Big Data problems. Here we first present an overview of current approaches and support in R that can enable parallel and distributed computations in order to improve computation scalability and performance. We categorize those approaches into two on the basis of the hardware requirement: single-node parallelism that requires multiple processing cores within a computer system and multi-node parallelism that requires access to computing cluster. We present a detail study on performance benefit of using Intel® Xeon Phi coprocessors (Xeon Phi) with R for improved performance in the case of single-node parallelism. The performance is also compared with using general-purpose graphic processing unit through HiPLAR package and other parallel packages enabling multi-node parallelism including SNOW and pbdR. The results show advantages and limitations of those approaches. We further provide two use cases to demonstrate parallel computations with R in practice. We also discuss a list of challenges in improving R performance for the end users. Nevertheless, the chapter shows the potential benefits of exploiting high performance computing with R and recommendations for end users of applying R to big data problems.
Weijia Xu, Ruizhu Huang, Hui Zhang, Yaakoub El-Khamra, David Walling

Chapter 10. Big Data Techniques as a Solution to Theory Problems

This chapter proposes a general approach for solving a broad class of difficult optimization problems using big data techniques. We provide a general description of this approach as well as some examples. This approach is ideally suited for solving nonconvex optimization problems, multiobjective programming problems, models with a large degree of heterogeneity, rich policy structure, potential model uncertainty, and potential policy objective uncertainty. In our applications of this algorithm we use Hierarchical Database Format (HDF5) distributed storage and I/O as well as message passing interface (MPI) for parallel computation of a large number of small optimization problems.
Richard W. Evans, Kenneth L. Judd, Kramer Quist

Chapter 11. High-Frequency Financial Statistics Through High-Performance Computing

Financial statistics covers a wide array of applications in the financial world, such as (high-frequency) trading, risk management, pricing and valuation of securities and derivatives, and various business and economic analytics. Portfolio allocation is one of the most important problems in financial risk management. One most challenging part in portfolio allocation is the tremendous amount of data and the optimization procedures that require computing power beyond the currently available desktop systems. In this article, we focus on the portfolio allocation problem using high-frequency financial data, and propose a hybrid parallelization solution to carry out efficient asset allocations in a large portfolio via intra-day high-frequency data. We exploit a variety of HPC techniques, including parallel R, Intel Math Kernel Library, and automatic offloading to Intel Xeon Phi coprocessor in particular to speed up the simulation and optimization procedures in our statistical investigations. Our numerical studies are based on high-frequency price data on stocks traded in New York Stock Exchange in 2011. The analysis results show that portfolios constructed using high-frequency approach generally perform well by pooling together the strengths of regularization and estimation from a risk management perspective. We also investigate the computation aspects of large-scale multiple hypothesis testing for time series data. Using a combination of software and hardware parallelism, we demonstrate a high level of performance on high-frequency financial statistics.
Jian Zou, Hui Zhang

Chapter 12. Large-Scale Multi-Modal Data Exploration with Human in the Loop

A new trend in many scientific fields is to conduct data-intensive research by collecting and analyzing a large amount of high-density, high-quality, multi-modal data streams. In this chapter we present a research framework for analyzing and mining such data streams at large-scale; we exploit parallel sequential pattern mining and iterative MapReduce in particular to enable human-in-the-loop large-scale data exploration powered by High Performance Computing (HPC). One basic problem is that, data scientists are now working with datasets so large and complex that it becomes difficult to process using traditional desktop statistics and visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers” (Jacobs, Queue 7(6):10:10–10:19, 2009). Meanwhile, discovering new knowledge requires the means to exploratively analyze datasets of this scale—allowing us to freely “wander” around the data, and make discoveries by combining bottom-up pattern discovery and top-down human knowledge to leverage the power of the human perceptual system. In this work, we first exploit a novel interactive temporal data mining method that allows us to discover reliable sequential patterns and precise timing information of multivariate time series. For our principal test case of detecting and extracting human sequential behavioral patterns over multiple multi-modal data streams, this suggests a quantitative and interactive data-driven way to ground social interactions in a manner that has never been achieved before. After establishing the fundamental analytics algorithms, we proceed to a research framework that can fulfill the task of extracting reliable patterns from large-scale time series using iterative MapReduce tasks. Our work exploits visual-based information technologies to allow scientists to interactively explore, visualize and make sense of their data. For example, the parallel mining algorithm running on HPC is accessible to users through asynchronous web service. In this way, scientists can compare the intermediate data to extract and propose new rounds of analysis for more scientifically meaningful and statistically reliable patterns, and therefore statistical computing and visualization can bootstrap each another. Finally, we show the results from our principal user application that can demonstrate our system’s capability of handling massive temporal event sets within just a few minutes. All these combine to reveal an effective and efficient way to support large-scale data exploration with human in the loop.
Guangchen Ruan, Hui Zhang

Chapter 13. Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection

The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors.
Ritu Arora, Jessica Trelogan, Trung Nguyen Ba

Chapter 14. Big Data Processing in the eDiscovery Domain

Legal Electronic Discovery (eDiscovery) is a business domain that utilizes large volumes of data, in a variety of structured and unstructured formats to discover evidence that may be pertinent to legal proceedings, compliance needs, litigation or other investigations. The eDiscovery practitioners are typically required to produce results with short turnaround times, utilizing mostly commodity solutions that are available at their disposal, while still conforming to legally defensible standards of data management. Therefore, they use optimal strategies for data analysis and management to meet the time and quality requirements of their business. In addition to using such optimal strategies, the time-to-results during eDiscovery can be further reduced by taking advantage of the High-Throughput Computing (HTC) paradigm on High Performance Computing (HPC) platforms.
In this chapter, we discuss the strategies for data management and analysis from the legal eDiscovery domain, and also discuss the advantages of using HTC for eDiscovery. The various techniques that the eDiscovery practitioners have adopted to meet the Big Data challenge are transferrable to other domains as well. Hence, the discussion on these techniques as presented in this chapter could be relevant to a wide range of disciplines that are grappling with the deluge of Big Data.
Sukrit Sondhi, Ritu Arora

Chapter 15. Databases and High Performance Computing

There are many data-intensive applications that require interaction with the database management systems or document management systems for processing, producing, or analyzing Big Data. Therefore, if such data-intensive applications are to be ported to High Performance Computing (HPC) platforms, the database management systems or document management systems that they require should either be directly provisioned on HPC platforms or should be made available on a storage platform that is in close proximity to the HPC platform. Otherwise, the time-to-results or time-to-insights can be significantly high. In this chapter, we present an overview of the various database management systems that can be used for the processing and analyses of Big Data on HPC platforms, and include some of the strategies to optimize the database access. We also present the steps to install and access a relational database management system on an HPC platform.
Ritu Arora, Sukrit Sondhi

Chapter 16. Conquering Big Data Through the Usage of the Wrangler Supercomputer

Data-intensive computing brings a new set of challenges that do not completely overlap with those met by the more typical and even state-of-the-art High Performance Computing (HPC) systems. Working with ‘big data’ can involve analyzing thousands of files that need to be rapidly opened, examined and cross-correlated—tasks that classic HPC systems might not be designed to do. Such tasks can be efficiently conducted on a data-intensive supercomputer like the Wrangler supercomputer at the Texas Advanced Computing Center (TACC). Wrangler allows scientists to share and analyze the massive collections of data being produced in nearly every field of research today in a user-friendly manner. It was designed to work closely with the Stampede supercomputer, which is ranked as the number ten most powerful in the world by TOP500, and is the HPC flagship of TACC. Wrangler was designed to keep much of what was successful with systems like Stampede, but also to introduce new features such as a very large flash storage system, a very large distributed spinning disk storage system, and high speed network access. This allows a new way for users to access HPC resources with data analysis needs that weren’t being fulfilled by traditional HPC systems like Stampede. In this chapter, we provide an overview of the Wrangler data-intensive HPC system along with some of the big data use-cases that it enables.
Jorge Salazar
Weitere Informationen

Premium Partner