Skip to main content

Über dieses Buch

This edited book collects state-of-the-art research related to large-scale data analytics that has been accomplished over the last few years. This is among the first books devoted to this important area based on contributions from diverse scientific areas such as databases, data mining, supercomputing, hardware architecture, data visualization, statistics, and privacy.

There is increasing need for new approaches and technologies that can analyze and synthesize very large amounts of data, in the order of petabytes, that are generated by massively distributed data sources. This requires new distributed architectures for data analysis. Additionally, the heterogeneity of such sources imposes significant challenges for the efficient analysis of the data under numerous constraints, including consistent data integration, data homogenization and scaling, privacy and security preservation. The authors also broaden reader understanding of emerging real-world applications in domains such as customer behavior modeling, graph mining, telecommunications, cyber-security, and social network analysis, all of which impose extra requirements for large-scale data analysis.

Large-Scale Data Analytics is organized in 8 chapters, each providing a survey of an important direction of large-scale data analytics or individual results of the emerging research in the field. The book presents key recent research that will help shape the future of large-scale data analytics, leading the way to the design of new approaches and technologies that can analyze and synthesize very large amounts of heterogeneous data. Students, researchers, professionals and practitioners will find this book an authoritative and comprehensive resource.



Chapter 1. The Family of Map-Reduce

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data, which called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications that can process vast amounts of data on large clusters of commodity machines. MapReduce isolates the application from the details of running a distributed program, such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in following up work. This chapter provides a comprehensive survey for a family of approaches and mechanisms of large scale data analysis that have been implemented based on the original father idea of the MapReduce framework, and are currently gaining a lot of momentum in both research and industrial communities. Some case studies are discussed as well.
Sherif Sakr, Anna Liu

Chapter 2. Optimization of Massively Parallel Data Flows

Massively parallel data analysis is an emerging research topic that is motivated by the continuous growth of data sets and the rising complexity of data analysis tasks. To facilitate the analysis of big data, several parallel data processing frameworks, such as MapReduce and parallel data flow processors, have emerged. However, the implementation and tuning of parallel data analysis tasks requires expert knowledge and is very time-consuming and costly. Higher-level abstraction frameworks have been designed to ease the definition of analysis tasks. Optimizers can automatically generate efficient parallel execution plans from higher-level task definitions. Therefore, optimization is a crucial technology for massively parallel data analysis. This chapter presents the state of the art in optimization of parallel data flows. It covers higher-level languages for MapReduce, approaches to optimize plain MapReduce jobs, and optimization for parallel data flow systems. The optimization capabilities of those approaches are discussed and compared with each other. The chapter concludes with directions for future research on parallel data flow optimization.
Fabian Hueske, Volker Markl

Chapter 3. Mining Tera-Scale Graphs with “Pegasus”: Algorithms and Discoveries

How do we find patterns and anomalies, on graphs with billions of nodes and edges, which do not fit in memory? How to use parallelism for such Tera- or Peta-scale graphs? We propose a carefully selected set of fundamental operations, that help answer those questions, including diameter estimation, connected components, and eigenvalues. We package all these operations in Pegasus, which, to the best of our knowledge, is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce. One of the key observations in this work is that many graph mining operations are essentially repeated matrix-vector multiplications. We describe a very important primitive for Pegasus, called GIM-V (Generalized Iterative Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines, (b) linear running time on the number of edges, and (c) more than nine times faster performance over the non-optimized version of GIM-V.Finally, we run experiments on real graphs. Our experiments run on M45, one of the largest Hadoop clusters available to academia. We report our findings on several real graphs, including one of the largest publicly available Web graphs with 6,7 billion edges. Some of our most impressive findings are (a) the discovery of adult advertisers in the who-follows-whom on Twitter, and (b) the 7-degrees of separation in the Web graph.
U Kang, Christos Faloutsos

Chapter 4. Customer Analyst for the Telecom Industry

The telecommunications industry is particularly rich in customer data, and telecom companies want to use this data to prevent customer churn, and improve the revenue per user through personalization and customer acquisition. Massive-scale analytics tools provide an opportunity to achieve this in is a flexible and scalable way. In this context, we have developed IBM Customer Analyst, a components library to analyze customer behavioral data and enable new insights and business scenarios based on the analysis of the relationship between users and the content they create and consume. Due to the massive amount of data and large number of users, this technology is built on IBM Infosphere BigInsights and Apache Hadoop. In this work, we first describe an efficient user profiling framework, with high user profiling quality guarantees, based on mobile web browsing log analysis. We describe the use of the Open Directory Project categories to generate user profiles. We then describe an end-to-end analysis flow and discuss its challenges. Last, we validate our methods through extensive experiments based on real data sets.
David Konopnicki, Michal Shmueli-Scheuer

Chapter 5. Machine Learning Algorithm Acceleration Using Hybrid (CPU-MPP) MapReduce Clusters

The uninterrupted growth of information repositories has progressively led data-intensive applications, such as MapReduce-based systems, to the mainstream. The MapReduce paradigm has frequently proven to be a simple yet flexible and scalable technique to distribute algorithms across thousands of nodes and petabytes of information. Under these circumstances, classic data mining algorithms have been adapted to this model, in order to run in production environments. Unfortunately, the high latency nature of this architecture has relegated the applicability of these algorithms to batch-processing scenarios. In spite of this shortcoming, the emergence of massively threaded shared-memory multiprocessors, such as Graphics Processing Units (GPU), on the commodity computing market has enabled these algorithms to be executed orders of magnitude faster, while keeping the same MapReduce-based model. In this chapter, we propose the integration of massively threaded shared-memory multiprocessors into MapReduce-based clusters, creating a unified heterogeneous architecture that enables executing Map and Reduce operators on thousands of threads across multiple GPU devices and nodes, while maintaining the built-in reliability of the baseline system. For this purpose, we created a programming model that facilitates the collaboration of multiple CPU cores and multiple GPU devices towards the resolution of a data intensive problem. In order to prove the potential of this hybrid system, we take a popular NP-hard supervised learning algorithm, the Support Vector Machine (SVM), and show that a 36 ×−192× speedup can be achieved on large datasets without changing the model or leaving the commodity hardware paradigm.
Sergio Herrero-Lopez, John R. Williams

Chapter 6. Large-Scale Social Network Analysis

Social Network Analysis (SNA) is an established discipline for the study of groups of individuals with applications in several areas, like economics, information science, organizational studies and psychology. In the last fifteen years the exponential growth of online Social Network Sites (SNSs), like Facebook, QQ and Twitter has provided a new challenging application context for SNA methods. However, with respect to traditional SNA application domains these systems are characterized by very large volumes of data, and this has recently led to the development of parallel network analysis algorithms and libraries. In this chapter we provide an overview of the state of the art in the field of large scale social network analysis; in particular, we focus on parallel algorithms and libraries for the computation of network centrality metrics.
Mattia Lambertini, Matteo Magnani, Moreno Marzolla, Danilo Montesi, Carmine Paolino

Chapter 7. Visual Analysis and Knowledge Discovery for Text

Providing means for effectively accessing and exploring large textual data sets is a problem attracting the attention of text mining and information visualization experts alike. The rapid growth of the data volume and heterogeneity, as well as the richness of metadata and the dynamic nature of text repositories, add to the complexity of the task. This chapter provides an overview of data visualization methods for gaining insight into large, heterogeneous, dynamic textual data sets. We argue that visual analysis, in combination with automatic knowledge discovery methods, provides several advantages. Besides introducing human knowledge and visual pattern recognition into the analytical process, it provides the possibility to improve the performance of automatic methods through user feedback.
Christin Seifert, Vedran Sabol, Wolfgang Kienreich, Elisabeth Lex, Michael Granitzer

Chapter 8. Practical Distributed Privacy-Preserving Data Analysis at Large Scale

In this chapter we investigate practical technologies for security and privacy in data analysis at large scale. We motivate our approach by discussing the challenges and opportunities in light of current and emerging analysis paradigms on large data sets. In particular, we present a framework for privacy-preserving distributed data analysis that is practical for many real-world applications. The framework is called Peers for Privacy (P4P) and features a novel heterogeneous architecture and a number of efficient tools for performing private computation and offering security at large scale. It maintains three key properties, which are essential for real-world applications: (i) provably strong privacy; (ii) adequate efficiency at reasonably large scale; and (iii) robustness against realistic adversaries. The framework gains its practicality by decomposing data mining algorithms into a sequence of vector addition steps, which can be privately evaluated using efficient cryptographic tools, namely verifiable secret sharing over small field (e.g., 32 or 64 bits), which have the same cost as regular, non-private arithmetic. This paradigm supports a large number of statistical learning algorithms, including SVD, PCA, k-means, ID3 and machine learning algorithms based on Expectation-Maximization, as well as all algorithms in the statistical query model (Kearns, Efficient noise-tolerant learning from statistical queries. In: STOC’93, San Diego, pp. 392–401, 1993). As a concrete example, we show how singular value decomposition, which is an extremely useful algorithm and the core of many data mining tasks, can be performed efficiently with privacy in P4P. Using real data, we demonstrate that P4P is orders of magnitude faster than other solutions.
Yitao Duan, John Canny


Weitere Informationen

Premium Partner