2016 | Book

# Introduction to HPC with MPI for Data Science

Author: Frank Nielsen

Publisher: Springer International Publishing

Book Series : Undergraduate Topics in Computer Science

2016 | Book

Author: Frank Nielsen

Publisher: Springer International Publishing

Book Series : Undergraduate Topics in Computer Science

This gentle introduction to High Performance Computing (HPC) for Data Science using the Message Passing Interface (MPI) standard has been designed as a first course for undergraduates on parallel programming on distributed memory models, and requires only basic programming notions.

Divided into two parts the first part covers high performance computing using C++ with the Message Passing Interface (MPI) standard followed by a second part providing high-performance data analytics on computer clusters.

In the first part, the fundamental notions of blocking versus non-blocking point-to-point communications, global communications (like broadcast or scatter) and collaborative computations (reduce), with Amdalh and Gustafson speed-up laws are described before addressing parallel sorting and parallel linear algebra on computer clusters. The common ring, torus and hypercube topologies of clusters are then explained and global communication procedures on these topologies are studied. This first part closes with the MapReduce (MR) model of computation well-suited to processing big data using the MPI framework.

In the second part, the book focuses on high-performance data analytics. Flat and hierarchical clustering algorithms are introduced for data exploration along with how to program these algorithms on computer clusters, followed by machine learning classification, and an introduction to graph analytics. This part closes with a concise introduction to data core-sets that let big data problems be amenable to tiny data problems.

Exercises are included at the end of each chapter in order for students to practice the concepts learned, and a final section contains an overall exam which allows them to evaluate how well they have assimilated the material covered in the book.

Advertisement

Abstract

High Performance Computing, or HPC for short, is an area encompassing among others the various paradigms of parallel programming, their related programming languages and application programming interfaces (APIs), the dedicated software tools, the international specialized conferences (ACM/IEEE Super-Computing, or SC for short), etc. Loosely speaking, HPC is both the scientific and technical fields of study of “Super-Computers” (SCs).

Abstract

MPI is a standardized Application Programming Interface (API) that allows one to provide unambiguously the interface (that is, the declaration of functions, procedures, data-types, constants, etc.) with the precise semantic of communication protocols and global calculation routines, among others. Thus a parallel program using distributed memory can be implemented using various implementations of the MPI interface provided by several vendors (like the prominent OpenMPI, MPICH2, etc.). Communications can either be synchronous or asynchronous, bufferized or not bufferized, and one can define synchronization barriers where all processes have to wait for each other before further carrying computations.

Abstract

An interconnection network of computers is modelled mathematically as a graph with vertices representing computer nodes and edges depicting the communication links between those nodes. We distinguish the physical network from the logical network that is used by a parallel algorithm to operate its communications. When a logical network is different from the underlying physical network, one needs to transpose or embed the logical network onto the physical network. In that case, we seek the best transposition (or embedding) that minimizes both the dilatation (defined as the maximal distance between nodes on the physical network for neighbor nodes on the logical network) and the expansion (defined as the ratio of the number of nodes in the physical network over the number of nodes in the logical network).

Abstract

There exist plenty sequential algorithms to sort n numbers that achieve the optimal time complexity of \(\varTheta (n\log n)\). We can sort on parallel architectures with distributed memory by considering the granularity of local sorting.

Abstract

The field of algorithms covering implementations is very rich and versatile. In computer science, we ubiquitously use computational linear algebra in algorithms, often by using a dedicated software library that hides the tedious nitty-gritty details of the optimized implementations of the fundamental algorithms (mainly matrix arithmetic operations and factorization primitives).

Abstract

The system of MapReduce (or Hadoop for an equivalent open source in Java) offers a simple framework to parallelize and execute parallel algorithms on massive data sets, commonly called Big Data (with size ranging from a few gigabytes to a few terabytes or even petabytes). This dedicated MapReduce paradigm of data-intensive parallel programming was originally developed by Google in 2003. MapReduce is an abstract model of parallel programming for processing massive data sets on a cluster of computers, and a platform to execute and monitor jobs. MapReduce is straightforward to use, can be easily extended, and even more importantly MapReduce is prone to both hardware and software failures.

Abstract

Nowadays, huge size data-sets are commonly publicly available, and it becomes increasingly important to eefficiently process them to discover worthwhile structures (or “patterns”) in those seas of data. Exploratory data analysis is concerned with this challenge of finding such structural information without any prior knowledge: in this case, those techniques that consist in learning from data without prior knowledge are called generically unsupervised machine learning.

Abstract

Agglomerative hierarchical clustering differs from partition-based clustering since it builds a binary merge tree starting from leaves that contain data elements to the root that contains the full data-set. The graphical representation of that tree that embeds the nodes on the plane is called a dendrogram. To implement a hierarchical clustering algorithm, one has to choose a linkage function (single linkage, average linkage, complete linkage, Ward linkage, etc.) that defines the distance between any two sub-sets (and rely on the base distance between elements). A hierarchical clustering is monotonous if and only if the similarity decreases along the path from any leaf to the root, otherwise there exists at least one inversion. The single, complete, and average linkage criteria guarantee the monotonic property, but not the often used Ward’s criterion.

Abstract

The k-NN classification rule labels a new observation query q from a test set by choosing the dominant class among the k nearest neighbors of q in the training set. One evaluates the performance of a classifier by calculating its F-score that is the harmonic mean of the precision rate and the recall rate. This yields a single quality value that takes into account the four different cases that can occur when classifying observations in one of either two classes (false/true-positive/negative). In statistical machine learning, a classifier can never beat the optimal Bayes’ error (or the probability of error), and the 1-NN guarantees asymptotically an error factor of 2. Since the nearest neighbor queries are decomposable queries, meaning that \(\mathrm {NN}_k(q,X_1\cup X_2)=\mathrm {NN}_k(\mathrm {NN}_k(q,X_1),\mathrm {NN}_k(q,X_2))\), the k-NN classification rule can be easily parallelized on a distributed memory architecture like a computer cluster. One of the drawback of the k-NN rule is that it needs to store all the training set in order to classify new observations.

Abstract

Often one is interested to solve optimization problems on very large size data-sets. It can be computationally interesting not to solve for the exact optimal solution (or one of the optimal solutions when several such optimal solutions exist) but rather seek for a guaranteed approximation in faster time. For example, consider the k-means clustering problem: we seek to minimize the k-means cost function (i.e., weighted sum of cluster variances, see Chap. 7). It makes sense to think that under some stability hypothesis to point perturbations of the optimal clustering, an approximation of the minimization of the cost function (instead of the regular k-means objective function) will end up with a good (if not an optimal) clustering.

Abstract

Graphs are convenient to model structures in data by taking into account relationships between them (edges). Nowadays, large graph data-sets are available and commonly use when we perform social network analytics (facebook, twitter, etc.). We attest a trend of analyzing efficiently voluminous graphs in data science.