Support Tools Environments

MPI Trace Compression Using Event Flow Graphs

Understanding how parallel applications behave is crucial for using high-performance computing (HPC) resources efficiently. However, the task of performance analysis is becoming increasingly difficult due to the growing complexity of scientific codes and the size of machines. Even though many tools have been developed over the past years to help in this task, current approaches either only offer an overview of the application discarding temporal information, or they generate huge trace files that are often difficult to handle.

In this paper we propose the use of

event flow graphs

for monitoring MPI applications, a new and different approach that balances the low overhead of profiling tools with the abundance of information available from tracers. Event flow graphs are captured with very low overhead, require orders of magnitude less storage than standard trace files, and can still recover the full sequence of events in the application. We test this new approach with the NERSC-8/Trinity Benchmark suite and achieve compression ratios up to 119x.

Xavier Aguilar, Karl Fürlinger, Erwin Laure

ScalaJack: Customized Scalable Tracing with In-situ Data Analysis

Root cause diagnosis of large-scale HPC applications often fails because tools, specifically trace-based ones, can no longer record all metrics they measure. We address this problems by combining customized tracing and providing support for in-situ data analysis via ScalaJack, a framework with customizable instrumentation and pluggable extension capabilities for problem directed instrumentation and in-situ data analysis. We further eliminate cross cutting concerns by code refactoring for aspect orientation and evaluate these capabilities in case studies within and beyond the scope of tracing.

Srinath Krishna Ananthakrishnan, Frank Mueller

Performance Measurement and Analysis of Transactional Memory and Speculative Execution on IBM Blue Gene/Q

The core count of modern processors is steadily increasing, forcing programmers to use more concurrent threads or tasks to effectively use the available hardware. This in turn makes it increasingly challenging to achieve correct and efficient thread synchronization. To support the programmer in this task, IBM introduced hardware transactional memory (TM) and speculative execution (SE) in their Blue Gene/Q system with its 16-core processor, which permits to run 64 simultaneous hardware threads in SMT mode. TM and SE allow for parallelization when race conditions may happen, however upon their detection the respective parts of the execution are rolled back and re-executed serially. This incurs some overhead and therefore usage must be well justified. In this paper, we describe extensions to the community instrumentation and measurement infrastructure Score-P, allowing developers to instrument, measure, and analyze applications. To our knowledge, this is the first integrated performance tool framework allowing to analyze TM/SE programs. We demonstrate its usefulness and effectiveness by describing experiments with benchmarks and a real-world application.

Jie Jiang, Peter Philippen, Michael Knobloch, Bernd Mohr

c-Eclipse: An Open-Source Management Framework for Cloud Applications

Cloud application portability and optimal resource allocation are of great importance in the realm of Cloud infrastructure provisioning. c-Eclipse is an open-source Cloud Application Management Framework through which users are able to define the description, deployment and management phases of their Cloud applications in a clean and intuitive graphical manner. It is built on top of the well-established Eclipse platform and it adheres to two highly desirable features of Cloud applications: portability and elasticity. In particular, c-Eclipse implements the open, non-proprietary OASIS TOSCA specification for describing the provision, deployment and re-contextualization of applications across different Cloud infrastructures, thereby ensuring application portability. Furthermore, c-Eclipse enables Cloud users to specify elasticity policies that describe how the deployed virtualized resources must be elastically adapted at runtime to match the needs of a dynamic application-workload. In this paper, we introduce the architecture and implementation of c-Eclipse, and describe its key characteristics via a use-case scenario that involves a user creating a description of a 3-tier Cloud application, enriching it with appropriate elasticity policies, submitting it for deployment to two different Cloud providers and, finally, monitoring its execution.

Chrystalla Sofokleous, Nicholas Loulloudes, Demetris Trihinas, George Pallis, Marios D. Dikaiakos

Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-core Architectures

Multi-core architectures comprising several GPUs have become mainstream in the field of High-Performance Computing. However, obtaining the maximum performance of such heterogeneous machines is challenging as it requires to carefully offload computations and manage data movements between the different processing units. The most promising and successful approaches so far rely on task-based runtimes that abstract the machine and rely on opportunistic scheduling algorithms. As a consequence, the problem gets shifted to choosing the task granularity, task graph structure, and optimizing the scheduling strategies. Trying different combinations of these different alternatives is also itself a challenge. Indeed, getting accurate measurements requires reserving the target system for the whole duration of experiments. Furthermore, observations are limited to the few available systems at hand and may be difficult to generalize. In this article, we show how we crafted a coarse-grain hybrid simulation/emulation of StarPU, a dynamic runtime for hybrid architectures, over SimGrid, a versatile simulator for distributed systems. This approach allows to obtain performance predictions accurate within a few percents on classical dense linear algebra kernels in a matter of seconds, which allows both runtime and application designers to quickly decide which optimization to enable or whether it is worth investing in higher-end GPUs or not.

Luka Stanisic, Samuel Thibault, Arnaud Legrand, Brice Videau, Jean-François Méhaut

Performance Prediction and Evaluation

Modeling the Impact of Reduced Memory Bandwidth on HPC Applications

To deliver the energy efficiency and raw compute throughput necessary to realize exascale systems, projected designs call for massive numbers of (simple) cores per processor. An unfortunate consequence of such designs is that the memory bandwidth per core will be significantly reduced, which can significantly degrade the performance of many memory-intensive HPC workloads. To identify the code regions that are most impacted and to guide them in developing mitigating solutions, system designers and application developers alike would benefit immensely from a systematic framework that allowed them to identify the types of computations that are sensitive to reduced memory bandwidth and to precisely identify those regions in their code that exhibit sensitivity. This paper introduces a framework for identifying the properties in computations that are associated with memory bandwidth sensitivity, extracting those same properties from HPC applications, and for associating bandwidth sensitivity to specific structures in the application source code. We apply our framework to a number of large scale HPC applications, observing that the bandwidth sensitivity model shows an absolute mean error that averages less than 5%.

Ananta Tiwari, Anthony Gamst, Michael A. Laurenzano, Martin Schulz, Laura Carrington

ParaShares: Finding the Important Basic Blocks in Multithreaded Programs

Understanding and optimizing multithreaded execution is a significant challenge. Numerous research and industrial tools debug parallel performance by combing through program source or thread traces for pathologies including communication overheads, data dependencies, and load imbalances. This work takes a new approach: it ignores any underlying pathologies, and focuses instead on pinpointing the exact locations in source code that consume the largest share of execution. Our new metric,

ParaShares

, scores and ranks all basic blocks in a program based on their share of parallel execution. For the eight benchmarks examined in this paper, ParaShare rankings point to just a few important blocks per application. The paper demonstrates two uses of this information, exploring how the important blocks vary across thread counts and input sizes, and making modest source code changes (fewer than 10 lines of code) that result in 14-92% savings in parallel program runtime.

Melanie Kambadur, Kui Tang, Martha A. Kim

Multi-Objective Auto-Tuning with Insieme: Optimization and Trade-Off Analysis for Time, Energy and Resource Usage

The increasing complexity of modern multi- and many-core hardware design makes performance tuning of parallel applications a difficult task. In the past, auto-tuners have been successfully applied to minimize execution time. However, besides execution time, additional optimization goals have recently arisen, such as energy consumption or computing costs. Therefore, more sophisticated methods capable of exploiting and identifying the trade-offs among these goals are required. In this work we present and discuss results of applying a multi-objective search-based auto-tuner to optimize for three conflicting criteria: execution time, energy consumption, and resource usage. We examine a method, called RS-GDE3, to tune HPC codes using the Insieme parallelizing and optimizing compiler. Our results demonstrate that RS-GDE3 offers solutions of superior quality than those provided by a hierarchical and a random search at a fraction of the required time (5%) or energy (8%). A comparison to a state-of-the-art multi-objective optimizer (NSGA-II) shows that RS-GDE3 computes solutions of higher quality. Finally, based on the trade-off solutions found by RS-GDE3, we provide a detailed analysis and several hints on how to improve the design of multi-objective auto-tuners and code optimization.

Philipp Gschwandtner, Juan J. Durillo, Thomas Fahringer

Performance Prediction and Evaluation of Parallel Applications in KVM, Xen, and VMware

Cloud computing platforms are considerably attractive for parallel applications that perform large-scale, computationally intensive tasks. These platforms can provide elastic computing resources to the parallel software owing to system virtualization technology. Almost every cloud service provider operates on a pay-per-use basis, and therefore, it is important to estimate the performance of parallel applications before deploying them. However, a comprehensive study that can predict the performance of parallel applications remains unexplored and is still a research topic. In this paper, we provide a theoretical performance model that can predict the performance of parallel applications in different virtual machine scheduling policies and evaluate the model in representative hypervisors including KVM, Xen, and VMware. Through this analysis and evaluation, we show that our performance prediction model is accurate and reliable.

Cheol-Ho Hong, Beom-Joon Kim, Young-Pil Kim, Hyunchan Park, Chuck Yoo

DReAM: Per-Task DRAM Energy Metering in Multicore Systems

Interaction across applications in DRAM memory impacts its energy consumption. This paper makes the case for accurate per-task DRAM energy metering in multicores, which opens new paths to energy/performance optimizations, such as per-task energy-aware task scheduling and energy-aware billing in datacenters. In particular, the contributions of this paper are (i) an ideal per-task energy metering model for DRAM memories; (ii)

DReAM

, an accurate, yet low cost, implementation of the ideal model (less than 5% accuracy error when 16 tasks share memory); and (iii) a comparison with standard methods (even distribution and access-count based) proving that

DReAM

is more accurate than these other methods.

Qixiao Liu, Miquel Moreto, Jaume Abella, Francisco J. Cazorla, Mateo Valero

Characterizing the Performance-Energy Tradeoff of Small ARM Cores in HPC Computation

Deploying large numbers of small, low-power cores has been gaining traction recently as a system design strategy in high performance computing (HPC). The ARM platform that dominates the embedded and mobile computing segments is now being considered as an alternative to high-end x86 processors that largely dominate HPC because peak performance per watt may be substantially improved using off-the-shelf commodity processors.

In this work we methodically characterize the performance and energy of HPC computations drawn from a number of problem domains on current ARM and x86 processors. Unsurprisingly, we find that the performance, energy and energy-delay product of applications running on these platforms varies significantly across problem types and inputs. Using static program analysis we further show that this variation can be explained largely in terms of the capabilities of two processor subsystems: single instruction multiple data (SIMD)/floating point and the cache/memory hierarchy; and that static analysis of this kind is sufficient to predict which platform is best for a particular application/input pair. In the context of these findings, we evaluate how some of the key architectural changes being made for upcoming 64-bit ARM platforms may impact HPC application performance.

Michael A. Laurenzano, Ananta Tiwari, Adam Jundt, Joshua Peraza, William A. Ward Jr., Roy Campbell, Laura Carrington

Scheduling and Load Balancing

On Interactions among Scheduling Policies: Finding Efficient Queue Setup Using High-Resolution Simulations

Many studies in the past two decades focused on the problem of efficient job scheduling in HPC and Grid-like systems. While many new scheduling algorithms have been proposed for systems with specific requirements, mainstream resource management systems and schedulers are still only using a limited set of scheduling policies. Production systems need to balance various policies that are set in place to satisfy both the resource providers and users (or virtual organizations) in the system. While many works address these separate policies, e.g., fairshare for fair resource allocation, only few works try to address the interactions between these separate solutions. In this paper we describe how to approach these interactions when developing site-specific policies. Notably, we describe how (priority) queues interact with scheduling algorithms, fairshare and with anti-starvation mechanisms. Moreover, we present a case study describing how an advanced simulation tool was used to find new configuration for an actual resource manager deployed in the Czech National Grid, significantly increasing its performance.

Dalibor Klusáček, Šimon Tóth

ProPS: A Progressively Pessimistic Scheduler for Software Transactional Memory

Software Transactional Memory (STM) is one promising abstraction to simplify the task of writing highly parallel applications. Nonetheless, in workloads lacking enough parallelism, STM’s optimistic approach to concurrency control can adversely degrade performance as transactions abort and restart often.

In this paper, we describe a new scheduling-based solution to improve STM’s performance in high-contention scenarios. Our Progressively Pessimistic Scheduler (ProPS) uses a fine-grained scheduling mechanism that controls the amount of concurrency in the system gradually as transactions abort and commit with success.

Experimental results with the STMBench7 benchmark and the STAMP benchmark suite showed that current coarse-grained, conservative transaction schedulers are not suitable for workloads with long transactions, whereas ProPS is up to 40% faster than all other scheduling alternatives.

Hugo Rito, João Cachopo

A Queueing Theory Approach to Pareto Optimal Bags-of-Tasks Scheduling on Clouds

Cloud hosting services offer computing resources which can scale along with the needs of users. When access to data is limited by the network capacity this scalability also becomes limited. To investigate the impact of this limitation we focus on bags–of–tasks where task data is stored outside the cloud and has to be transferred across the network before task execution can commence. The existing bags–of–tasks estimation tools are not able to provide accurate estimates in such a case. We introduce a queuing–network inspired model which successfully models the limited network resources. Based on the Mean–Value Analysis of this model we derive an efficient procedure that results in an estimate of the makespan and the executions costs for a given configuration of cloud virtual machines. We compare the calculated Pareto set with measurements performed in a number of experiments for real–world bags–of–tasks and validate the proposed model and the accuracy of the estimated configurations.

Cosmin Dumitru, Ana-Maria Oprescu, Miroslav Živković, Rob van der Mei, Paola Grosso, Cees de Laat

SPAGHETtI: Scheduling/Placement Approach for Task-Graphs on HETerogeneous archItecture

We propose a new algorithm, called SPAGHETtI, for static scheduling tasks on an unbounded heterogeneous resources where resources belongs to different architecture (e.g. CPU or GPU). We show that this algorithm is optimal in complexity

O

(|

E

||

A

|

2

+ |

V

||

A

|), where |

E

| is the number of edges, |

V

| the number of vertices of the scheduled DAG and |

A

| the number of architectures – usually a small value – and that it is able to compute the optimal makespan. Moreover, the number of resources to be used for executing the schedule is given by a linear time algorithm. When the resources are bounded we provide a method to reduce the number of necessary resources up to the bound providing a set of compromises between the makespan and the size of the infrastructure.

Denis Barthou, Emmanuel Jeannot

Energy-Aware Multi-Organization Scheduling Problem

Scheduling algorithms for shared platforms such as grids and clouds granted users of different organizations access to powerful resources and may improve machine utilization; however, this can also increase operational costs of less-loaded organizations.

We consider energy as a resource, where the objective is to optimize the total energy consumption without increasing the energy spent by a

selfish organization

. We model the problem as a energy-aware variant of the Multi-Organization Scheduling Problem that we call

MOSP-energy

.

We show that the clairvoyant problem with variable speed processors and jobs with release dates and deadlines is NP-hard and also that being selfish can cause solutions at most

m

α

− 1

far from the optimal, where

m

is the number of machines and

α

> 1 is a constant. Finally, we present efficient heuristics for scenarios with all jobs ready from the beginning.

Johanne Cohen, Daniel Cordeiro, Pedro Luis F. Raphael

Energy Efficient Scheduling of MapReduce Jobs

MapReduce has emerged as a prominent programming model for data-intensive computation. In this work, we study power-aware MapReduce scheduling in the speed scaling setting first introduced by Yao et al. [FOCS 1995]. We focus on the minimization of the total weighted completion time of a set of MapReduce jobs under a given budget of energy. Using a linear programming relaxation of our problem, we derive a polynomial time constant-factor approximation algorithm. We also propose a convex programming formulation that we combine with standard list scheduling policies, and we evaluate their performance using simulations.

Evripidis Bampis, Vincent Chau, Dimitrios Letsios, Giorgio Lucarelli, Ioannis Milis, Georgios Zois

High Performance Architectures and Compilers

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

When adapting GPU-specific OpenCL kernels to run on multi-core/many-core CPUs, coarsening the thread granularity is necessary and thus extensively used. However, locality concerns exposed in GPU-specific OpenCL code are usually inherited without analysis, which may give side-effects on the CPU performance. When executing GPU-specific kernels on CPUs, local-memory arrays no longer match well with the hardware and the associated synchronizations are costly. To solve this dilemma, we actively analyze the memory access patterns by using array-access descriptors derived from GPU-specific kernels, which can thus be adapted for CPUs by removing all the unwanted local-memory arrays together with the obsolete barrier statements. Experiments show that the automated transformation can satisfactorily improve OpenCL kernel performances on Sandy Bridge CPU and Intel’s Many-Integrated-Core coprocessor.

Dafei Huang, Mei Wen, Changqing Xun, Dong Chen, Xing Cai, Yuran Qiao, Nan Wu, Chunyuan Zhang

Switchable Scheduling for Runtime Adaptation of Optimization

Parallel applications used to be executed alone until their termination on partitions of supercomputers: a very static environment for very static applications. The recent shift to multicore architectures for desktop and embedded systems as well as the emergence of cloud computing is raising the problem of the impact of the execution context on performance. The number of criteria to take into account for that purpose is significant: architecture, system, workload, dynamic parameters, etc. Finding the best optimization for every context at compile time is clearly out of reach. Dynamic optimization is the natural solution, but it is often costly in execution time and may offset the optimization it is enabling. In this paper, we present a static-dynamic compiler optimization technique that generates loop-based programs with dynamic auto-tuning capabilities with very low overhead. Our strategy introduces switchable scheduling, a family of program transformations that allows to switch between optimized versions while always processing useful computation. We present both the technique to generate self-adaptive programs based on switchable scheduling and experimental evidence of their ability to sustain high-performance in a dynamic environment.

Lénaïc Bagnères, Cédric Bastoul

A New GCC Plugin-Based Compiler Pass to Add Support for Thread-Level Speculation into OpenMP

In this paper we propose a compile-time system that adds support for Thread-Level Speculation (TLS) into OpenMP. Our solution augments the original user code with calls to a TLS library that handles the speculative parallel execution of a given loop, with the help of a new OpenMP

speculative

clause for variable usage classification. To support it, we have developed a plugin-based compiler pass for GCC that augments the code of the loop. With this approach, we only need one additional code line to speculatively parallelize the code, compared with the tens or hundreds of changes needed (depending on the number of accesses to speculative variables) to manually apply the required transformations. Moreover, the plugin leads to a faster performance than the manual parallelization.

Sergio Aldea, Alvaro Estebanez, Diego R. Llanos, Arturo Gonzalez-Escribano

Parallel and Distributed Data Management

Improving Read Performance with Online Access Pattern Analysis and Prefetching

Among the major challenges of transitioning to exascale in HPC is the ubiquitous I/O bottleneck. For analysis and visualization applications in particular, this bottleneck is exacerbated by the write-onceread- many property of most scientific datasets combined with typically complex access patterns. One promising way to alleviate this problem is to recognize the application’s access patterns and utilize them to prefetch data, thereby overlapping computation and I/O. However, current research methods for analyzing access patterns are either offline-only and/or lack the support for complex access patterns, such as high-dimensional strided or composition-based unstructured access patterns. Therefore, we propose an online analyzer capable of detecting both simple and complex access patterns with low computational and memory overhead and high accuracy. By combining our pattern detection with prefetching,we consistently observe run-time reductions, up to 26%, across 18 configurations of PIOBench and 4 configurations of a micro-benchmark with both structured and unstructured access patterns.

Houjun Tang, Xiaocheng Zou, John Jenkins, David A. Boyuka II, Stephen Ranshous, Dries Kimpe, Scott Klasky, Nagiza F. Samatova

Robust and Efficient Large-Large Table Outer Joins on Distributed Infrastructures

Outer joins are ubiquitous in many workloads but are sensitive to load-balancing problems. Current approaches mitigate such problems caused by data skew by using (partial) replication. However, contemporary replication-based approaches (1) introduce overhead, since they usually result in redundant data movement, (2) are sensitive to parameter tuning and value of data skew and (3) typically require that one side is small. In this paper, we propose a novel parallel algorithm, Redistribution and Efficient Query with Counters (REQC), aimed at robustness in terms of size of join sides, variation in skew and parameter tuning. Experimental results demonstrate that our algorithm is faster, more robust and less demanding in terms of network bandwidth, compared to the state-of-the-art.

Long Cheng, Spyros Kotoulas, Tomas E Ward, Georgios Theodoropoulos

Top-k Item Identification on Dynamic and Distributed Datasets

The problem of identifying the most frequent items across multiple datasets has received considerable attention over the last few years. When storage is a scarce resource, the topic is already a challenge; yet, its complexity may be further exacerbated not only by the many independent data sources, but also by the dynamism of the data, i.e., the fact that new items may appear and old ones disappear at any time. In this work, we provide a novel approach to the problem by using an existing gossip-based algorithm for identifying the

k

most frequent items over a distributed collection of datasets, in ways that deal with the dynamic nature of the data. The algorithm has been thoroughly analyzed through trace-based simulations and compared to state-of-the-art decentralized solutions, showing better precision at reduced communication overhead.

Alessio Guerrieri, Alberto Montresor, Yannis Velegrakis

Applying Selectively Parallel I/O Compression to Parallel Storage Systems

This paper presents a new I/O technique called

Selectively Parallel I/O Compression

(SPIOC) for providing high-speed storage and access to data in QoS enabled parallel storage systems. SPIOC reduces the time of I/O operations by applying transparent compression between the computing and the storage systems. SPIOC can predict whether to compress or not at runtime, allowing parallel or sequential compression techniques, guaranteeing QoS and allowing partial and full reading by decompressing the minimum part of the file. SPIOC maximises the measured efficiency of data movement by applying run-time customising compression before storing data in the Papio storage system.

Rosa Filgueira, Malcolm Atkinson, Yusuke Tanimura, Isao Kojima

Ultra-Fast Load Balancing of Distributed Key-Value Stores through Network-Assisted Lookups

Many systems rely on distributed caches with thousands of nodes to improve response times and off-load underlying systems. Large-scale caching presents challenges in terms of resource utilization, load balancing, robustness and flexibility of deployment. In this paper, we propose a novel distributed caching method based on dynamic IP address assignment. Keys are mapped to a large IP address space statically and each node is dynamically assigned multiple IP addresses. As a result, we have a system with minimal need for central coordination, while eliminating the single point of failure in competitive solutions. We evaluate our system in our datacenter and show that our approach localizes the effect of load-balancing to only loaded cache servers, while leaving cache clients unaffected and also providing for finely-granular rebalancing.

Davide De Cesaris, Kostas Katrinis, Spyros Kotoulas, Antonio Corradi

Grid, Cluster and Cloud Computing

Virtual Machine Consolidation in Cloud Data Centers Using ACO Metaheuristic

In this paper, we propose the AVVMC VM consolidation scheme that focuses on balanced resource utilization of servers across different computing resources (CPU, memory, and network I/O) with the goal of minimizing power consumption and resource wastage. Since the VM consolidation problem is strictly NP-hard and computationally infeasible for large data centers, we propose adaptation and integration of the Ant Colony Optimization (ACO) metaheuristic with balanced usage of computing resources based on vector algebra. Our simulation results show that AVVMC outperforms existing methods and achieves improvement in both energy consumption and resource wastage reduction.

Md Hasanul Ferdaus, Manzur Murshed, Rodrigo N. Calheiros, Rajkumar Buyya

Workflow Scheduling on Federated Clouds

Federated Clouds, or the orchestration of multiple Cloud services for fulfilling applications’ requirements, is receiving increasing attention. Despite their many advantages, federated Clouds also present some downsides since different services may reside in different geographically located areas. This paper focuses on evaluating the advantages and disadvantages, from the point of view of performance and financial costs, of using a federation of Clouds for executing scientific workflows. It evaluates a wide range of different workflow types with different requirements in terms of computation and communication (produced and consumed data), and discusses which kind of workflow applications can benefit from a Cloud federation and how.

Juan J. Durillo, Radu Prodan

Locality-Aware Cooperation for VM Scheduling in Distributed Clouds

The promotion of distributed Cloud Computing infrastructures as the next platform to deliver the Utility Computing paradigm, leads to new virtual machines (VMs) scheduling algorithms leveraging peer-to-peer approaches. Although these proposals considerably improve the scalability, leading to the management of hundreds of thousands of VMs over thousands of physical machines (PMs), they do not consider the network overhead introduced by multi-site infrastructures. This overhead can have a dramatic impact on the performance if there is no mechanism favoring intra-site

v.s.

inter-site manipulations.

This paper introduces a new building block designed on top of a network with Vivaldi coordinates maximizing the locality criterion (

i.e.,

efficient collaborations between PMs). We combined such a mechanism with DVMS, a large-scale virtual machine scheduler and showed its benefit by discussing several experiments performed on four distinct sites of the Grid’5000 testbed. With our proposal and without changing the scheduling decision algorithm, the number of inter-site operations has been reduced by 72%. This result provides a glimpse of the promising future of using locality properties to improve the performance of massive distributed Cloud platforms.

Jonathan Pastor, Marin Bertier, Frédéric Desprez, Adrien Lebre, Flavien Quesnel, Cédric Tedeschi

Can Inter-VM Shmem Benefit MPI Applications on SR-IOV Based Virtualized Infiniband Clusters?

Single Root I/O Virtualization (SR-IOV) technology has been introduced for high-performance interconnects such as InfiniBand. Recent studies mainly focus on performance characteristics of high-performance communication middleware (e.g. MPI) and applications on SR-IOV enabled HPC clusters. However, current SR-IOV based MPI applications do not take advantage of the locality-aware communication on intra-host inter-VM environment. Although Inter-VM Shared Memory (IVShmem) has been proven to support efficient locality-aware communication, the performance benefits of IVShmem for MPI libraries on virtualized environments are yet to be explored. In this paper, we present a comprehensive performance evaluation for IVShmem backed MPI using micro-benchmarks and HPC applications. The performance evaluations show that, through IVShmem, the performance of MPI point-to-point and collective operations can be improved up to 193% and 91%, respectively. The application performance can be improved up to 96%, compared to SR-IOV. The results further show that IVShmem just brings minor overhead compared to native environment.

Jie Zhang, Xiaoyi Lu, Jithin Jose, Rong Shi, Dhabaleswar K. (DK) Panda

Green High Performance Computing

Power-Aware L1 and L2 Caches for GPGPUs

General Purpose Graphics Processing Units (GPGPUs) employ several levels of memory to execute hundreds of threads concurrently. L

1

and L

2

caches are critical to performance of GPGPUs but they are extremely power hungry due to the large number of cores they need to serve. This paper focuses on power consumption of L

1

data caches and L

2

cache in GPGPUs and proposes two optimization techniques: the first optimization technique places idle cache blocks into drowsy state to reduce leakage power. Our evaluations show that cache blocks are idle for long intervals and putting them into drowsy mode immediately after each access reduces leakage power dramatically with negligible impact on performance. The second optimization technique reduces dynamic power of caches. In GPGPU applications, many warps have inactive threads due to branch divergence. Existing GPGPU architectures access cache blocks for both active and inactive threads, wasting power of caches. We use active mask of GPGPUs and access only the portion of cache blocks that are required by active threads. By dynamically disabling unnecessary sections of cache blocks, we are able to reduce dynamic power of caches significantly.

Ehsan Atoofian, Ali Manzak

Power Consumption Due to Data Movement in Distributed Programming Models

The amount of energy consumed due to data movement poses a serious challenge when implementing and using distributed programming models. Message-passing models like MPI provide the user with explicit interfaces to initiate data-transfers among distributed processes. In this work, we establish the notion that from a programmer’s standpoint, design decisions like the size of the data-payload to be transferred and the number of explicit MPI calls to service such transfers have a direct impact on the power signatures of communication kernels. Upon closer look, we additionally observe that the choice of the transport layer (along with the associated interconnect) and the design of the data transfer protocol, both, contribute to these signatures. This paper presents a fine-grained study on the impact of the power and energy consumption due to data movement in distributed programming models. We hope that results discussed in this work would motivate application and system programmers to include energy consumption as one of the important design factors while targeting HPC systems.

Siddhartha Jana, Oscar Hernandez, Stephen Poole, Barbara Chapman

Distributed Systems and Algorithms

Spanning Tree or Gossip for Aggregation: A Comparative Study

Distributed aggregation queries like average and sum can be implemented in several different paradigms including gossip and hierarchical approaches. In the literature, these two paradigms are routinely associated with stereotypes such as “trees are fragile and complicated” and “gossip is slow and expensive”. However, a closer look reveals that these statements are not backed up by thorough studies. A fair and informative comparison is clearly needed. However, it is a very hard task, because the performance of protocols from the two paradigms depends on different subtleties of the environment and the implementation of the protocols. We tackle this problem by carefully designing the comparison study. We use state-of-the-art algorithms and propose the problem of monitoring the network size in the presence of churn as the ideal problem for comparing very different paradigms for global aggregation. Our experiments help us identify the most important factors that differentiate between gossip and spanning tree aggregation: the time needed to compute a truly global output, the properties of the underlying topology, and the sensitivity to dynamism. We demonstrate the effect of these factors in different practically interesting topologies and scenarios. Our results help us to choose the right protocol in the knowledge of the topology and dynamism patterns.

Lehel Nyers, Márk Jelasity

Shades: Expediting Kademlia’s Lookup Process

Kademlia is considered to be one of the most effective key based routing protocols. It is nowadays implemented in many file sharing peer-to-peer networks such as BitTorrent, KAD, and Gnutella.

This paper introduces

Shades

, a combined routing/caching scheme that significantly shortens the average lookup process in Kademlia and improves its load handling. The paper also includes an extensive performance study demonstrating the benefits of Shades and compares it to other suggested alternatives using both synthetic workloads and traces from YouTube and Wikipedia.

Gil Einziger, Roy Friedman, Yoav Kantor

Analysis and Comparison of Truly Distributed Solvers for Linear Least Squares Problems on Wireless Sensor Networks

The solution of linear least squares problems across large loosely connected distributed networks (such as wireless sensor networks) requires distributed algorithms which ideally need very little or no coordination between the nodes. We first provide an extensive overview of distributed least squares solvers appearing in the literature and classify them according to their communication patterns. We are particularly interested in

truly distributed

algorithms which do not require a fusion centre, cluster heads or any multi-hop communication. Beyond existing methods, we propose the novel least squares solver PSDLS, which utilises a recently developed distributed QR factorisation algorithm. All communication between nodes is exclusively performed within the push-sum algorithm for distributed aggregation.

We analytically compare the communication cost of PSDLS and the existing truly distributed algorithms. In all these algorithms, the communication cost of reaching a predefined accuracy depends on many factors, including network topology, problem size, and settings of algorithm-specific parameters. We illustrate with simulation experiments that our novel PSDLS solver requires significantly fewer messages per node than the previously existing methods to reach a predefined solution accuracy.

Karl E. Prikopa, Hana Straková, Wilfried N. Gansterer

Parallel and Distributed Programming

High-Performance Computer Algebra: A Hecke Algebra Case Study

We describe the first ever parallelisation of an algebraic computation at modern HPC scale. Our case study poses challenges typical of the domain: it is a multi-phase application with dynamic task creation and irregular parallelism over complex control and data structures.

Our starting point is a sequential algorithm for finding invariant bilinear forms in the representation theory of Hecke algebras, implemented in the GAP computational group theory system. After optimising the sequential code we develop a parallel algorithm that exploits the new skeleton-based SGP2 framework to parallelise the three most computationally-intensive phases. To this end we develop a new domain-specific skeleton,

parBufferTryReduce

. We report good parallel performance both on a commodity cluster and on a national HPC, delivering speedups up to 548 over the optimised sequential implementation on 1024 cores.

Patrick Maier, Daria Livesey, Hans-Wolfgang Loidl, Phil Trinder

Generic Deterministic Random Number Generation in Dynamic-Multithreaded Platforms

On dynamic multithreaded platforms with on-line scheduling such as work-stealing, randomized computations raise the issue of reproducibility. Compliant with

de facto

standard sequential Deterministic Random Number Generators (DRNGs) noted

R

, we propose a parallel DRNG implementation for finite computations that provides deterministic parallel execution. It uses the stateless sub-stream approach, enabling the use of efficient DRNG such as Mersenne Twister or Linear Congruential. We demonstrate that if

R

provides fast jump ahead in the random sequence, the re-seeding overhead is small, polylog in expectation, independently from the parallel computation’s depth. Experiments benchmark the performance of randomized algorithms employing our solution against the stateful DRNG DotMix, tailored to the Cilk Plus dynamic multithreading runtime. The overhead of our implementation

ParDRNG

<

R

> compares favorably to the linear overhead of DotMix re-seedings.s.

Stefano Mor, Jean-Louis Roch, Nicolas Maillard

Implementation and Performance Analysis of SkelGIS for Network Mesh-Based Simulations

The implicit parallelism is an active domain of computer-science to hide intricate details of parallelization from the end-user. Some solutions are specific to a precise domain while others are more generic, however, the purpose is always to find the adapted level of abstraction to ease the high performance and parallel programming. We present SkelGIS, a

header-only

implicit parallelism C++ library to solve mesh-based scientific simulations. In this paper is detailed the implementation of SkelGIS for the specific case of

network simulations

, where the space domain can be represented as a

directed acyclic graph

(DAG). This implementation is based on a modified, optimized and parallelized version of the

Compressed Sparse Row

format, which is completely described in this paper. Finally, experiments on different kinds of clusters and different sizes of DAGs are evaluated.

Hélène Coullon, Sébastien Limet

GoFFish: A Sub-graph Centric Framework for Large-Scale Graph Analytics

Vertex centric models for large scale graph processing are gaining traction due to their simple distributed programming abstraction. However, pure vertex centric algorithms under-perform due to large communication overheads and slow iterative convergence. We introduce

GoFFish

a scalable sub-graph centric framework co-designed with a distributed persistent graph storage for large scale graph analytics on commodity clusters, offering the added natural flexibility of shared memory sub-graph computation. We map Connected Components, SSSP and PageRank algorithms to this model and empirically analyze them for several real world graphs, demonstrating

orders of magnitude improvements

, in some cases, compared to Apache Giraph’s vertex centric framework.

Yogesh Simmhan, Alok Kumbhare, Charith Wickramaarachchi, Soonil Nagarkar, Santosh Ravi, Cauligi Raghavendra, Viktor Prasanna

Resolving Semantic Conflicts in Word Based Software Transactional Memory

In this paper we describe a technique for addressing semantic conflicts within word based Software Transactional Memory. A semantic conflict is considered to be some application condition which causes transactions to explicitly abort. Session locking and a companion Contention Management Policy are described which support the parallel exploration of multiple transaction schedules at run time, to resolve semantic conflicts. Performance figures are provided to demonstrate the effectiveness of our technique when semantic conflicts are introduced into established benchmarks.

Craig Sharp, William Blewitt, Graham Morgan

Automatic Tuning of the Parallelism Degree in Hardware Transactional Memory

Transactional Memory (TM) is an emerging paradigm that promises to ease the development of parallel applications. Due to its inherently speculative nature, however, TM can suffer of performance degradations in presence of conflict intensive workloads.A key technique to tackle this issue consists in dynamically regulating the number of concurrent threads, which allows for selecting the concurrency level that best fits the intrinsic parallelism of specific applications. In this area, several self-tuning approaches have been proposed for Software-based implementations of TM (STM). In this paper we investigate the effectiveness of these techniques when applied to Hardware TM (HTM), a theme that is particularly relevant and timely given the recent integration of hardware supports for TM in next generation of mainstream Intel processors. Our study, conducted on Intel’s implementation of HTM, identifies several issues associated with the employment of techniques originally conceived for STM. Motivated by these findings, we propose an innovative machine learning based technique explicitly designed to take into account peculiarities of HTM systems, and demonstrate its advantages, in terms of higher accuracy and shorter learning times, using the STAMP benchmark suite.

Diego Rughetti, Paolo Romano, Francesco Quaglia, Bruno Ciciani

Parallel Numerical Algorithms

A Distributed CPU-GPU Sparse Direct Solver

This paper presents the first hybrid MPI+OpenMP+CUDA implementation of a distributed memory right-looking unsymmetric sparse direct solver (i.e., sparse LU factorization) that uses static pivoting. While BLAS calls can account for more than 40% of the overall factorization time, the difficulty is that small problem sizes dominate the workload, making efficient GPU utilization challenging. This fact motivates our approach, which is to find ways to aggregate collections of small BLAS operations into larger ones; to schedule operations to achieve load balance and hide long-latency operations, such as PCIe transfer; and to exploit simultaneously all of a node’s available CPU cores and GPUs.

Piyush Sao, Richard Vuduc, Xiaoye Sherry Li

Parallel Computation of Echelon Forms

We propose efficient parallel algorithms and implementations on shared memory architectures of LU factorization over a finite field. Compared to the corresponding numerical routines, we have identified three main specifities of linear algebra over finite fields. First, the arithmetic complexity could be dominated by modular reductions. Therefore, it is mandatory to delay as much as possible these reductions while mixing fine-grain parallelizations of tiled iterative and recursive algorithms. Second, fast linear algebra variants, e.g., using Strassen-Winograd algorithm, never suffer from instability and can thus be widely used in cascade with the classical algorithms. There, trade-offs are to be made between size of blocks well suited to those fast variants or to load and communication balancing. Third, many applications over finite fields require the rank profile of the matrix (quite often rank deficient) rather than the solution to a linear system. It is thus important to design parallel algorithms that preserve and compute this rank profile. Moreover, as the rank profile is only discovered during the algorithm, block size has then to be dynamic. We propose and compare several block decompositions: tile iterative with left-looking, right-looking and Crout variants, slab and tile recursive. Experiments demonstrate that the tile recursive variant performs better and matches the performance of reference numerical software when no rank deficiency occurs. Furthermore, even in the most heterogeneous case, namely when all pivot blocks are rank deficient, we show that it is possbile to maintain a high efficiency.

Jean-Guillaume Dumas, Thierry Gautier, Clément Pernet, Ziad Sultan

Time-Domain BEM for the Wave Equation: Optimization and Hybrid Parallelization

The problem of time-domain BEM for the wave equation in acoustics and electromagnetism can be expressed as a sparse linear system composed of multiple interaction/convolution matrices. It can be solved using sparse matrix-vector products which are inefficient to achieve high Flop-rate. In this paper we present a novel approach based on the re-ordering of the interaction matrices in slices. We end up with a custom multi-vectors/vector product operation and compute it using SIMD intrinsic functions. We take advantage of the new order of the computation to parallelize in shared and distributed memory. We demonstrate the performance of our system by studying the sequential Flop-rate and the parallel scalability, and provide results based on an industrial test-case with up to 32 nodes.

Berenger Bramas, Olivier Coulaud, Guillaume Sylvand

Structured Orthogonal Inversion of Block p-Cyclic Matrices on Multicores with GPU Accelerators

We present a block structured orthogonal factorization (BSOF) algorithm and its parallelization for computing the inversion of block

p

-cyclic matrices. We aim at the high performance on multicores with GPU accelerators. We provide a quantitative performance model for optimal host-device load balance, and validate the model through numerical tests. Benchmarking results show that the parallel BSOF based inversion algorithm attains up to 90% of

DGEMM

performance on hybrid CPU+GPU systems.

Sergiy Gogolenko, Zhaojun Bai, Richard Scalettar

Multicore and Manycore Programming

High-Throughput Maps on Message-Passing Manycore Architectures: Partitioning versus Replication

The advent of manycore architectures raises new scalability challenges for concurrent applications. Implementing scalable data structures is one of them. Several manycore architectures provide hardware message passing as a means to efficiently exchange data between cores. In this paper, we study the implementation of high-throughput concurrent maps in message-passing manycores. Partitioning and replication are the two approaches to achieve high throughput in a message-passing system. Our paper presents and compares different strongly-consistent map algorithms based on partitioning and replication. To assess the performance of these algorithms independently of architecture-specific features, we propose a communication model of message-passing manycores to express the throughput of each algorithm. The model is validated through experiments on a 36-core TILE-Gx8036 processor. Evaluations show that replication outperforms partitioning only in a narrow domain.

Omid Shahmirzadi, Thomas Ropars, André Schiper

A Fast Sparse Block Circulant Matrix Vector Product

In the context of computed tomography (CT), iterative image reconstruction techniques are gaining attention because high-quality images are becoming computationally feasible. They involve the solution of large systems of equations, whose cost is dominated by the sparse matrix vector product (SpMV). Our work considers the case of the sparse matrices being block circulant, which arises when taking advantage of the rotational symmetry in the tomographic system. Besides the straightforward storage saving, we exploit the circulant structure to rewrite the poor-performance SpMVs into a high-performance product between sparse and dense matrices. This paper describes the implementations developed for multi-core CPUs and GPUs, and presents experimental results with typical CT matrices. The presented approach is up to ten times faster than without exploiting the circulant structure.

Eloy Romero, Andrés Tomás, Antonio Soriano, Ignacio Blanquer

Scheduling Data Flow Program in XKaapi: A New Affinity Based Algorithm for Heterogeneous Architectures

Efficient implementations of parallel applications on heterogeneous hybrid architectures require a careful balance between computations and communications with accelerator devices. Even if most of the communication time can be overlapped by computations, it is essential to reduce the total volume of communicated data. The literature therefore abounds with

ad hoc

methods to reach that balance, but these are architecture and application dependent. We propose here a generic mechanism to automatically optimize the scheduling between CPUs and GPUs, and compare two strategies within this mechanism: the classical Heterogeneous Earliest Finish Time (HEFT) algorithm and our new, parametrized, Distributed Affinity Dual Approximation algorithm (DADA), which consists in grouping the tasks by affinity before running a fast dual approximation. We ran experiments on a heterogeneous parallel machine with twelve CPU cores and eight NVIDIA Fermi GPUs. Three standard dense linear algebra kernels from the PLASMA library have been ported on top of the XKaapi runtime system. We report their performances. It results that HEFT and DADA perform well for various experimental conditions, but that DADA performs better for larger systems and number of GPUs, and, in most cases, generates much lower data transfers than HEFT to achieve the same performance.

Raphaël Bleuse, Thierry Gautier, João V. F. Lima, Grégory Mounié, Denis Trystram

Delegation Locking Libraries for Improved Performance of Multithreaded Programs

While standard locking libraries are common and easy to use, delegation algorithms that offload work to a single thread can achieve better performance in multithreaded applications, but are hard to use without adequate library support. This paper presents an interface for delegation locks together with libraries for C and C++ that make it easy to use

queue delegation locking

, a versatile high-performance delegation algorithm. We show examples of using these libraries, discuss the porting effort needed to take full advantage of delegation locking in applications designed with standard locking in mind, and the improved performance that this achieves.

David Klaftenegger, Konstantinos Sagonas, Kjell Winblad

A Generic Strategy for Multi-stage Stencils

Stencil computations on regular grids are widely used in scientific simulations. Optimization techniques for such stencil computations typically exploit temporal locality across time steps. More complex stencil applications, like those in meteorology and seismic simulations, cannot easily take advantage of these techniques, since the number of physical fields and computation stages to consider at each time step flush all data present in the cache at the beginning of the next time step. In this paper we present a technique for improving performance of such computations, based only on spatial tiling, which is implemented as a generic algorithm.

More specifically, we investigate how to take advantage of producer-consumer relations of stencil loops, in a single time step, to improve memory hierarchy utilization. This approach makes it possible to balance computation and communication to improve resource usage. We implement our methods using generic programming constructs of C++, which we compare with hand-tuned implementations of the stencils. The results show that this technique can improve both single-threaded and multi-threaded performance to closely match that of hand-tuned implementations, with the convenience of a high-level specification.

Mauro Bianco, Benjamin Cumming

Evaluation of OpenMP Task Scheduling Algorithms for Large NUMA Architectures

Current generation of high performance computing platforms tends to hold a large number of cores. Therefore applications have to expose a fine-grain parallelism to be more efficient. Since version 3.0, the

OpenMP

standard proposes a way to express such parallelism through tasks. Because the task scheduling strategy is implementation defined, each runtime can have a different behavior and efficiency. Notwithstanding, the hierarchical characteristic of current parallel computing systems is rarely considered. This might come down to a loss of performance on large multicore NUMA systems. This paper studies multiple task scheduling algorithms with a configurable scheduler. It relies on a topology-aware tree-based representation of the computing platform to orchestrate the execution and the load-balacing of

OpenMP

tasks. High-end users can select the task-list granularity according to the tree structure and choose the most convenient work-stealing strategy. One of these strategies takes into account data locality with the help of the hierarchical view. It performs well with unbalanced codes, from

BOTS

benchmarks, in comparison to

Intel

and

GNU

OpenMP

runtimes on 16-core and 128-core systems.

Jérôme Clet-Ortega, Patrick Carribault, Marc Pérache

Theory and Algorithms for Parallel Computation

Power-Aware Replica Placement in Tree Networks with Multiple Servers per Client

In this paper, we revisit the well-studied problem of replica placement in tree networks. Rather than minimizing the number of servers needed to serve all client requests, we aim at minimizing the total power consumed by these servers. In addition, we use the most general (and powerful) server assignment policy, where the requests of a client can be served by multiple servers located in the (unique) path from this client to the root of the tree. We consider multi-modal servers that can operate at a set of discrete speeds, using the dynamic voltage and frequency scaling (DVFS) technique. The optimization problem is to determine an optimal location of the servers in the tree, as well as the speed at which each server is operated. A major result is the NP-completeness of this problem, to be contrasted with the minimization of the number of servers, which has polynomial complexity. Another important contribution is the formulation of a Mixed Integer Linear Program (MILP) for the problem, together with the design of several polynomial-time heuristics. We assess the efficiency of these heuristics by simulation. For mid-size instances (up to 30 nodes in the tree), we evaluate their absolute performance by comparison with the optimal solution (obtained via the MILP). The most efficient heuristics provide satisfactory results, within 20% of the optimal solution.

Guillaume Aupy, Anne Benoit, Matthieu Journault, Yves Robert

On Constructing DAG-Schedules with Large AREAs

The Area of a schedule Σ for a

dag

measures the rate at which Σ renders

’s nodes eligible for execution. Specifically,

AREA

(Σ) is the average number of nodes that are eligible for execution as Σ executes

node by node. Extensive simulations suggest that, for many distributions of processor availability and power, schedules having larger Areas execute

dag

s faster on platforms that are

dynamically

heterogeneous: their processors change power and availability status in unpredictable ways and at unpredictable times. While Area-maximal schedules exist for every

dag

, efficient generators of such schedules are known only for well-structured

dag

s. We prove that the general problem of crafting Area-maximal schedules is

NP

-complete, hence likely computationally intractable. This situation motivates the development of heuristics for producing

dag

-schedules that have large Areas. We build on the

Sidney decomposition

of a

dag

to develop a

polynomial-time

heuristic,

Sidney

, whose schedules have quite large Areas. (1) Simulations on

dag

s having

random structure

indicate that

Sidney

’s schedules have Areas: (a) at least 85% of maximal; (b) at least 1.25 times larger than those produced by previous heuristics. (2) Simulations on

dag

s having the structure of

random

“

LEGO

®

”

dag

s indicate that

Sidney

’s schedules have Areas that are at least 1.5 times larger than those produced by previous heuristics. The “85%” result emerges from an LP-based formulation of the Area-maximization problem. (3) Our results on random

dag

s are roughly matched by a second heuristic that emerges directly from the LP formulation.

Scott T. Roche, Arnold L. Rosenberg, Rajmohan Rajaraman

High Performance Networks and Communication

Software Defined Multicasting for MPI Collective Operation Offloading with the NetFPGA

Collective operations play a key role in the performance of many high performance computing applications and are central to the widely used Message Passing Interface (MPI) programming model. In this paper we explore the use of programmable networking devices to accelerate the implementation of collective operations by offloading functionality to the underlying network. In our work we utilize a networked FPGA in conjunction with commercial OpenFlow switches supporting multicast. The union of hardware configurable network interfaces with Software Defined Networking (SDN) provides a significant opportunity to improve the performance of MPI applications that rely heavily on collective operations. The programmable interfaces implement collective operations in hardware using OpenFlow supported multicast. In our 8-node cluster, we observed up to 12% reduction in MPI_Allreduce latency in dynamic schemes employing SDN; and up to 22% reduction in static topologies. The results suggest more benefits if our approach is deployed in larger settings with low latency switches.

Omer Arap, Geoffrey Brown, Bryce Himebaugh, Martin Swany

MapReduce over Lustre: Can RDMA-Based Approach Benefit?

Recently, MapReduce is getting deployed over many High Performance Computing (HPC) clusters. Different studies reveal that by leveraging the benefits of high-performance interconnects like InfiniBand in these clusters, faster MapReduce job execution can be obtained by using additional performance enhancing features. Although RDMA-enhanced MapReduce has been proven to provide faster solutions over Hadoop distributed file system, efficiencies over parallel file systems used in HPC clusters are yet to be discovered. In this paper, we present a complete methodology for evaluating MapReduce over Lustre file system to provide insights about the interactions of different system components in HPC clusters. Our performance evaluation shows that RDMA-enhanced MapReduce can achieve significant benefits in terms of execution time (49% in a 128-node HPC cluster) and resource utilization, compared to the default architecture. To the best of our knowledge, this is the first attempt to evaluate RDMA-enhanced MapReduce over Lustre file system on HPC clusters.

Md. Wasi-ur Rahman, Xiaoyi Lu, Nusrat Sharmin Islam, Raghunath Rajachandrasekar, Dhabaleswar K. (DK) Panda

High-Performance and Scientic Applications

Random Fields Generation on the GPU with the Spectral Turning Bands Method

Random field (RF) generation algorithms are of paramount importance for many scientific domains, such as astrophysics, geostatistics, computer graphics and many others. Some examples are the generation of initial conditions for cosmological simulations or hydrodynamical turbulence driving. In the latter a new random field is needed every time-step. Current approaches commonly make use of 3D FFT (Fast Fourier Transform) and require the whole generated field to be stored in memory. Moreover, they are limited to regular rectilinear meshes and need an extra processing step to support non-regular meshes.

In this paper, we introduce TBARF (Turning BAnd Random Fields), a RF generation algorithm based on the turning band method that is optimized for massively parallel hardware such as GPUs. Our algorithm replaces the 3D FFT with a lower order, one-dimensional FFT followed by a projection step, and is further optimized with loop unrolling and blocking. We show that TBARF can easily generate RF on non-regular (non uniform) meshes and can afford mesh sizes bigger than the available GPU memory by using a streaming, out-of-core approach. TBARF is 2 to 5 times faster than the traditional methods when generating RFs with more than 16M cells. It can also generate RF on non-regular meshes, and has been successfully applied to two real case scenarios: planetary nebulae and cosmological simulations.

Lars Hunger, Biagio Cosenza, Stefan Kimeswenger, Thomas Fahringer

Fast Set Intersection through Run-Time Bitmap Construction over PForDelta-Compressed Indexes

Set intersection is a fundamental operation for evaluating conjunctive queries in the context of scientific data analysis. The state-of-the-art approach in performing set intersection, compressed bitmap indexing, achieves high computational efficiency because of cheap bitwise operations; however, overall efficiency is often nullified by the HPC I/O bottleneck, because compressed bitmap indexes typically exhibit a heavy storage footprint. Conversely, the recently-presented PForDelta-compressed index has been demonstrated to be storage-lightweight, but has limited performance for set intersection. Thus, a more effective set intersection approach should be efficient in both computation and I/O.

Therefore, we propose a fast set intersection approach that couples the storage light-weight PForDelta indexing format with computationally-efficient bitmaps through a specialized on-the-fly conversion. The resultant challenge is to ensure this conversion process is fast enough to maintain the performance gains from both PForDelta and the bitmaps. To this end, we contribute two key enhancements to PForDelta,

BitRun

and

BitExp

, which improve bitmap conversion through bulk bit-setting and a more streamlined PForDelta decoding process, respectively. Our experimental results show that our integrated PForDelta-bitmap method speeds up conjunctive queries by up to 7.7x versus the state-of-the-art approach, while using indexes that require 15%-60% less storage in most cases.

Xiaocheng Zou, Sriram Lakshminarasimhan, David A. Boyuka II, Stephen Ranshous, Houjun Tang, Scott Klasky, Nagiza F. Samatova

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

High-throughput genotyping technologies allow the collection of up to a few million genetic markers (such as SNPs) of an individual within a few minutes of time. Detecting epistasis, such as 2-SNP interactions, in Genome-Wide Association Studies is an important but time consuming operation since statistical computations have to be performed for each pair of measured markers. In this work we present EpistSearch, a parallelized tool that, following the log-linear model approach, uses a novel filter to determine the interactions between all SNP-pairs. Our tool is parallelized using a hybrid combination of Pthreads and CUDA in order to take advantage of CPU/GPU architectures. Experimental results with simulated and real datasets show that EpistSearch outperforms previous approaches, either using GPUs or only CPU cores. For instance, an exhaustive analysis of a real-world dataset with 500,000 SNPs and 5,000 individuals requires less than 42 minutes on a machine with 6 CPU cores and a GTX Titan GPU.

Jorge González-Domínguez, Bertil Schmidt, Jan Christian Kässens, Lars Wienbrandt

IFM: A Scalable High Resolution Flood Modeling Framework

Accurate and timely flood forecasts are essential for effective management of flood disasters, which has become increasingly frequent over the last decade. Obtaining such forecasts requires high resolution integrated weather and flood models with computational costs optimized to provide sufficient lead time. Existing overland flood modeling software packages do not readily scale to topography grids of large size and only permit coarse resolution modeling of large regions. In this paper, we present a highly scalable, integrated flood forecasting system called IFM that runs on both shared and distributed memory architectures, effectively allowing the computation of domains with billions of cells. In order to optimize IFM for large areas, we focus on the computationally expensive overland routing engine. We describe a parallelization scheme and novel strategies to partition irregular domains to minimize load imbalance in the presence of memory constraints that results in 40% reduction in time compared to best uniform partitioning. We demonstrate the scalability of the proposed approach for up to 8192 processors on large scale real-world domains. Our model can provide a 48-hour flood forecast on a watershed of 656 million cells in under 5 minutes.

Swati Singhal, Sandhya Aneja, Frank Liu, Lucas Villa Real, Thomas George

High Performance Pseudo-analytical Simulation of Multi-Object Adaptive Optics over Multi-GPU Systems

Multi-object adaptive optics (MOAO) is a novel adaptive optics (AO) technique dedicated to the special case of wide-field multi-object spectrographs (MOS). It applies dedicated wavefront corrections to numerous independent tiny patches spread over a large field of view (FOV). The control of each deformable mirror (DM) is done individually using a tomographic reconstruction of the phase based on measurements from a number of wavefront sensors (WFS) pointing at natural and artificial guide stars in the field. The output of this study helps the design of a new instrument called MOSAIC, a multi-object spectrograph proposed for the European Extremely Large Telescope (E-ELT). We have developed a novel hybrid pseudo-analytical simulation scheme that allows us to accurately simulate in detail the tomographic problem. The main challenge resides in the computation of the tomographic reconstructor, which involves pseudo-inversion of a large dense symmetric matrix. The pseudo-inverse is computed using an eigenvalue decomposition, based on the divide and conquer algorithm, on multicore systems with multi-GPUs. Thanks to a new symmetric matrix-vector product (SYMV) multi-GPU kernel, our overall implementation scores significant speedups over standard numerical libraries on multicore, like Intel MKL, and up to 60% speedups over the standard MAGMA implementation on 8 Kepler K20c GPUs. At 40,000 unknowns, this appears to be the largest-scale tomographic AO matrix solver submitted to computation, to date, to our knowledge and opens new research directions for extreme scale AO simulations.

Ahmad Abdelfattah, Eric Gendron, Damien Gratadour, David Keyes, Hatem Ltaief, Arnaud Sevin, Fabrice Vidal

Parallel Dual Tree Traversal on Multi-core and Many-core Architectures for Astrophysical N-body Simulations

In astrophysical

N

-body simulations, Dehnen’s algorithm, implemented in the serial

falcON

code and based on a dual tree traversal, is faster than serial Barnes-Hut tree-codes, but outperformed by parallel CPU and GPU tree-codes. In this paper, we present a parallel dual tree traversal, implemented in the

pfalcON

code, targeting multicore CPUs and many-core architectures (Xeon Phi). We focus here on both performance and portability, while preserving Dehnen’s original algorithm. We first use task parallelism, with either OpenMP or Intel TBB, for the dual tree traversal. We then rely on the SPMD (single-program, multiple-data) model for the SIMD vectorization of the near field part thanks to the Intel SPMD Program Compiler. We compare the

pfalcON

performance to related work, and finally obtain performance results that match one of the best current tree-code implementations on GPU.

Benoit Lange, Pierre Fortin

GPU and Accelerator Computing

Customizing Driving Directions with GPUs

Computing driving directions interactively on continental road networks requires preprocessing. This step can be costly, limiting our ability to incorporate new optimization functions, including traffic information or personal preferences. We show how the performance of the state-of-the-art customizable route planning (CRP) framework is boosted by GPUs, even though it has highly irregular structure. Our experimental study reveals that our method is an order of magnitude faster than a highly-optimized parallel CPU implementation, enabling interactive personalized driving directions on continental scale.

Daniel Delling, Moritz Kobitzsch, Renato F. Werneck

GPU Accelerated Range Trees with Applications

Range searching is a primal problem in computational geometry with applications to database systems, mobile computing, geographical information systems, and the like. Defined simply, the problem is to preprocess a given a set of points in a

d

-dimensional space so that the points that lie inside an orthogonal query rectangle can be efficiently reported.

Many practical applications of range trees require one to process a massive amount of points and a massive number of queries. In this context, we propose an efficient parallel implementation of range trees on manycore architectures such as GPUs. We extend our implementation to query processing. While queries can be batched together to exploit inter-query parallelism, we also utilize intra-query parallelism. This inter- and intra-query parallelism greatly reduces the per query latency thereby increasing the throughput. On an input of 1 M points in a 2-dimensional space, our implementation on a single Nvidia GTX 580 GPU for constructing a range tree shows an improvement of 12X over a 12-threaded CPU implementation. We also achieve an average throughput of 10 M queries per second for answering 4 M queries on a range tree containing 1 M points on a Nvidia GTX 580 GPU. We extend our implementation to an application where we seek to report the set of maximal points in a given orthogonal query rectangle.

Manoj Kumar Maramreddy, Kishore Kothapalli

Scalable On-Board Multi-GPU Simulation of Long-Range Molecular Dynamics

Molecular dynamics simulations allow us to study the behavior of complex biomolecular systems by modeling the pairwise interaction forces between all atoms. Molecular systems are subject to slowly decaying electrostatic potentials, which turn molecular dynamics into an n-body problem. In this paper, we present a parallel and scalable solution to compute long-range molecular forces, based on the multilevel summation method (MSM). We first demonstrate an optimization of MSM that replaces 3D convolutions with FFTs, and we achieve a single-GPU performance comparable to the particle mesh Ewald (PME) method, the de facto standard for long-range molecular force computation. But most importantly, we propose a distributed MSM that avoids the scalability difficulties of PME. Our distributed solution is based on a spatial partitioning of the MSM multilevel grid, together with massively parallel algorithms for interface update and synchronization. We demonstrate the scalability of our approach on an on-board multi-GPU platform.

Marcos Novalbos, Jaime González, Miguel A. Otaduy, Roberto Martinez-Benito, Alberto Sanchez

Resolution of Linear Algebra for the Discrete Logarithm Problem Using GPU and Multi-core Architectures

In cryptanalysis, solving the discrete logarithm problem (DLP) is key to assessing the security of many public-key cryptosystems. The index-calculus methods, that attack the DLP in multiplicative subgroups of finite fields, require solving large sparse systems of linear equations modulo large primes. This article deals with how we can run this computation on GPU- and multi-core-based clusters, featuring InfiniBand networking. More specifically, we present the sparse linear algebra algorithms that are proposed in the literature, in particular the block Wiedemann algorithm. We discuss the parallelization of the central matrix–vector product operation from both algorithmic and practical points of view, and illustrate how our approach has contributed to the recent record-sized DLP computation in GF(2

809

).

Hamza Jeljeli

Toward OpenCL Automatic Multi-Device Support

To fully tap into the potential of today heterogeneous machines, offloading parts of an application on accelerators is no longer sufficient. The real challenge is to build systems where the application would permanently spread across the entire machine, that is, where parallel tasks would be dynamically scheduled over the full set of available processing units. In this paper we present SOCL, an OpenCL implementation that improves and simplifies the programming experience on heterogeneous architectures. SOCL enables applications to dynamically dispatch computation kernels over processing devices so as to maximize their utilization. OpenCL applications can incrementally make use of light extensions to automatically schedule kernels in a controlled manner on multi-device architectures. We demonstrate the relevance of our approach by experimenting with several OpenCL applications on a range of heterogeneous architectures. We show that performance portability is enhanced by using SOCL extensions.

Sylvain Henry, Alexandre Denis, Denis Barthou, Marie-Christine Counilh, Raymond Namyst

Concurrent Kernel Execution on Xeon Phi within Parallel Heterogeneous Workloads

Computations with a sufficient amount of parallelism and workload size may take advantage of many-core coprocessors. In contrast, small-scale workloads usually suffer from a poor utilization of the coprocessor resources. For parallel applications with small but many computational kernels a concurrent processing on a shared coprocessor may be a viable solution. We evaluate the Xeon Phi offload models Intel LEO and OpenMP4 within multi-threaded and multi-process host applications with concurrent coprocessor offloading. Limitations of OpenMP4 regarding data persistence across function calls, e.g. when used within libraries, can slow down the application. We propose an offload-proxy approach for OpenMP4 to recover the performance in these cases. For concurrent kernel execution, we demonstrate the performance of the different offload models and our offload-proxy by using synthetic kernels and a parallel hybrid CPU/Xeon Phi molecular simulation application.

Florian Wende, Thomas Steinke, Frank Cordes

Writing Self-adaptive Codes for Heterogeneous Systems

Heterogeneous systems are becoming increasingly common. Relatedly, the popularity of OpenCL is growing, as it provides a unified mean to program a wide variety of devices including GPUs or multicore CPUs. More recently, the Heterogeneous Programming Library (HPL) targets the same variety of systems as OpenCL, intending to improve their programmability. The main drawback of such unified approaches is the lack of performance portability, as codes written using OpenCL or HPL may obtain a good performance in a given device but a poor performance in a different one. HPL allows to generate different versions of kernels at run-time by combining C++ and the HPL embedded language. This paper explores the development of self-adaptive kernels that exploit this characteristic so that their code depends on configuration parameters that are tuned using a genetic algorithm through an iterative optimization process. The results show that these self-adaptive kernels are faster than those generated by hand following heuristics.

Jorge F. Fabeiro, Diego Andrade, Basilio B. Fraguela, Ramón Doallo

A Pattern-Based Comparison of OpenACC and OpenMP for Accelerator Computing

Nowadays, HPC systems frequently emerge as clusters of commodity processors with attached accelerators. Moving from tedious low-level accelerator programming to increased development productivity, the directive-based programming models OpenACC and OpenMP are promising candidates. While OpenACC was completed about two years ago, OpenMP just recently added support for accelerator programming. To assist developers in their decision-making which approach to take, we compare both models with respect to their programmability. Besides investigating their expressiveness by putting their constructs side by side, we focus on the evaluation of their power based on structured parallel programming patterns (aka algorithmic skeletons). These patterns describe the basic entities of parallel algorithms of which we cover the patterns

map

,

stencil

,

reduction

,

fork-join

,

superscalar sequence

,

nesting

and

geometric decomposition

. Architectural targets of this work are NVIDIA-type accelerators (GPUs) and specialties of Intel-type accelerators (Xeon Phis). Additionally, we assess the prospects of OpenACC and OpenMP concerning future development in soft- and hardware design.

Sandra Wienke, Christian Terboven, James C. Beyer, Matthias S. Müller

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Support Tools Environments

MPI Trace Compression Using Event Flow Graphs

ScalaJack: Customized Scalable Tracing with In-situ Data Analysis

Performance Measurement and Analysis of Transactional Memory and Speculative Execution on IBM Blue Gene/Q

c-Eclipse: An Open-Source Management Framework for Cloud Applications

Modeling and Simulation of a Dynamic Task-Based Runtime System for Heterogeneous Multi-core Architectures

Performance Prediction and Evaluation

Modeling the Impact of Reduced Memory Bandwidth on HPC Applications

ParaShares: Finding the Important Basic Blocks in Multithreaded Programs

Multi-Objective Auto-Tuning with Insieme: Optimization and Trade-Off Analysis for Time, Energy and Resource Usage

Performance Prediction and Evaluation of Parallel Applications in KVM, Xen, and VMware

DReAM: Per-Task DRAM Energy Metering in Multicore Systems

Characterizing the Performance-Energy Tradeoff of Small ARM Cores in HPC Computation

Scheduling and Load Balancing

On Interactions among Scheduling Policies: Finding Efficient Queue Setup Using High-Resolution Simulations

ProPS: A Progressively Pessimistic Scheduler for Software Transactional Memory

A Queueing Theory Approach to Pareto Optimal Bags-of-Tasks Scheduling on Clouds

SPAGHETtI: Scheduling/Placement Approach for Task-Graphs on HETerogeneous archItecture

Energy-Aware Multi-Organization Scheduling Problem

Energy Efficient Scheduling of MapReduce Jobs

High Performance Architectures and Compilers

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs

Switchable Scheduling for Runtime Adaptation of Optimization

A New GCC Plugin-Based Compiler Pass to Add Support for Thread-Level Speculation into OpenMP

Parallel and Distributed Data Management

Improving Read Performance with Online Access Pattern Analysis and Prefetching

Robust and Efficient Large-Large Table Outer Joins on Distributed Infrastructures

Top-k Item Identification on Dynamic and Distributed Datasets

Applying Selectively Parallel I/O Compression to Parallel Storage Systems

Ultra-Fast Load Balancing of Distributed Key-Value Stores through Network-Assisted Lookups

Grid, Cluster and Cloud Computing

Virtual Machine Consolidation in Cloud Data Centers Using ACO Metaheuristic

Workflow Scheduling on Federated Clouds

Locality-Aware Cooperation for VM Scheduling in Distributed Clouds

Can Inter-VM Shmem Benefit MPI Applications on SR-IOV Based Virtualized Infiniband Clusters?

Green High Performance Computing

Power-Aware L1 and L2 Caches for GPGPUs

Power Consumption Due to Data Movement in Distributed Programming Models

Distributed Systems and Algorithms

Spanning Tree or Gossip for Aggregation: A Comparative Study

Shades: Expediting Kademlia’s Lookup Process

Analysis and Comparison of Truly Distributed Solvers for Linear Least Squares Problems on Wireless Sensor Networks

Parallel and Distributed Programming

High-Performance Computer Algebra: A Hecke Algebra Case Study

Generic Deterministic Random Number Generation in Dynamic-Multithreaded Platforms

Implementation and Performance Analysis of SkelGIS for Network Mesh-Based Simulations

GoFFish: A Sub-graph Centric Framework for Large-Scale Graph Analytics

Resolving Semantic Conflicts in Word Based Software Transactional Memory

Automatic Tuning of the Parallelism Degree in Hardware Transactional Memory

Parallel Numerical Algorithms

A Distributed CPU-GPU Sparse Direct Solver

Parallel Computation of Echelon Forms

Time-Domain BEM for the Wave Equation: Optimization and Hybrid Parallelization

Structured Orthogonal Inversion of Block p-Cyclic Matrices on Multicores with GPU Accelerators

Multicore and Manycore Programming

High-Throughput Maps on Message-Passing Manycore Architectures: Partitioning versus Replication

A Fast Sparse Block Circulant Matrix Vector Product

Scheduling Data Flow Program in XKaapi: A New Affinity Based Algorithm for Heterogeneous Architectures

Delegation Locking Libraries for Improved Performance of Multithreaded Programs

A Generic Strategy for Multi-stage Stencils

Evaluation of OpenMP Task Scheduling Algorithms for Large NUMA Architectures

Theory and Algorithms for Parallel Computation

Power-Aware Replica Placement in Tree Networks with Multiple Servers per Client

On Constructing DAG-Schedules with Large AREAs

High Performance Networks and Communication

Software Defined Multicasting for MPI Collective Operation Offloading with the NetFPGA

MapReduce over Lustre: Can RDMA-Based Approach Benefit?

High-Performance and Scientic Applications

Random Fields Generation on the GPU with the Spectral Turning Bands Method

Fast Set Intersection through Run-Time Bitmap Construction over PForDelta-Compressed Indexes

Hybrid CPU/GPU Acceleration of Detection of 2-SNP Epistatic Interactions in GWAS

IFM: A Scalable High Resolution Flood Modeling Framework

High Performance Pseudo-analytical Simulation of Multi-Object Adaptive Optics over Multi-GPU Systems

Parallel Dual Tree Traversal on Multi-core and Many-core Architectures for Astrophysical N-body Simulations

GPU and Accelerator Computing

Customizing Driving Directions with GPUs