Skip to main content

2017 | Buch

Euro-Par 2017: Parallel Processing

23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28 – September 1, 2017, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 23rd International Conference on Parallel and Distributed Computing, Euro-Par 2017, held in Santiago de Compostela, Spain, in August/September 2017. The 50 revised full papers presented together with 2 abstract of invited talks and 1 invited paper were carefully reviewed and selected from 176 submissions. The papers are organized in the following topical sections: support tools and environments; performance and power modeling, prediction and evaluation; scheduling and load balancing; high performance architectures and compilers; parallel and distributed data management and analytics; cluster and cloud computing; distributed systems and algorithms; parallel and distributed programming, interfaces and languages; multicore and manycore parallelism; theory and algorithms for parallel computation and networking; prallel numerical methods and applications; and accelerator computing.

Inhaltsverzeichnis

Frontmatter

Invited Paper

Frontmatter
Computing Just What You Need: Online Data Analysis and Reduction at Extreme Scales

A growing disparity between supercomputer computation speeds and I/O rates makes it increasingly infeasible for applications to save all results for offline analysis. Instead, applications must analyze and reduce data online so as to output only those results needed to answer target scientific question(s). This change in focus complicates application and experiment design and introduces algorithmic, implementation, and programming model challenges that are unfamiliar to many scientists and that have major implications for the design of various elements of supercomputer systems. We review these challenges and describe methods and tools that we are developing to enable experimental exploration of algorithmic, software, and system design alternatives.

Ian Foster, Mark Ainsworth, Bryce Allen, Julie Bessac, Franck Cappello, Jong Youl Choi, Emil Constantinescu, Philip E. Davis, Sheng Di, Wendy Di, Hanqi Guo, Scott Klasky, Kerstin Kleese Van Dam, Tahsin Kurc, Qing Liu, Abid Malik, Kshitij Mehta, Klaus Mueller, Todd Munson, George Ostouchov, Manish Parashar, Tom Peterka, Line Pouchard, Dingwen Tao, Ozan Tugluk, Stefan Wild, Matthew Wolf, Justin M. Wozniak, Wei Xu, Shinjae Yoo

Support Tools and Environments

Frontmatter
Scaling Energy Adaptive Applications for Sustainable Profitability

Energy efficiency in data centres is addressed through workload management usually to reduce the operational costs and as a by-product, the environmental footprint. This includes to minimise total power consumption or to minimise the power issued from non-renewable energy sources. Hence, the performance requirements of the client’s applications are either totally overlooked or strictly enforced.To encourage profitable sustainability in data centres, we consider the total financial gain as a trade-off between energy efficiency and client satisfaction. We propose Carver to orchestrate energy-adaptive applications, according to performance and environmental preferences and given forecasts of the renewable energy production. We validated Carver by simulating a testbed powered by the grid and a photovoltaic array and running the Web service HP LIFE.

Fabien Hermenier, Giuliani Giovanni, Andre Milani, Sophie Demassey
Off-Road Performance Modeling – How to Deal with Segmented Data

Besides correctness, scalability is one of the top priorities of parallel programmers. With manual analytical performance modeling often being too laborious, developers increasingly resort to empirical performance modeling as a viable alternative, which learns performance models from a limited amount of performance measurements. Although powerful automatic techniques exist for this purpose, they usually struggle with the situation where performance data representing two or more different phenomena are conflated into a single performance model. This not only generates an inaccurate model for the given data, but can also either fail to point out existing scalability issues or create the appearance of such issues when none are present. In this paper, we present an algorithm to detect segmentation in a sequence of performance measurements and estimate the point where the behavior changes. Our method correctly identified segmentation in more than 80% of 5.2 million synthetic tests and confirmed expected segmentation in three application case studies.

M. Kashif Ilyas, Alexandru Calotoiu, Felix Wolf
Online Dynamic Monitoring of MPI Communications

As the complexity and diversity of computer hardware and the elaborateness of network technologies have made the implementation of portable and efficient algorithms more challenging, the need to understand application communication patterns has become increasingly relevant. This paper presents details of the design and evaluation of a communication-monitoring infrastructure developed in the Open MPI software stack that can expose a dynamically configurable level of detail concerning application communication patterns.

George Bosilca, Clément Foyer, Emmanuel Jeannot, Guillaume Mercier, Guillaume Papauré

Performance and Power Modeling, Prediction and Evaluation

Frontmatter
Micro-benchmarking MPI Neighborhood Collective Operations

In this article, performance expectations for MPI neighborhood collective operations are formulated as self-consistent performance guidelines. A microbenchmark and an experimental methodology are presented to assess these guidelines. Measurement results from a large, InfiniBand-based cluster, the Vienna Scientific Cluster (VSC), as well as from a small commodity cluster computer are shown and discussed to illustrate the methodology and to gain first insights into the performance of current MPI implementations. Results show that the examined libraries seem to be sensitive to the order in which topological neighbors are specified, and that in some cases Cartesian topologies can be outperformed by simulating them with distributed graph topologies.

Felix Donatus Lübbe
Performance Characterization of De Novo Genome Assembly on Leading Parallel Systems

De novo genome assembly is one of the most important and challenging computational problems in modern genomics; further, it shares algorithms and communication patterns important to other graph analytic and irregular applications. Unlike simulations, it has no floating point arithmetic and is dominated by small memory transactions within and between computing nodes. In this work, we focus on the highly scalable HipMer assembler and identify the dominant algorithms and communication patterns, also using microbenchmarks to capture the workload. We evaluate HipMer on a variety of platforms from the latest HPC systems to ethernet clusters. HipMer performs well on all single node systems, including the Xeon Phi manycore architecture. Given large enough problems, it also demonstrates excellent scaling across nodes in an HPC system, but requires a high speed network with low overhead and high injection rates. Our results shed light on the architectural features that are most important for achieving good parallel efficiency on this and related problems.

Marquita Ellis, Evangelos Georganas, Rob Egan, Steven Hofmeyr, Aydın Buluç, Brandon Cook, Leonid Oliker, Katherine Yelick
NVIDIA Jetson Platform Characterization

This study characterizes the NVIDIA Jetson TK1 and TX1 Platforms, both built on a NVIDIA Tegra System on Chip and combining a quad-core ARM CPU and an NVIDIA GPU. Their heterogeneous nature, as well as their wide operating frequency range, make it hard for application developers to reason about performance and determine which optimizations are worth pursuing. This paper attempts to inform developers’ choices by characterizing the platforms’ performance using Roofline models obtained through an empirical measurement-based approach as well as through a case study of a heterogeneous application (matrix multiplication). Our results highlight a difference of more than an order of magnitude in compute performance between the CPU and GPU on both platforms. Given that the CPU and GPU share the same memory bus, their Roofline models’ balance points are also more than an order of magnitude apart. We also explore the impact of frequency scaling: build CPU and GPU Roofline profiles and characterize both platforms’ balance point variation, power consumption, and performance per watt as frequency is scaled.The characterization we provide can be used in two main ways. First, given an application, it can inform the choice and number of processing elements to use (i.e., CPU/GPU and number of cores) as well as the optimizations likely to lead to high performance gains. Secondly, this characterization indicates that developers can use frequency scaling to tune the Jetson Platform to suit the requirements of their applications. Third, given a required power/performance budget, application developers can identify the appropriate parameters to use to tune the Jetson platforms to their specific workload requirements. We expect that this optimization approach can lead to overall gains in performance and/or power efficiency without requiring application changes.

Hassan Halawa, Hazem A. Abdelhafez, Andrew Boktor, Matei Ripeanu
Following the Blind Seer – Creating Better Performance Models Using Less Information

Offering insights into the behavior of applications at higher scale, performance models are useful for finding performance bugs and tuning the system. Extra-P, a tool for automated performance modeling, uses statistical methods to automatically generate, from a small number of performance measurements, models that can be used to predict performance where no measurements are available. However, the current version requires the manual pre-configuration of a search space, which might turn out to be unsuitable for the problem at hand. Furthermore, noise in the data often leads to models that indicate a worse behavior than there actually is. In this paper, we propose a new model-generation algorithm that solves both of the above problems: The search space is built and automatically refined on demand, and a scale-independent error metric tells both when to stop the refinement process and whether a model reflects faithfully enough the behavior the data exhibits. This makes Extra-P easier to use, while also allowing it to produce more accurate results. Using data from previous case studies, we show that the mean relative prediction error decreases from 46% to 13%.

Patrick Reisert, Alexandru Calotoiu, Sergei Shudler, Felix Wolf
An Accurate Simulator of Cache-Line Conflicts to Exploit the Underlying Cache Performance

This paper describes a cache-line conflict profiling method that advances the state of the art performance tuning workflow by accurately highlighting the sources of conflicts. The basic idea behind this is the use of cache simulators as a diagnosis tool for cache-line conflicts. We also propose a mechanism that enables to identify where line conflict misses are incurred and the reasons why the conflicts occur. We evaluate our conflict simulator using some of the benchmark codes used in the HPC field. From the results, we confirm that our simulator can accurately model the cache behaviors that cause line conflicts and reveal the sources of them during the execution. Finally, we demonstrate that optimizations assisted by our mechanism contribute to improving performance for both of serial and parallel executions.

Yukinori Sato, Toshio Endo
Shutdown Policies with Power Capping for Large Scale Computing Systems

Large scale distributed systems are expected to consume huge amounts of energy. To solve this issue, shutdown policies constitute an appealing approach able to dynamically adapt the resource set to the actual workload. However, multiple constraints have to be taken into account for such policies to be applied on real infrastructures, in particular the time and energy cost of shutting down and waking up nodes, and power capping to avoid disruption of the system. In this paper, we propose models translating these various constraints into different shutdown policies that can be combined. Our models are validated through simulations on real workload traces and power measurements on real testbeds.

Anne Benoit, Laurent Lefèvre, Anne-Cécile Orgerie, Issam Raïs

Scheduling and Load Balancing

Frontmatter
Partitioning Strategy Selection for In-Memory Graph Pattern Matching on Multiprocessor Systems

Pattern matching on large graphs is the foundation for a variety of application domains. The continuously increasing size of the underlying graphs requires highly parallel in-memory graph processing engines that need to consider non-uniform memory access (NUMA) and concurrency issues to scale up on modern multiprocessor systems. To tackle these aspects, a fine-grained graph partitioning becomes increasingly important. Hence, we present a classification of graph partitioning strategies and evaluate representative algorithms on medium and large-scale NUMA systems in this paper. As a scalable pattern matching processing infrastructure, we leverage a data-oriented architecture that preserves data locality and minimizes concurrency-related bottlenecks on NUMA systems. Our in-depth evaluation reveals that the optimal partitioning strategy depends on a variety of factors and consequently, we derive a set of indicators for selecting the optimal partitioning strategy suitable for a given graph and workload.

Alexander Krause, Thomas Kissinger, Dirk Habich, Hannes Voigt, Wolfgang Lehner
Efficient Dynamic Pinning of Parallelized Applications by Reinforcement Learning with Applications

This paper describes a dynamic framework for mapping the threads of parallel applications to the computation cores of parallel systems. We propose a feedback-based mechanism where the performance of each thread is collected and used to drive the reinforcement-learning policy of assigning affinities of threads to CPU cores. The proposed framework is flexible enough to address different optimization criteria, such as maximum processing speed and minimum speed variance among threads. We evaluate the framework on the Ant Colony optimization parallel benchmark from the heuristic optimization application domain, and demonstrate that we can achieve an improvement of 12% in the execution time compared to the default operating system scheduling/mapping of threads under varying availability of resources (e.g. when multiple applications are running on the same system).

Georgios C. Chasparis, Michael Rossbory, Vladimir Janjic
Accelerating by Idling: How Speculative Delays Improve Performance of Message-Oriented Systems

We propose a technique called speculative lagging, which improves performance by dynamically adding periods of idle execution into the message-oriented system. The speculation is guided by a statistical model, which predicts context switches that benefit from delays. We analytically derive the expected speedup, which, for a fixed confidence, allows identifying lagging opportunities in O(1) time, without a performance overhead. We describe the corresponding speculation algorithm and use it to extend an existing scheduler. Comparison with other actor frameworks on standard benchmarks shows improvements of up to 2.1$$\times $$.

Aleksandar Prokopec
Using Simulation to Evaluate and Tune the Performance of Dynamic Load Balancing of an Over-Decomposed Geophysics Application

Finite difference methods are commonplace in scientific computing. Despite their apparent regularity, they often exhibit load imbalance that damages their efficiency. We characterize the spatial and temporal load imbalance of Ondes3D, a seismic wave propagation simulator. We reveal that this imbalance originates from the nature of the input data and from low-level CPU optimizations. Such dynamic imbalance should therefore be quite common and is intractable by any static approach or classical code reorganization. An effective solution, with few code modifications, combines domain over-decomposition and dynamic load balancing (e.g., with AMPI), migrating data and computation at the granularity of an MPI rank. It generally requires a careful tuning of the over-decomposition level, the load balancing heuristic and frequency. These choices are quite dependent on application and platform characteristics. In this paper, we propose a methodology that leverages the capabilities of the SimGrid framework to conduct such study at low experimental cost. It combines emulation, simulation, and application modeling that requires minimal code modification and yet manages to capture both spatial and temporal load imbalance, faithfully predicting its overall performance. We compare simulation and real executions results and show how our strategy can be used to determine the best load balancing configuration for a given application/hardware configuration.

Rafael Keller Tesser, Lucas Mello Schnorr, Arnaud Legrand, Fabrice Dupros, Philippe Olivier Alexandre Navaux
Optimizing Egalitarian Performance in the Side-Effects Model of Colocation for Data Center Resource Management

In data centers, up to dozens of tasks are colocated on a single physical machine. Machines are used more efficiently, but tasks’ performance deteriorates, as colocated tasks compete for shared resources. As tasks are heterogeneous, the resulting performance dependencies are complex. In our previous work [18] we proposed a new combinatorial optimization model that uses two parameters of a task—its size and its type—to characterize how a task influences the performance of other tasks allocated to the same machine.In this paper, we study the egalitarian optimization goal: maximizing the worst-off performance. This problem generalizes the classic makespan minimization on multiple processors ($$P||C_{\max }$$). We prove that polynomially-solvable variants of $$P||C_{\max }$$ are NP-hard and hard to approximate when the number of types is not constant. For a constant number of types, we propose a PTAS, a fast approximation algorithm, and a series of heuristics. We simulate the algorithms on instances derived from a trace of one of Google clusters. Algorithms aware of jobs’ types lead to better performance compared to algorithms solving $$P||C_{\max }$$.The notion of type enables us to model degeneration of performance caused by colocation using standard combinatorial optimization methods. Types add a layer of additional complexity. However, our results—approximation algorithms and good average-case performance—show that types can be handled efficiently.

Fanny Pascual, Krzysztof Rzadca
Generic Algorithms for Scheduling Applications on Hybrid Multi-core Machines

We study the problem of executing an application represented by a precedence task graph on a multi-core machine composed of standard computing cores and accelerators. Contrary to most existing approaches, we distinguish the allocation and the scheduling phases and we mainly focus on the allocation part of the problem: choose the most appropriate type of computing unit for each task. We address both off-line and on-line settings. In the first case, we establish strong lower bounds on the worst-case performance of a known approach based on Linear Programming for solving the allocation problem. Then, we refine the scheduling phase and we replace the greedy list scheduling policy used in this approach by a better ordering of the tasks. Although this modification leads to the same approximability guarantees, it performs much better in practice. In the on-line case, we assume that the tasks arrive in any, not known in advance, order which respects the precedence relations and the scheduler has to take irrevocable decisions about their allocation and execution. In this setting, we propose the first online scheduling algorithm which takes into account precedences. Our algorithm is based on adequate rules for selecting the type of processor where to allocate the tasks and it achieves a constant factor approximation guarantee if the ratio of the number of CPUs over the number of GPUs is bounded. Finally, all the previous algorithms have been experimented on a large number of simulations built on actual libraries. These simulations assess the good practical behavior of the algorithms with respect to the state-of-the-art solutions whenever these exist or baseline algorithms.

Marcos Amaris, Giorgio Lucarelli, Clément Mommessin, Denis Trystram
Low-Cost Approximation Algorithms for Scheduling Independent Tasks on Hybrid Platforms

Hybrid platforms embedding accelerators such as GPUs or Xeon Phis are increasingly used in computing. When scheduling tasks on such platforms, one has to take into account that a task execution time depends on the type of core used to execute it. We focus on the problem of minimizing the total completion time (or makespan) when scheduling independent tasks on two processor types, also known as the $$(Pm,Pk)||C_{\max }$$ problem. We propose BalancedEstimate and BalancedMakespan, two novel 2-approximation algorithms with low complexity. Their approximation ratio is both on par with the best approximation algorithms using dual approximation techniques (which are, thus, of high complexity) and significantly smaller than the approximation ratio of existing low-cost approximation algorithms. We compared both algorithms by simulations to existing strategies in different scenarios. These simulations showed that their performance is among the best ones in all cases.

Louis-Claude Canon, Loris Marchal, Frédéric Vivien

High Performance Architectures and Compilers

Frontmatter
Runtime-Assisted Shared Cache Insertion Policies Based on Re-reference Intervals

Processor speed is improving at a faster rate than the speed of main memory, which makes memory accesses increasingly expensive. One way to solve this problem is to reduce miss ratio of the processor’s last level cache by improving its replacement policy. We approach the problem by co-designing the runtime system and hardware and exploiting the semantics of the applications written in data-flow task-based programming models to provide hardware with information about the task types and task data-dependencies. We propose the Task-Type aware Insertion Policy, TTIP, which uses the runtime system to dynamically determine the best probability per task type for bimodal insertion in the recency stack and the static Dependency-Type aware Insertion Policy, DTIP, that inserts cache lines in the optimal position taking into account the dependency types of the current task. TTIP and DTIP perform similarly or better than state-of-the-art replacement policies, while requiring less hardware.

Vladimir Dimić, Miquel Moretó, Marc Casas, Mateo Valero
Rewriting System for Profile-Guided Data Layout Transformations on Binaries

Careful data layout design is crucial for achieving high performance. However exploring data layouts is time-consuming and error-prone, and assessing the impact of a layout transformation on performance is difficult without performing it. We propose to guide application programmers through data layout restructuring by providing a comprehensive multidimensional description of the initial layout, built from trace analysis, and then by giving a performance evaluation of the transformations tested and an expression of each transformed layout. The programmer can limit the exploration to layouts matching some patterns. We apply this method to two multithreaded applications. The performance prediction of multiple transformations matches within 5% the performance of hand-transformed layout code.

Christopher Haine, Olivier Aumage, Denis Barthou
Hardware Support for Scratchpad Memory Transactions on GPU Architectures

Graphics Processing Units (GPUs) have become the accelerator of choice for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction - Multiple Thread (SIMT) fashion. Using OpenCL terminology, GPUs offer a global memory space shared by all the threads in the GPU, as well as a low-latency local memory space shared by a subset of the threads. The latter is used as a scratchpad to improve the performance of the applications.We propose GPU-LocalTM, a hardware transactional memory (TM), as an alternative to data locking mechanisms in local memory. GPU-LocalTM allocates transactional metadata in the existing memory resources, minimizing the storage requirements for TM support. In addition, it ensures forward progress through an automatic serialization mechanism. In our experiments, GPU-LocalTM provides up to 100X speedup over serialized execution.

Alejandro Villegas, Rafael Asenjo, Angeles Navarro, Oscar Plata, Rafael Ubal, David Kaeli

Parallel and Distributed Data Management and Analytics

Frontmatter
Execution of Recursive Queries in Apache Spark

MapReduce environments offer great scalability by restricting the programming model to only map and reduce operators. This abstraction simplifies many difficult problems occuring in generic distributed computations like fault tolerance and synchronization, hiding them from the programmer. There are, however, algorithms that cannot be easily or efficiently expressed in MapReduce, such as recursive functions. In this paper we extend the Apache Spark runtime so that it can support recursive queries. We also introduce a new parallel and more lightweight scheduling mechanism, ideal for scheduling a very large set of tiny tasks. We implemented the aformentioned scheduler and found that it simplifies the code for recursive computation and can perform up to 2.1$$\times $$ faster than the default Spark scheduler.

Pavlos Katsogridakis, Sofia Papagiannaki, Polyvios Pratikakis
Replica-Aware Partitioning Design in Parallel Database Systems

In parallel database systems, data is partitioned and replicated across multiple independent nodes to improve system performance and increase robustness. In current practice of database partitioning design, all replicas are uniformly partitioned, however, different statements may prefer contradictory partitioning plans, so a single plan cannot achieve the overall optimal performance for the workload.In this paper, we propose a novel approach of replica-aware data partitioning design to address the contradictions. According to the access graph of SQL statements, we use the k-medoids algorithm to classify workload into statement clusters, then we use the branch-and-bound algorithm to search for the optimal partitioning plan for each cluster. Finally, we organize replicas with these plans, and route statements to their preferred replicas. We use TPC-E, TPC-H and National College and University Enrollment System (NACUES) to evaluate our approach. The evaluation results demonstrate that our approach improves system performance by up to 4x over the current practice of partitioning design.

Liming Dong, Weidong Liu, Renchuan Li, Tiejun Zhang, Weiguo Zhao

Cluster and Cloud Computing

Frontmatter
A Simplified Model for Simulating the Execution of a Workflow in Cloud

Although simulators provide approximate, faster and easier simulation of an application execution in Clouds, still many researchers argue that these results cannot be always generalized for complex application types, which consist of many dependencies among tasks and various scheduling possibilities, such as workflows. DynamicCloudSim, the extension of the well known CloudSim simulator, offers users the capability to simulate the Cloud heterogeneity by introducing noisiness in dozens parameters. Still, it is difficult, or sometimes even impossible to determine appropriate values for all these parameters because they are usually Cloud or application-dependent. In this paper, we propose a new model that simplifies the simulation setup for a workflow and reduces the bias between the behavior of simulated and real Cloud environments based on one parameter only, the Cloud noisiness. It represents the noise produced by the Cloud’s interference including the application’s (in our case a workflow) noisiness too. Another novelty in our model is that it does not use a normal distribution naively to create noised values, but shifts the mean value of the task execution time by the cloud noisiness and uses its deviation as a standard deviation. Besides our model reduces the complexity of DynamicCloudSim’s heterogeneity model, evaluation conducted in Amazon EC2 shows that it is also more accurate, with better trueness (closeness to the real mean values) of up to $$9.2 \%$$ and precision (closeness to the real deviation) of up to 8.39 times.

Roland Mathá, Sasko Ristov, Radu Prodan
Dealing with Performance Unpredictability in an Asymmetric Multicore Processor Cloud

In a Cloud computing data center and especially in a IaaS (Infrastructure as a Service), performance predictability is one of the most important challenges. For a given allocated virtual machine (VM) in one IaaS, a client expects his application to perform identically whatever is the hosting physical server or its resource management strategy. However, performance predictability is very difficult to enforce in a heterogeneous hardware environment where machines do not have identical performance characteristics, and even more difficult when machines are internally heterogeneous as for Asymmetric Multicore Processor machines. In this paper, we introduce a VM scheduler extension which takes into account hardware performance heterogeneity of Asymmetric Multicore Processor machines in the cloud. Based on our analysis of the problem, we designed and implemented two solutions: the first weights CPU allocations according to core performance, while the second adapts CPU allocations to reach a given instruction execution rate (Ips) regardless the core types. We demonstrate that such scheduler extensions can enforce predictability with a negligible overhead on application performance.

Boris Teabe, Patrick Lavoisier Wapet, Alain Tchana, Daniel Hagimont
Deadline-Aware Deployment for Time Critical Applications in Clouds

Time critical applications are appealing to deploy in clouds due to the elasticity of cloud resources and their on-demand nature. However, support for deploying application components with strict deadlines on their deployment is lacking in current cloud providers. This is particularly important for adaptive applications that must automatically and seamlessly scale, migrate, or recover swiftly from failures. A common deployment procedure is to transmit application packages from the application provider to the cloud, and install the application there. Thus, users need to manually deploy their applications into clouds step by step with no guarantee regarding deadlines. In this work, we propose a Deadline-aware Deployment System (DDS) for time critical applications in clouds. DDS enables users to automatically deploy applications into clouds. We design bandwidth-aware EDF scheduling algorithms in DDS that minimize the number of deployments that miss their deadlines and maximize the utilization of network bandwidth. In the evaluation, we show that DDS leverages network bandwidth sufficiently, and significantly reduces the number of missed deadlines during deployment.

Yang Hu, Junchao Wang, Huan Zhou, Paul Martin, Arie Taal, Cees de Laat, Zhiming Zhao
More Sharing, More Benefits? A Study of Library Sharing in Container-Based Infrastructures

Container-based infrastructures have surged in popularity, offering advantages in agility and scaling, while also presenting new challenges in resource utilization due to unnecessary library duplication. In this paper, we consider sharing libraries across containers, and study the impact of such a strategy on overall resource requirements, scheduling, and utilization. Our analysis and simulations suggest significant benefits arising from library sharing. Furthermore, a small fraction of libraries shared between any two containers, on average, is enough to reap most of the benefits, and even naïve schedulers, such as a First Fit scheduler, succeed at doing so. We also propose a score maximization, mixed-integer linear-programming scheduler for handling bulk request arrivals (such as large jobs composed of many smaller tasks), which compares favorably against state-of-the-art schedulers in these scenarios.

José Bravo Ferreira, Marco Cello, Jesús Omana Iglesias
An Efficient Communication Aware Heuristic for Multiple Cloud Application Placement

To deploy a distributed application on the cloud, cost, resource and communication constraints have to be considered to select the most suitable Virtual Machines (VMs), from private and public cloud providers. This process becomes very complex in large scale scenarios and, as this problem is NP-Hard, its automation must take scalability into consideration. In this work, we propose a heuristic able to calculate initial placements for distributed component-based applications on possibly multiple clouds with the objective of minimizing VM renting costs while satisfying applications’ resource and communication constraints. We evaluate the heuristic performance and determine its limitations by comparing it to other placement approaches, namely exact algorithms and meta-heuristics. We show that the proposed heuristic is able to compute a good solution much faster than them.

Pedro Silva, Christian Perez
Energy-Driven Straggler Mitigation in MapReduce

Energy consumption is an important concern for large-scale data-centers, which results in huge monetary cost for data-center operators. Due to the hardware heterogeneity and contentions between concurrent workloads, straggler mitigation is important to many Big Data applications running in large-scale data-centers and the speculative execution technique is widely-used to handle stragglers. Although a large number of studies have been proposed to improve the performance of Big Data applications using speculative execution, few of them have studied the energy efficiency of their solutions. In this paper, we propose two techniques to improve the energy efficiency of speculative executions while ensuring comparable performance. Specifically, we propose a hierarchical straggler detection mechanism which can greatly reduce the number of killed speculative copies and hence save the energy consumption. We also propose an energy-aware speculative copy allocation method which considers the trade-off between performance and energy when allocating speculative copies. We implement both techniques into Hadoop and evaluate them using representative MapReduce benchmarks. Results show that our solution can reduce the energy waste on killed speculative copies by up to 100% and improve the energy efficiency by 20% compared to state-of-the-art mechanisms.

Tien-Dat Phan, Shadi Ibrahim, Amelie Chi Zhou, Guillaume Aupy, Gabriel Antoniu
Leveraging Cloud Heterogeneity for Cost-Efficient Execution of Parallel Applications

Public cloud providers offer a wide range of instance types, with different processing and interconnection speeds, as well as varying prices. Furthermore, the tasks of many parallel applications show different computational demands due to load imbalance. These differences can be exploited for improving the cost efficiency of parallel applications in many cloud environments by matching application requirements to instance types. In this paper, we introduce the concept of heterogeneous cloud systems consisting of different instance types to leverage the different computational demands of large parallel applications for improved cost efficiency. We present a mechanism that automatically suggests a suitable combination of instances based on a characterization of the application and the instance types. With such a heterogeneous cloud, we are able to improve cost efficiency significantly for a variety of MPI-based applications, while maintaining a similar performance.

Eduardo Roloff, Matthias Diener, Emmanuell Diaz Carreño, Luciano Paschoal Gaspary, Philippe O. A. Navaux

Distributed Systems and Algorithms

Frontmatter
A Consensus-Based Fault-Tolerant Event Logger for High Performance Applications

Most message logging protocols rely on a centralized event logger to store information (i.e., the determinants) to allow the recovery of an application process. This centralized approach, besides suffering from the single point of failure problem, represents a bottleneck for the efficiency of message logging protocols. In this work, we present a fault-tolerant distributed event logger based on consensus that outperforms the centralized approach. We implemented the event logger of MPI determinants using the Paxos algorithm. Our event logger inherits the Paxos properties: safety is guaranteed even if the system is asynchronous and liveness is guaranteed despite processes failures. Experimental results are reported for the performance of the distributed event logger based both on classic Paxos and parallel Paxos applied to AMG (Algebraic MultiGrid) and NAS Parallel Benchmark applications.

Edson Tavares de Camargo, Elias P. Duarte Jr., Fernando Pedone
Families of Graph Algorithms: SSSP Case Study

Single-Source Shortest Paths (SSSP) is a well-studied graph problem. Examples of SSSP algorithms include the original Dijkstra’s algorithm and the parallel $$\varDelta $$-stepping and KLA-SSSP algorithms. In this paper, we use a novel Abstract Graph Machine (AGM) model to show that all these algorithms share a common logic and differ from one another by the order in which they perform work. We use the AGM model to thoroughly analyze the family of algorithms that arises from the common logic. We start with the basic algorithm without any ordering (Chaotic), and then we derive the existing and new algorithms by methodically exploring semantic and spatial ordering of work. Our experimental results show that new derived algorithms show better performance than the existing distributed memory parallel algorithms, especially at higher scales.

Thejaka Amila Kanewala, Marcin Zalewski, Andrew Lumsdaine
SEMem: Deployment of MPI-Based In-Memory Storage for Hadoop on Supercomputers

This paper reports our experiments to compare various deployment strategies of memcached-like in-memory storage for Hadoop on supercomputers, where each node often does not have a local disk but shares a slow central disk. For the experiments, we developed our own memcached-like file system, named SEMem, for Hadoop. Since SEMem was designed for supercomputers, it uses MPI for communication. SEMem is configurable to adopt various deployment strategies and our experiments revealed that a good deployment strategy was allocating some nodes that work only for in-memory storage but do not directly perform map-reduce computation.

Thanh-Chung Dao, Shigeru Chiba

Parallel and Distributed Programming, Interfaces, and Languages

Frontmatter
Supporting the Xeon Phi Coprocessor in a Heterogeneous Programming Model

Supercomputers are becoming more heterogeneous. They are composed by several machines with different computation capabilities and different kinds and families of accelerators, such as GPUs or Intel Xeon Phi coprocessors. Programming these machines is a hard task, that requires a deep study of the architectural details, in order to exploit efficiently each computational unit.In this paper, we present an extension of a GPU-CPU heterogeneous programming model, to include support for Intel Xeon Phi coprocessors. This contribution extends the previous model and its implementation, by taking advantage of both the GPU communication model and the CPU execution model of the original approach, to derive a new approach for the Xeon Phi. Our experimental results show that using our approach, the programming effort needed for changing the kind of target devices is highly reduced for several study cases. For example, using our model to program a Mandelbrot benchmark, the 97% of the application code is reused between a GPU implementation and a Xeon Phi implementation.

Ana Moreton-Fernandez, Eduardo Rodriguez-Gutiez, Arturo Gonzalez-Escribano, Diego R. Llanos
GLT: A Unified API for Lightweight Thread Libraries

In recent years, several lightweight thread (LWT) libraries have emerged to tackle exascale challenges. These offer programming models (PMs) based on user-level threads and incorporate their own lightweight mechanisms. However, each library proposes its own PM, exposing different semantics and hindering portability.To address this drawback, we have designed Generic Lightweight Thread (GLT), an application programming interface that frames the functionality of the most popular LWT libraries for high-performance computing under a single PM. We implement GLT on top of Argobots, MassiveThreads, and Qthreads. We provide GLT as a dynamic library, as well as in the form of a static version based on macro preprocessing resolution to reduce overhead. This paper discusses the GLT PM and demonstrates its minimal performance impact.

Adrián Castelló, Sangmin Seo, Rafael Mayo, Pavan Balaji, Enrique S. Quintana-Ortí, Antonio J. Peña
PASCAL: A Parallel Algorithmic SCALable Framework for N-body Problems

We propose PASCAL, a parallel unified algorithmic framework for generalized N-body problems. PASCAL utilizes tree data structures and user-controlled pruning or approximations to reduce the asymptotic runtime complexity from being linear in the number of data points to be logarithmic. In PASCAL, the domain scientists express their N-body problem in terms of application-specific operations, and PASCAL generates the pruning and approximation conditions automatically from this high-level specification. In order to evaluate PASCAL, we generate solutions for six problems: k-nearest neighbors, range search, Euclidean minimum spanning tree, kernel density estimation, expectation maximization (EM), and Hausdorff distance chosen from various domains.We show that applying domain-specific optimizations and parallelizations to the algorithms generated by PASCAL achieves $$10\times $$ to $$230\times $$ speedup compared to state-of-the-art libraries on a dual-socket Intel Xeon processor with 16 cores on real world datasets. We also obtain a novel out-of-the-box asymptotically optimal algorithm for Hausdorff distance calculation and an improved algorithm for EM. This shows the impact and potential of PASCAL in rapidly extending to a larger class of problems that are yet to be explored.

Laleh Aghababaie Beni, Aparna Chandramowlishwaran
GASPI/GPI In-memory Checkpointing Library

Fault tolerance becomes an important feature at large computer systems where the mean time between failure decreases. Checkpointing is a method often used to provide resilience. We present an in-memory checkpointing library based on a PGAS API implemented with GASPI/GPI. It offers a substantial benefit when recovering from failure and leverages existing fault tolerance features of GASPI/GPI. The overhead of the library is negligible when testing it with a simple stencil code and a real life seismic imaging method.

Valeria Bartsch, Rui Machado, Dirk Merten, Mirko Rahn, Franz-Josef Pfreundt

Multicore and Manycore Parallelism

Frontmatter
Optimized Batched Linear Algebra for Modern Architectures

Solving large numbers of small linear algebra problems simultaneously is becoming increasingly important in many application areas. Whilst many researchers have investigated the design of efficient batch linear algebra kernels for GPU architectures, the common approach for many/multi-core CPUs is to use one core per subproblem in the batch. When solving batches of very small matrices, $$2\times 2$$ for example, this design exhibits two main issues: it fails to fully utilize the vector units and the cache of modern architectures, since the matrices are too small. Our approach to resolve this is as follows: given a batch of small matrices spread throughout the primary memory, we first reorganize the elements of the matrices into a contiguous array, using a block interleaved memory format, which allows us to process the small independent problems as a single large matrix problem and enables cross-matrix vectorization. The large problem is solved using blocking strategies that attempt to optimize the use of the cache. The solution is then converted back to the original storage format. To explain our approach we focus on two BLAS routines: general matrix-matrix multiplication (GEMM) and the triangular solve (TRSM). We extend this idea to LAPACK routines using the Cholesky factorization and solve (POSV). Our focus is primarily on very small matrices ranging in size from $$2 \times 2$$ to $$32 \times 32$$. Compared to both MKL and OpenMP implementations, our approach can be up to 4 times faster for GEMM, up to 14 times faster for TRSM, and up to 40 times faster for POSV on the new Intel Xeon Phi processor, code-named Knights Landing (KNL). Furthermore, we discuss strategies to avoid data movement between sockets when using our interleaved approach on a NUMA node.

Jack Dongarra, Sven Hammarling, Nicholas J. Higham, Samuel D. Relton, Mawussi Zounon
New Efficient General Sparse Matrix Formats for Parallel SpMV Operations

The Sparse Matrix-Vector Multiplication (SpMV) is an important building block in High Performance Computing. Performance improvements for the SpMV are often gained by the development of new optimized sparse matrix formats either by utilizing special sparsity patterns of a matrix or by taking bottlenecks of a hardware architecture into account. In this work a requirements analysis is done for sparse matrix formats with an emphasis on the parallel SpMV for large general sparse matrices. Based on these requirements, three new sparse matrix formats were developed, each combining several optimization techniques and addressing different optimization goals/hardware architectures. The CSR5 Bit Compressed (CSR5BC) format is an extension to the existing CSR5 format and optimized for GPUs. The other two formats, Hybrid Compressed Slice Storage (HCSS) and Local Group Compressed Sparse Row (LGCSR), are new formats optimized for multi-core/-processor architectures including the Xeon Phi Knights Landing. Results show that all three storage formats deliver good parallel SpMV performance on their target architectures over a large set of test matrices compared to other well performing formats in vendor and research libraries.

Jan Philipp Ecker, Rudolf Berrendorf, Florian Mannuss
Lazy Parallel Kronecker Algebra-Operations on Heterogeneous Multicores

Kronecker algebra is a matrix calculus which allows the generation of thread interleavings from the source-code of a program. Thread interleavings have been shown effective for proving the absence of deadlocks. Because the number of interleavings grows exponentially in the number of threads, deadlock analysis is still a challenging problem.To make the computation of thread interleavings tractable, we propose a lazy, parallel evaluation method for Kronecker algebra. Our method incorporates the constraints induced by synchronization constructs. To reduce problem size, only interleavings legal under the locking behavior of a program are considered. We leverage the data-parallelism of Kronecker sum- and product-operations for multicores and GPUs. Proposed algebraic transformations further improve performance. For one synthetic and two real-world benchmarks, our GPU implementation is up to 5453$$\times $$ faster than our multi-threaded version. Lazy evaluation significantly reduces memory consumption compared to both the sequential and the multicore versions of the SPIN model-checker.

Wasuwee Sodsong, Robert Mittermayr, Yoojin Park, Bernd Burgstaller, Johann Blieberger
Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture

Manycore optimizations are essential for achieving performance worthy of anticipated exascale systems. Utilization of manycore chips is inevitable to attain the desired floating point performance of these energy-austere systems. In this work, we revisit ExaFMM, the open source Fast Multiple Method (FMM) library, in light of highly tuned shared-memory parallelization and detailed performance analysis on the new highly parallel Intel manycore architecture, Knights Landing (KNL). We assess scalability and performance gain using task-based parallelism of the FMM tree traversal. We also provide an in-depth analysis of the most computationally intensive part of the traversal kernel (i.e., the particle-to-particle (P2P) kernel), by comparing its performance across KNL and Broadwell architectures. We quantify different configurations that exploit the on-chip 512-bit vector units within different task-based threading paradigms. MPI communication-reducing and NUMA-aware approaches for the FMM’s global tree data exchange are examined with different cluster modes of KNL. By applying several algorithm- and architecture-aware optimizations for FMM, we show that the N-Body kernel on 256 threads of KNL achieves on average 2.8$$\times $$ speedup compared to the non-vectorized version, whereas on 56 threads of Broadwell, it achieves on average 2.9$$\times $$ speedup. In addition, the tree traversal kernel on KNL scales monotonically up to 256 threads with task-based programming models. The MPI-based communication-reducing algorithms show expected improvements of the data locality across the KNL on-chip network.

Mustafa Abduljabbar, Mohammed Al Farhan, Rio Yokota, David Keyes
Efficient Non-blocking Radix Trees

Radix trees belong to the class of trie data structures, used for storing both sets and dictionaries in a way optimized for space and lookup. In this work, we present an efficient non-blocking implementation of radix tree data structure that can be configured for arbitrary alphabet sizes. Our algorithm implements a linearizable set with contains, insert and remove operations and uses single word compare-and-swap (CAS) instruction for synchronization. We extend the idea of marking the child edges instead of nodes to improve the parallel performance of the data structure. Experimental evaluation indicates that our implementation out-performs other known lock-free implementations of trie and binary search tree data structures using CAS by more than 100% under heavy contention.

Varun Velamuri
A Concurrency-Optimal Binary Search Tree

The paper presents the first concurrency-optimal implementation of a binary search tree (BST). The implementation, based on a standard sequential implementation of a partially-external tree, ensures that every schedule, i.e., interleaving of steps of the sequential code, is accepted unless linearizability is violated. To ensure this property, we use a novel read-write locking protocol that protects tree edges in addition to its nodes.Our implementation performs comparably to the state-of-the-art BSTs and even outperforms them on few workloads, which suggests that optimizing the set of accepted schedules of the sequential code can be an adequate design principle for efficient concurrent data structures.

Vitaly Aksenov, Vincent Gramoli, Petr Kuznetsov, Anna Malova, Srivatsan Ravi
Scalable Fine-Grained Metric-Based Remeshing Algorithm for Manycore/NUMA Architectures

In this paper, we present a fine-grained multi-stage metric-based triangular remeshing algorithm on manycore and NUMA architectures. It is motivated by the dynamically evolving data dependencies and workload of such irregular algorithms, often resulting in poor performance and data locality at high number of cores. In this context, we devise a multi-stage algorithm in which a task graph is built for each kernel. Parallelism is then extracted through fine-grained independent set, maximal cardinality matching and graph coloring heuristics. In addition to index ranges precalculation, a dual-step atomic-based synchronization scheme is used for nodal data updates. Despite its intractable latency-boundness, a good overall scalability is achieved on a NUMA dual-socket Intel Haswell and a dual-memory Intel KNL computing nodes (64 cores). The relevance of our synchronization scheme is highlighted through a comparison with the state-of-the-art.

Hoby Rakotoarivelo, Franck Ledoux, Franck Pommereau, Nicolas Le-Goff
Performance Evaluation of Thread-Level Speculation in Off-the-Shelf Hardware Transactional Memories

Thread-Level Speculation (TLS) is a hardware/software technique that enables the execution of multiple loop iterations in parallel, even in the presence of some loop-carried dependences. TLS requires hardware mechanisms to support conflict detection, speculative storage, in-order commit of transactions, and transaction roll-back. There is no off-the-shelf processor that provides direct support for TLS. Speculative execution is supported, however, in the form of Hardware Transactional Memory (HTM)—available in recent processors such as the Intel Core and the IBM POWER8. Earlier work has demonstrated that, in the absence of specific TLS support in commodity processors, HTM support can be used to implement TLS. This paper presents a careful evaluation of the implementation of TLS on the HTM extensions available in such machines. This evaluation provides evidence to support several important claims about the performance of TLS over HTM in the Intel Core and the IBM POWER8 architectures. Experimental results reveal that by implementing TLS on top of HTM, speed-ups of up to 3.8$$\times $$ can be obtained for some loops.

Juan Salamanca, José Nelson Amaral, Guido Araujo

Theory and Algorithms for Parallel Computation and Networking

Frontmatter
Addressing Volume and Latency Overheads in 1D-parallel Sparse Matrix-Vector Multiplication

The scalability of sparse matrix-vector multiplication (SpMV) on distributed memory systems depends on multiple factors that involve different communication cost metrics. The irregular sparsity pattern of the coefficient matrix manifests itself as high bandwidth (total and/or maximum volume) and/or high latency (total and/or maximum message count) overhead. In this work, we propose a hypergraph partitioning model which combines two earlier models for one-dimensional partitioning, one addressing total and maximum volume, and the other one addressing total volume and total message count. Our model relies on the recursive bipartitioning paradigm and simultaneously addresses three cost metrics in a single partitioning phase in order to reduce volume and latency overheads. We demonstrate the validity of our model on a large dataset that contains more than 300 matrices. The results indicate that compared to the earlier models, our model significantly improves the scalability of SpMV.

Seher Acer, Oguz Selvitopi, Cevdet Aykanat
Improving the Network of Search Engine Services Through Application-Driven Routing

We studied a search engine service in order to evaluate the impact of the traffic pattern on network performance. This paper focuses on how the routing algorithm can improve the query latency of a search engine. The architecture of the service includes three main components: Front Service, Cache Service and Index Service. This service receives queries from users, and after a process of seeking in a cluster, a set of results are returned to users. This workload produces unbalanced traffic throughout the network. As a result, this behavior impacts the network performance in terms of latency and throughput and increases the user timeout. This paper proposes an application-driven routing policy based on the application architecture which merges a set of criteria and prioritizes the Cache Service messages. We evaluated the performance using real traces and simulation techniques. The experiment results show a reduction of network latency and throughput when we apply the application-driven routing policy.

Joe Carrión, Daniel Franco, Veronica Gil-Costa, Mauricio Marin, Emilio Luque

Parallel Numerical Methods and Applications

Frontmatter
Accelerating the Tucker Decomposition with Compressed Sparse Tensors

The Tucker decomposition is a higher-order analogue of the singular value decomposition and is a popular method of performing analysis on multi-way data (tensors). Computing the Tucker decomposition of a sparse tensor is demanding in terms of both memory and computational resources. The primary kernel of the factorization is a chain of tensor-matrix multiplications (TTMc). State-of-the-art algorithms accelerate the underlying computations by trading off memory to memoize the intermediate results of TTMc in order to reuse them across iterations. We present an algorithm based on a compressed data structure for sparse tensors and show that many computational redundancies during TTMc can be identified and pruned without the memory overheads of memoization. In addition, our algorithm can further reduce the number of operations by exploiting an additional amount of user-specified memory. We evaluate our algorithm on a collection of real-world and synthetic datasets and demonstrate up to $$20.7{\times }$$ speedup while using $$28.5{\times }$$ less memory than the state-of-the-art parallel algorithm.

Shaden Smith, George Karypis
Shared Memory Pipelined Parareal

For the parallel-in-time integration method Parareal, pipelining can be used to hide some of the cost of the serial correction step and improve its efficiency. The paper introduces a basic OpenMP implementation of pipelined Parareal and compares it to a standard MPI-based variant. Both versions yield almost identical runtimes, but, depending on the compiler, the OpenMP variant consumes about 7% less energy and has a significantly smaller memory footprint. However, its higher implementation complexity might make it difficult to use in legacy codes and in combination with spatial parallelisation.

Daniel Ruprecht
Nonintrusive AMR Asynchrony for Communication Optimization

Adaptive Mesh Refinement (AMR) is a well known method for efficiently solving partial differential equations. A straightforward AMR algorithm typically exhibits many synchronization points even during a single time step, where costly communication often degrades the performance. This problem will be even more pronounced on future supercomputers containing billion way parallelism, which will raise the communication cost further. Re-designing AMR algorithms to avoid synchronization is not a viable solution due to the large code size and complex control structures. We present a nonintrusive asynchronous approach to hiding the effects of communication in an AMR application. Specifically, our approach reasons about data dependencies automatically using domain knowledge about AMR applications, allowing asynchrony to be discovered with only a modest amount of code modification. Using this approach, we optimize the synchronous AMR algorithm in the BoxLib software framework without severely affecting the productivity of the application programmer. We observe around 27–31% performance improvement for an advection solver on the Hazel Hen supercomputer using 12288 cores.

Muhammad Nufail Farooqi, Didem Unat, Tan Nguyen, Weiqun Zhang, Ann Almgren, John Shalf

Accelerator Computing

Frontmatter
Balanced CSR Sparse Matrix-Vector Product on Graphics Processors

We propose a novel parallel approach to compute the sparse matrix-vector product (SpMV) on graphics processing units (GPUs), optimized for matrices with an irregular row distribution of the non-zero entries. Our algorithm relies on the standard CSR format to store the sparse matrix, requires an inexpensive pre-processing step, and consumes only a minor amount of additional memory compared with significantly more expensive GPU-specific sparse matrix layouts. In addition, we propose a simple heuristic to determine whether our method or the standard CSR SpMV algorithm should be used for a specific matrix. As a result, our proposal, combined with the standard CSR SpMV, can be adopted as the default choice for the implementation of SpMV in general-purpose sparse linear algebra libraries for GPUs.

Goran Flegar, Enrique S. Quintana-Ortí
To Distribute or Not to Distribute: The Question of Load Balancing for Performance or Energy

Heterogeneous systems are nowadays a common choice in the path to Exascale. Through the use of accelerators they offer outstanding energy efficiency. The programming of these devices employs the host-device model, which is suboptimal as CPU remains idle during kernel executions, but still consumes energy. Making the CPU contribute computing effort might improve the performance and energy consumption of the system. This paper analyses the advantages of this approach and sets the limits of when its beneficial. The claims are supported by a set of models that determine how to share a single data-parallel task between the CPU and the accelerator for optimum performance, energy consumption or efficiency. Interestingly, the models show that optimising performance does not always mean optimum energy or efficiency as well. The paper experimentally validates the models, which represent an invaluable tool for programmers when faced with the dilemma of whether to distribute their workload in these systems.

Esteban Stafford, Borja Pérez, Jose Luis Bosque, Ramón Beivide, Mateo Valero
Backmatter
Metadaten
Titel
Euro-Par 2017: Parallel Processing
herausgegeben von
Francisco F. Rivera
Tomás F. Pena
José C. Cabaleiro
Copyright-Jahr
2017
Electronic ISBN
978-3-319-64203-1
Print ISBN
978-3-319-64202-4
DOI
https://doi.org/10.1007/978-3-319-64203-1