Euro-Par 2018: Parallel Processing
24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27 - 31, 2018, Proceedings
- 2018
- Book
- Editors
- Marco Aldinucci
- Luca Padovani
- Massimo Torquati
- Book Series
- Lecture Notes in Computer Science
- Publisher
- Springer International Publishing
About this book
This book constitutes the proceedings of the 24th International Conference on Parallel and Distributed Computing, Euro-Par 2018, held in Turin, Italy, in August 2018. The 57 full papers presented in this volume were carefully reviewed and selected from 194 submissions. They were organized in topical sections named: support tools and environments; performance and power modeling, prediction and evaluation; scheduling and load balancing; high performance architecutres and compilers; parallel and distributed data management and analytics; cluster and cloud computing; distributed systems and algorithms; parallel and distributed programming, interfaces, and languages; multicore and manycore methods and tools; theory and algorithms for parallel computation and networking; parallel numerical methods and applications; and accelerator computing for advanced applications.
Table of Contents
-
Frontmatter
-
Support Tools and Environments
-
Frontmatter
-
Automatic Detection of Synchronization Errors in Codes that Target the Open Community Runtime
Jiri Dokulil, Jana KatreniakovaAbstractThe complexity of writing and debugging parallel programs makes tools that can support this effort very important. In the case of the Open Community Runtime, one major problem is ensuring that the program manages runtime objects correctly. For example, when one task uses an object and another task is responsible for deleting the object, the tasks need to be synchronized to ensure that the object is only destroyed once it is no longer being used. In this paper, we present a tool which observes program execution and analyzes it in order to find cases where the required synchronization is missing. -
A Methodology for Performance Analysis of Applications Using Multi-layer I/O
Ronny Tschüter, Christian Herold, Bert Wesarg, Matthias WeberAbstractEfficient usage of file systems poses a major challenge for highly scalable parallel applications. The performance of even the most sophisticated I/O subsystems lags behind the compute capabilities of current processors. To improve the utilization of I/O subsystems, several libraries, such as HDF5, facilitate the implementation of parallel I/O operations. These libraries abstract from low-level I/O interfaces (for instance, Posix I/O) and may internally interact with additional I/O libraries. While improving usability, I/O libraries also add complexity and impede the analysis and optimization of application I/O performance. In this work, we present a methodology to investigate application I/O behavior in detail. In contrast to current methods, our approach explicitly captures interactions between multiple I/O libraries. This allows to identify inefficiencies at individual layers of the I/O stack as well as to detect possible conflicts in the interplay between layers. We implement our methodology in an established performance monitoring infrastructure and demonstrate its effectiveness with an I/O analysis study of a cloud model simulation code. In summary, this work provides the foundation for application I/O tuning by exposing inefficiency patterns in the usage of I/O routines. -
Runtime Determinacy Race Detection for OpenMP Tasks
Hassan Salehe Matar, Didem UnatAbstractOne potential problem when writing parallel programs with OpenMP is to introduce determinacy races where for a given input, the program may unexpectedly produce different final outputs at different runs. Such startling behavior can result from incorrect ordering of OpenMP tasks. We present a method to detect determinacy races in OpenMP tasks at runtime. Based on OpenMP program semantics, our proposed solution models an OpenMP program as a collection of tasks with inferred dependencies among them where a task is implicitly created with a parallel region construct or explicitly created with a task construct. We define happens-before relation among tasks based on such dependencies for determining an execution order when detecting determinacy races. Based on this formalization, we developed a tool, TaskSanitizer, which detects and reports concurrent memory accesses whose tasks do not have common dependencies. Finally, TaskSanitizer works at runtime, has been able to find bugs in micro-benchmarks and it is reasonably efficient to be utilized in a working environment. -
Estimating the Impact of External Interference on Application Performance
Aamer Shah, Matthias Müller, Felix WolfAbstractThe wall-clock execution time of applications on HPC clusters is commonly subject to run-to-run variation, often caused by external interference from concurrently running jobs. Because of the irregularity of this interference from the perspective of the affected job, performance analysts do not consider it an intrinsic part of application execution, which is why they wish to factor it out when measuring execution time. However, if chances are high enough that at least one interference event strikes while the job is running, merely repeating runs several times and picking the fastest run does not guarantee a measurement free of external influence. In this paper, we present a novel approach to estimate the impact of sporadic and high-impact interference on bulk-synchronous MPI applications. An evaluation with several realistic benchmarks shows that the impact of interference can be estimated already based on a single run. -
GT-Race: Graph Traversal Based Data Race Detection for Asynchronous Many-Task Parallelism
Lechen Yu, Vivek SarkarAbstractAsynchronous Many-Task (AMT) parallelism is growing in popularity because of its promise to support future platforms with new heterogeneity and resiliency requirements. It supports the construction of parallel programs with fine-grained tasks, thereby enabling portability across a wide range of platforms. However, applications written for AMT parallelism still remain vulnerable to data races, and existing data race detection tools are unsuitable for AMT programs because they either incur intractably large overheads or are limited to restricted task structures such as fork-join parallelism.In this paper, we propose GT-Race, a new graph-traversal based data race detector for AMT parallelism. It leverages the computation graph data structure, which encodes the general happens-before structures in AMT programs. After introducing a baseline algorithm for data race detection, we propose key optimizations to reduce its time and space complexity, including the epoch adjacency list to compress the computation graph representation, the reachability cache combined with depth filtering to reduce the number of unnecessary traversals, and bounded race detection to limit the range of data that is monitored. The impact of these optimizations is demonstrated for nine benchmark programs written for the Open Community Runtime (OCR), an open source AMT runtime that supports point-to-point synchronization and disjoint data blocks.
-
-
Performance and Power Modeling, Prediction and Evaluation
-
Frontmatter
-
Reducing GPU Register File Energy
Vishwesh Jatala, Jayvant Anantpur, Amey KarkareAbstractGraphics Processing Units (GPUs) maintain a large register file to increase the thread level parallelism (TLP). To increase the TLP further, recent GPUs have increased the number of on-chip registers in every generation. However, with the increase in the register file size, the leakage power increases. Also, with the technology advances, the leakage power component has increased and has become an important consideration for the manufacturing process. The leakage power of a register file can be reduced by turning infrequently used registers into low power (drowsy or off) state after accessing them. A major challenge in doing so is the lack of runtime register access information.To address this, we propose a system called
. It employs a compiler analysis that determines the power state of the registers, i.e., which registers can be switched off or placed in drowsy state at each program point and encodes this information in program instructions. Further, it uses a runtime optimization that increases the accuracy of power state of registers. We implemented the proposed ideas using GPGPU-Sim simulator and evaluated them on 21 kernels from several benchmarks suites. We observe that when compared to the baseline without any power optimizations,
shows an average reduction of register leakage energy by 69.04% with a negligible number of simulation cycles overhead (0.53%).
-
Taxonomist: Application Detection Through Rich Monitoring Data
Emre Ates, Ozan Tuncer, Ata Turk, Vitus J. Leung, Jim Brandt, Manuel Egele, Ayse K. CoskunAbstractModern supercomputers are shared among thousands of users running a variety of applications. Knowing which applications are running in the system can bring substantial benefits: knowledge of applications that intensively use shared resources can aid scheduling; unwanted applications such as cryptocurrency mining or password cracking can be blocked; system architects can make design decisions based on system usage. However, identifying applications on supercomputers is challenging because applications are executed using esoteric scripts along with binaries that are compiled and named by users.This paper introduces a novel technique to identify applications running on supercomputers. Our technique, Taxonomist, is based on the empirical evidence that applications have different and characteristic resource utilization patterns. Taxonomist uses machine learning to classify known applications and also detect unknown applications. We test our technique with a variety of benchmarks and cryptocurrency miners, and also with applications that users of a production supercomputer ran during a 6 month period. We show that our technique achieves nearly perfect classification for this challenging data set. -
Diagnosing Highly-Parallel OpenMP Programs with Aggregated Grain Graphs
Nico Reissmann, Ananya MuddukrishnaAbstractGrain graphs simplify OpenMP performance analysis by visualizing performance problems from a fork-join perspective that is familiar to programmers. However, when programmers decide to expose a high amount of parallelism by creating thousands of task and parallel for-loop chunk instances, the resulting grain graph becomes large and tedious to understand. We present an aggregation method that hierarchically groups related nodes together to reduce grain graphs of any size to one single node. This aggregated graph is then navigated by progressively uncovering groups and following visual clues that guide programmers towards problems while hiding non-problematic regions. Our approach enhances productivity by enabling programmers to understand problems in highly-parallel OpenMP programs with less effort than before. -
Characterization of Smartphone Governor Strategies
Sarbartha Banerjee, Lizy Kurian JohnAbstractThe voltage and frequency of the various components of a smartphone processor such as CPU cores, graphics, multimedia and display units can be independently controlled by their own dynamic voltage and frequency (DVFS) governors to fit the requirement of the workload. The dynamic change of the voltage and frequency performed by governors is targeted either towards achieving the optimal performance with the minimum energy consumption or choosing a mode which requires minimum supervision of workload and minimal change of DVFS modes (since changes in modes are accompanied by overheads of switching).This paper explores the behaviour of different governors run on a wide variety of workloads and enlists the best strategy for different scenarios exemplifying the need for workload characterization. We also analyze the performance and power efficiency of workloads in a system having a common power source and study their behavior when multiple such blocks are operating together pushing the power source to its limit. Our results show that choosing the correct CPU governor alone is not sufficient but tuning the DVFS of different resources is necessary to achieve the best performance with minimum energy expenditure. We observe that the powersave governor does not always give the best energy efficiency. It was found to be sub-optimal for CPU intensive workloads due to increased execution time. Moreover, the race-to-idle strategy was found to be optimal for workloads in which one component is utilized for majority of the time. These results demonstrate the necessity for characterizing workloads and tuning the DVFS while distributing the power between the various components based on the workload’s characteristics. -
HPC Benchmarking: Scaling Right and Looking Beyond the Average
Milan Radulovic, Kazi Asifuzzaman, Paul Carpenter, Petar Radojković, Eduard AyguadéAbstractDesigning a balanced HPC system requires an understanding of the dominant performance bottlenecks. There is as yet no well established methodology for a unified evaluation of HPC systems and workloads that quantifies the main performance bottlenecks. In this paper, we execute seven production HPC applications on a production HPC platform, and analyse the key performance bottlenecks: FLOPS performance and memory bandwidth congestion, and the implications on scaling out. We show that the results depend significantly on the number of execution processes and granularity of measurements. We therefore advocate for guidance in the application suites, on selecting the representative scale of the experiments. Also, we propose that the FLOPS performance and memory bandwidth should be represented in terms of the proportions of time with low, moderate and severe utilization. We show that this gives much more precise and actionable evidence than the average. -
Combined Vertical and Horizontal Autoscaling Through Model Predictive Control
Emilio Incerto, Mirco Tribastone, Catia TrubianiAbstractMeeting performance targets of co-located distributed applications in virtualized environments is a challenging goal. In this context, vertical and horizontal scaling are promising techniques; the former varies the resource sharing on each individual machine, whereas the latter deals with choosing the number of virtual machines employed. State-of-the-art approaches mainly apply vertical and horizontal scaling in an isolated fashion, in particular assuming a fixed and symmetric load balancing across replicas. Unfortunately this may result unsatisfactory when replicas execute in environments with different computational resources.To overcome this limitation, we propose a novel combined runtime technique to determine the resource sharing quota and the horizontal load balancing policy in order to fulfill performance goals such as response time and throughput of co-located applications. Starting from a performance model as a multi-class queuing network, we formulate a model-predictive control problem which can be efficiently solved by linear programming. A validation performed on a shared virtualized environment hosting two real applications shows that only a combined vertical and horizontal load balancing adaptation can efficiently achieve desired performance targets in the presence of heterogeneous computational resources.
-
-
Scheduling and Load Balancing
-
Frontmatter
-
Early Termination of Failed HPC Jobs Through Machine and Deep Learning
Michał Zasadziński, Victor Muntés-Mulero, Marc Solé, David Carrera, Thomas LudwigAbstractFailed jobs in a supercomputer cause not only waste in CPU time or energy consumption but also decrease work efficiency of users. Mining data collected during the operation of data centers helps to find patterns explaining failures and can be used to predict them. Automating system reactions, e.g., early termination of jobs, when software failures are predicted does not only increase availability and reduce operating cost, but it also frees administrators’ and users’ time. In this paper, we explore a unique dataset containing the topology, operation metrics, and job scheduler history from the petascale Mistral supercomputer. We extract the most relevant system features deciding on the final state of a job through decision trees. Then, we successfully train a neural net to predict job evolution based on power time series of nodes. Finally, we evaluate the effect on CPU time saving for static and dynamic job termination policies. -
Peacock: Probe-Based Scheduling of Jobs by Rotating Between Elastic Queues
Mansour Khelghatdoust, Vincent GramoliAbstractIn this paper, we propose Peacock, a new distributed probe-based scheduler which handles heterogeneous workloads in data analytics frameworks with low latency. Peacock mitigates the Head-of-Line blocking problem, i.e., shorter tasks are enqueued behind the longer tasks, better than the state-of-the-art. To this end, we introduce a novel probe rotation technique. Workers form a ring overlay network and rotate probes using elastic queues. It is augmented by a novel probe reordering algorithm executed in workers. We evaluate the performance of Peacock against two state-of-the-art probe-based solutions through both trace-driven simulation and distributed experiment in Spark under various loads and cluster sizes. Our large-scale performance results indicate that Peacock outperforms the state-of-the-art in all cluster sizes and loads. Our distributed experiments confirm our simulation results. -
Online Scheduling of Task Graphs on Hybrid Platforms
Louis-Claude Canon, Loris Marchal, Bertrand Simon, Frédéric VivienAbstractModern computing platforms commonly include accelerators. We target the problem of scheduling applications modeled as task graphs on hybrid platforms made of two types of resources, such as CPUs and GPUs. We consider that task graphs are uncovered dynamically, and that the scheduler has information only on the available tasks, i.e., tasks whose predecessors have all been completed. Each task can be processed by either a CPU or a GPU, and the corresponding processing times are known. Our study extends a previous \(4\sqrt{m/k}\)-competitive online algorithm [2], where m is the number of CPUs and k the number of GPUs (\(m\ge k\)). We prove that no online algorithm can have a competitive ratio smaller than \(\sqrt{m/k}\). We also study how adding flexibility on task processing, such as task migration or spoliation, or increasing the knowledge of the scheduler by providing it with information on the task graph, influences the lower bound. We provide a \((2\sqrt{m/k}+1)\)-competitive algorithm as well as a tunable combination of a system-oriented heuristic and a competitive algorithm; this combination performs well in practice and has a competitive ratio in \(\varTheta (\sqrt{m/k})\). Finally, simulations on different sets of task graphs illustrate how the instance properties impact the performance of the studied algorithms and show that our proposed tunable algorithm performs the best among the online algorithms in almost all cases and has even performance close to an offline algorithm. -
Interference-Aware Scheduling Using Geometric Constraints
Raphaël Bleuse, Konstantinos Dogeas, Giorgio Lucarelli, Grégory Mounié, Denis TrystramAbstractThe large scale parallel and distributed platforms produce a continuously increasing amount of data which have to be stored, exchanged and used by various jobs allocated on different nodes of the platform. The management of this huge communication demand is crucial for the performance of the system. Meanwhile, we have to deal with more interferences as the trend is to use a single all-purpose interconnection network. In this paper, we consider two different types of communications: the flows induced by data exchanges during computations and the flows related to Input/Output operations. We propose a general model for interference-aware scheduling, where explicit communications are replaced by external topological constraints. Specifically, we limit the interferences of both communication types by adding geometric constraints on the allocation of jobs into machines. The proposed constraints reduce implicitly the data movements by restricting the set of possible allocations for each job. We present this methodology on the case study of simple network topologies, namely the line and the ring. We propose theoretical lower and upper bounds under different assumptions with respect to the platform and jobs characteristics. The obtained results illustrate well the difficulty of the problem even on simple topologies. -
Resource-Efficient Execution of Conditional Parallel Real-Time Tasks
Sanjoy BaruahAbstractUnder the federated paradigm of multiprocessor scheduling, a set of processors is reserved for the exclusive use of each task. We consider the federated scheduling of parallel real-time tasks containing conditional (if-then-else) constructs, in which different executions of the task may result in workloads of substantially different magnitude and different character (e.g., degree of parallelism and critical path length). If the task is hard-real-time, then processors must be reserved for it under worst-case assumptions. However, it may be the case that most invocations of the task will have computational demand far below the worst-case characterization, and could have been scheduled correctly upon far fewer processors than had been assigned to it based upon the worst-case characterization of its run-time behavior. Provided we could safely determine during run-time if the worst-case characterization is likely to be realized during some execution and all the processors are therefore going to be needed, for the rest of the time the unneeded processors could be idled in low-energy “sleep” mode, or used for executing non-real time work in the background. In this paper we propose an algorithm for scheduling parallel conditional tasks that permits us to do so.
-
- Title
- Euro-Par 2018: Parallel Processing
- Editors
-
Marco Aldinucci
Luca Padovani
Massimo Torquati
- Copyright Year
- 2018
- Publisher
- Springer International Publishing
- Electronic ISBN
- 978-3-319-96983-1
- Print ISBN
- 978-3-319-96982-4
- DOI
- https://doi.org/10.1007/978-3-319-96983-1
Accessibility information for this book is coming soon. We're working to make it available as quickly as possible. Thank you for your patience.