Skip to main content

2018 | Buch

Euro-Par 2017: Parallel Processing Workshops

Euro-Par 2017 International Workshops, Santiago de Compostela, Spain, August 28-29, 2017, Revised Selected Papers

herausgegeben von: Dora B. Heras, Luc Bougé, Gabriele Mencagli, Emmanuel Jeannot, Dr. Rizos Sakellariou, Rosa M. Badia, Jorge G. Barbosa, Laura Ricci, Stephen L. Scott, Stefan Lankes, Josef Weidendorfer

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science


Über dieses Buch

This book constitutes the proceedings of the workshops of the 23rd International Conference on Parallel and Distributed Computing, Euro-Par 2017, held in Santiago de Compostela. Spain in August 2017.

The 59 full papers presented were carefully reviewed and selected from 119 submissions.

Euro-Par is an annual, international conference in Europe, covering all aspects of parallel and distributed processing. These range from theory to practice, from small to the largest parallel and distributed systems and infrastructures, from fundamental computational problems to full-edged applications, from architecture, compiler, language and interface design and implementation to tools, support infrastructures, and application performance aspects.


Erratum to: Euro-Par 2017: Parallel Processing Workshops
Dora B. Heras, Luc Bougé, Gabriele Mencagli, Emmanuel Jeannot, Rizos Sakellariou, Rosa M. Badia, Jorge G. Barbosa, Laura Ricci, Stephen L. Scott, Stefan Lankes, Josef Weidendorfer

Auto-DASP – Workshop on Autonomic Solutions for Parallel and Distributed Data Stream Processing

Moderated Resource Elasticity for Stream Processing Applications

In stream processing, elasticity is often realized by adapting the system scale and topology according to the volume of input data. However, this volume is often fluctuating, with a high degree of noise, which can trigger a high amount of scaling operations. Since these scaling operations introduce additional overhead and cost, systems employing such approaches are at risk of spending a significant amount of time scaling up and down, nullifying the positive effects of scalability.To overcome this, we propose an approach for moderating the scaling behavior of stream processing applications by reducing the number of scaling operations, while still providing quick responses to changes in input data volume. Contrary to existing approaches, instead of using linear smoothing techniques, we show how to employ non-linear filtering techniques from the field of signal processing to pre-process the raw volume measurements, mitigating superfluous scaling operations, and effectively reducing the number of such operations by up to 94%.

Michael Borkowski, Christoph Hochreiner, Stefan Schulte
Container-Based Support for Autonomic Data Stream Processing Through the Fog

We present a container-based architecture for supporting autonomic data stream processing application on fog computing infrastructures. Our architecture runs applications as Docker containers, and it exploits the native features of Docker to dynamically scale up/down the resources of a fog node assigned to the applications running on it. Preliminary results demonstrate that Docker containers are appropriate for building migratable autonomic solutions on fog infrastructures.

Antonio Brogi, Gabriele Mencagli, Davide Neri, Jacopo Soldani, Massimo Torquati
NOA-AID: Network Overlays for Adaptive Information Aggregation, Indexing and Discovery at the Edge

This paper presents NOA-AID a network architecture for targeting highly distributed systems, composed of a large set of distributed stream processing devices, aimed at adaptive information indexing, aggregation and discovery in streams of data. The architecture is organized on two layers. The upper layer is aimed at supporting the information discovery process by providing a distributed index structure. The lower layer is mainly devoted to resource aggregation based on epidemic protocols targeting highly distributed and dynamic scenarios, well suited to stream-oriented scenarios. We present a theoretical study on the costs of information management operations, also giving an empirical validation of such findings. Finally, we presented an experimental evaluation of the ability of our solution to be effective and efficient in retrieving meaningful information in streams on a highly-dynamic and distributed scenario.

Patrizio Dazzi, Matteo Mordacchini
Nornir: A Customizable Framework for Autonomic and Power-Aware Applications

A desirable characteristic of modern parallel applications is the ability to dynamically select the amount of resources to be used to meet requirements on performance or power consumption. In many cases, providing explicit guarantees on performance is of paramount importance. In streaming applications, this is related with the concept of elasticity, i.e. being able to allocate the proper amount of resources to match the current demand as closely as possible. Similarly, in other scenarios, it may be useful to limit the maximum power consumption of an application to do not exceed the power budget. In this paper we propose Nornir, a customizable C++ framework for autonomic and power-aware parallel applications on shared memory multicore machines. Nornir can be used by autonomic strategy designers to implement new algorithms and by application users to enforce requirements on applications.

Daniele De Sensi, Tiziano De Matteis, Marco Danelutto
Supporting Advanced Patterns in GrPPI, a Generic Parallel Pattern Interface

The emergence of generic interfaces, encapsulating algorithmic aspects in pattern-based constructions, has greatly alleviated the development of data-intensive and stream-processing applications. In this paper, we complement the basic patterns supported by GrPPI, a C++ General and Reusable Parallel Pattern Interface of the state-of-the-art, with the advanced parallel patterns Pool, Windowed-Farm, and Stream-Iterator. This collection of advanced patterns is basically oriented to some domain-specific applications, ranging from the evolutionary to the real-time computing areas, where compositions of basic patterns are not capable of fully mimicking algorithmic behavior of their original sequential codes. The experimental evaluation of the advanced patterns on a set of domain-specific use-cases, using different back-ends (C++ Threads, OpenMP and Intel TBB) and pattern-specific parameters, reports remarkable performance gains. We also demonstrate the benefits of the GrPPI pattern interface from the usability and flexibility points of view.

David del Rio Astorga, Manuel F. Dolz, Javier Fernández, J. Daniel García
A Topology and Traffic Aware Two-Level Scheduler for Stream Processing Systems in a Heterogeneous Cluster

To efficiently handle a large volume of data, scheduling algorithms in stream processing systems need to minimise the data movement between communicating tasks to improve system throughput. However, finding an optimal scheduling algorithm for these systems is NP-hard. In this paper, we propose a heuristic scheduling algorithm for a heterogeneous cluster—T3-Scheduler—that can efficiently identify the communicating tasks and assign them to the same node, up to a specified level of utilisation for that node. Using three common micro-benchmarks and an evaluation using a real-world application, we demonstrate that T3-Scheduler outperforms current state-of-the-art scheduling algorithms, such as Aniello et al.’s popular ‘Online scheduler’, improving throughput by 20–72% for micro-benchmarks and 60% for the real-world application.

Leila Eskandari, Jason Mair, Zhiyi Huang, David Eyers
Stateful Load Balancing for Parallel Stream Processing

Timely processing of streams in parallel requires dynamic load balancing to diminish skewness of data. In this paper we study this problem for stateful operators with key grouping for which the process of load balancing involves a lot of state movements. Consequently, load balancing is a bi-objective optimization problem, namely Minimum-Cost-Load-Balance (MCLB). We address MCLB with two approximate algorithms by a certain relaxation of the objectives: (1) a greedy algorithm ELB performs load balancing eagerly but relaxes the objective of load imbalance to a range; and (2) a periodic algorithm CLB aims at reducing load imbalance via a greedy procedure of minimizing the covariance of substreams but ignores the objective of state movement by amortizing the overhead of it over a relative long period. We evaluate our approaches with both synthetic and real data. The results show that they can adapt effectively to load variations and improve latency efficiently comparing to the existing solutions whom ignored the overhead of state movement in stateful load balancing.

Qingsong Guo, Yongluan Zhou
Towards Memory-Optimal Schedules for SDF

The Synchronous Data Flow (SDF) programming model is an established programming paradigm for stream processing applications. SDF programs are expressed by actors and streams that establish communication among actors. Streams are implemented as FIFO buffers, and the size of the FIFO buffers depends on the steady-state schedule. Finding a steady-state schedule that minimizes the sizes of FIFO buffers, is of great importance to minimize the memory consumption. The state-of-the-art provides ad-hoc heuristics only, so finding memory-optimal steady-state schedules is still an open challenge.In this work, we study three objective functions capturing the memory utilization of three different implementations of the FIFO buffers. We show that one objective is NP-hard to optimize, while the other two can be solved optimally in polynomial time. The algorithm for computing these optimal schedules is implementable as an online algorithm. We show the effectiveness of our new algorithm comparing it with the state-of-the-art heuristics. Our experiments show that for large synthetic instances, our algorithm generates schedules that use up to 8 times less memory.

Mitchell Jones, Julián Mestre, Bernhard Scholz
Towards Hierarchical Autonomous Control for Elastic Data Stream Processing in the Fog

In the Big Data era, Data Stream Processing (DSP) applications should be capable to seamlessly process huge amount of data. Hence, they need to dynamically scale their execution on multiple computing nodes so to adjust to unpredictable data source rate. In this paper, we present a hierarchical and distributed architecture for the autonomous control of elastic DSP applications. It revolves around a two layered approach. At the lower level, distributed components issue requests for adapting the deployment of DSP operations as to adjust to changing workload conditions. At the higher level, a per-application centralized component works on a broader time scale; it oversees the application behavior and grants reconfigurations to control the application performance while limiting the negative effect of their enactment, i.e., application downtime. We have implemented the proposed solution in our distributed Storm prototype and evaluated its behavior adopting simple policies. The experimental results are promising and show that, even with simple policies, it is possible to limit the number of reconfigurations while at the same time guaranteeing an adequate level of application performance.

Valeria Cardellini, Francesco Lo Presti, Matteo Nardelli, Gabriele Russo Russo
PiCo: A Novel Approach to Stream Data Analytics

In this paper, we present a new C++ API with a fluent interface called PiCo (Pipeline Composition). PiCo’s programming model aims at making easier the programming of data analytics applications while preserving or enhancing their performance. This is attained through three key design choices: (1) unifying batch and stream data access models, (2) decoupling processing from data layout, and (3) exploiting a stream-oriented, scalable, efficient C++11 runtime system. PiCo proposes a programming model based on pipelines and operators that are polymorphic with respect to data types in the sense that it is possible to re-use the same algorithms and pipelines on different data models (e.g., streams, lists, sets, etc.). Preliminary results show that PiCo can attain better performances in terms of execution times and hugely improve memory utilization when compared to Spark and Flink in both batch and stream processing.

Claudia Misale, Maurizio Drocco, Guy Tremblay, Marco Aldinucci
Viper: Communication-Layer Determinism and Scaling in Low-Latency Stream Processing

Stream Processing Engines (SPEs) process continuous streams of data and produce up-to-date results in a real-time fashion, typically through one-at-a-time tuple analysis. When looking into the vital SPE processing properties required from applications, determinism has a strong position besides scalability in throughput and low processing latency. SPEs scale in throughput and latency by relying on shared-nothing parallelism, deploying multiple copies of each operator to which tuples are distributed based on the semantics of the operator. The coordination of the asynchronous analysis of parallel operators required to enforce determinism is then carried out by additional dedicated sorting operators. In this work we shift such costly coordination to the communication layer of the SPE. Specifically, we extend earlier work on shared-memory implementations of deterministic operators and provide a communication module (Viper) which can be integrated in the SPE communication layer. Using Apache Storm and the Linear Road benchmark, we show the benefits that can be achieved by our approach in terms of throughput and energy efficiency of SPEs implementing one-at-a-time analysis.

Ivan Walulya, Yiannis Nikolakopoulos, Vincenzo Gulisano, Marina Papatriantafilou, Philippas Tsigas
Scalability and State: A Critical Assessment of Throughput Obtainable on Big Data Streaming Frameworks for Applications With and Without State Information

Emerging Big Data streaming applications are facing unbounded (infinite) data sets at a scale of millions of events per second. The information captured in a single event, e.g., GPS position information of mobile phone users, loses value (perishes) over time and requires sub-second latency responses. Conventional Cloud-based batch-processing platforms are inadequate to meet these constraints.Existing streaming engines exhibit low throughput and are thus equally ill-suited for emerging Big Data streaming applications. To validate this claim, we evaluated the Yahoo streaming benchmark and our own real-time trend detector on three state-of-the-art streaming engines: Apache Storm, Apache Flink and Spark Streaming. We adapted the Kieker dynamic profiling framework to gather accurate profiling information on the throughput and CPU utilization exhibited by the two benchmarks on the Google Compute Engine.To estimate the performance overhead incurred by current streaming engines, we re-implemented our Java-based trend detector as a multi-threaded, shared-memory application in . The achieved throughput of 3.2 million events per second on a stand-alone 2 CPU (44 cores) Intel Xeon E5-2699 v4 server is 44 times higher than the maximum throughput achieved with the Apache Storm version of the trend detector deployed on 30 virtual machines (nodes) in the Cloud. Our experiment suggests vertical scaling as a viable alternative to horizontal scaling, especially if shared state has to be maintained in a streaming application. For reproducibility, we have open-sourced our framework configurations on GitHub [1].

Shinhyung Yang, Yonguk Jeong, ChangWan Hong, Hyunje Jun, Bernd Burgstaller

COLOC – Workshop on Data Locality

Netloc: A Tool for Topology-Aware Process Mapping

Interconnection networks in parallel platforms can be made of thousands of nodes and hundreds of switches. The communication cost between tasks of a parallel application varies significantly with their actual location in such platforms. Topology-aware process mapping consists in matching the application communication pattern with the network topology to improve the communication cost by placing related tasks close on the hardware.We show that our Netloc tool for gathering network topology in a generic way can be combined with the state-of-the-art Scotch partitioner for computing topology-aware MPI process placement. Our experiments with a stencil application on a fat-tree machine show that we are able to significantly improve the runtime in the vast majority of cases.

Cyril Bordage, Clément Foyer, Brice Goglin
Runtime Support for Distributed Dynamic Locality

Single node hardware design is shifting to a heterogeneous nature and many of today’s largest HPC systems are clusters that combine heterogeneous compute device architectures. The need for new programming abstractions in the advancements to the Exascale era has been widely recognized and variants of the Partitioned Global Address Space (PGAS) programming model are discussed as a promising approach in this respect. In this work, we present a graph-based approach to provide runtime support for dynamic, distributed hardware locality, specifically considering heterogeneous systems and asymmetric, deep memory hierarchies. Our reference implementation dyloc leverages hwloc to provide high-level operations on logical hardware topology based on user-specified predicates such as filter- and group transformations and locality-aware partitioning. To facilitate integration in existing applications, we discuss adapters to maintain compatibility with the established hwloc API.

Tobias Fuchs, Karl Fürlinger
Large-Scale Experiment for Topology-Aware Resource Management

A Resource and Job Management System (RJMS) is a crucial system software part of the HPC stack. It is responsible for efficiently delivering computing power to applications in supercomputing environments and its main intelligence relies on resource selection techniques to find the most adapted resources to schedule the users’ jobs. In [8], we introduced a new topology-aware resource selection algorithm to determine the best choice among the available nodes of the platform based on their position in the network and on application behaviour (expressed as a communication matrix). We did integrate this algorithm as a plugin in Slurm and validated it with several optimization schemes by making comparisons with the default Slurm algorithm. This paper presents further experiments with regard to this selection process.

Yiannis Georgiou, Guillaume Mercier, Adèle Villiermet

Euro-EDUPAR – European Workshop on Parallel and Distributed Computing Education for Undergraduate Students

SCoPE@Scuola: (In)-formative Paths on Topics Related with High Performance, Parallel and Distributed Computing

The SCoPE@Scuola initiative was born with the aim to inspire curiosity in high school students about High Performance Computing (HPC) and Parallel and Distributed Computing (PDC). The HPC/PDC world could be an interesting matter for students because is a necessary tool to solve challenging problems in science and technology and it provides context where a plenty of knowledge acquired at school can find a real application. In fact, the themes related to HPC/PDC involve a large range of knowledge and skills: from mathematical modelling of problems to algorithm design, from software implementation to design and management of complex computer systems. The initiative, begun at the end of 2014, involved several schools in the Naples (Italy) district, and has also been used for work-based learning activities and projects aimed to avoid students “dropouts”. The results collected during all the last years make us hopeful that such initiative could be useful both to increment students awareness about the utility in the real world of all the knowledge acquired at school and to help them in their future educational and/or working choices.

Giovanni Battista Barone, Vania Boccia, Davide Bottalico, Luisa Carracciuolo
A Set of Patterns for Concurrent and Parallel Programming Teaching

The use of key parallel-programming patterns has proved to be extremely helpful for mastering difficult concurrent and parallel programming concepts and the associated syntactical constructs. The method suggested here consists of a substantial change of more traditional teaching and learning approaches to teach programming. According to our approach, students are first introduced to concurrency problems through a selected set of preliminar program code-patterns. Each pattern also has a series of tests with selected samples to enable students to discover the most common cases that cause problems and then the solutions to be applied. In addition, this paper presents the results obtained from an informal assessment realized by the students of a course on concurrent and real-time programming that belongs to the computer engineering (CE) degree. The obtained results show that students feel now to be more actively involved in lectures, practical lessons, and thus students make better use of their time and gain a better understanding of concurrency topics that would not have been considered possible before the proposed method was implemented at our University.

Manuel I. Capel, Antonio J. Tomeu, Alberto G. Salguero
Integrating Parallel Computing in Introductory Programming Classes: An Experience and Lessons Learned

Parallel and distributed computing (PDC) has become ubiquitous to the extent that even common users depend on parallel programming. This points to the need for every programmer to understand how parallelism and distributed programming affect problem solving, teaching only traditional sequential programming is no longer sufficient. To address the rapidly widening gap between emerging highly-parallel computer architectures and the sequential programming approach taught in traditional CS/CE courses, the Computer Science Department at Tennessee Technological University has integrated PDC into their introductory programming course sequence. This paper presents our implementation efforts, experience and lessons learned, as well as preliminary evaluation results.

Sheikh Ghafoor, David W. Brown, Mike Rogers
Revisiting Flynn’s Classification: The Portfolio Approach

Today, we are reaching the limits of Moore’s law: the progress of parallel components does not grow exponentially as it did continuously during the last decades. This is somehow a paradox since the computing platforms are always more powerful. It simply tells us that the efficiency of parallel programs is becoming less obvious.If we want to continue to solve hard computational problems, the only way is to change the way problems are solved. In this work, we propose to investigate how algorithms portfolio may be a direction to solve hard and large problems. It is also the occasion for us to revisit the well-known Flynn’s classification and clarifying the MISD (Multiple Instructions Single Data) class which was never really well-understood.

Yanik Ngoko, Denis Trystram
Experience with Teaching PDC Topics into Babeş-Bolyai University’s CS Courses

In this paper, we present an analysis of the outcomes of teaching Parallel and Distributed Computing within the Faculty of Mathematics and Computer Science from Babeş-Bolyai University of Cluj-Napoca. The analysis considers the level of interest of students for different topics as being determinant in achieving the learning outcomes. Our experiences have been greatly influenced by the specific context defined by the fact that the majority of the students are already enrolled into a software company either as interns in an internship program or as employees. The level of interest of students for a specific topic is also determined by the development of the IT industry in the region. The learning activity is in general influenced by this specific context, and a new, high demanding topic as Parallel and Distributed Computing is even more influenced, when is to be taught to the undergraduate level. This analysis further leads to a more general analysis on the appropriateness of introducing PDC topics, or other relatively advanced topics, to all undergraduate students in CS, or to consider newly defined educational degrees.

Virginia Niculescu, Darius Bufnea
Cellular ANTomata: A Tool for Early PDC Education

The thesis of this essay is that the Cellular ANTomaton (CAnt) computational model—obtained by deploying a team of mobile finite-state machines (the model’s “Ants”) upon a cellular automaton (CA)—can be a highly effective platform for introducing early undergraduate students to a broad range of concepts relating to parallel and distributed computing (PDC). CAnts permit many sophisticated PDC concepts to be taught within a unified, perspicuous model and then experimented with using the many easily accessed systems for simulating CAs and CAnts. Space restrictions limit us to supporting the thesis via only three important PDC concepts: synchronization, (algorithmic) scalability, and leader election (symmetry breaking). Having a single versatile pedagogical platform facilitates the goal of endowing all undergraduate students with a level of computational literacy adequate for success in an era characterized increasingly by ubiquitous parallel and/or distributed computing devices.

Arnold L. Rosenberg
Teaching Software Transactional Memory in Concurrency Courses with Clojure and Java

In the field of concurrency and parallelism, it is known that the use of lock-based synchronization mechanisms limits the programming efficiency of concurrent applications and reveals problems in thread synchronization. Software Transactional Memory (STM) is a consolidated concurrency control mechanism that may be considered as an alternative to lock-based constructs for programming critical software, although STM is still not fully accepted as a programming model for the industry. It is our opinion that STM programming must be more emphasized in undergraduate courses on concurrency and parallelism. In this paper we propose an academic experience regarding the introduction of STM programming in concurrency courses by using the Clojure language as the common vehicle for teaching Concurrent Programming. Java, the most popular and extended programming language for teaching concurrency, becomes a second language in our course, and thus our students can take advantage of Clojure API which is defined in Java in order to simplify the development of programming, lectures and assignments.

Antonio J. Tomeu, Alberto G. Salguero, Manuel I. Capel

F2C-DP – Workshop on Fog-to-Cloud Distributed Processing

Benefits of a Coordinated Fog-to-Cloud Resources Management Strategy on a Smart City Scenario

The advent of fog computing devices as computing paradigm enriching traditional cloud computing applications, paves the way to deploy innovative services, typically not completely appropriate and well supported by cloud computing technology. For example, fog computing is highly suitable for services requiring high constraints on delay, such as dependable services in the e-health arena or tracking strategies in manufacturing processes. Recently, some initiatives have focussed on putting together fog and cloud computing to make the best out of utilizing both, such as the reference architecture by the OpenFog consortium or the Fog-to-Cloud (F2C) concept. However, such a scenario requires a novel management strategy taking over the foreseen specific demands. In this paper, we argue the benefits of a F2C architecture on a particular application to be deployed on a smart city or smart environment scenario.

Andrea Bartolí, Francisco Hernández, Laura Val, Jose Gorchs, Xavi Masip-Bruin, Eva Marín-Tordera, Jordi Garcia, Ana Juan, Admela Jukan
Fog and Cloud in the Transportation, Marine and eHealth Domains

Amazing things have been achieved in a wide range of application domains by exploiting a multitude of small connected devices, defined as the Internet of Things. Managing of these devices and their resources is a task for the underlying Fog technology that enables building of smart and efficient applications. Currently, the Fog is not implemented to the extent that we can submit application requirements to a Fog provider, select returned resources and deploy an application on them. A widely adopted workaround is to deploy Cloud applications that exploit the functionality of IoT and Fog devices. Although Clouds provide virtually unlimited computation power, they could present a bottleneck and unnecessary communication overhead when a huge number of devices needs to be controlled, read or written to. Therefore, it is reasonable to formulate use cases that will exploit the Edge and Fog functionality and define a set of basic requirements for Fog providers.

Matija Cankar, Eneko Olivares Gorriti, Matevž Markovič, Flavio Fuart
Scalable Linux Container Provisioning in Fog and Edge Computing Platforms

The tremendous increase in the number of mobile devices and the proliferation of all kinds of new types of sensors is creating new value opportunities by analyzing, developing insights from, and actuating upon large volumes of data streams generated at the edge of the network. While general purpose processing required to unleash this value is abundant in Cloud datacenters, bringing raw IoT data streams to the Cloud poses critical challenges, including: (i) regulatory constraints related to data sensitivity, (ii) significant bandwidth costs and (iii) latency barriers inhibiting near-real-time applications. Edge Computing aspires to extend the traditional cloud model to the “edge of the network”, to deliver low-latency, bandwidth-efficiencies and controlled privacy. For all the commonalities between the two models, transitioning the provisioning and orchestration of a distributed analytics platform from Cloud to Edge is not trivial. The two models present totally different cost structures such as price of bandwidth, data communication latency, power density and availability. In this paper, we address the challenge associated with transitioning scalable provisioning from Cloud to distributed Edge platforms. We identify current scalability challenges in Linux container provisioning at the Edge; we propose a novel peer-to-peer model taking on them; we present a prototype of this model designed for and tested on real Edge testbeds, and we report a scalability evaluation on a scale-out virtualized platform. Our results demonstrate significant savings in terms of provisioning latency and bandwidth utilization.

Michele Gazzetti, Andrea Reale, Kostas Katrinis, Antonio Corradi
A Hash-Based Naming Strategy for the Fog-to-Cloud Computing Paradigm

The growth of the Internet connected devices population has fuelled the emergence of new distributed computer paradigms; one of these paradigms is the so-called Fog-to-Cloud (F2C) computing, where resources (compute, storage, data) are distributed in a hierarchical fashion between the edge and the core of the network. This new paradigm has brought new research challenges, such as the need for a novel framework intended to controlling and, more in general, facilitating the interaction among the heterogeneous devices conforming the environment at the edge of the network and the available resources at cloud. A key feature that this framework should meet is the capability of uniquely and unequivocally identify the connected devices. In this paper a hash-based naming strategy suitable to be used in the F2C environment is presented. The proposed naming method is based on three main components: certification, hashing and identification. This research is an ongoing work, thus, the steps to follow since a device connects to the F2C network until it receives a name are described and the major challenges that must be solved are analyzed.

Alejandro Gómez-Cárdenas, Xavi Masip-Bruin, Eva Marín-Tordera, Sarang Kahvazadeh, Jordi Garcia
An Architecture for Programming Distributed Applications on Fog to Cloud Systems

This paper presents a framework to develop and execute applications in distributed and highly dynamic computing systems composed of cloud resources and fog devices such as mobile phones, cloudlets, and micro-clouds. The work builds on the COMPSs programming framework, which includes a programming model and a runtime already validated in HPC and cloud environments for the transparent execution of parallel applications. As part of the proposed contribution, COMPSs has been enhanced to support the execution of applications on mobile platforms that offer GPUs and CPUs. The scheduling component of COMPSs is under design to be able to offload the computation to other fog devices in the same level of the hierarchy and to cloud resources when more computational power is required. The framework has been tested executing a sample application on a mobile phone offloading task to a laptop and a private cloud.

Francesc Lordan, Daniele Lezzi, Jorge Ejarque, Rosa M. Badia
Making Use of a Smart Fog Hub to Develop New Services in Airports

The EC H2020 mF2C Project aims at developing a software framework that enables the orchestration of resources and communication at Fog level, as an extension of Cloud Computing and interacting with the IoT. In order to show the project functionalities and added-values three real world Use Cases have been chosen. This paper introduces the mF2C Use case 3: Smart Fog Hub Service (SFHS), in the context of an airport, with the objective of proving that great potential value and different business opportunities can be created in physical environments with a great concentration of smart objects, to showcase the wide range of scenarios on which mF2C can impact, validate the project in industrial events and determine a massive interest of relevant stakeholders.

Antonio Salis, Glauco Mancini

HeteroPar – Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms

Approximation Algorithm for Scheduling a Chain of Tasks on Heterogeneous Systems

This paper presents an efficient approximation algorithm to solve the task scheduling problem on heterogeneous platform for the particular case of the linear chain of tasks. The objective is to minimize both the total execution time (makespan) and the total energy consumed by the system. For this purpose, we introduce a constraint on the energy consumption during execution. Our goal is to provides an algorithm with a performance guarantee. Two algorithms have been proposed; the first provides an optimal solution for preemptive scheduling. This solution is then used in the second algorithm to provide an approximate solution for non-preemptive scheduling. Numerical evaluations demonstrate that the proposed algorithm achieves a close-to-optimal performance compared to exact solution obtained by CPLEX for small instances. For large instances, CPLEX is struggling to provide a feasible solution, whereas our approach takes less than a second to produce a solution for an instance of 10000 tasks.

Massinissa Ait Aba, Lilia Zaourar, Alix Munier
Software-Distributed Shared Memory over Heterogeneous Micro-server Architecture

Nowadays, the design of computing architectures not only targets computing performances but also the energy power savings. Low-power computing units, such as ARM and FPGA-based nodes, are now being integrated together with high-end processors and GPGPU accelerators into computing clusters. One example is the micro-server architecture that consists of a backbone onto which it is possible to plug computing nodes. These nodes can host high-end and low-end CPUs, GPUs, FPGAs and multi-purpose accelerators such as manycores, building up a real heterogeneous platform. In this context, there is no hardware to federate memories, and the programmability of such architectures suddenly relies on the developer experience to manage data location and task communications. The purpose of this paper is to evaluate the possibility of bringing back the convenient shared-memory programming model by deploying a software-distributed shared memory among heterogeneous computing nodes. We describe how we have built such a system over a message-passing runtime. Experimentations have been conducted using a parallel image processing application over an homogeneous cluster and an heterogeneous micro-server.

Loïc Cudennec
A High-Throughput Kalman Filter for Modern SIMD Architectures

The Kalman filter is a critical component of the reconstruction process of subatomic particle collision in high-energy physics detectors. At the LHCb detector in the Large Hadron Collider this reconstruction must be performed at an average rate of 30 million times per second. As a consequence of the ever-increasing collision rate and upcoming detector upgrades, the data rate that needs to be processed in real time is expected to increase by a factor of 40 in the next five years. In order to keep pace, processing and filtering software must take advantage of latest developments in hardware technology.In this paper we present a cross-architecture SIMD parallel algorithm and implementation of a low-rank Kalman filter. We integrate our implementation in production code and validate the numerical results in the context of physics reconstruction. We also compare its throughput across modern multi- and many-core architectures.Using our Kalman filter implementation we are able to achieve a sustained throughput of 75 million particle hit reconstructions per second on an Intel Xeon Phi Knights Landing platform, a factor 6.81 over the current production implementation running on a two-socket Haswell system. Additionally we show that under the constraints of our Kalman filter formulation we efficiently use the available hardware resources.Our implementation will allow us to better sustain the required throughput of the detector in the coming years and scale to future hardware architectures. Additionally our work enables the evaluation of other computing platforms for future hardware upgrades.

Daniel Hugo Cámpora Pérez, Omar Awile, Cédric Potterat
Resource Contention Aware Execution of Multiprocessor Tasks on Heterogeneous Platforms

In high performance computing (HPC), the tasks of complex applications have to be assigned to the compute nodes of heterogeneous HPC platforms in such a way that the total execution time is minimized. Common approaches, such as task scheduling methods, usually base their decisions on task runtimes that are predicted by cost models. A high accuracy and reliability of these models is crucial for achieving low execution times for all tasks. The individual runtimes of concurrently executed tasks are often affected by contention for hardware resources, such as communication networks, the main memory, or hard disks. However, existing cost models usually ignore the effects of resource contention, thus leading to large deviations between predicted and measured runtimes. In this article, we present a resource contention aware cost model for the execution of multiprocessor tasks on heterogeneous platforms. The integration of the proposed model into two task scheduling methods is described. The cost model is validated in isolation as well as within the utilized scheduling methods. Performance results with different benchmark tasks and with tasks of a complex simulation application are shown to demonstrate the performance improvements achieved by taking the effects of resource contention into account.

Robert Dietze, Michael Hofmann, Gudula Rünger
Hybrid CPU-GPU Simulation of Hierarchical Adaptive Random Boolean Networks

Random boolean networks (RBNs) as models of gene regulatory networks are widely studied by the means of computer simulation to explore interconnections between their topology, regimes of functioning and patterns of information processing. Direct simulation of random boolean networks is known to be computationally hard because of the exponential growth of attractor lengths with an increase of a network size. In this paper, we propose hybrid CPU-GPU algorithm for parallel simulation of hierarchical adaptive RBNs. The rules of evolution of this type of RBN makes it possible to parallelize calculations both for different subnetworks and for different nodes while updating their states. In the experimental part of the study, we explore the efficiency of OpenMP and CPU-GPU algorithms for different sizes of networks and configurations of hierarchy. The results show that a hybrid algorithm performs better for a smaller number of subnetworks while OpenMP version may be preferable for a limited number of nodes in each subnetwork.

Kirill Kuvshinov, Klavdiya Bochenina, Piotr J. Górski, Janusz A. Hołyst
Benchmarking Heterogeneous Cloud Functions

Cloud Functions, often called Function-as-a-Service (FaaS), pioneered by AWS Lambda, are an increasingly popular method of running distributed applications. As in other cloud offerings, cloud functions are heterogeneous, due to different underlying hardware, runtime systems, as well as resource management and billing models. In this paper, we focus on performance evaluation of cloud functions, taking into account heterogeneity aspects. We developed a cloud function benchmarking framework, consisting of one suite based on Serverless Framework, and one based on HyperFlow. We deployed the CPU-intensive benchmarks: Mersenne Twister and Linpack, and evaluated all the major cloud function providers: AWS Lambda, Azure Functions, Google Cloud Functions and IBM OpenWhisk. We make our results available online and continuously updated. We report on the initial results of the performance evaluation and we discuss the discovered insights on the resource allocation policies.

Maciej Malawski, Kamil Figiela, Adam Gajek, Adam Zima
Impact of Compiler Phase Ordering When Targeting GPUs

Research in compiler pass phase ordering (i.e., selection of compiler analysis/transformation passes and their order of execution) has been mostly performed in the context of CPUs and, in a small number of cases, FPGAs. In this paper we present experiments regarding compiler pass phase ordering specialization of OpenCL kernels targeting NVIDIA GPUs using Clang/LLVM 3.9 and the libclc OpenCL library. More specifically, we analyze the impact of using specialized compiler phase orders on the performance of 15 PolyBench/GPU OpenCL benchmarks. In addition, we analyze the final NVIDIA PTX assembly code generated by the different compilation flows in order to identify the main reasons for the cases with significant performance improvements. Using specialized compiler phase orders, we were able to achieve performance improvements over the CUDA version and OpenCL compiled with the NVIDIA driver. Compared to CUDA, we were able to achieve geometric mean improvements of $$1.54\times $$ (up to $$5.48\times $$). Compared to the OpenCL driver version, we were able to achieve geometric mean improvements of $$1.65\times $$ (up to $$5.70\times $$).

Ricardo Nobre, Luís Reis, João M. P. Cardoso
Evaluating Scientific Workflow Execution on an Asymmetric Multicore Processor

Asymmetric multicore architectures that integrate different types of cores are emerging as a potential solution for good performance and power efficiency. Although scheduling can be improved by utilizing an appropriate set of cores for the execution of the different jobs, determining frequency configurations is also crucial to achieve both good performance and energy efficiency. This challenge may be more profound with scientific workflow applications that consist of jobs with data dependency constraints. The paper focuses on deploying and evaluating the Montage scientific workflow on an asymmetric multicore platform with the aim to explore CPU frequency configurations with different trade-offs between execution time and energy efficiency. The proposed approach provides good estimates of workflow execution time and energy consumption for different frequency configurations with an average error of less than 8.63% for time and less than 9.69% for energy compared to actual values.

Ilia Pietri, Sicong Zhuang, Marc Casas, Miquel Moretó, Rizos Sakellariou
Operational Concepts of GPU Systems in HPC Centers: TCO and Productivity

Nowadays, numerous supercomputers comprise GPUs due to promising high performance and memory bandwidth at low power consumption. With GPUs attached to a host system, applications could improve their runtime by utilizing both devices. However, this comes at a cost of increased development effort and system power consumption. In this paper, we compare the total cost of ownership (TCO) and productivity of different operational concepts of GPU systems in HPC centers covering various (heterogeneous) program execution models and CPU-GPU setups. Our investigations include runtime, power consumption, development effort and hardware purchase costs and are exemplified with two application case studies.

Fabian P. Schneider, Sandra Wienke, Matthias S. Müller
Large Scale Graph Processing in a Distributed Environment

Large graphs are widely used in real world graph analytics. Memory available in a single machine is usually inadequate to process these graphs. A good solution is to use a distributed environment. Typical programming styles used in existing distributed environment frameworks are different from imperative programming and difficult for programmers to adapt. Moreover, some graph algorithms having a high degree of parallelism ideally run on an accelerator cluster. Error prone and lower level programming methods (memory and thread management) available for such systems repel programmers from using such architectures. Existing frameworks do not deal with the accelerator clusters.We propose a framework which addresses the previously stated deficiencies. Our framework automatically generates implementations of graph algorithms for distributed environments from the intuitive shared memory based code written in a high-level Domain Specific Language (DSL), Falcon. The framework analyses the intermediate representation, applies a set of optimizations and then generates Giraph code for a CPU cluster and MPI+OpenCL code for a GPU cluster. Experimental evaluations show efficiency and scalability of our framework.

Nitesh Upadhyay, Parita Patel, Unnikrishnan Cheramangalath, Y. N. Srikant

LSDVE – Workshop on Large Scale Distributed Virtual Environments

Appraising SPARK on Large-Scale Social Media Analysis

Software systems for social media analysis provide algorithms and tools for extracting useful knowledge from user-generated social media data. ParSoDA (Parallel Social Data Analytics) is a Java library for developing parallel data analysis applications based on the extraction of useful knowledge from social media data. This library aims at reducing the programming skills necessary to implement scalable social data analysis applications. This work describes how the ParSoDA library has been extended to execute applications on Apache Spark. Using a cluster of 12 workers, the Spark version of the library reduces the execution time of two case study applications exploiting social media data up to 42%, compared to the Hadoop version of the library.

Loris Belcastro, Fabrizio Marozzo, Domenico Talia, Paolo Trunfio
A Spatial Analysis of Multiplayer Online Battle Arena Mobility Traces

A careful analysis and a deep understanding of real mobility traces is of paramount importance when it comes to design mobility models that aim to accurately reproduce avatar movements in virtual environment. In this paper we focus on the analysis of a specific kind of virtual environment, namely the Multiplayer Online Battle Arena (MOBA), which is a extremely popular online game genre. We performed a spatial analysis of about one hundred games of a popular MOBA, roughly corresponding to 4000 min of movements. The analysis revealed interesting patterns in terms of AoI observation, and the utilization of the map by the avatars. These results are effective building blocks toward the creation of realistic mobility models targeting MOBA environments.

Emanuele Carlini, Alessandro Lulli
Long Transaction Chains and the Bitcoin Heartbeat

Over the past few years a persistent growth of the number of daily Bitcoin transactions has been observed. This trend however, is known to be influenced by a number of phenomena that generate long transaction chains that are not related to real purchases (e.g. wallets shuffling and coin mixing). For a transaction chain we call transaction chain frequency the number of transactions of the chain divided by the time interval of the chain. In this paper, we first analyze to which extent Bitcoin transactions are involved in high frequency transaction chains, in the short and in the long term. Based on this analysis, we then argue that a large fraction of transactions do not refer to explicit human activity, namely to transactions between users that trade goods or services. Finally, we show that most of the transactions are involved into chains whose frequency is roughly stable over time and that we call Bitcoin Heartbeat.

Giuseppe Di Battista, Valentino Di Donato, Maurizio Pizzonia
Dynamic Community Analysis in Decentralized Online Social Networks

Community structure is one of the most studied features of Online Social Networks (OSNs). Community detection guarantees several advantages for both centralized and decentralized social networks. Decentralized Online Social Networks (DOSNs) have been proposed to provide more control over private data. One of the main challenge in DOSNs concerns the availability of social data and communities can be exploited to guarantee a more efficient solution about the data availability problem. The detection of communities and the management of their evolution represents a hard process, especially in highly dynamic social networks, such as DOSNs, where the online/offline status of user changes very frequently. In this paper, we focus our attention on a preliminary analysis of dynamic community detection in DOSNs by studying a real Facebook dataset to evaluate how frequent the communities change over time and which events are more frequent. The results prove that the social graph has a high instability and distributed solutions to manage the dynamism are needed.

Barbara Guidi, Andrea Michienzi, Giulio Rossetti
Multi-objective Service Oriented Network Provisioning in Ultra-Scale Systems

The paradigm of ultra-scale computing has been recently pushed forward by the current trends in distributed computing. This novel architecture concept is focused towards a federation of multiple geographically distributed heterogeneous systems under a single system image, thus allowing efficient deployment and management of very complex architectures applications. To enable sustainable ultra-scale computing, there are multiple major challenges, which have to be tackled, such as, improved data distribution, increased systems scalability, enhanced fault tolerance, elastic resource management, low latency communication and etc. Regrettably, the current research initiatives in the area of ultra-scale computing are in a very early stage of research and are predominantly concentrated on the management of the computational and storage resources, thus leaving the networking aspects unexplored. In this paper we introduce a promising new paradigm for cluster-based Multi-objective service-oriented network provisioning for ultra-scale computing environments by unifying the management of the local communication resources and the external inter-domain network services under a single point of view. We explore the potentials for representing the local network resources within a single distributed or parallel system and combine them together with the external communication services.

Dragi Kimovski, Sashko Ristov, Roland Mathá, Radu Prodan

Resilience – Workshop on Resiliency in High Performance Computing with Clouds, Grids, and Clusters

Understanding and Improving the Trust in Results of Numerical Simulations and Scientific Data Analytics

With ever-increasing execution scale of parallel scientific simulations, potential unnoticed corruptions to scientific data during simulation make users more suspicious about the correctness of floating-point calculations than ever before. In this paper, we analyze the issue of the trust in results of numerical simulations and scientific data analytics. We first classify the corruptions into two categories, nonsystematic corruption and systematic corruption, and also discuss their origins. Then, we provide a formal definition of the trust in simulation and analytical results across multiple areas. We also discuss what kind of result accuracy would be expected from user’s perspective and how to build trust by existing techniques. We finally identify the current gap and discuss two potential research directions based on existing techniques. We believe that this paper will be interesting to the researchers who are working on the detection of potential unnoticed corruptions of scientific simulation and data analytics, in that not only does it provide a clear definition and classification of corruption as well as an in-depth survey on corruption sources, but we also discuss potential research directions/topics based on existing detection techniques.

Franck Cappello, Rinku Gupta, Sheng Di, Emil Constantinescu, Thomas Peterka, Stefan M. Wild
Pattern-Based Modeling of High-Performance Computing Resilience

With the growing scale and complexity of high-performance computing (HPC) systems, resilience solutions that ensure continuity of service despite frequent errors and component failures must be methodically designed to balance the reliability requirements with the overheads to performance and power. Design patterns enable a structured approach to the development of resilience solutions, providing hardware and software designers with the building block elements for the rapid development of novel solutions and for adapting existing technologies for emerging, extreme-scale HPC environments. In this paper, we develop analytical models that enable designers to evaluate the reliability and performance characteristics of the design patterns. These models are particularly useful in building a unified framework that analyzes and compares various resilience solutions built using a combination of patterns.

Saurabh Hukerikar, Christian Engelmann
On the Resilience of Conjugate Gradient and Multigrid Methods to Node Failures

In this paper, we examine the inherent resilience of multigrid (MG) and conjugate gradient (CG) methods in the search for algorithm-based approaches to deal with node failures in large parallel HPC systems. In previous work, silent data corruption has been modeled as the perturbation of values in the work arrays of a MG solver. It was concluded that MG recovers fast from errors of this type. We explore how fast MG and CG methods recover from the loss of a contiguous section of their working memory, modeling a node failure. Since MG and CG methods differ in their convergence rates, we propose a methodology to compare their resilience: Time is represented as a fraction of the iterations required to reach a certain target precision, and failures are introduced when the residual norm reaches a certain threshold. We use the two solvers on a linear system that represents a model elliptic partial differential equation, and we experimentally evaluate the overhead caused by the introduced faults. Additionally, we observe the behavior of the conjugate gradient solver under node failures for additional test problems. Approximating the lost values of the solution using interpolation reduces the overhead for MG, but the effect on the CG solver is minimal. We conclude that the methods also have the inherent ability to recover from node failures. However, we illustrate that the relative overhead caused by node failures is significant.

Carlos Pachajoa, Wilfried N. Gansterer
It’s Not the Heat, It’s the Humidity: Scheduling Resilience Activity at Scale

Maintaining the performance of high-performance computing (HPC) applications with the expected increase in failures is a major challenge for next-generation extreme-scale systems. With increasing scale, resilience activities (e.g. checkpointing) are expected to become more diverse, less tightly synchronized, and more computationally intensive. Few existing studies, however, have examined how decisions about scheduling resilience activities impact application performance. In this work, we examine the relationship between the duration and frequency of resilience activities and application performance. Our study reveals several key findings: (i) the aggregate amount of time consumed by resilience activities is not an effective metric for predicting application performance; (ii) the duration of the interruptions due to resilience activities has the greatest influence on application performance; shorter, but more frequent, interruptions are correlated with better application performance; and (iii) the differential impact of resilience activities across applications is related to the applications’ inter-collective frequencies; the performance of applications that perform infrequent collective operations scales better in the presence of resilience activities than the performance of applications that perform more frequent collective operations. This initial study demonstrates the importance of considering how resilience activities are scheduled. We provide critical analysis and direct guidance on how the resilience challenges of future systems can be met while minimizing the impact on application performance.

Patrick M. Widener, Kurt B. Ferreira, Scott Levy

ROME – Workshop on Runtime and Operating Systems for the Many-core Era

Data Partitioning Strategies for Stencil Computations on NUMA Systems

Many scientific problems rely on the efficient execution of stencil computations, which are usually memory-bound. In this paper, stencils on two-dimensional data are executed on NUMA architectures. Each node of a NUMA system processes a distinct partition of the input data independent from other nodes. However, processors may need access to the memory of other nodes at the edges of the partitions. This paper demonstrates two techniques based on machine learning for identifying partitioning strategies that reduce the occurrence of remote memory access. One approach is generally applicable and is based on an uninformed search. The second approach caps the search space by employing geometric decomposition. The partitioning strategies obtained with these techniques are analyzed theoretically. Finally, an evaluation on a real NUMA machine is conducted, which demonstrates that the expected reduction of the remote memory accesses can be achieved.

Frank Feinbube, Max Plauth, Marius Knaust, Andreas Polze
Delivering Fairness on Asymmetric Multicore Systems via Contention-Aware Scheduling

Asymmetric single-ISA multicore processors (AMPs), which integrate high-performance big cores and low-power small cores, were shown to deliver better energy efficiency than symmetric multicores for diverse workloads. Previous work has highlighted that this potential of AMP systems can be realizable with help from the OS scheduler. Notably, delivering fairness on AMPs still constitutes an important challenge, as it requires the scheduler to accurately track the progress of each thread as it runs on the various core types throughout the execution. In turn, this progress depends on the speedup that an application derives on a big core relative to a small one. While existing fairness-aware schedulers take application relative speedup into consideration when tracking progress, they do not cater to the performance degradation that may occur naturally due to contention on shared resources among cores, such as the last-level cache or the memory bus. In this paper, we propose CAMPS, a contention-aware fair scheduler for AMPs. Our experimental evaluation, which employs real asymmetric hardware and scheduler implementations in the Linux kernel, demonstrates that CAMPS improves fairness by 10.6% on average with respect to a state-of-the-art fairness-aware scheme, while delivering higher throughput.

Adrian Garcia-Garcia, Juan Carlos Saez, Manuel Prieto-Matias
Powernightmares: The Challenge of Efficiently Using Sleep States on Multi-core Systems

Sleep states are an important and well-understood feature of modern server and desktop CPUs that enable significant power savings during idle and partial load scenarios. Making proper decisions about how to use this feature remains a major challenge for operating systems since it requires a trade-off between potential energy-savings and performance penalties for long and short phases of inactivity, respectively. In this paper we analyze the default behavior of the Linux kernel in this regard and identify weaknesses of certain default assumptions. We derive pathological patterns that trigger these weaknesses and lead to ‘Powernightmares’ during which power-saving sleep states are used insufficiently. Our analysis of a workstation and a large supercomputer reveals that these scenarios are relevant on real-life systems in default configuration. We present a methodology to analyze these effects in detail despite their inherent nature of being hardly observable. Finally, we present a concept to mitigate these problems and reclaim lost power saving opportunities.

Thomas Ilsche, Marcus Hähnel, Robert Schöne, Mario Bielert, Daniel Hackenberg
Help Your Busy Neighbors: Dynamic Multicasts over Static Topologies

Acknowledged multicasts, e.g. for software-based TLB invalidation, are a performance critical aspect of runtime environments for many-core processors. Their latency and peak throughput highly depend on the topology used to propagate the events and to collect the acknowledgements. Based on the assumption of an inevitable interrupt latency, previous work focused on very simple flat topologies. However, the emergence of simultaneous multi-threading with locally shared caches enables interrupt-free multicasts. Therefore, this paper explores and re-evaluates the design space for dynamic multicast groups based on combining shared memory with active messages and helping mechanisms. We expect this new approach to considerably improve the scalability of acknowledged multicasts on many-core processors.

Robert Kuban, Randolf Rotta, Jörg Nolte

UCHPC – Workshop on Unconventional High Performance Computing

Accelerating the 3-D FFT Using a Heterogeneous FPGA Architecture

Future Exascale architectures will likely make extensive use of computing accelerators such as Field Programmable Gate Arrays (FPGAs) given that these accelerators are very power efficient. Oftentimes, these FPGAs are located at the network interface card (NIC) and switch level in order to accelerate network operations, incorporate contention avoiding routing schemes, and perform computations directly on the NIC and bypass the arithmetic logic unit (ALU) of the CPU. This work explores just such a heterogeneous FPGA architecture in the context of two kernels that are driving applications in leadership machines: the 3-D Fast Fourier Transform (3-D FFT) and Asynchronous Multi-Tasking (AMT). The machine explored here is a DataVortex system which consists of conventional processors but with programmable logic incorporated in the memory architecture. The programmable logic controls the network and is incorporated both in the network interface cards and the network switches and implements a contention avoiding network routing. Both the 3-D FFT and AMT kernels show compelling performance for deployment to FFT driven applications in both molecular dynamics and density functional theory.

Matthew Anderson, Maciej Brodowicz, Martin Swany, Thomas Sterling
Evaluation of a Floating-Point Intensive Kernel on FPGA
A Case Study of Geodesic Distance Kernel

Heterogeneous platforms provide a promising solution for high-performance and energy-efficient computing applications. This paper presents our research on usage of heterogeneous platform for a floating-point intensive kernel. We first introduce the floating-point intensive kernel from the geographical information system. Then we analyze the FPGA designs generated by the Intel FPGA SDK for OpenCL, and evaluate the kernel performance and the floating-point error rate of the FPGA designs. Finally, we compare the performance and energy efficiency of the kernel implementations on the Arria 10 FPGA, Intel’s Xeon Phi Knights Landing CPU, and NVIDIA’s Kepler GPU. Our evaluation shows the energy efficiency of the single-precision kernel on the FPGA is 1.35X better than on the CPU and the GPU, while the energy efficiency of the double-precision kernel on the FPGA is 1.36X and 1.72X less than the CPU and GPU, respectively.

Zheming Jin, Hal Finkel, Kazutomo Yoshii, Franck Cappello
Shallow Water Waves on a Deep Technology Stack: Accelerating a Finite Volume Tsunami Model Using Reconfigurable Hardware in Invasive Computing

Reconfigurable architectures are commonly used in the embedded systems domain to speed up compute-intensive tasks. They combine a reconfigurable fabric with a general-purpose microprocessor to accelerate compute-intensive tasks on the fabric while the general-purpose CPU is used for the rest of the workload. Through the use of invasive computing, we aim to show the feasibility of this technology for HPC scenarios. We demonstrate this by accelerating a proxy application for the simulation of shallow water waves using the i-Core, a reconfigurable processor that is part of the invasive computing multiprocessor system-on-chip. Using a floating-point custom instruction, the entire computation of numerical fluxes occurring in the application’s finite volume scheme is performed by hardware accelerators.

Alexander Pöppl, Marvin Damschen, Florian Schmaus, Andreas Fried, Manuel Mohr, Matthias Blankertz, Lars Bauer, Jörg Henkel, Wolfgang Schröder-Preikschat, Michael Bader
Linking Application Description with Efficient SIMD Code Generation for Low-Precision Signed-Integer GEMM

The need to implement demanding numerical algorithms within a constrained power budget has led to a renewed interest in low-precision number formats. Exploration of the degrees of freedom provided both by better support for low-precision number formats on computer architectures and by the respective application domain remains a most demanding task, though.In this example, we upgrade the machine learning framework Theano and the Eigen linear algebra library to support matrix multiplication of formats between 32 and 1 bit by packing multiple values in a 32-bit vector. This approach keeps all the optimizations of Eigen to the overall matrix operation, while maximizing performance enabled through SIMD units on modern embedded CPUs. With respect to 32-bit formats, we achieve a speedup between 0.45 and 21.17 on an ARM Cortex-A15.

Günther Schindler, Manfred Mücke, Holger Fröning

Complementary Papers

A Formula-Driven Scalable Benchmark Model for ABM, Applied to FLAME GPU

Agent Based Modelling (ABM) systems have become a popular technique for describing complex and dynamic systems. ABM is the simulation of intelligent agents and how these agents communicate with each other within the model. The growing number of agent-based applications in the simulation and AI fields led to an increase in the number of studies that focused on evaluating modelling capabilities of these applications. Observing system performance and how applications behave during increases in population size is the main factor for benchmarking in most of these studies. System scalability is not the only issue that may affect the overall performance, but there are some issues that need to be dealt with to create a standard benchmark model that meets all ABM criteria. This paper presents a new benchmark model and benchmarks the performance characteristics of the FLAME GPU simulator as an example of a parallel framework for ABM. The aim of this model is to provide parameters to easily measure the following elements: system scalability, system homogeneity, and the ability to handle increases in the level of agent communications and model complexity. Results show that FLAME GPU demonstrates near linear scalability when increasing population size and when reducing homogeneity. The benchmark also shows a negative correlation between increasing the communication complexity between agents and execution time. The results create a baseline for improving the performance of FLAME GPU and allow the simulator to be contrasted with other multi-agent simulators.

Eidah Alzahrani, Paul Richmond, Anthony J. H. Simons
PhotoNoCs: Design Simulation Tool for Silicon Integrated Photonics Towards Exascale Systems

The need to greatly increase the number of compute nodes to design exascale systems raises numerous challenges that must be solved to obtain an efficient system in terms of cost, energy consumption and performance. Data movement is a critical barrier toward realizing exascale computing systems, and therefore the interconnection network is a key component of these systems. Among the different technologies that could contribute to an efficient interconnect, photonics is perhaps the most disruptive, due to its capabilities to generate, transmit, and receive high bandwidth signals with superior power efficiencies and inherent immunity to degradation. However, photonic interconnects lack from practical buffering, which make these networks circuit switched in its essence. Therefore, new network architectures are required, both to satisfy the requirements of data transfers between nodes and between the multiple computing resources of each multicore node. This paper presents PhotoNoCs as a tool which helps the computer architect to design and test new approaches of photonics interconnection systems at different levels: On-chip networks for multicore architectures and off-chip networks for the whole supercomputer.

Juan-Jose Crespo, Francisco J. Alfaro-Cortés, José L. Sánchez
On the Effects of Data-Aware Allocation on Fully Distributed Storage Systems for Exascale

The convergence between computing- and data-centric workloads and platforms is imposing new challenges on how to best use the resources of modern computing systems. In this paper we show the need of enhancing system schedulers to differentiate between compute- and data-oriented applications to minimise interferences between storage and application traffic. These interferences can be especially harmful in systems featuring fully distributed storage systems together with unified interconnects, such as our custom-made architecture ExaNeSt. We analyse several data-aware allocation strategies, and found that such strategies are essential to maintain performance in distributed storage systems.

Jose A. Pascual, Caroline Concatto, Joshua Lant, Javier Navaridas
Efficient Implementation of Data Objects in the OSD+-Based Fusion Parallel File System

OSD+s are enhanced object-based storage devices (OSDs) able to deal with both data and metadata operations via data and directory objects, respectively. So far, we have focused on designing and implementing efficient directory objects in OSD+s. This paper, however, presents our work on also supporting data objects, and describes how the coexistence of both kinds of objects in OSD+s is profited to efficiently implement data objects and to speed up some common file operations. We compare our OSD+-based Fusion Parallel File System (FPFS) with Lustre and OrangeFS. Results show that FPFS provides a performance up to $$37{\times }$$ better than Lustre, and up to $$95{\times }$$ better than OrangeFS, for metadata workloads. FPFS also provides 34% more bandwidth than OrangeFS for data workloads, and competes with Lustre for data writes. Results also show serious scalability problems in Lustre and OrangeFS.

Juan Piernas, Pilar González-Férez
Euro-Par 2017: Parallel Processing Workshops
herausgegeben von
Dora B. Heras
Luc Bougé
Gabriele Mencagli
Emmanuel Jeannot
Dr. Rizos Sakellariou
Rosa M. Badia
Jorge G. Barbosa
Laura Ricci
Stephen L. Scott
Stefan Lankes
Josef Weidendorfer
Electronic ISBN
Print ISBN