Skip to main content

2019 | Buch

Euro-Par 2018: Parallel Processing Workshops

Euro-Par 2018 International Workshops, Turin, Italy, August 27-28, 2018, Revised Selected Papers

herausgegeben von: Dr. Gabriele Mencagli, Dora B. Heras, Valeria Cardellini, Prof. Emiliano Casalicchio, Emmanuel Jeannot, Felix Wolf, Antonio Salis, Claudio Schifanella, Ravi Reddy Manumachu, Laura Ricci, Marco Beccuti, Laura Antonelli, José Daniel Garcia Sanchez, Stephen L. Scott

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes revised selected papers from of the workshops held at 24th International Conference on Parallel and Distributed Computing, Euro-Par 2018, which took place in Turin, Italy, in August 2018.

The 64 full papers presented in this volume were carefully reviewed and selected from 109 submissions.

Euro-Par is an annual, international conference in Europe, covering all aspects of parallel and distributed processing. These range from theory to practice, from small to the largest parallel and distributed systems and infrastructures, from fundamental computational problems to full-edged applications, from architecture, compiler, language and interface design and implementation to tools, support infrastructures, and application performance aspects.

Inhaltsverzeichnis

Frontmatter

Auto-DaSP - Workshop on Autonomic Solutions for Parallel and Distributed Data Stream Processing

Frontmatter
TPICDS: A Two-Phase Parallel Approach for Incremental Clustering of Data Streams

Parallel and distributed solutions are essential for clustering data streams due to the large volumes of data. This paper first examines a direct adaptation of a recently developed prototype-based algorithm into three existing parallel frameworks. Based on the evaluation of performance, the paper then presents a customised pipeline framework that combines incremental and two-phase learning into a balanced approach that dynamically allocates the available processing resources. This new framework is evaluated on a collection of synthetic datasets. The experimental results reveal that the framework not only produces correct final clusters on the one hand, but also significantly improves the clustering efficiency.

Ammar Al Abd Alazeez, Sabah Jassim, Hongbo Du
Cost of Fault-Tolerance on Data Stream Processing

Data streaming engines process data on the fly in contrast to databases that first, store the data and then, they process it. In order to process the increasing amount of data produced every day, data streaming engines run on top of a distributed system. In this setting failures will likely happen. Current distributed data streaming engines like Apache Flink provide fault tolerance. In this paper we evaluate the impact on performance of fault tolerance mechanisms of Flink during regular operation (when there are no failures) on a distributed system and the impact on performance when there are failures. We use the Intel HiBench for conducting the evaluation.

Valerio Vianello, Marta Patiño-Martínez, Ainhoa Azqueta-Alzúaz, Ricardo Jimenez-Péris
Autonomic and Latency-Aware Degree of Parallelism Management in SPar

Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off to the application programmers by using the C++11 attributes’ annotation mechanism. In SPar and other programming frameworks for stream processing applications, the manual definition of the number of replicas to be used for the stream operators is a challenge. In addition to that, low latency is required by several stream processing applications. We noted that explicit latency requirements are poorly considered on the state-of-the-art parallel programming frameworks. Since there is a direct relationship between the number of replicas and the latency of the application, in this work we propose an autonomic and adaptive strategy to choose the proper number of replicas in SPar to address latency constraints. We experimentally evaluated our implemented strategy and demonstrated its effectiveness on a real-world application, demonstrating that our adaptive strategy can provide higher abstraction levels while automatically managing the latency.

Adriano Vogel, Dalvan Griebler, Daniele De Sensi, Marco Danelutto, Luiz Gustavo Fernandes
Consistency of the Fittest: Towards Dynamic Staleness Control for Edge Data Analytics

A critical challenge for data stream processing at the edge of the network is the consistency of the machine learning models in distributed worker nodes. Especially in the case of non-stationary streams, which exhibit high degree of data set shift, mismanagement of models poses the risks of suboptimal accuracy due to staleness and ignored data. In this work, we analyze model consistency challenges of distributed online machine learning scenario and present preliminary solutions for synchronizing model updates. Additionally, we propose metrics for measuring the level and speed of data set shift.

Atakan Aral, Ivona Brandic
A Multi-level Elasticity Framework for Distributed Data Stream Processing

Data Stream Processing (DSP) applications should be capable to efficiently process high-velocity continuous data streams by elastically scaling the parallelism degree of their operators so to deal with high variability in the workload. Moreover, to efficiently use computing resources, modern DSP frameworks should seamlessly support infrastructure elasticity, which allows to exploit resources available on-demand in geo-distributed Cloud and Fog systems. In this paper we propose E2DF, a framework to autonomously control the multi-level elasticity of DSP applications and the underlying computing infrastructure. E2DF revolves around a hierarchical approach, with two control layers that work at different granularity and time scale. At the lower level, fully decentralized Operator and Region managers control the reconfiguration of distributed DSP operators and resources. At the higher level, centralized managers oversee the overall application and infrastructure adaptation. We have integrated the proposed solution into Apache Storm, relying on a previous extension we developed, and conducted an experimental evaluation. It shows that, even with simple control policies, E2DF can improve resource utilization without application performance degradation.

Matteo Nardelli, Gabriele Russo Russo, Valeria Cardellini, Francesco Lo Presti

CBDP - Workshop on Container-Based Systems for Big Data, Distributed and Parallel Computing

Frontmatter
A Resource Allocation Framework with Qualitative and Quantitative SLA Classes

This paper presents a new resource allocation framework based on SLA (Service Level Agreements) classes for cloud computing environments. Our framework is proposed in the context of containers with two qualitative and two quantitative SLAs classes to meet the needs of users. The two qualitative classes represent the satisfaction time criterion, and the reputation criterion. Moreover, the two quantitative classes represent the criterion over the number of resources that must be allocated to execute a container and the redundancy (number of replicas) criterion. The novelty of our work is based on the possibility to adapt, dynamically, the scheduling and the resources allocation of containers according to the different qualitative and quantitative SLA classes and the activities peaks of the nodes in the cloud. This dynamic adaptation allows our framework a flexibility for efficient global scheduling of all submitted containers and for efficient management, on the fly, of the resources allocation. The key idea is to make the specification on resources demand less rigid and to ask the system to decide on the precise number of resources to allocate to a container. Our framework is implemented in C++ and it is evaluated using Docker containers inside the Grid’5000 testbed. Experimental results show that our framework gives expected results for our scenario and provides with good performance regarding the balance between objectives.

Tarek Menouer, Christophe Cérin, Walid Saad, Xuanhua Shi
Automated Multi-Swarm Networking with Open Baton NFV MANO Framework

Container-based Network Functions Virtualization (NFV) and multi-site/multi-cluster service orchestration are a critical topic in the field of ICT infrastructure. Academia, Industry and Open Source projects are actively working on the technology. With the trends, Open Baton, an implementation of the ETSI NFV MANO Reference Architecture, started efforts to orchestrate network services over multiple Docker Swarm clusters. To achieve that, Open Baton would require an additional feature to configure an overlay networking over multiple swarm clusters, since Docker Swarm does not support multi-cluster service. In this paper, we discuss our design and implementation of the Multi-Swarm Networking Helper in Open Baton, which configures an L2 overlay networking over multiple Docker Swarm clusters by leveraging on a third-party Docker networking driver.

Jun-Sik Shin, Mathias Santos de Brito, Thomas Magedanz, JongWon Kim
The Impact of the Storage Tier: A Baseline Performance Analysis of Containerized DBMS

Containers emerged as cloud resource offerings. While the advantages of containers, such as easing the application deployment, orchestration and adaptation, work well for stateless applications, the feasibility of containerization of stateful applications, such as database management system (DBMS), still remains unclear due to potential performance overhead. The myriad of container operation models and storage backends even raises the complexity of operating a containerized DBMS. Here, we present an extensible evaluation methodology to identify performance overhead of a containerized DBMS by combining three operational models and two storage backends. For each combination a memory-bound and disk-bound workload is applied. The results show a clear performance overhead for containerized DBMS on top of virtual machines (VMs) compared to physical resources. Further, a containerized DBMS on top of VMs with different storage backends results in a tolerable performance overhead. Building upon these baseline results, we derive a set of open evaluation challenges for containerized DBMSs.

Daniel Seybold, Christopher B. Hauser, Georg Eisenhart, Simon Volpert, Jörg Domaschka
Towards Vertically Scalable Spark Applications

The dynamic provisioning of virtual machines (VMs) supported by many cloud computing infrastructures eases the scalability of software applications. Unfortunately, VMs are relatively slow to boot and public cloud providers do not allow users to vary their resources (vertical scalability) dynamically. To tackle both problems, a few years ago we presented a solution that combines the management of VMs with the use of containers specifically targeted to the efficient runtime management of the resources provisioned to Web applications. This paper borrows from this solution and addresses the problem of provisioning resources to big data, Spark applications at runtime. Spark does not allow for the runtime scalability of the resources associated with its executors, but resources must be provisioned statically. To tackle this problem, the paper describes a container-based version of Spark that supports the dynamic resizing of the memory and CPU cores associated with the different executors. The evaluation demonstrates the feasibility of the approach and identifies the trade-offs involved.

Luciano Baresi, Giovanni Quattrocchi

COLOC - Workshop on Data Locality

Frontmatter
Progress Thread Placement for Overlapping MPI Non-blocking Collectives Using Simultaneous Multi-threading

Non-blocking collectives have been proposed so as to allow communications to be overlapped with computation in order to amortize the cost of MPI collective operations. To obtain a good overlap ratio, communications and computation have to run in parallel. To achieve this, different hardware and software techniques exists. Dedicated some cores to run progress threads is one of them. However, some CPUs provide Simultaneous Multi-Threading, which is the ability for a core to have multiple hardware threads running simultaneously, sharing the same arithmetic units. Our idea is to use them to run progress threads to avoid dedicated cores allocation. We have run benchmarks on Haswell processors, using its Hyper-Threading capability, and get good results for both performance and overlap only when inter-node communications are used by MPI processes. However, we also show that enabling Simultaneous Multi-Threading for intra-communications leads to bad performances due to cache effects.

Alexandre Denis, Julien Jaeger, Hugo Taboada
A Methodology for Handling Data Movements by Anticipation: Position Paper

The enhanced capabilities of large scale parallel and distributed platforms produce a continuously increasing amount of data which have to be stored, exchanged and used by various tasks allocated on different nodes of the system. The management of such a huge communication demand is crucial for reaching the best possible performance of the system. Meanwhile, we have to deal with more interferences as the trend is to use a single all-purpose interconnection network whatever the interconnect (tree-based hierarchies or topology-based heterarchies). There are two different types of communications, namely, the flows induced by data exchanges during the computations, and the flows related to Input/Output operations. We propose in this paper a general model for interference-aware scheduling, where explicit communications are replaced by external topological constraints. Specifically, the interferences of both communication types are reduced by adding geometric constraints on the allocation of tasks into machines. The proposed constraints reduce implicitly the data movements by restricting the set of possible allocations for each task. This methodology has been proved to be efficient in a recent study for a restricted interconnection network (a line/ring of processors which is an intermediate between a tree and higher dimensions grids/torus). The obtained results illustrated well the difficulty of the problem even on simple topologies, but also provided a pragmatic greedy solution, which was assessed to be efficient by simulations. We are currently extending this solution for more complex topologies. This work is a position paper which describes the methodology, it does not focus on the solving part.

Raphaël Bleuse, Giorgio Lucarelli, Denis Trystram
Scalable Work-Stealing Load-Balancer for HPC Distributed Memory Systems

Work-stealing schedulers are common in shared memory environments. However, large scale distributed memory usage has been limited to specific ad-hoc implementations preventing a broader adoption. In this paper we introduce a new scalable work-stealing algorithm for distributed memory systems as well as our implementation as the TITUS_DLB library. It is based on Kleinberg’s small-world graph. It allows to control the communication patterns and associated runtime overheads while providing efficient heuristics for victim selection and results routing. To validate our approach, we present the DLB_Bench benchmark which emulates arbitrary workload distribution and imbalance characteristics. Finally, we compare TITUS_DLB to the ad-hoc solution developed for the YALES2 computational fluid dynamics and combustion solver. We achieve up to 54% performance gain over thousands of cores.

Clement Fontenaille, Eric Petit, Pablo de Oliveira Castro, Seijilo Uemura, Devan Sohier, Piotr Lesnicki, Ghislain Lartigue, Vincent Moureau
NUMAPROF, A NUMA Memory Profiler

The number of cores in HPC systems and servers increased a lot for the last few years. In order to also increase the available memory bandwidth and capacity, most systems became NUMA (Non-Uniform Memory Access) meaning each processor has its own memory and can share it. Although the access to the remote memory is transparent for the developer, it comes with a lower bandwidth and a higher latency. It might heavily impact the performance of the application if it happens too often. Handling this memory locality in multi-threaded applications is a challenging task. In order to help the developer, we developed NUMAPROF, a memory profiling tool pinpointing the local and remote memory accesses onto the source code with the same approach as MALT, a memory allocation profiling tool. The paper offers a full review of the capacity of NUMAPROF on mainstream HPC workloads. In addition to the dedicated interface, the tool also provides hints about unpinned memory accesses (unpinned thread or unpinned page) which can help the developer find portion of codes not safely handling the NUMA binding. The tool also provides dedicated metrics to track access to MCDRAM of the Intel Xeon Phi codenamed Knight’s Landing. To operate, the tool instruments the application by using Pin, a parallel binary instrumentation framework from Intel. NUMAPROF also has the particularity of using the OS memory mapping without relying on hardware counters or OS simulation. It permits understanding what really happened on the system without requiring dedicated hardware support.

Sébastien Valat, Othman Bouizi
ASPEN: An Efficient Algorithm for Data Redistribution Between Producer and Consumer Grids

HPC applications and libraries have frequently moved parallel data from one distribution scheme to another, for reasons of performance. In modern times, a resurgence of interest in this data redistribution problem has emerged due to the need to relocate data distributed across one Producer grid onto a different distribution scheme across a Consumer grid. In this paper, we study the efficient algorithms to perform redistribution, and show how the best methods from the literature are still dependent on the number of processors in both grids. We describe a new algorithm ASPEN that exploits more cyclic patterns and relations in the distribution, is not dependent on the total number of processors and is thus well suited for use in a workflow management systems. We describe a preliminary implementation of the algorithm within such a workflow system and show performance results that indicate a significant performance benefit in data redistribution generation.

Clément Foyer, Adrian Tate, Simon McIntosh-Smith

Euro-EDUPAR - Workshop on Parallel and Distributed Computing Education for Undergraduate Students

Frontmatter
Getting Started with CAPI SNAP: Hardware Development for Software Engineers

To alleviate development of FPGA-based accelerator function units for software engineers, the OpenPOWER Accelerator Work Group has recently introduced the CAPI Storage, Network, and Analytics Programming (SNAP) framework. However, we found that software engineers are still overwhelmed with many aspects of the novel hardware development framework. This paper provides background and instructions for mastering the first steps of hardware development using the CAPI SNAP framework. The insights reported in this paper are based on the experiences of software engineering students with little to no prior knowledge about hardware development.

Lukas Wenzel, Robert Schmid, Balthasar Martin, Max Plauth, Felix Eberhardt, Andreas Polze
Studying the Structure of Parallel Algorithms as a Key Element of High-Performance Computing Education

Since the computing world has become fully parallel, every software developer today should be familiar with the notion of “parallel algorithm structure.” If in recent years, students have studied a basic introduction to algorithms; today, parallel algorithm structure must become a vital part of computer science education. In this work we present two years of experience teaching a “Supercomputer Modeling and Technologies” course, and running practical assignments at the Computational Mathematics and Cybernetics faculty of Lomonosov Moscow State University, aimed at teaching students a methodology for analyzing parallel algorithm properties.

Vladimir Voevodin, Alexander Antonov, Nina Popova
From Mathematical Model to Parallel Execution to Performance Improvement: Introducing Students to a Workflow for Scientific Computing

Current courses in parallel and distributed computing (PDC) often focus on programming models and techniques. However, PDC is embedded in a scientific workflow that incorporates more than programming skills. The workflow spans from mathematical modeling to programming, data interpretation, and performance analysis. Especially the last task is covered insufficiently in educational courses. Often scientists from different fields of knowledge, each with individual expertise, collaborate to perform these tasks. In this work, the general design and the implementation of an exercise within the course “Supercomputers and their programming” at Technische Universität Dresden, Faculty of Computer Science is presented. In the exercise, the students pass through a complete workflow for scientific computing. The students gain or improve their knowledge about: (i) mathematical modeling of systems, (ii) transferring the mathematical model to a (parallel) program, (iii) visualization and interpretation of the experiment results, and (iv) performance analysis and improvements. The exercise exactly aims at bridging the gap between the individual tasks of a scientific workflow and equip students with wide knowledge.

Franziska Kasielke, Ronny Tschüter
Integrating Parallel Computing in the Curriculum of the University Politehnica of Bucharest

The continuous shift of hardware computing architectures, from single to many-core processors, as well as the blurring of the hardware - software interface, has made the introduction of parallel and distributed computing topics in the undergraduate curriculum an essential requirement for any quality computer science program. The University Politehnica of Bucharest offers a unique approach, employing a heterogeneous hardware and software teaching and computing infrastructure, to its over 450 students enrolled in undergraduate studies of Computer Science and Electrical Engineering. In this study we present two of the most important lectures covering PDC topics at the UPB.

Mihai Carabaş, Adriana Drăghici, Grigore Lupescu, Cosmin-Gabriel Samoilă, Emil-Ioan Sluşanschi

F2C-DP - Workshop on Fog-to-Cloud Distributed Processing

Frontmatter
Benefits of a Fog-to-Cloud Approach in Proximity Marketing

The EC H2020 mF2C Project is working to the development of a software framework that enables the orchestration of resources and communication at fog level, as an extension of cloud computing and interacting with the IoT. In order to show the project functionalities and added-values three real world use cases have been chosen. This paper introduces one of the mF2C use cases: Smart Fog Hub Service (SFHS) use case, in the context of an airport, with the objective of proving that the adoption of the fog-to-cloud approach brings relevant benefits in terms of performance and optimization of resource usage, thus giving an objective evidence of the impact of the mF2C framework.

Antonio Salis, Glauco Mancini, Roberto Bulla, Paolo Cocco, Daniele Lezzi, Francesc Lordan
Multi-tenant Pub/Sub Processing for Real-Time Data Streams

Devices and sensors generate streams of data across a diversity of locations and protocols. That data usually reaches a central platform that is used to store and process the streams. Processing can be done in real time, with transformations and enrichment happening on-the-fly, but it can also happen after data is stored and organized in repositories. In the former case, stream processing technologies are required to operate on the data; in the latter batch analytics and queries are of common use.This paper introduces a runtime to dynamically construct data stream processing topologies based on user-supplied code. These dynamic topologies are built on-the-fly using a data subscription model defined by the applications that consume data. Each user-defined processing unit is called a Service Object. Every Service Object consumes input data streams and may produce output streams that others can consume. The subscription-based programing model enables multiple users to deploy their own data-processing services. The runtime does the dynamic forwarding of data and execution of Service Objects from different users. Data streams can originate in real-world devices or they can be the outputs of Service Objects.The runtime leverages Apache STORM for parallel data processing, that combined with dynamic user-code injection provides multi-tenant stream processing topologies. In this work we describe the runtime, its features and implementation details, as well as we include a performance evaluation of some of its core components.

Álvaro Villalba, David Carrera
A Review of Mobility Prediction Models Applied in Cloud/Fog Environments

Cloud and Fog Computing are two emerging technologies that have being used in various fields of application. On one hand, Cloud Computing has the problem of big latency, being especially problematic when the application requires a rapid response in the edge network. On the other hand, Fog Computing distributes the computational data processing tasks to the edge network to reduce the latency, but it still faces challenges especially when dealing with support for mobile users. This work aims to present a review of the works in Cloud/Fog Computing that use mobility prediction techniques in their favor in order to deal with users mobility problem. Additionally we present the potential of applying the techniques in Cloud/Fog environments.

David H. S. Lima, Andre L. L. Aquino, Marilia Curado
An Architecture for Resource Management in a Fog-to-Cloud Framework

Fog-to-cloud (F2C) platforms provide an excellent framework for the efficient resource management in the context of smart cities. In such a scenario, a vast number of heterogeneous resources, including computing devices and IoT sensors, are considered in coordination to provide the best facilities. One of the most critical and challenging tasks in this framework is appropriately managing the set of resources available in the smart city. Many devices with different features should be efficiently classified, organized, and selected, to fulfill the requirements during services execution. In this paper, we present the design of an architecture for resource management as part of a core module in an F2C system. In this architecture, we classify both, the system resources and services and, based on the users’ preferences and sharing policies; we discuss the process of resource selection according to a predefined cost model. The cost model could consider any cost dimension, such as performance, energy consumption, or any eventual business model associated with the F2C system.

Souvik Sengupta, Jordi Garcia, Xavi Masip-Bruin
Enhancing Service Management Systems with Machine Learning in Fog-to-Cloud Networks

With the fog-to-cloud hybrid computing systems emerging as a promising networking architecture, particularly interesting for IoT scenarios, there is an increasing interest in exploring and developing new technologies and solutions to achieve high performances of these systems. One of these solutions includes machine learning algorithms implementation. Even without defined and standardized way of using machine learning in fog-to-cloud systems, it is obvious that machine learning capabilities of autonomous decision making would enrich both fog computing and cloud computing network nodes. In this paper, we propose a service management system specially designed to work in fog-to-cloud architectures, followed with a proposal on how to implement it with different machine learning solutions. We first show the global overview of service management system functionality with the current specific design for each of its integral components and, finally, we show the first results obtained with machine learning algorithm for its component in charge of traffic prediction.

Jasenka Dizdarević, Francisco Carpio, Mounir Bensalem, Admela Jukan
A Knowledge-Based IoT Security Checker

The widespread diffusion of ubiquitous and smart devices is radically changing the environment surrounding the users and brought to the definition of a new ecosystem called Internet of Things (IoT). Users are connected anywhere anytime, and can continuously monitor and interact with the external environment. While devices are becoming more and more powerful and efficient (e.g., using protocols like zigbee, LTE, 5G), their security is still in its infancy. Such devices, as well as the edge network providing connectivity, become the target of security attacks without their owners being aware of the risks they are exposed to. In this paper we present IoT Security Checker, a solution for IoT security assessment coping with the most relevant IoT security issues. We also provide some preliminary analysis showing how the IoT Security Checker can be used for verifying the security of an IoT system.

Marco Anisetti, Rasool Asal, Claudio Agostino Ardagna, Lorenzo Comi, Ernesto Damiani, Filippo Gaudenzi
MAD-C: Multi-stage Approximate Distributed Cluster-Combining for Obstacle Detection and Localization

Efficient distributed multi-sensor monitoring is a key feature of upcoming digitalized infrastructures. We address the problem of obstacle detection, having as input multiple point clouds, from a set of laser-based distance sensors; the latter generate high-rate data and can rapidly exhaust baseline analysis methods, that gather and cluster all the data. We propose MAD-C, a distributed approximate method: it can build on any appropriate clustering, to process disjoint subsets of the data distributedly; MAD-C then distills each resulting cluster into a data-summary. The summaries, computable in a continuous way, in constant time and space, are combined, in an order-insensitive, concurrent fashion, to produce approximate volumetric representations of the objects. MAD-C leads to (i) communication savings proportional to the number of points, (ii) multiplicative decrease in the dominating component of the processing complexity and, at the same time, (iii) high accuracy (with RandIndex $$>0.95$$ ), in comparison to its baseline counterpart. We also propose MAD-C-ext, building on the MAD-C’s output, by further combining the original data-points, to improve the outcome granularity, with the same asymptotic processing savings as MAD-C.

Amir Keramatian, Vincenzo Gulisano, Marina Papatriantafilou, Philippas Tsigas, Yiannis Nikolakopoulos

FPDAPP - Workshop on Future Perspective of Decentralised Applications

Frontmatter
A Suite of Tools for the Forensic Analysis of Bitcoin Transactions: Preliminary Report

Crypto-currencies are nowadays widely known and used by more and more users, principally as a means of investment and payment, outside the restrict circle of technologists and computer scientists. However, as fiat money, they can also be used as a means for illegal activities, exploiting their pseudo-anonymity and easiness/speed in moving capitals. The aim of the suite of tools we propose in this paper is to better analyse and understand money flows in the Bitcoin block-chain, e.g., by clustering addresses, scraping them in the Web, identifying mixing services, and visualising all such information to forensic scientists.

Stefano Bistarelli, Ivan Mercanti, Francesco Santini
On and Off-Blockchain Enforcement of Smart Contracts

Emerging blockchain technology is a promising platform for implementing smart contracts. But there is a large class of applications, where blockchain is inadequate due to performance, scalability, and consistency requirements, and also due to language expressiveness and cost issues that are hard to solve. In this paper we explain that in some situations a centralised approach that does not rely on blockchain is a better alternative due to its simplicity, scalability, and performance. We suggest that in applications where decentralisation and transparency are essential, developers can advantageously combine the two approaches into hybrid solutions where some operations are enforced by enforcers deployed on–blockchains and the rest by enforcers deployed on trusted third parties.

Carlos Molina-Jimenez, Ellis Solaiman, Ioannis Sfyrakis, Irene Ng, Jon Crowcroft
MaRSChain: Framework for a Fair Manuscript Review System Based on Permissioned Blockchain

Current Manuscript Review Systems (Conference/Journal) rely on a centralized services (like EasyChair, iChair, HotCRP or EDAS), which manage the whole process that starts with manuscript submissions to notification of the results. As these review systems are centralized, the trust is based on a single entity. The fairness of the system hinges on the honesty of the central controlling authority. This dependency can be avoided by decentralizing the source of the trust. Bitcoin has shown the power of decentralization and shared database through blockchain technology, and currently is being studied for its immense impact on FinTech. We leverage blockchain to address the above concern and present a decentralized manuscript review system that provides trust and fairness. We call this system MaRSChain. As a proof of concept, we develop a prototype of MaRSChain system on top of Hyperledger Fabric platform. To the best of our knowledge, this is the first ever decentralized manuscript review system based on Blockchain.

Nitesh Emmadi, Lakshmi Padmaja Maddali, Sumanta Sarkar
Tamper-Proof Volume Tracking in Supply Chains with Smart Contracts

Complex supply chains involve many different stakeholders such as producers, traders, manufacturers, and consumers. These entities comprise companies and other stakeholders spanning different countries or continents. Depending on the involved goods, the origin and the responsible harvesting of these elements are essential. Due to their high complexity, these systems enable the introduction of resources with forged identity, tricking the participants of the supply chain into believing that they acquire goods with specific properties, e.g., environmentally friendly wood or resources which are not the result of child labor. We derive requirements from the global world trade of timber and timer-based products, in which the origin of a large portion of certified wood cannot be verified. A set of smart contracts deployed within the Ethereum platform allows for a transparent supply chain with validated sources. The platform enables the tracking of variations of the original good, tracing not only the raw material but also the resulting products. The proposed solution introduces a novel exchange contract and ensures a correct overall volume of assets managed in the supply chain.

Ulrich Gallersdörfer, Florian Matthes
A Blockchain Based System to Ensure Transparency and Reliability in Food Supply Chain

We propose a blockchain oriented platform to secure storage origin provenance for food data. By exploiting the blockchain distributed and immutable nature the proposed system ensures the supply chain transparency with a view to encourage local region by promoting the smart food tourism and by increasing local economy. Thanks to the decentralized application platforms that makes us able to develop smart contracts, we define and implement a system that works inside the blockchain and guarantees transparency, reliability to all actors of the food supply chain. Food, in fact, is the most direct way to get in touch with a place. The touristic activities related to wine and food consumption and sale in fact influence the choice of a destination and may encourage the purchase of typical food also once tourists are back home to the country of origin. Touristic destinations must therefore be equipped with innovative tools that, in a context of Smart Tourism, guarantee the originality of the products and their traceability.

Gavina Baralla, Simona Ibba, Michele Marchesi, Roberto Tonelli, Sebastiano Missineo
Selecting Effective Blockchain Solutions

Distributed ledger technologies (DLT) are becoming increasingly popular and seen as a panacea for a wide range of applications. However, it is clear that many organisations, and even engineers, are selecting DLT solutions without fully understanding their power or limitations. Those that make the assessment that blockchain is the best solution are provided little guidance on the vast array of types of blockchain; whether permissioned, permissionless or federated; which consensus algorithm to use; and a range of other considerations. This paper aims to addresses this gap.

Carsten Maple, Jack Jackson

HeteroPar - Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms

Frontmatter
Evaluation Through Realistic Simulations of File Replication Strategies for Large Heterogeneous Distributed Systems

File replication is widely used to reduce file transfer times and improve data availability in large distributed systems. Replication techniques are often evaluated through simulations, however, most simulation platform models are oversimplified, which questions the applicability of the findings to real systems. In this paper, we investigate how platform models influence the performance of file replication strategies on large heterogeneous distributed systems, based on common existing techniques such as prestaging and dynamic replication. The novelty of our study resides in our evaluation using a realistic simulator. We consider two platform models: a simple hierarchical model and a detailed model built from execution traces. Our results show that conclusions depend on the modeling of the platform and its capacity to capture the characteristics of the targeted production infrastructure. We also derive recommendations for the implementation of an optimized data management strategy in a scientific gateway for medical image analysis.

Anchen Chai, Sorina Camarasu-Pop, Tristan Glatard, Hugues Benoit-Cattin, Frédéric Suter
Modeling and Optimizing Data Transfer in GPU-Accelerated Optical Coherence Tomography

Signal processing of optical coherence tomography (OCT) has become a bottleneck for using OCT in medical and industrial applications. Recently, GPUs gained more importance as compute device to achieve video frame rate of 25 frames/s. Therefore, we develop a CUDA implementation of an OCT signal processing chain: We focus on reformulating the signal processing algorithms in terms of high-performance libraries like CUBLAS and CUFFT. Additionally, we use NVIDIA’s stream concept to overlap computations and data transfers. Performance results are presented for two Pascal GPUs and validated with a derived performance model. The model gives an estimate for the overall execution time for the OCT signal processing chain, including compute and transfer times.

Tobias Schrödter, David Pallasch, Sandra Wienke, Robert Schmitt, Matthias S. Müller
A Modular Precision Format for Decoupling Arithmetic Format and Storage Format

In this work, we propose to decouple the arithmetic format from the storage format in numerical algorithms. We complement this idea with a modular precision storage layout that allows runtime precision adaptation such that a value can be accessed faster if lower accuracy is acceptable. Combined with precision-aware numerical algorithms that use full precision in all arithmetic computations, this strategy can result in runtime savings without impacting the memory footprint or the accuracy of the final result. In an experimental analysis using the adaptive precision Jacobi method we assess the benefits of the modular precision format on a recent high-end GPU architecture.

Thomas Grützmacher, Hartwig Anzt
Benchmarking the NVIDIA V100 GPU and Tensor Cores

The V100 GPU is the newest server-grade GPU produced by NVIDIA and introduces a number of new hardware and API features. This paper details the results of benchmarking the V100 GPU and demonstrates that it is a significant generational improvement, increasing memory bandwidth, cache bandwidth, and reducing latency. A major new addition is the Tensor core units, which have been marketed as deep learning acceleration features that enable the computation of a $$4\times 4\times 4$$ half precision matrix-multiply-accumulate operation in a single clock cycle. This paper confirms that the Tensor cores offer considerable performance gains for half precision general matrix multiplication; however, programming them requires fine control of the memory hierarchy that is typically unnecessary for other applications.

Matt Martineau, Patrick Atkinson, Simon McIntosh-Smith
SiL: An Approach for Adjusting Applications to Heterogeneous Systems Under Perturbations

Scientific applications consist of large and computationally-intensive loops. Dynamic loop scheduling (DLS) techniques are used to load balance the execution of such applications. Load imbalance can be caused by variations in loop iteration execution times due to problem, algorithmic, or systemic characteristics (also perturbations). The following question motivates this work: “Given an application, a high-performance computing (HPC) system, and their characteristics and interplay, which DLS technique will achieve improved performance under unpredictable perturbations?” Existing work only considers perturbations caused by variations in the HPC system delivered computational speeds. However, perturbations in available network bandwidth or latency are inevitable on production HPC systems. Simulator in the loop (SiL) is introduced, herein, as a new control-theoretic inspired approach to dynamically select DLS techniques that improve the performance of applications on heterogeneous HPC systems under perturbations. The present work examines the performance of six applications on a heterogeneous system under all above system perturbations. The SiL proof of concept is evaluated using simulation. The performance results confirm the initial hypothesis that no single DLS technique can deliver best performance in all scenarios, whereas the SiL-based DLS selection achieved improved application performance in most experiments.

Ali Mohammed, Florina M. Ciorba
Merging the Publish-Subscribe Pattern with the Shared Memory Paradigm

Heterogeneous distributed architectures require high-level abstractions to ease the programmability and efficiently manage resources. Both the publish-subscribe and the shared memory models offer such abstraction. However they are intended to be used in different application contexts. In this paper we propose to merge these two models into a new one. It benefits from the rigorous cache coherence management of the shared memory and the ability to cope with dynamic large-scale environment of the publish-subscribe model. The publish-subscribe mechanisms have been implemented within a distributed shared memory system and tested using an heterogeneous micro-server.

Loïc Cudennec
Towards Application-Centric Parallel Memories

Many applications running on parallel processors and accelerators are bandwidth bound. In this work, we explore the benefits of parallel (scratch-pad) memories to further accelerate such applications. To this end, we propose a comprehensive approach to designing and implementing application-centric parallel memories based on the polymorphic memory-model called PolyMem. Our approach enables the acceleration of a memory-bound region of an application by (1) analyzing the memory access to extract parallel accesses, (2) configuring PolyMem to deliver maximum speed-up for the detected accesses, and (3) building an actual FPGA-based parallel-memory accelerator for this region, with predictable performance. We validate our approach on 10 instances of Sparse-STREAM (a STREAM benchmark adaptation with sparse memory accesses), for which we design and benchmark the corresponding parallel-memory accelerators in hardware. Our results demonstrate that building parallel-memory accelerators is feasible and leads to performance gain, but their efficient integration in heterogeneous platforms remains a challenge.

Giulio Stramondo, Cătălin Bogdan Ciobanu, Ana Lucia Varbanescu, Cees de Laat
Fast Heuristic-Based GPU Compiler Sequence Specialization

Iterative compilation focused on specialized phase orders (i.e., custom selections of compiler passes and orderings for each program or function) can significantly improve the performance of compiled code. However, phase ordering specialization typically needs to deal with large solution space. A previous approach, evaluated by targeting an x86 CPU, mitigates this issue by first using a training phase on reference codes to produce a small set of high-quality reusable phase orders. This approach then uses these phase orders to compile new codes, without any code analysis. In this paper, we evaluate the viability of using this approach to optimize the GPU execution performance of OpenCL kernels. In addition, we propose and evaluate the use of a heuristic to further reduce the number of evaluated phase orders, by comparing the speedups of the resulting binaries with those of the training phase for each phase order. This information is used to predict which untested phase order is most likely to produce good results (e.g., highest speedup). We performed our measurements using the PolyBench/GPU OpenCL benchmark suite on an NVIDIA Pascal GPU. Without heuristics, we can achieve a geomean execution speedup of 1.64 $$\times $$ , using cross-validation, with 5 non-standard phase orders. With the heuristic, we can achieve the same speedup with only 3 non-standard phase orders. This is close to the geomean speedup achieved in our iterative compilation experiments exploring thousands of phase orders. Given the significant reduction in exploration time and other advantages of this approach, we believe that it is suitable for a wide range of compiler users concerned with performance.

Ricardo Nobre, Luís Reis, João M. P. Cardoso
Accelerating Online Change-Point Detection Algorithm Using 10 GbE FPGA NIC

In statistical analysis and data mining, change-point detection that identifies the change-points which are times when the probability distribution of time series changes has been used for various purposes, such as anomaly detections on network traffic and transaction data. However, computation cost of a conventional AR (Auto-Regression) model based approach is too high and infeasible for online. In this paper, an AR model based online change-point detection algorithm, called ChangeFinder, is implemented on an FPGA (Field Programmable Gate Array) based NIC (Network Interface Card). The proposed system computes the change-point score from time series data received from 10 GbE (10 Gbit Ethernet). More specifically, it computes the change-point score at the 10 GbE NIC in advance of host applications. This paper aims to reduce the host workload and improve change-point detection performance by offloading ChangeFinder algorithm from host to the NIC. As evaluations, change-point detection in the FPGA NIC is compared with a baseline software implementation and those enhanced by two network optimization techniques using DPDK and Netfilter in terms of throughput. The result demonstrates 16.8x improvement in change-point detection throughput compared to the baseline software implementation. The throughput achieves 83.4% of the 10 GbE line rate.

Takuma Iwata, Kohei Nakamura, Yuta Tokusashi, Hiroki Matsutani
OS-ELM-FPGA: An FPGA-Based Online Sequential Unsupervised Anomaly Detector

Autoencoder, a neural-network based dimensionality reduction algorithm has demonstrated its effectiveness in anomaly detection. It can detect whether an input sample is normal or abnormal by just training only with normal data. In general, Autoencoder is built on backpropagation-based neural networks (BP-NNs). When BP-NNs are implemented in edge devices, they are typically specialized only for prediction with weight matrices precomputed offline due to the high computational cost. However, such devices cannot be immediately adapted to time-series trend changes of input data. In this paper, we propose an FPGA-based unsupervised anomaly detector, called OS-ELM-FPGA, that combines Autoencoder and an online sequential learning algorithm OS-ELM. Based on our theoretical analysis of the algorithm, the proposed OS-ELM-FPGA completely eliminates matrix pseudoinversions while improving the learning throughput. Simulation results using open-source datasets show that OS-ELM-FPGA achieves favorable anomaly detection accuracy compared to CPU and GPU implementations of BP-NNs. Learning throughput of OS-ELM-FPGA is 3.47x to 27.99x and 5.22x to 78.06x higher than those of CPU and GPU implementations of OS-ELM. It is also 3.62x to 36.15x and 1.53x to 43.44x higher than those of CPU and GPU implementations of BP-NNs.

Mineto Tsukada, Masaaki Kondo, Hiroki Matsutani

LSDVE - Workshop on Large Scale Distributed Virtual Environments

Frontmatter
The Drivers Behind Blockchain Adoption: The Rationality of Irrational Choices

There has been a huge increase in interest in blockchain technology. However, little is known about the drivers behind the adoption of this technology. In this paper we identify and analyze these drivers, using three real-world and representative scenarios. We confirm in our analysis that blockchain is not an appropriate technology for some scenarios, from a purely technical point of view. The choice for blockchain technology in such scenarios may therefore seem as an irrational choice. However, our analysis reveals that there are non-technical drivers at play that drive the adoption of blockchain, such as philosophical beliefs, network effects, and economic incentives. These non-technical drivers may explain the rationality behind the choice for blockchain adoption.

Tommy Koens, Erik Poll
Field Experiment on the Performance of an Android-Based Opportunistic Network

Android smartphones ubiquitously available, they are mobile and have sophisticated communication opportunities. With Opportunistic Networks, we can use the wireless connectivity of smartphones and other smart devices to relay messages in store-carry-forward fashion from one node to another to implement novel data-oriented applications. We can use these networks for high-bandwidth local data transfers, in cases with low or no connectivity, such as in third-world countries or remote areas, or in cases where communication should not leave any traces. In the last years, we developed an Android application for Opportunistic Networking, named opptain, that can be deployed on off-the-shelf unrooted smartphones and smart devices, enabling to harness this idea by simply installing an app. As the quality of such networks is essential, we implemented a test framework for Android-based opportunistic networks to run tests and aggregate results automatically. In this paper, we present the evaluation results of a field experiment we conducted with the opptain application, in which we used 26 devices to evaluate the outcome typical use cases. The tests show that the expected quality is reached and provides robust performance for various applications. In total, opptain, the testing environment, as well as the results themselves, are promising; for an office scenario in which interference is more common than in other possible scenarios, we achieved encouraging results.

Andre Ippisch, Philipp Brühn, Kalman Graffi
Distributed Computation of Mobility Patterns in a Smart City Environment

This paper copes with the issue of extracting mobility patterns in a urban computing scenario. The computation is parallelized by partitioning the territory into a number of regions. In each region a computing node collects data from a set of local sensors, analyzes the data and coordinates with neighbor regions to extract the mobility patterns. We propose and analyze a “local” synchronization approach, where computation regarding a specific region is performed using the information received from a subset of neighbor regions. When opposed to the usual approach, where the computation proceeds after collecting the results from all the regions, our approach offers notable benefits: reduction of computation time, real-time model extraction, better support to local decisions. The paper describes the model of local synchronization by means of a Petri net and analyzes the performance in terms of the ability of the system of keeping the pace with the data collected by sensors. The analysis is based on a real world dataset tracing the movements of taxis in the urban area of Beijing.

Eugenio Cesario, Franco Cicirelli, Carlo Mastroianni
Exploiting Community Detection to Recommend Privacy Policies in Decentralized Online Social Networks

The usage of Online Social Networks (OSNs) has become a daily activity for billions of people that share their contents and personal information with the other users. Regardless of the platform exploited to provide the OSNs’ services, these contents’ sharing could expose the OSNs’ users to a number of privacy risks if proper privacy-preserving mechanisms are not provided. Indeed, users must be able to define its own privacy policies that are exploited by the OSN to regulate access to the shared contents. To reduce such users’ privacy risks, we propose a Privacy Policies Recommended System (PPRS) that assists the users in defining their own privacy policies. Besides suggesting the most appropriate privacy policies to end users, the proposed system is able to exploits a certain set of properties (or attributes) of the users to define permissions on the shared contents. The evaluation results based on real OSN dataset show that our approach classifies users with a higher accuracy by recommending specific privacy policies for different communities of the users’ friends.

Andrea De Salve, Barbara Guidi, Andrea Michienzi
ComeHere: Exploiting Ethereum for Secure Sharing of Health-Care Data

The problem of protecting sensitive data like medical records, and enabling the access only to authorized entities is currently a challenge. Current solutions often require trusting some centralized entity which is in charge of managing the data. The disruptive technology of blockchains may offer the possibility to change the current scenario and give to the users the control on their personal data.In this paper we propose ComeHere, a system able to store medical records and to exploit the blockchain technology to control and track the access right transfer on the blockchain. The paper shows the current status of the project, presents a preliminary proof-of-concept implementation and discusses the future improvements of the system, and some critical issues which are still open.

Matteo Franceschi, Davide Morelli, David Plans, Alan Brown, John Collomosse, Louise Coutts, Laura Ricci

Med-HPC - Workshop on Advances in High-Performance Bioinformatics, Systems Biology

Frontmatter
BaaS - Bioinformatics as a Service

Genomics and related technologies, collectively known as Omics, have transformed life sciences research. These technologies produce mountain of data that needs to be managed and analysed. Rapid developments in the Next Generation Sequencing technologies have helped genomics become mainstream, but the compute support systems, meant to enable genomics, have lagged behind. As genomics is making inroads into personalised health care and clinical settings, it is paramount that a robust compute infrastructure be designed to meet the growing needs of the field. Infrastructure design to deal with omics datasets is an active area of research and a critical one, for omics to be adopted in industrial healthcare and clinical settings. In this paper, we propose a blueprint for an as-a service compute infrastructure for fast and scalable processing of omics datasets. We explain our approach with help of a well-known bioinformatics workflow and a compute environment that can be tailored to achieve portability, reproducibility and scalability using modern High Performance Computing systems.

Ritesh Krishna, Vadim Elisseev, Samuel Antao
Disaggregating Non-Volatile Memory for Throughput-Oriented Genomics Workloads

Massive exploitation of next-generation sequencing technologies requires dealing with both: huge amounts of data and complex bioinformatics pipelines. Computing architectures have evolved to deal with these problems, enabling approaches that were unfeasible years ago: accelerators and Non-Volatile Memories (NVM) are becoming widely used to enhance the most demanding workloads. However, bioinformatics workloads are usually part of bigger pipelines with different and dynamic needs in terms of resources. The introduction of Software Defined Infrastructures (SDI) for data centers provides roots to dramatically increase the efficiency in the management of infrastructures. SDI enables new ways to structure hardware resources through disaggregation, and provides new hardware composability and sharing mechanisms to deploy workloads in more flexible ways. In this paper we study a state-of-the-art genomics application, SMUFIN, aiming to address the challenges of future HPC facilities.

Aaron Call, Jordà Polo, David Carrera, Francesc Guim, Sujoy Sen
GPU Accelerated Analysis of Treg-Teff Cross Regulation in Relapsing-Remitting Multiple Sclerosis

The computational analysis of complex biological systems can be hindered by two main factors. First, modeling the system so that it can be easily understood and analyzed by non-expert users is not always possible, especially when dealing with systems of Ordinary Differential Equations. Second, when the system is composed of hundreds or thousands of reactions and chemical species, the classic CPU-based simulators could not be appropriate to efficiently derive the behavior of the system. To overcome these limitations, in this paper we propose a novel approach that combines the descriptive power of Stochastic Symmetric Nets–a Petri Net formalism that allows modeler to describe the system in a parametric and compact manner–with LASSIE, a GPU-powered deterministic simulator that offloads onto the GPU the calculations required to execute many simulations by following both fine-grained and coarse-grained parallelization strategies. This pipeline has been applied to carry out a parameter sweep analysis of a relapsing-remitting multiple sclerosis model, aimed at understanding the role of possible malfunctions in the cross-balancing mechanisms that regulate peripheral tolerance of self-reactive T lymphocytes. From our experiments, LASSIE achieves around $$97\times $$ speed-up with respect to the sequential execution of the same number of simulations.

Marco Beccuti, Paolo Cazzaniga, Marzio Pennisi, Daniela Besozzi, Marco S. Nobile, Simone Pernice, Giulia Russo, Andrea Tangherloni, Francesco Pappalardo
Cross-Environment Comparison of a Bioinformatics Pipeline: Perspectives for Hybrid Computations

In this work a previously published bioinformatics pipeline was reimplemented across various computational platforms, and the performances of its steps evaluated. The tested environments were: (I) dedicated bioinformatics-specific server (II) low-power single node (III) HPC single node (IV) virtual machine. The pipeline was tested on a use case of the analysis of a single patient to assess single-use performances, using the same configuration of the pipeline to be able to perform meaningful comparison and search the optimal environment/hybrid system configuration for biomedical analysis. Performances were evaluated in terms of execution wall time, memory usage and energy consumption per patient. Our results show that, albeit slower, low power single nodes are comparable with other environments for most of the steps, but with an energy consumption two to four times lower. These results indicate that these environments are viable candidates for bioinformatics clusters where long term efficiency is a factor.

Nico Curti, Enrico Giampieri, Andrea Ferraro, Cristina Vistoli, Elisabetta Ronchieri, Daniele Cesini, Barbara Martelli, Cristina Duma Doina, Gastone Castellani
High Performance Computing for Haplotyping: Models and Platforms

The reconstruction of the haplotype pair for each chromosome is a hot topic in Bioinformatics and Genome Analysis. In Haplotype Assembly (HA), all heterozygous Single Nucleotide Polymorphisms (SNPs) have to be assigned to exactly one of the two chromosomes. In this work, we outline the state-of-the-art on HA approaches and present an in-depth analysis of the computational performance of GenHap, a recent method based on Genetic Algorithms. GenHap was designed to tackle the computational complexity of the HA problem by means of a divide-et-impera strategy that effectively leverages multi-core architectures. In order to evaluate GenHap’s performance, we generated different instances of synthetic (yet realistic) data exploiting empirical error models of four different sequencing platforms (namely, Illumina NovaSeq, Roche/454, PacBio RS II and Oxford Nanopore Technologies MinION). Our results show that the processing time generally decreases along with the read length, involving a lower number of sub-problems to be distributed on multiple cores.

Andrea Tangherloni, Leonardo Rundo, Simone Spolaor, Marco S. Nobile, Ivan Merelli, Daniela Besozzi, Giancarlo Mauri, Paolo Cazzaniga, Pietro Liò

PCDLifeS - Workshop on Parallel and Distributed Computing for Life Sciences: Algorithms, Methodologies and Tools

Frontmatter
Effect of Spatial Decomposition on the Efficiency of k Nearest Neighbors Search in Spatial Interpolation

Spatial interpolations are commonly used in geometric modeling for life science applications. In large-scale spatial interpolations, it is always needed to find a local set of data points for each interpolated point using the k Nearest Neighbor (kNN) search procedure. To improve the computational efficiency of kNN search, spatial decomposition structures such as grids and trees are employed to fastly locate the nearest neighbors. Among those spatial decomposition structures, the uniform grid is the simplest one, and the size of the grid cell could strongly affect the efficiency of kNN search. In this paper, we evaluate the effect of the size of uniform grid cell on the efficiency of kNN search. Our objective is to find the relatively optimal size of grid cell by considering the distribution of scattered points (i.e., the data points and the interpolated points). We employ the Standard Deviation of points’ coordinates to measure the spatial distribution of scattered points. For the irregularly distributed scattered points, we perform several series of kNN search procedures in two dimensions. Benchmark results indicate that: in two dimensions, with the increase of the Standard Deviation of points’ coordinates, the relatively optimal size of the grid cell decreases and eventually converges. The relationships between the Standard Deviation of scattered points’ coordinates and the relatively optimal size of grid cell are also fitted. The fitted relationships could be applied to determine the relatively optimal grid cell in kNN search, and further, improve the computational efficiency of spatial interpolations.

Naijie Fan, Gang Mei, Zengyu Ding, Salvatore Cuomo, Nengxiong Xu
Understanding Chromatin Structure: Efficient Computational Implementation of Polymer Physics Models

In recent years the development of novel technologies, as Hi-C or GAM, allowed to investigate the spatial structure of chromatin in the cell nucleus with a constantly increasing level of accuracy. Polymer physics models have been developed and improved to better interpret the wealth of complex information coming from the experimental data, providing highly accurate understandings on chromatin architecture and on the mechanisms regulating genome folding. To investigate the capability of the models to explain the experiments and to test their agreement with the data, massive parallel simulations are needed and efficient algorithms are fundamental. In this work, we consider general computational Molecular Dynamics (MD) techniques commonly used to implement such models, with a special focus on the Strings & Binders Switch polymer model. By combining this model with machine learning computational approaches, it is possible to give an accurate description of real genomic loci. In addition, it is also possible to make predictions about the impact of structural variants of the genomic sequence, which are known to be linked to severe congenital diseases.

Simona Bianco, Carlo Annunziatella, Andrea Esposito, Luca Fiorillo, Mattia Conte, Raffaele Campanile, Andrea M. Chiariello
Towards Heterogeneous Network Alignment: Design and Implementation of a Large-Scale Data Processing Framework

The importance of the use of networks to model and analyse biological data and the interplay of bio-molecules is widely recognised. Consequently, many algorithms for the analysis and the comparison of networks (such as alignment algorithms) have been developed in the past. Recently, many different approaches tried to integrate into a single model the interplay of different molecules, such as genes, transcription factors and microRNAs. A possible formalism to model such scenario comes from node coloured networks (or heterogeneous networks) implemented as node/ edge-coloured graphs. Consequently, the need for the introduction of alignment algorithms able to analyse heterogeneous networks arises. To the best of our knowledge, all the existing algorithms are not able to mine heterogeneous networks. We propose a two-step alignment strategy that receives as input two heterogeneous networks (node-coloured graphs) and a similarity function among nodes of two networks extending the previous formulations. We first build a single alignment graph. Then we mine this graph extracting relevant subgraphs. Despite this simple approach, the analysis of such networks relies on graph and subgraph isomorphism and the size of the data is still growing. Therefore the use of high-performance data analytics framework is needed. We here present HetNetAligner a framework built on top of Apache Spark. We also implemented our algorithm, and we tested it on some selected heterogeneous biological networks. Preliminary results confirm that our method may extract relevant knowledge from biological data reducing the computational time.

Marianna Milano, Pierangelo Veltri, Mario Cannataro, Pietro H. Guzzi
A Parallel Cellular Automaton Model For Adenocarcinomas in Situ with Java: Study of One Case

Adenocarcinomas are tumors that originate in the lining epithelium of the ducts that form the endocrine glands of the human body. Infiltrating breast and one of the most frequent neoplasms among female population, and the early detection of the disease is then fundamental and, for this reason, a profound knowledge of the biology of tumor at this phase is essential. Among the distinct tools that contribute to this knowledge, computational simulation is more frequently used every day. The availability of fast and efficient computations that allow the simulation of tumor dynamics in situ, under a wide range of different parameters, is an important research topic. Based on cellular automata, this paper proposes a generic simulation model for the Adenocarcinomas In Situ (CIS). We applied it to the breast ductal adenocarcinoma in situ (DCIS), modeling our cells with the genomic load that we currently know that the tumor starts, and proposing a numerical coding method for the genome that allows efficient computational management. We propose a parallelization scheme using data parallelism, and we show the acceleration achieved in multiple nodes of our cluster of processors.

Antonio J. Tomeu-Hardasmal, Alberto G. Salguero-Hidalgo, Manuel I. Capel
Performance Evaluation for a PETSc Parallel-in-Time Solver Based on the MGRIT Algorithm

We herein describe the performance evaluation of a modular implementation of the MGRIT (MultiGrid-In-Time) algorithm within the context of the PETSc (the Portable, Extensible Toolkit for Scientific computing) library. Our aim is to give the PETSc users the opportunity of testing the MGRIT parallel-in-time approach as an alternative to the Time Stepping integrator (TS), when solving their problems arising from the discretization of linear evolutionary models. To this end, we analyzed the performance parameters of the algorithm in order to underline the relationship between the configuration factors and problem characteristics, intentionally overlooking any accuracy issue and spacial parallelism.

Valeria Mele, Diego Romano, Emil M. Constantinescu, Luisa Carracciuolo, Luisa D’Amore

RePara - Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms

Frontmatter
Programmable HSA Accelerators for Zynq UltraScale+ MPSoC Systems

Modern algorithms for virtual reality, machine learning or big data find its way into more and more application fields and result in stricter power per watt requirements. This challenges traditional homogeneous computing concepts and drives the development of new, heterogeneous architectures. One idea to attain a balance of high data throughput and flexibility are GPU-like soft-core processors combined with general purpose CPUs as hosts. However, the approaches proposed in recent years are still not sufficient regarding their integration in a shared hardware environment and unified software stack. The approach of the HSA Foundation provides a complete communication definition for heterogeneous systems but lacks FPGA accelerator support. Our work presents a methodology making soft-core processors HSA compliant within MPSoC systems. This enables high level software programming and therefore eases the accessibility of soft-core FPGA accelerators. Furthermore, the integration effort is kept low by fully utilizing the HSA Foundation standards and toolchains.

Wolfgang Bauer, Philipp Holzinger, Marc Reichenbach, Steffen Vaas, Paul Hartke, Dietmar Fey
Service Level Objectives via C++11 Attributes

In recent years, increasing attention has been given to the possibility of guaranteeing Service Level Objectives (SLOs) to users about their applications, either regarding performance or power consumption. SLO can be implemented for parallel applications since they can provide many control knobs (e.g., the number of threads to use, the clock frequency of the cores, etc.) to tune the performance and power consumption of the application. Different from most of the existing approaches, we target sequential stream processing applications by proposing a solution based on C++ annotations. The user specifies which parts of the code to parallelize and what type of requirements should be enforced on that part of the code. Our solution first automatically parallelizes the annotated code and then applies self-adaptation approaches at run-time to enforce the user-expressed objectives. We ran experiments on different real-world applications, showing its simplicity and effectiveness.

Dalvan Griebler, Daniele De Sensi, Adriano Vogel, Marco Danelutto, Luiz Gustavo Fernandes
, a Programming Model to Decouple Performance from Algorithm in HPC Codes

Existing programming models tend to tightly interleave algorithm and optimization in HPC simulation codes. This requires scientists to become experts in both the simulated domain and the optimization process and makes the code difficult to maintain and port to new architectures. This paper proposes the $$\textsc {InKS}_{\textsf {}}$$ programming model that decouples these two concerns with distinct languages for each. The simulation algorithm is expressed in the $$\textsc {InKS}_{\textsf {pia}}$$ language with no concern for machine-specific optimizations. Optimizations are expressed using both a family of dedicated optimizations DSLs ( $$\textsc {InKS}_{\textsf {O}}$$ ) and plain C++. $$\textsc {InKS}_{\textsf {O}}$$ relies on the $$\textsc {InKS}_{\textsf {pia}}$$ source to assist developers with common optimizations while C++ is used for less common ones. Our evaluation demonstrates the soundness of the approach by using it on synthetic benchmarks and the Vlasov-Poisson equation. It shows that $$\textsc {InKS}_{\textsf {}}$$ offers separation of concerns at no performance cost.

Ksander Ejjaaouani, Olivier Aumage, Julien Bigot, Michel Mehrenberger, Hitoshi Murai, Masahiro Nakao, Mitsuhisa Sato
Refactoring Loops with Nested IFs for SIMD Extensions Without Masked Instructions

Most CPUs in heterogeneous systems are now equipped with SIMD (Single Instruction Multiple Data) extensions that operate on short vectors in parallel to enable high performance. Refactoring programs for such systems relies on vectorization, i.e., transforming into a form with SIMD-instructions. We improve the state of the art in refactoring loops with nested IF-statements that are notoriously difficult to vectorize. For IF-statements whose conditions are independent of the loop variable, we improve the classical loop unswitching method, such that it can tackle nested IFs. For IF-statements whose conditions change with loop iterations, we develop a novel IF-select transformation method: (1) it can work with arbitrarily nested IFs, and (2) while previous methods rely on either masked instructions or hardware support for predicated execution, our method works for SIMD extensions without such operations (as found, e.g., in IBM Power8 and ARM Cortex-A8). Our experimental evaluation for the SPEC CPU2006 benchmark suite is conducted on an SW26010 processor used in the Sunway TaihuLight supercomputer (#2 in the TOP500 list); it demonstrates the performance advantages of our implemented approach over the vectorizer of the Open64 compiler.

Huihui Sun, Sergei Gorlatch, Rongcai Zhao

Resilience - Workshop on Resiliency in High Performance Computing with Clouds, Grids, and Clusters

Frontmatter
Do Moldable Applications Perform Better on Failure-Prone HPC Platforms?

This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) $$\textsc {Rigid}$$ applications, which use a constant number of processors throughout execution; (ii) $$\textsc {Moldable}$$ applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) $$\textsc {GridShaped}$$ applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.

Valentin Le Fèvre, George Bosilca, Aurelien Bouteiller, Thomas Herault, Atsushi Hori, Yves Robert, Jack Dongarra
FINJ: A Fault Injection Tool for HPC Systems

We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool.

Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sîrbu, Andrea Bartolini, Andrea Borghesi
Performance Efficient Multiresilience Using Checkpoint Recovery in Iterative Algorithms

In this paper, we address the design challenge of building multiresilient iterative high-performance computing (HPC) applications. Multiresilience in HPC applications is the ability to tolerate and maintain forward progress in the presence of both soft errors and process failures. We address the challenge by proposing performance models which are useful to design performance efficient and resilient iterative applications. The models consider the interaction between soft error and process failure resilience solutions. We experimented with a linear solver application with two distinct kinds of soft error detectors: one detector has high overhead and high accuracy, whereas the second has low overhead and low accuracy. We show how both can be leveraged for verifying the integrity of checkpointed state used to recover from both soft errors and process failures. Our results show the performance efficiency and resiliency benefit of employing the low overhead detector with high frequency within the checkpoint interval, so that timely soft error recovery can take place, resulting in less re-computed work.

Rizwan A. Ashraf, Christian Engelmann
A Lightweight Approach to GPU Resilience

Resilience for HPC applications typically is implemented as a CPU-based rollback-recovery technique. In this context, long running accelerator computations on GPUs pose a major challenge as these devices usually do not offer any means of interrupt. This paper proposes a solution to the aforementioned problem: it suggests a novel approach that rewrites GPU kernels so that a soft interrupt of their execution becomes possible. Our approach is based on the Compute Unified Device Architecture (CUDA) by Nvidia and works by taking advantage of CUDA’s execution model of partitioning threads into blocks. In essence, we re-write the kernel so that each block determines whether it should continue execution or return control to the CPU. By doing so we are able to perform a premature interrupt of kernels.

Max Baird, Christian Fensch, Sven-Bodo Scholz, Artjoms Šinkarovs
Backmatter
Metadaten
Titel
Euro-Par 2018: Parallel Processing Workshops
herausgegeben von
Dr. Gabriele Mencagli
Dora B. Heras
Valeria Cardellini
Prof. Emiliano Casalicchio
Emmanuel Jeannot
Felix Wolf
Antonio Salis
Claudio Schifanella
Ravi Reddy Manumachu
Laura Ricci
Marco Beccuti
Laura Antonelli
José Daniel Garcia Sanchez
Stephen L. Scott
Copyright-Jahr
2019
Electronic ISBN
978-3-030-10549-5
Print ISBN
978-3-030-10548-8
DOI
https://doi.org/10.1007/978-3-030-10549-5

Premium Partner