Skip to main content
Top

2017 | Book

Parallel Computing Technologies

14th International Conference, PaCT 2017, Nizhny Novgorod, Russia, September 4-8, 2017, Proceedings

insite
SEARCH

About this book

This book constitutes the proceedings of the 14th International Conference on Parallel Computing Technologies, PaCT 2017, held in Nizhny Novgorod, Russia, in September 2017. The 25 full papers and 24 short papers presented were carefully reviewed and selected from 93 submissions. The papers are organized in topical sections on mainstream parallel computing, parallel models and algorithms in numerical computation, cellular automata and discrete event systems, organization of parallel computation, parallel computing applications.

Table of Contents

Frontmatter

Mainstream Parallel Computing

Frontmatter
Experimenting with a Context-Aware Language

Contextual information plays an increasingly crucial role in concurrent applications in the times of mobility and pervasiveness of computing. Context-Oriented Programming languages explicitly treat this kind of information. They provide primitive constructs to adapt the behaviour of a program, depending on the evolution of its operational environment, which is affected by other programs hosted therein independently and unpredictably. We discuss these issues and the challenges they pose, reporting on our recent work on $$\text {ML}_\text {CoDa}$$MLCoDa, a language specifically designed for adaptation and equipped with a clear formal semantics and analysis tools. We will show how applications and context interactions can be better specified, analysed and controlled, with the help of some experiments done with a preliminary implementation of $$\text {ML}_\text {CoDa}$$MLCoDa.

Chiara Bodei, Pierpaolo Degano, Gian-Luigi Ferrari, Letterio Galletta
Generating Maximal Domino Patterns by Cellular Automata Agents

Considered is a 2D cellular automaton with moving agents. The objective is to find agents controlled by a Finite State Program (FSP) that can form domino patterns. The quality of a formed pattern is measured by the degree of order computed by counting matching $$3 \times 3$$3×3 patterns (templates). The class of domino patterns is defined by four templates. An agent reacts on its own color, the color in front, and whether it is blocked or not. It can change the color, move or not, and turn into any direction. Four FSP were evolved for multi-agent systems with 1, 2, 4 agents initially placed in the corners of the field. For a $$12 \times 12$$12×12 training field the aimed pattern could be formed with a 100% degree of order. The performance was also high with other field sizes. Livelocks are avoided by using three different variants of the evolved FSP. The degree of order usually fluctuates after reaching a certain threshold, but it can also be stable, and the agents may show the termination by running in a cycle, or by stopping their activity.

Rolf Hoffmann, Dominique Désérable
Automated Parallelization of a Simulation Method of Elastic Wave Propagation in Media with Complex 3D Geometry Surface on High-Performance Heterogeneous Clusters

The paper considers application of DVM and SAPFOR in order to automate mapping of 3D elastic waves simulation method on high-performance heterogeneous clusters. A distinctive feature of the proposed method is the use of a curved three-dimensional grid, which is consistent with the geometry of free surface. Usage of curved grids considerably complicates both manual and automated parallelization. Technique to map curved grid on a structured grid has been presented to solve this problem. The sequential program based on the finite difference method on a structured grid, has been parallelized using Fortran-DVMH language. Application of SAPFOR analysis tools simplified this parallelization process. Features of automated parallelization are described. Authors estimate efficiency and acceleration of the parallel program and compare performance of the DVMH based program with a program obtained after manual parallelization using MPI programming technology.

Nikita Kataev, Alexander Kolganov, Pavel Titov
Parallel Algorithm with Modulus Structure for Simulation of Seismic Wave Propagation in 3D Multiscale Multiphysics Media

This paper presents a problem-oriented approach, designed for the numerical simulation of seismic wave propagation in models containing geological formations with complex properties such as anisotropy, attenuation, and small-scale heterogeneities. Each of the named property requires a special treatment that increases the computational complexity of an algorithm in comparison with ideally elastic isotropic media. At the same time, such formations are typically relatively small, filling about 25% of the model, thus the local use of computationally expensive approaches can speed-up the simulation essentially. In this paper we discuss both mathematical and numerical aspects of the hybrid algorithm paying most attention to its parallel implementation. At the same time essential efforts are spent to couple different equations and, hence, different finite-difference stencils to describe properly the different nature of seismic wave propagation in different areas. The main issue in the coupling is to suppress numerical artifacts down to the acceptable level, usually a few tenth of the percent.

Victor Kostin, Vadim Lisitsa, Galina Reshetova, Vladimir Tcheverda
Performance Evaluation of Two Load Balancing Algorithms on a Hybrid Parallel Architecture

Accelerated Processing Units (APUs) are an emerging architecture that integrates, in a single silicon chip, the traditional CPU and the GPU. Due to its heterogeneous architecture, APUs impose new challenges to data parallel applications that want to take advantage of all the processing units available on the hardware to minimize its execution time. Some standards help in the task of writing parallel code for heterogeneous devices, but it is not easy to find the data division between CPU and GPU that will minimize the execution time. In this context, this work further extends and details load balancing algorithms designed to be used in a data parallel problem. Also, a sensitivity analysis of the parameters used in our models was performed. The results have shown that the algorithms are effective in their purpose of improving the performance of an application on an heterogeneous environment.

Tiago M. do Nascimento, Rodrigo W. dos Santos, Marcelo Lobosco
Accelerated Analysis of Biological Parameters Space Using GPUs

Mathematical modeling and computer simulation represent a valuable mean to integrate experimental research for the study of biological systems. However, many computational methods—e.g., sensitivity analysis—require the execution of a massive number of simulations to investigate the model behavior in physiological or perturbed conditions, which can be a computationally challenging task. This huge amount of simulations is necessary to collect data in the vast space of kinetic parameters. This paper provides the state-of-the-art of biochemical simulators relying on Graphics Processing Units (GPUs) in the context of Systems Biology. Moreover, we discuss two examples of integration of such simulators into computational methods for parameter sweep and sensitivity analysis, both implemented using the Python language.

Marco S. Nobile, Giancarlo Mauri

Parallel Models and Algorithms in Numerical Computation

Frontmatter
Fragmentation of IADE Method Using LuNA System

The fragmented programming system LuNA is based on the Fragmented Programming Technology. LuNA is a platform for building automatically tunable portable libraries of parallel numerical subroutines. This paper focuses on the parallel implementation of the IADE method for solving 1D partial differential equation (PDE) of parabolic type using LuNA programming system. A fragmented numerical algorithm of IADE method is designed in terms of the data-flow graph. A performance comparison of different algorithm’s implementations including LuNA and Message Passing Interface are given.

Norma Alias, Sergey Kireev
Performance Aspects of Collocated and Staggered Grids for Particle-in-Cell Plasma Simulation

We present a computational comparison of collocated and staggered uniform grids for particle-in-cell plasma simulation. Both types of grids are widely used, and numerical properties of the corresponding solvers are well-studied. However, for large-scale simulations performance is also an important factor, which is the focus of this paper. We start with a baseline implementation, apply widely-used techniques for performance optimization and measure their efficacy for both grids on a high-end Xeon CPU and a second-generation Xeon Phi processor. For the optimized version the collocated grid outperforms the staggered one by about 1.5 x on both Xeon and Xeon Phi. The speedup on the Xeon Phi processor compared to Xeon is about 1.9 x.

Sergey Bastrakov, Igor Surmin, Evgeny Efimenko, Arkady Gonoskov, Iosif Meyerov
Technological Aspects of the Hybrid Parallelization with OpenMP and MPI

In this paper we present practical parallelization techniques for different explicit and implicit numerical algorithms. These algorithms are considered on the base of the analysis of characteristics of modern computer systems and the nature of modeled physical processes. Limits of applicability of methods and parallelization techniques are determined in terms of practical implementation. Finally, the unified parallelization approach for OpenMP and MPI for solving a CFD problem in a regular domain is presented and discussed.

Oleg Bessonov
Application of Graph Models to the Parallel Algorithms Design for the Motion Simulation of Tethered Satellite Systems

Tethered satellite systems (TSS) are characterized by ununiform distribution of mass characteristics of the system and the environment parameters in space, which necessitates the use of mathematical models with distributed parameters. Simulation of such systems is performed with the use of partial differential equations with complex boundary conditions. The complexity of the boundary conditions is caused by the presence of the end-bodies that perform spatial fluctuations, and by the variable length of the tether. As a result computer simulation of TSS motion takes a long time. This paper presents a parallel algorithm for motion simulation of the TSS and representation of this algorithm in the form of a graph model in graph-symbolic programming technology. The main characteristics of the proposed algorithm and the advantages of using graph models of algorithms for modeling the motion of the TSS are discussed.

A. N. Kovartsev, V. V. Zhidchenko
The DiamondTetris Algorithm for Maximum Performance Vectorized Stencil Computation

An algorithm from the LRnLA family, DiamondTetris, for stencil computation is constructed. It is aimed for Many-Integrated-Core processors of the Xeon Phi family. The algorithm and its implementation is described for the wave equation based simulation. Its strong points are locality, efficient use of memory hierarchy, and, most importantly, seamless vectorization. Specifically, only 1 vector rearrange operation is necessary per cell value update. The performance is estimated with the roofline model. The algorithm is implemented in code and tested on Xeon and Xeon Phi machines.

Vadim Levchenko, Anastasia Perepelkina
A Parallel Locally-Adaptive 3D Model on Cartesian Nested-Type Grids

The paper addresses the 3D extension of the Cartesian multilevel nested-type grid methodology and its software implementation in an application library written in C++ object-oriented language with the application program interface OpenMP for parallelizing calculations on shared memory. The library accounts for the specifics of multithread calculations of 3D problems on Cartesian grids, which makes it possible to substantially minimize the loaded memory via non-storing the grid information. The loop order over cells is represented by a special list that remarkably simplifies parallel realization with the OpenMP directives. Test results show high effectiveness of dynamical local adaptation of Cartesian grids, and increasing of this effectiveness while the number of adaptation levels becomes larger.

Igor Menshov, Viktor Sheverdin
Auto-Vectorization of Loops on Intel 64 and Intel Xeon Phi: Analysis and Evaluation

This paper evaluates auto-vectorizing capabilities of modern optimizing compilers Intel C/C++, GCC C/C++, LLVM/Clang and PGI C/C++ on Intel 64 and Intel Xeon Phi architectures. We use the Extended Test Suite for Vectorizing Compilers consisting of 151 loops. In this work, we estimate speedup by running the loops in scalar and vector modes for different data types and determine loop classes which the compilers used in the study fail to vectorize. We use the dual CPU system (NUMA, 2 x Intel Xeon E5-2620v4, Intel Broadwell microarchitecture) with the Intel Xeon Phi 3120A co-processor for our experiments.

Olga V. Moldovanova, Mikhail G. Kurnosov
Parallel Algorithms for an Implicit CFD Solver on Tree-Based Grids

Parallel implementation of the implicit LU-SGS solver is considered. It leads to the graph coloring problem. A novel recursive graph coloring algorithm has been proposed that requires only three colors on 2:1 balanced quadtree-based meshes. The algorithm has been shown to allow simple parallel implementations, including GPU architectures, and is fully coherent with local grid coarsing/refining procedures resulting in highly effective co-execution with local grid adaptation.

Pavel Pavlukhin, Igor Menshov
Software Implementation of Mathematical Model of Thermodynamic Processes in a Steam Turbine on High-Performance System

The aim of this paper is the development of the mathematical model of thermal processes in steam turbine based on the modern information technologies and computational methods, with help of which the accuracy of calculations of thermal modes. The practical significance of the paper are: the model of thermal processes in steam turbine is proposed and implemented, the information about the temperature modes of the steam turbine is derived, limits and prospects of the proposed mathematical model is defined. The thermal processes in the turbine are characterized by a strong non-uniformity of the heat flow, which has significantly influence to the reliability and efficiency of the facility. As a rule, it the influence of these parameters on the geometry is not considered in the designing of the system that results in premature wear of the machine. The developed model takes into account the complex geometry of the steam turbine, does not require the significant changes in the processing of the design features and can be used to calculate the thermal processes other construction such as turbines. Software solution was developed for two-dimensional simulation of thermal processes in steam turbine that takes into account the occupancy control volumes.

Aleksandr Sukhinov, Aleksandr Chistyakov, Alla Nikitina, Irina Yakovenko, Vladimir Parshukov, Nikolay Efimov, Vadim Kopitsa, Dmitriy Stepovoy
Predictive Modeling of Suffocation in Shallow Waters on a Multiprocessor Computer System

The model of the algal bloom, causing suffocations in shallow waters takes into account the follows: the transport of water environment; microturbulent diffusion; gravitational sedimentation of pollutants and plankton; nonlinear interaction of plankton populations; biogenic, temperature and oxygen regimes; influence of salinity. The computational accuracy is significantly increased and computational time is decreased at using schemes of high order of accuracy for discretization of the model. The practical significance is the software implementation of the proposed model, the limits and prospects of it practical use are defined. Experimental software was developed based on multiprocessor computer system and intended for mathematical modeling of possible progress scenarios of shallow waters ecosystems on the example of the Azov Sea in the case of suffocation. We used decomposition methods of grid domains in parallel implementation for computationally laborious convection-diffusion problems, taking into account the architecture and parameters of multiprocessor computer system. The advantage of the developed software is also the use of hydrodynamical model including the motion equations in the three coordinate directions.

Aleksandr Sukhinov, Alla Nikitina, Aleksandr Chistyakov, Vladimir Sumbaev, Maksim Abramov, Alena Semenyakina

Cellular Automata and Discrete Event Systems

Frontmatter
Finite and Infinite Computations and a Classification of Two-Dimensional Cellular Automata Using Infinite Computations

This paper proposes an application of the Infinite Unit Axiom and grossone, introduced by Yaroslav Sergeyev (see [19–23]), to the development and classification of two-dimensional cellular automata. This application establishes, by the application of grossone, a new and more precise nonarchimedean metric on the space of definition for two-dimensional cellular automata, whereby the accuracy of computations is increased. Using this new metric, open disks are defined and the number of points in each disk computed. The forward dynamics of a cellular automaton map are also studied by defined sets. It is also shown that using the Infinite Unit Axiom, the number of configurations that follow a given configuration, under the forward iterations of the cellular automaton map, can now be computed and hence a classification scheme developed based on this computation.

Louis D’Alotto
Multiple-Precision Residue-Based Arithmetic Library for Parallel CPU-GPU Architectures: Data Types and Features

In this paper a new software library for multiple-precision (integer and floating-point) and extended-range computations is considered. The library is targeted at heterogeneous CPU-GPU architectures. The use of residue number system (RNS), enabling effective parallelization of arithmetic operations, lies in the basis of library multiple-precision modules. The paper deals with the supported number formats and the library features. An algorithm for the selection of an RNS moduli set for a given precision of computations are also presented.

Konstantin Isupov, Alexander Kuvaev, Mikhail Popov, Anton Zaviyalov
Parallel Implementation of Cellular Automaton Model of the Carbon Corrosion Under the Influence of the Electrochemical Oxidation

In the paper we present a cellular automaton model of electrochemical oxidation of the carbon. A two-dimensional sample of the electro-conductive carbon black “Ketjenblack ES DJ 600” is simulated. In the model the sample consists of a ring-formed granules of carbon. The carbon granules under the influence of the electrochemical oxidation are destroyed through a few successive stages. The rates of these oxidation stages are chosen to fit the simulation result with the experiment. In result of a computer simulation of carbon electrochemical oxidation the portions of surface atoms and atoms with different degree of oxidation were calculated and compared with the experimental data. In addition, a parallel implementation of the cellular automaton simulating the carbon corrosion is developed and efficiency of the parallel code is analyzed.

A. E. Kireeva, K. K. Sabelfeld, N. V. Maltseva, E. N. Gribov
A Fine-Grained Parallel Particle Swarm Optimization on Many-core and Multi-core Architectures

Particle Swarm Optimization (PSO) is a stochastic metaheuristics yet very robust. Real-world optimizations require a high computational effort to converge to a viable solution. In general, parallel PSO implementations provide good performance, but this depends on the parallelization strategy as well as the number and/or characteristics of the exploited processors. In this paper, we propose a fine-grained paralellization strategy that focuses on the work done w.r.t. each of the problem dimensions and does it in parallel. Moreover, all particles act in parallel. This strategy is useful in computationally demanding optimization problems wherein the objective function has a very large number of dimensions. We map the computation onto three different parallel high-performance multiprocessor architectures, which are based on many and multi-core architectures. The performance of the proposed strategy is evaluated for four well-known benchmarks with high-dimension and different complexity. The obtained speedups are very promising.

Nadia Nedjah, Rogério de Moraes Calazan, Luiza de Macedo Mourelle
The Implementation of Cellular Automata Interference of Two Waves in LuNA Fragmented Programming System

In this paper, a parallel implementation of the cellular-automata interference algorithm for two waves using the fragmented programming technology and LuNA system based on it is proposed. The technology is based on a strategy of data flow control. Unlike existing systems and technologies, LuNA provides a unified technology for implementing parallel programs on a heterogeneous multicomputer. The LuNA program contains a description of data fragments, computational fragments, and information dependencies between them. In the work, the LuNA program was executed on a computational cluster with homogeneous nodes. The results of comparison of the LuNA and MPI implementations showed that the execution time of the LuNA program exceeded that of the MPI program. This is due to the peculiarities of algorithms used for the distribution, search and transfer of data and computation fragments between the nodes of a cluster. The complexity of writing the LuNA program is much lower than for the MPI program.

V. P. Markova, M. B. Ostapkevich
A New Class of the Smallest Four-State Partial FSSP Solutions for One-Dimensional Ring Cellular Automata

The synchronization in cellular automata has been known as the firing squad synchronization problem (FSSP) since its development, where the FSSP gives a finite-state protocol for synchronizing a large scale of cellular automata. A quest for smaller state FSSP solutions has been an interesting problem for a long time. Umeo, Kamikawa and Yunès [9] answered partially by introducing a concept of partial FSSP solutions and proposing a full list of the smallest four-state symmetric powers-of-2 FSSP protocols that can synchronize any one-dimensional (1D) ring cellular automata of length $$n=2^{k}$$n=2k for any positive integer $$k \ge 1$$k≥1. Afterwards, Ng [7] also added a list of asymmetric FSSP partial solutions, thus completing the four-state powers-of-2 FSSP partial solutions. The number four is the smallest one in the class of FSSP protocols proposed so far. A question remained is that “are there any other four-state partial solutions?”. In this paper, we answer to the question by proposing a new class of the smallest four-state FSSP protocols that can synchronize any 1D ring of length $$n=2^{k}-1$$n=2k-1 for any positive integer $$k \ge 2$$k≥2. We show that the class includes a rich variety of FSSP protocols that consists of 39 symmetric solutions and 132 asymmetric ones, ranging from minimum-time to linear-time in synchronization steps. In addition, we make an investigation into several interesting properties of these partial solutions such as swapping general states, a duality between them, inclusion of powers-of-2 solutions, reflected solutions and so on.

Hiroshi Umeo, Naoki Kamikawa
Properties of the Conservative Parallel Discrete Event Simulation Algorithm

We address question of synchronisation in parallel discrete event simulation (PDES) algorithms. We study synchronisation in conservative PDES model adding long-range connections between processing elements. We investigate how fraction of the random long-range connections in the synchronisation scheme influences the simulation time profile of PDES. We found that small fraction of random distant connections enhance synchronisation, namely, the width of the local virtual times remains constant with increasing number of processing elements. At the same time the conservative algorithm of PDES on small-world networks remains free from deadlocks. We compare our results with the case-study simulations.

Liliia Ziganurova, Lev Shchur

Organization of Parallel Computation

Frontmatter
Combining Parallelization with Overlaps and Optimization of Cache Memory Usage

This paper allows L. Lamport hyperplane method modified for improvement of the temporal data locality. Gauss-Seidel algorithm optimized by modified hyperplane method is faster than non-optimized in 2.5 times. This algorithm was paralleled by the technique of data placement with overlaps and we have got the speedup in 28 times on 16 processors in comparison with the non-optimized sequential algorithm.

S. G. Ammaev, L. R. Gervich, B. Y. Steinberg
Defining Order of Execution in Aspect Programming Language

A fragmented approach to parallel programming and its implementation in the Aspect programming language are considered. Approach to define order of execution of computation fragments in Aspect language is described and illustrated by the example of matrix LU decomposition task.

Sergey Arykov
Automated GPU Support in LuNA Fragmented Programming System

The paper is devoted to the problem of reduction of complexity of development of numerical parallel programs for distributed memory computers with hybrid (CPU+GPU) computing nodes. The basic idea is to employ a high-level representation of an application algorithm to allow its automated execution on multicomputers with hybrid nodes without a programmer having to do low-level programming. LuNA is a programming system for numerical algorithms, which implements the idea, but only for CPU. In the paper we propose a LuNA language extension, as well as necessary run-time algorithms to support GPU utilization. For that a user only has to provide a limited number of computational GPU procedures using CUDA, while the system will take care of such associated low-level problems, as jobs scheduling, CPU-GPU data transfer, network communications and others. The algorithms developed and implemented take advantage of concerning informational dependencies of an application and support automated tuning to available hardware configuration and application input data.

Belyaev Nikolay, Vladislav Perepelkin
Automation Development Framework of Scalable Scientific Web Applications Based on Subject Domain Knowledge

Currently high-performance computing technologies using computational capabilities for solving scientific, are actively improving. The purpose of our research is the development of toolkit for construction and execution of scientific service-oriented application in heterogeneous distributed computing environment (HDCE). These tools provide the access for subject domain experts to the high-capacity computing resource, using these resources without extensive knowledge of computing architecture and low-level software, and the parallel execution of the user application on the base of the service-oriented technology and multi-agent control. We describe an architecture and functional capabilities of automated toolkit for the service-oriented application creation based on applied programs package, and multi-agent control of this application parallel running in HDCE. We demonstrate an example of the creation of the web-application for parametric feedback synthesis of linear dynamic object by these tools. The offered technology allows simplifying service creation and provides new qualitative opportunities of controlling parallel high-performance computations.

Igor V. Bychkov, Gennady A. Oparin, Vera G. Bogdanova, Anton A. Pashinin, Sergey A. Gorsky
Stopwatch Automata-Based Model for Efficient Schedulability Analysis of Modular Computer Systems

In this paper we propose a stopwatch automata-based model of a modular computer system operation. This model provides an ability to perform schedulability analysis for a wide class of modular computer systems. It is formally proven that the model satisfies a set of correctness requirements. It is also proven that all the traces, generated by the model interpretation, are equivalent for schedulability analysis purposes. The traces equivalence allows to use any trace for analysis and therefore the proposed approach is much more efficient than Model Checking, especially for parallel systems with many simultaneous events. The software implementation of the proposed approach is also presented in the paper.

Alevtina Glonina, Anatoly Bahmurov
Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Data reduction operations such as deduplication and compression are widely used to save storage capacity in primary storage system. These operations are compute-intensive. High performance storage devices like SSDs are widely used in most primary storage systems. Therefore, data reduction operations become a performance bottleneck in SSD-based primary storage systems.In this paper, we propose a parallel data reduction technique on data deduplication and compression utilizing both multi-core CPU and GPU in an integrated manner. First, we introduce bin-based data deduplication, a parallel technique on deduplication, where CPU-based parallelism is mainly applied whereas GPU is utilized as co-processor of CPU. Second, we also propose a parallel technique on compression, where main computation is done by GPU while CPU is responsible only for post-processing. Third, we propose a parallel technique handling both deduplication and compression in an integrated manner, where our technique controls when and how to use GPU. Experimental evaluation shows that our proposed techniques can achieve 15.0%, 88.3%, and 89.7% better throughput than the case where only CPU is applied for deduplication, compression, and integrated data reductions, respectively. Our proposed technique enables easy application of data reduction operations to SSD-based primary storage systems.

Jeonghyeon Ma, Chanik Park
Distributed Algorithm of Dynamic Multidimensional Data Mapping on Multidimensional Multicomputer in the LuNA Fragmented Programming System

The distributed algorithm Patch with local communications for dynamic data allocation of a distributed multicomputer in the course of an application LuNA fragmented program execution is presented. The objective of the Patch is to decrease the length and as result the volume of communications while the parallel program is executed. Communications include all the internode interactions for data processing, dynamic data allocation, search and balancing. The Patch takes into account the data dependencies and maximally tries to keep the data locality during all the internode interactions.

Victor E. Malyshkin, Georgy A. Schukin
Probabilistic Causal Message Ordering

Causal broadcast is a classical communication primitive that has been studied for more then three decades and several implementations have been proposed. The implementation of such a primitive has a non negligible cost either in terms of extra information messages have to carry or in time delays needed for the delivery of messages. It has been proved that messages need to carry a control information the size of which is linear with the size of the system. This problem has gained more interest due to new application domains such that collaborative applications are widely used and are becoming massive and social semantic web and linked-data the implementation of which needs causal ordering of messages. This paper proposes a probabilistic but efficient causal broadcast mechanism for large systems with changing membership that uses few integer timestamps.

Achour Mostéfaoui, Stéphane Weiss
An Experimental Study of Workflow Scheduling Algorithms for Heterogeneous Systems

The paper studies the efficiency of nine state-of-the-art algorithms for scheduling of workflow applications in heterogeneous computing systems (HCS). The comparison of algorithms is performed on the base of discrete-event simulation for a wide range of workflow and system configurations. The developed open source simulation framework based on SimGrid toolkit allowed us to perform a large number of experiments in a reasonable amount of time and to ensure reproducible results. The accuracy of the used network model helped to reveal drawbacks of simpler models commonly used for studying scheduling algorithms.

Alexey Nazarenko, Oleg Sukhoroslov
PGAS Approach to Implement Mapreduce Framework Based on UPC Language

Over the years from its introduction Mapreduce technology proved to be very effective parallel programming technique to process large volumes of data. One of the most prevalent implementations of Mapreduce is Hadoop framework and Google proprietary Mapreduce system.Out of other notable implementations one should mention recent PGAS (partitioned global address space) – based X10, UPC (Unified Parallel C) versions. These implementations present a new viewpoint when Mapreduce application developers can benefit from using global address space model while writing data parallel tasks. In this paper we introduce a novel UPC implementation of Mapreduce technology based on idea of using purely UPC based implementation of shared hashmap data structure as an intermediate key/value store. Shared hashmap is used in to perform exchange of key/values between parallel UPC threads during shuffle phase of Mapreduce framework. The framework also allows to express data parallel applications using simple sequential code.Additionally, we present a heuristic approach based on genetic algorithm that could efficiently perform load balancing optimization to distribute key/values among threads such that we minimize data movement operations and evenly distribute computational workload.Results of evaluation of Mapreduce on UPC framework based on WordCount benchmark application are presented and compared to Apache Hadoop implementation.

Shomanov Aday, Akhmed-Zaki Darkhan, Mansurova Madina
Islands-of-Cores Approach for Harnessing SMP/NUMA Architectures in Heterogeneous Stencil Computations

SMP/NUMA systems are powerful HPC platforms which could be applied for a wide range of real-life applications. These systems provide large capacity of shared memory, and allow using the shared-variable programming model to take advantages of shared memory for inter-process communications and synchronizations. However, as data can be physically dispersed over many nodes, the access to various data items may require significantly different times. In this paper, we face the challenge of harnessing the heterogeneous nature of SMP/NUMA communications for a complex scientific application which implements the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA), consisting of a set of heterogeneous stencil computations.When using our method of MPDATA workload distribution, which was successfully applied for small-scale shared memory systems with several CPUs and/or accelerators, significant performance losses are noticeable for larger SMP/NUMA systems, such as SGI UV 2000 server used in this work. To overcome this shortcoming, we propose a new islands-of-cores approach. It exposes a correlation between computation and communication for heterogeneous stencils, and enables an efficient management of trade-off between computation and communication costs in accordance with the features of SMP/NUMA systems. In consequence, when using the maximum configuration with 112 cores of 14 Intel Xeon E5-4627v2 3.3 GHz processors, the proposed approach accelerates the previous method more then 10 times, achieving about 390 Gflop/s, or approximately 30% of the theoretical peak performance.

Lukasz Szustak, Roman Wyrzykowski, Ondřej Jakl
The Algorithm of Control Program Generation for Optimization of LuNA Program Execution

LuNA fragmented programming system is a high-level declarative system of parallel programming. Such systems have the problem of achieving on appropriate program execution performance in comparison with MPI. The reasons are a high degree of parallel program execution non-determinism and execution overhead. The paper presents an algorithm of control program generation for LuNA programs. That is a step towards automatic improvement of LuNA program execution performance. Performance tests presented show effectiveness of the proposed approach.

Anastasia A. Tkacheva
Cyclic Anticipation Scheduling in Grid VOs with Stakeholders Preferences

In this work, a job-flow scheduling approach for Grid virtual organizations (VOs) is proposed and studied. Users’ and resource providers’ preferences, VOs internal policies, resources geographical distribution along with local private utilization impose specific requirements for efficient scheduling according to different, usually contradictive, criteria. With increasing resources utilization level the available resources set and corresponding decision space are reduced. This further complicates the problem of efficient scheduling. In order to improve overall scheduling efficiency, we propose an anticipation scheduling approach based on a cyclic scheduling scheme. It generates a near optimal but infeasible scheduling solution and includes a special replication procedure for efficient and feasible resources allocation. Anticipation scheduling is compared with the general cycle scheduling scheme and conservative backfilling using such criteria as average jobs’ response time (start and finish times) as well as users’ and VO economic criteria (execution time and cost).

Victor Toporkov, Dmitry Yemelyanov, Anna Toporkova, Petr Potekhin

Parallel Computing Applications

Frontmatter
Comparison of Auction Methods for Job Scheduling with Absolute Priorities

The model of geographically distributed computing system with absolute priorities of jobs is described in the paper. Authors designed the decentralized scheduling algorithm using the auction methods. Two auction methods were researched and compared: the first-price sealed-bid auction and the English auction. The paper includes results of experimental comparison of researched auction methods.

Anton Baranov, Pavel Telegin, Artem Tikhomirov
Parallel Algorithm for Solving Constrained Global Optimization Problems

This work considers a parallel algorithm for solving multiextremal problems with non-convex constraints. The distinctive feature of this algorithm, which does not use penalty functions, is the separate consideration of each problem constraint. The search process can be conducted by reducing the original multidimensional problem to a number of related one-dimensional problems and solving this set of problems in parallel. An experimental assessment of parallel algorithm efficiency was conducted by finding the numeric solution to several hundred randomly generated multidimensional multiextremal problems with non-convex constraints.

Konstantin Barkalov, Ilya Lebedev
Parallelizing Metaheuristics for Optimal Design of Multiproduct Batch Plants on GPU

We propose a metaheuristics-based approach to the optimal design of multi-product batch plants, with a particular application example of chemical-engineering systems. Our hybrid approach combines two metaheuristics: Ant Colony Optimization (ACO) and Simulated Annealing (SA). We develop a sequential implementation of the proposed method and we parallelize it on Graphics Processing Units (GPU) using the CUDA programming environment. We experimentally demonstrate that the results of our hybrid metaheuristic approach (ACO+SA) are very near to the global optimal solutions, but they are produced much faster than using the deterministic Branch-and-Bound approach.

Andrey Borisenko, Sergei Gorlatch
The Optimization of Traffic Management for Cloud Application and Services in the Virtual Data Center

Nowadays one of the problems of optimization is the control of the traffic in cloud applications and services in the network environment of virtual data center. Taking into account the multitier architecture of modern data centers, we need to pay a special attention to this task. The advantage of modern infrastructure virtualization is the possibility to use software-defined networks and software-defined data storages. However, the existing optimization of algorithmic solutions does not take into account the specific features of the heterogeneous network traffic routing with multiple application types. The task of optimizing traffic distribution for cloud applications and services can be solved by using software-defined infrastructure of virtual data centers. We have developed a simulation model for the traffic in software-defined networks segments of virtual data centers involved in processing user requests to cloud application and services within a network environment. Our model enables to implement the traffic management algorithm of cloud applications and optimize the access to storage systems through the effective use of data transmission channels. During the experimental studies, we have found that the use of our algorithm enables to decrease the response time of cloud applications and services and, therefore, increase the productivity of user requests processing and reduce the number of refusals.

Irina Bolodurina, Denis Parfenov
Distributed Data Fusion for the Internet of Things

The ubiquitous Internet of Things is underpinned by the recent advancements in the wireless networking technology, which enabled connecting previously scattered devices into the global network. IoT engineers, however, are required to handle current limitations and find the right balance between data transferring range, throughput, and power consumption of wireless IoT devices. As a result, existing IoT systems, based on collecting data from a distributed network of edge devices, are limited by the amount of data they are able to transfer over the network. This means that some sort of data fusion mechanism has to be introduced, which would be responsible for filtering raw data before sending them further to a next node through the network. As a potential way of implementing such a mechanism, this paper proposes utilising Complex Event Processing and introduces a hierarchical distributed architecture for enabling data fusion at various levels.

Rustem Dautov, Salvatore Distefano
Scalable Computations of GeRa Code on the Base of Software Platform INMOST

The hydrogeological modeling code GeRa is based on INMOST software platform, which operates with distributed mesh data and allows to assemble and solve the system of linear equations. The set of groundwater flow models with filtration, transport, and chemical processes are considered. The comparison of parallel efficiency for different linear solvers in the INMOST framework is performed. The analysis of scalability of GeRa code on different computer platforms from multicore laptop to Lomonosov supercomputer is presented.

Igor Konshin, Ivan Kapyrin
Parallel Computing for Time-Consuming Multicriterial Optimization Problems

In the present paper, an efficient method for parallel solving the time-consuming multicriterial optimization problems, where the optimality criteria can be multiextremal, and the computation of the criteria values can require a large amount of computations, is proposed. The proposed scheme of parallel computations allows obtaining several efficient decisions of a multicriterial problem. During performing the computations, the maximum use of the search information is provided. The results of the numerical experiments have demonstrated such an approach to allow reducing the computational costs of solving the multicriterial optimization problems essentially – several tens and hundred times.

Victor Gergel, Evgeny Kozinov
A Functional Approach to Parallelizing Data Mining Algorithms in Java

We describe a new approach to parallelizing data mining algorithms. We use the representation of an algorithm as a sequence of functions and we use higher-order functions to express parallel execution. Our approach generalizes the popular MapReduce programming model by enabling not only data-parallel, but also task-parallel implementation and a combination of both. We implement our approach as an extension of the industrial-strength library Xelopes, and we illustrate it by developing a multi-threaded Java program for the 1R classification algorithm, with experiments on a multi-core processor.

Ivan Kholod, Andrey Shorov, Sergei Gorlatch
Parallel Calculation of Diameter Constrained Network Reliability

The problem of network reliability calculation in case of the diameter constraint is studied. The problem of computing this characteristic is known to be NP-hard. We introduce the parallel methods, which are based on the well-known factoring method and on the factoring method modification proposed by H. Cancela and L. Petingi. The analysis of the numerical experiments has allowed us to set some important parameters of the parallel algorithm for speeding up calculations.

Sergei N. Nesterov, Denis A. Migov
Congestion Game Scheduling Implementation for High-Throughput Virtual Drug Screening Using BOINC-Based Desktop Grid

Virtual drug screening is one of the most common applications of high-throughput computing. As virtual screening is time consuming, a problem of obtaining a diverse set of hits in a short time is very important. We propose a mathematical model based on game theory. Task scheduling for virtual drug screening in high-performance computational systems is considered as a congestion game between computing nodes to find the equilibrium solutions for best balancing between the number of interim hits and their chemical diversity. We present the developed scheduling algorithm implementation for Desktop Grid and Enterprise Desktop Grid, and perform comprehensive computational experiments to evaluate its performance. We compare the algorithm with two known heuristics used in practice and observe that game-based scheduling outperforms them by the hits discovery rate and chemical diversity at earlier steps.

Natalia Nikitina, Evgeny Ivashko, Andrei Tchernykh
Globalizer – A Parallel Software System for Solving Global Optimization Problems

In this paper, we describe the Globalizer software system for solving global optimization problems. The system implements an approach to solving the global optimization problems using the block multistage scheme of the dimension reduction, which combines the use of Peano curve type evolvents and the multistage reduction scheme. The scheme allows an efficient parallelization of the computations and increasing the number of processors employed in the parallel solving of the global optimization problems many times.

Alexander Sysoyev, Konstantin Barkalov, Vladislav Sovrasov, Ilya Lebedev, Victor Gergel
A Novel String Representation and Kernel Function for the Comparison of I/O Access Patterns

Parallel I/O access patterns act as fingerprints of a parallel program. In order to extract meaningful information from these patterns, they have to be represented appropriately. Due to the fact that string objects can be easily compared using Kernel Methods, a conversion to a weighted string representation is proposed in this paper, together with a novel string kernel function called Kast Spectrum Kernel. The similarity matrices, obtained after applying the mentioned kernel over a set of examples from a real application, were analyzed using Kernel Principal Component Analysis (Kernel PCA) and Hierarchical Clustering. The evaluation showed that 2 out of 4 I/O access pattern groups were completely identified, while the other 2 conformed a single cluster due to the intrinsic similarity of their members. The proposed strategy can be promisingly applied to other similarity problems involving tree-like structured data.

Raul Torres, Julian Kunkel, Manuel F. Dolz, Thomas Ludwig
Backmatter
Metadata
Title
Parallel Computing Technologies
Editor
Prof. Dr. Victor Malyshkin
Copyright Year
2017
Electronic ISBN
978-3-319-62932-2
Print ISBN
978-3-319-62931-5
DOI
https://doi.org/10.1007/978-3-319-62932-2

Premium Partner