main-content

## Über dieses Buch

The two-volume set LNCS 10777 and 10778 constitutes revised selected papers from the 12th International Conference on Parallel Processing and Applied Mathematics, PPAM 2017, held in Lublin, Poland, in September 2017.

The 49 regular papers presented in the proceedings were selected from 98 submissions. For the workshops and special sessions, that were held as integral parts of the PPAM 2017 conference, a total of 51 papers was accepted from 75 submissions.

The papers were organized in topical sections named as follows:

Part I: numerical algorithms and parallel scientific computing; particle methods in simulations; task-based paradigm of parallel computing; GPU computing; parallel non-numerical algorithms; performance evaluation of parallel algorithms and applications; environments and frameworks for parallel/distributed/cloud computing; applications of parallel computing; soft computing with applications; and special session on parallel matrix factorizations.

Part II: workshop on models, algorithms and methodologies for hybrid parallelism in new HPC systems; workshop power and energy aspects of computations (PEAC 2017); workshop on scheduling for parallel computing (SPC 2017); workshop on language-based parallel programming models (WLPP 2017); workshop on PGAS programming; minisymposium on HPC applications in physical sciences; minisymposium on high performance computing interval methods; workshop on complex collective systems.

## Inhaltsverzeichnis

### An Experience Report on (Auto-)tuning of Mesh-Based PDE Solvers on Shared Memory Systems

With the advent of manycore systems, shared memory parallelisation has gained importance in high performance computing. Once a code is decomposed into tasks or parallel regions, it becomes crucial to identify reasonable grain sizes, i.e. minimum problem sizes per task that make the algorithm expose a high concurrency at low overhead. Many papers do not detail what reasonable task sizes are, and consider their findings craftsmanship not worth discussion. We have implemented an autotuning algorithm, a machine learning approach, for a project developing a hyperbolic equation system solver. Autotuning here is important as the grid and task workload are multifaceted and change frequently during runtime. In this paper, we summarise our lessons learned. We infer tweaks and idioms for general autotuning algorithms and we clarify that such a approach does not free users completely from grain size awareness.

Dominic E. Charrier, Tobias Weinzierl

### Using GPGPU Accelerated Interpolation Algorithms for Marine Bathymetry Processing with On-Premises and Cloud Based Computational Resources

Data crowdsourcing is one of most remarkable results of pervasive and internet connected low-power devices making diverse and different “things” as a world wide distributed system. This paper is focused on a vertical application of GPGPU virtualization software exploitation targeted on high performance geographical data interpolation. We present an innovative implementation of the Inverse Distance Weight (IDW) interpolation algorithm leveraging on CUDA GPGPUs. We perform tests in both physical and virtualized environments in order to demonstrate the potential scalability in production. We present an use case related to high resolution bathymetry interpolation in a crowdsource data context.

Livia Marcellino, Raffaele Montella, Sokol Kosta, Ardelio Galletti, Diana Di Luccio, Vincenzo Santopietro, Mario Ruggieri, Marco Lapegna, Luisa D’Amore, Giuliano Laccetti

### Relaxing the Correctness Conditions on Concurrent Data Structures for Multicore CPUs. A Numerical Case Study

The rise of new multicore CPUs introduced new challenges in the process of design of concurrent data structures: in addition to traditional requirements like correctness, linearizability and progress, the scalability is of paramount importance. It is a common opinion that these two demands are partially in conflict each others, so that in these computational environments it is necessary to relax the requirements on the traditional features of the data structures. In this paper we introduce a relaxed approach for the management of heap based priority queues on multicore CPUs, with the aim to realize a tradeoff between efficiency and sequential correctness. The approach is based on a sharing of information among only a small number of cores, so that to improve performance without completely losing the features of the data structure. The results obtained on a numerical algorithm show significant benefits in terms of parallel efficiency.

Giuliano Laccetti, Marco Lapegna, Valeria Mele, Raffaele Montella

### Energy Analysis of a 4D Variational Data Assimilation Algorithm and Evaluation on ARM-Based HPC Systems

Driven by the emerging requirements of High Performance Computing (HPC) architectures, the main focus of this work is the interplay of computational and energetic aspects of a Four Dimensional Variational (4DVAR) Data Assimilation algorithm, based on Domain Decomposition (named DD-4DVAR). We report first results on the energy consumption of the DD-4DVAR algorithm on embedded processor and a mathematical analysis of the energy behavior of the algorithm by assuming the architectures characteristics as variable of the model. The main objective is to capture the essential operations of the algorithm exhibiting a direct relationship with the measured energy. The experimental evaluation is carried out on a set of mini-clusters made available by the Barcelona Supercomputing Center.

Rossella Arcucci, Davide Basciano, Alessandro Cilardo, Luisa D’Amore, Filippo Mantovani

### Performance Assessment of the Incremental Strong Constraints 4DVAR Algorithm in ROMS

We consider the Incremental Strong constraint 4D VARiational (IS4DVAR) algorithm for data assimilation implemented in ROMS with the aim to study its performance in terms of strong scaling scalability on computing architectures such as a cluster of CPUs. We consider realistic test cases with data collected in enclosed and semi enclosed seas, namely, Caspian sea, West Africa/Angola, as well as data collected into the California bay. The computing architecture we use is currently available at Imperial College London. The analysis allows us to highlight that the ROMS-IS4DVAR performance on emerging architectures depends on a deep relation among the problems size, the domain decomposition approach and the computing architecture characteristics.

Luisa D’Amore, Rossella Arcucci, Yi Li, Raffaele Montella, Andrew Moore, Luke Phillipson, Ralf Toumi

### Evaluation of HCM: A New Model to Predict the Execution Time of Regular Parallel Applications on a Heterogeneous Cluster

In a previous work we proposed a new model that predicts the execution time of a regular application on a heterogeneous parallel environment. The model considers that a heterogeneous cluster is composed by distinct types of processors, accelerators and networks. This work further details and proposes some modifications to the original model, as well as evaluate it on a heterogeneous cluster environment. The results have show that the worst error in the estimations of the parallel execution time was about 12.7%, and, in many cases, the estimated execution time is equal to or very close to the actual one.

Thiago Marques Soares, Rodrigo Weber dos Santos, Marcelo Lobosco

### Applicability of the Empirical Mode Decomposition for Power Traces of Large-Scale Applications

Current trends in HPC show that exascale systems will be power capped, prompting their users to determine the best combination of resources to satisfy a power budget. Hence, performance and energy models must interplay and aid users in this resource selection based on the desired application parameters. While existing performance models may predict application execution at a scale, current power models are inadequate for this propose due, in part, to the variability of instantaneous dynamic power and the need to handle large amount of power measurements at the runtime to populate the models. In this paper, the latter challenge is tackled by selecting certain power measurements and applying to them the empirical mode decomposition (EMD) technique, which itself already deals with instantaneous variability of power during the runtime. Specifically, it is proposed here to apply EMD to segments of a power trace to rapidly generate a quadratic model that describes overall time, power, and thus energy simultaneously. The proposed models have been applied to several realistic applications. The error across the proposed models and the measured energy consumption is within 5% for the smaller segments consisting of 2,000 trace samples and is about 2% for the segments of 6,000 samples.

Gary Lawson, Masha Sosonkina, Tal Ezer, Yuzhong Shen

### Efficiency Analysis of Intel, AMD and Nvidia 64-Bit Hardware for Memory-Bound Problems: A Case Study of Ab Initio Calculations with VASP

Nowadays, the wide spectrum of Intel Xeon processors is available. The new Zen CPU architecture developed by AMD has extended the number of options for x86_64 HPC hardware. Moreover, Nvidia has released a custom 64-bit Denver architecture based on the ARM instruction set. This large number of options makes the optimal CPU choice for perspective HPC systems not a straightforward procedure. Such a co-design procedure should follow the requests from the end-users community. Modern computational materials science studies are among the major consumers of HPC resources worldwide. The VASP code is perhaps the most popular tool for these research. In this work, we discuss the benchmark metric and results based on a VASP test model that give us the possibility to compare different hardware and to distinguish the best options with respect to energy-to-solution criterion.

### GPU Power Modeling of HPC Applications for the Simulation of Heterogeneous Clouds

Hardware accelerators have been widely used in the scientific community, as the gain in the performance of HPC applications is significant. Hardware accelerators have been used in cloud computing as well, though existing cloud simulation frameworks do not support modeling and simulation of such hardware. Models for the estimation of the power consumption of accelerators have been proposed by many researchers, but they require large number of inputs and computations, making them unsuitable for hyper scale simulations. In previous work, a generic model for the estimation of the power consumption of accelerators has been proposed, that can be combined with generic CPU power models suitable for integration in hyper scale simulation environments. This paper extends this work by providing models for the energy consumption of GPUs and CPU-GPU pairs, that are experimentally validated with the use of different GPU hardware models and GPU intensive applications. The relative error between the actual and the estimated energy consumption is low, thus the proposed models provide accurate estimations and can be efficiently integrated into cloud simulation frameworks.

Antonios T. Makaratzis, Malik M. Khan, Konstantinos M. Giannoutakis, Anne C. Elster, Dimitrios Tzovaras

### Bi-cluster Parallel Computing in Bioinformatics – Performance and Eco-Efficiency

The paper discusses the selected bi-clustering algorithms in terms of energy efficiency. We demonstrate the need for the power aware software development, elaborate bi-clustering methods and applications, and describe the experimental computational cluster with a custom built energy measurement instrumentation.

Paweł Foszner, Przemysław Skurowski

### Performance and Energy Analysis of Scientific Workloads Executing on LPSoCs

Low-power system-on-chip (LPSoC) processors provide an interesting alternative as building blocks for future HPC systems due to their high energy efficiency. However, understanding their performance-energy trade-offs and minimizing the energy-to-solution for an application running across the heterogeneous devices of an LPSoC remains a challenge. In this paper, we describe our methodology for developing an energy model which may be used to predict the energy usage of application code executing on an LPSoC system under different frequency settings. For this paper, we focus only on the CPU. Performance and energy measurements are presented for different types of workloads on the NVIDIA Tegra TK1 and Tegra TX1 systems at varying frequencies. From these results, we provide insights on how to develop a model to predict energy usage at different frequencies for general workloads.

Anish Varghese, Joshua Milthorpe, Alistair P. Rendell

### Energy Efficient Dynamic Load Balancing over MultiGPU Heterogeneous Systems

Current HPC technologies demand high amounts of power and energy to achieve good performances. In order to address the next milestone in peak power, powerful graphic processing units and manycore processors present in current HPC systems need to be programmed having energy efficiency in mind. As energy efficiency is a major issue in this area, the existing codes and libraries need to be adapted to improve the use of the available resources. Rewriting the code requires deep knowledge of programming and architectural details to achieve good efficiency, which is an ad-hoc solution for a concrete system. We present the Ull_Multiobjective_Framework, an interface that allows automatic dynamic balance of the workload for parallel iterative codes in heterogeneous environments. UllMF allows to include the overall energy consumption as a parameter during the balancing process. This tool hides all architectural measurement details, requires very low effort to the programmer and introduces a minimum overhead. The calibration library has been used to solve iterative problems over heterogeneous platforms. To validate it, we present an analysis of different Dynamic Programming problems over different hardware configurations of a MultiGPU heterogeneous system.

Alberto Cabrera, Alejandro Acosta, Francisco Almeida, Vicente Blanco

### Scheduling Data Gathering with Maximum Lateness Objective

In this paper, scheduling in a star data gathering network is studied. The worker nodes of the network produce datasets that have to be gathered by a single base station. The datasets may be released at different moments. Each dataset is assigned a due date by which it should arrive at the base station. The scheduling problem is to organize the communication in the network so that the maximum dataset lateness is minimized. As this problem is strongly NP-hard, we propose a heuristic algorithm for solving it. The performance of the algorithm is evaluated on the basis of computational experiments.

Joanna Berlińska

### Fair Scheduling in Grid VOs with Anticipation Heuristic

In this work, a job-flow scheduling approach for Grid virtual organizations (VOs) is proposed and studied. Users’ and resource providers’ preferences, VOs internal policies along with local private utilization impose specific requirements for scheduling according to different, usually contradictive, criteria. We study the problem of a fair job batch scheduling with a relatively limited resources supply. With increasing resources utilization level the available resources set and corresponding decision space are reduced. The main problem is a scarce set of job execution alternatives which eliminates scheduling optimization. In order to improve overall scheduling efficiency we propose a heuristic anticipation approach. It generates a reference, most likely infeasible, scheduling solution. A special replication procedure performs a feasible solution with a minimum distance to a reference alternative under given metrics.

Victor Toporkov, Dmitry Yemelyanov, Anna Toporkova

### A Security-Driven Approach to Online Job Scheduling in IaaS Cloud Computing Systems

The paper presents a general framework to study issues of multi-objective on-line scheduling in the Infrastructure as a Service model of Cloud Computing (CC) systems taking into account the aspects of the total work-flow execution cost while meeting the deadline and risk rate constraints. Our goal is providing fairness between concurrent job submissions by minimizing tardiness of individual applications and dynamically rescheduling them to the best suited resources. The system, via the scheduling algorithms, is responsible to guarantee the corresponding Quality of Service (QoS) and Service Level Agreement (SLA) for all accepted jobs.

Jakub Gąsior, Franciszek Seredyński, Andrei Tchernykh

### Dynamic Load Balancing Algorithm for Heterogeneous Clusters

Half of the ten fastest supercomputers in the world use multiprocessors and accelerators. This hybrid environment, also present in personal computers and clusters, imposes new challenges to the programmer that wants to use all the processing power available on the hardware. OpenCL, OpenACC and other standards can help in the task of writing parallel code for heterogeneous platforms. However, some issues are not eliminated by such standards. Since multiprocessors and accelerators are different architectures and for this reason present distinct performance, data parallel applications have to find a data division that distributes the same amount of work to all devices, i.e., they have to finish their work in approximately the same time. This work proposes a dynamic load balancing algorithm that can be used in small-scale heterogeneous environments. A simulator of the Human Immune System (HIS) was used to evaluate the proposed algorithm. The results have shown that the dynamic load balancing algorithm was very effective in its purpose.

Tiago Marques do Nascimento, Rodrigo Weber dos Santos, Marcelo Lobosco

### Multi-Objective Extremal Optimization in Processor Load Balancing for Distributed Programs

The paper presents a multi-objective load balancing algorithm based on Extremal Optimization in execution of distributed programs. The Extremal Optimization aims in defining task migration as a means for improving balance in loading executive processors with program tasks. In the proposed multi-objective approach three objectives relevant in processor load balancing for distributed applications are jointly optimized. These objectives include: balance in computational load of distributed processors, total volume of inter-processor communication between tasks and task migration metrics. In the proposed Extremal Optimization algorithms a special approach called Guided Search is applied in selection of a new partial solution to be improved. It is supported by some knowledge of the problem in terms of computational and communication loads influenced by task migration. The proposed algorithms are assessed by simulation experiments with distributed execution of program macro data flow graphs.

Ivanoe De Falco, Eryk Laskowski, Richard Olejnik, Umberto Scafuri, Ernesto Tarantino, Marek Tudruj

### Pardis: A Process Calculus for Parallel and Distributed Programming in Haskell

Parallel and distributed programming involve substantial amounts of boilerplate code for process management and data synchronisation. This leads to increased bug potential and often results in unintended non-deterministic program behaviour. Moreover, algorithmic details are mixed with technical details concerning parallelisation and distribution. Process calculi are formal models for parallel and distributed programming but often leave details open, causing a gap between formal model and implementation. We propose a fully deterministic process calculus for parallel and distributed programming and implement it as a domain specific language in Haskell to address these problems. We eliminate boilerplate code by abstracting from the exact notion of parallelisation and encapsulating it in the implementation of our process combinators. Furthermore, we achieve correctness guarantees regarding process composition at compile time through Haskell’s type system. Our result can be used as a high-level tool to implement parallel and distributed programs.

Christopher Blöcker, Ulrich Hoffmann

### Towards High-Performance Python

Python became the preferred language for teaching in academia, and it is one of the most popular programming languages for scientific computing. This wide popularity occurs despite the weak performance of the language. This weakness is the motivation that drives the efforts devoted by the Python community to improve the performance of the language. In this article, we are following these efforts while we focus on one specific promised solution that aims to provide high-performance for Python applications.

Ami Marowka

### Actor Model of a New Functional Language - Anemone

This paper describes actor system of a new functional language called Anemone and compares it with actor systems of Scala and Erlang. Implementation details of the actor system are described. Performance evaluation is provided on sequential and concurrent programs.

Paweł Batko, Marcin Kuta

### Almost Optimal Column-wise Prefix-sum Computation on the GPU

The row-wise and column-wise prefix-sum computation of a matrix has many applications in the area of image processing such as computation of the summed area table and the Euclidean distance map. It is known that the prefix-sums of a 1-dimensional array can be computed efficiently on the GPU. Hence, the row-wise prefix-sums of a matrix can also be computed efficiently on the GPU by executing this prefix-sum algorithm for every row in parallel. However, the same approach does not work well for computing the column-wise prefix-sums, because inefficient stride memory access to the global memory is performed. The main contribution of this paper is to present an almost optimal column-wise prefix-sum algorithm on the GPU. Since all elements in an input matrix must be read and the resulting prefix-sums must be written, computation of the column-wise prefix-sums cannot be faster than simple matrix duplication in the global memory of the GPU. Quite surprisingly, experimental results using NVIDIA TITAN X show that our column-wise prefix-sum algorithm runs only 2–6% slower than matrix duplication. Thus, our column-wise prefix-sum algorithm is almost optimal.

Hiroki Tokura, Toru Fujita, Koji Nakano, Yasuaki Ito

### A Combination of Intra- and Inter-place Work Stealing for the APGAS Library

Since today’s clusters consist of nodes with multicore processors, modern parallel applications should be able to deal with shared and distributed memory simultaneously. In this paper, we present a novel hybrid work stealing scheme for the APGAS library for Java, which is a branch of the X10 project. Our scheme extends the library’s runtime system, which traditionally performs intra-node work stealing with the Java Fork/Join framework. We add an inter-node work stealing scheme that is inspired by lifeline-based global load balancing. The extended functionality can be accessed from the APGAS library with new constructs. Most important, locality-flexible tasks can be submitted with asyncAny, and are then automatically scheduled over both nodes and cores. In experiments with up to 144 workers on up to 12 nodes, our system achieved near linear speedups for three benchmarks.

Jonas Posner, Claudia Fohry

### Benchmarking Molecular Dynamics with OpenCL on Many-Core Architectures

Molecular Dynamics (MD) is a widely used tool for simulations of particle systems with pair-wise interactions. Since large scale MD simulations are very demanding in computation time, parallelisation is an important factor. As in the current HPC environment different heterogeneous computing architectures are emerging, a benchmark tool for a representative number of these architectures is desirable. OpenCL as a platform-overarching standard provides the capabilities for such a benchmark. This paper describes the implementation of an OpenCL MD benchmark code and discusses the results achieved on different types of computing hardware.

Rene Halver, Wilhelm Homberg, Godehard Sutmann

### Efficient Language-Based Parallelization of Computational Problems Using Cilk Plus

The aim of this paper is to evaluate Cilk Plus as a language-based tool for simple and efficient parallelization of recursively defined computational problems and other problems that need both task and data parallelization techniques. We show that existing source codes can be easily transformed to programs that can utilize multiple cores and additionally offload some computations to coprocessors like Intel Xeon Phi. We also advise how to improve simplicity and performance of data parallel algorithms by tuning data structures to utilize vector extensions of modern processors. Numerical experiments show that in most cases our Cilk Plus versions of Adaptive Simpson’s Integration and Belman-Ford Algorithm for solving single-source shortest-path problems achieve better performance than corresponding OpenMP programs.

Przemysław Stpiczyński

### A Taxonomy of Task-Based Technologies for High-Performance Computing

Task-based programming models for shared memory – such as Cilk Plus and OpenMP 3 – are well established and documented. However, with the increase in heterogeneous, many-core and parallel systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing, no comprehensive overview or classification of task-based technologies for HPC exists.In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We demonstrate the usefulness of our taxonomy by classifying state-of-the-art task-based environments in use today.

Peter Thoman, Khalid Hasanov, Kiril Dichev, Roman Iakymchuk, Xavier Aguilar, Philipp Gschwandtner, Pierre Lemarinier, Stefano Markidis, Herbert Jordan, Erwin Laure, Kostas Katrinis, Dimitrios S. Nikolopoulos, Thomas Fahringer

### Interoperability of GASPI and MPI in Large Scale Scientific Applications

One of the main hurdles of a broad distribution of PGAS approaches is the prevalence of MPI, which as a de-facto standard appears in the code basis of many applications. To take advantage of the PGAS APIs like GASPI without a major change in the code basis, interoperability between MPI and PGAS approaches needs to be ensured. In this article, we address this challenge by providing our study and preliminary performance results regarding interoperating GASPI and MPI on the performance crucial parts of the Ludwig and iPIC3D applications. In addition, we draw a strategy for better coupling of both APIs.

Dana Akhmetova, Luis Cebamanos, Roman Iakymchuk, Tiberiu Rotaru, Mirko Rahn, Stefano Markidis, Erwin Laure, Valeria Bartsch, Christian Simmendinger

### Evaluation of the Parallel Performance of the Java and PCJ on the Intel KNL Based Systems

In this paper, we present performance and scalability of the Java codes parallelized on the Intel KNL platform using Java and PCJ Library. The parallelization is performed using PGAS programming model with no modification to Java language nor Java Virtual Machine. The obtained results show good overall performance, especially for parallel applications. The microbenchmark results, compared to the C/MPI, show that PCJ communication efficiency should be improved.

Marek Nowicki, Łukasz Górski, Piotr Bała

### Fault-Tolerance Mechanisms for the Java Parallel Codes Implemented with the PCJ Library

Parallel programs run on multiple processor systems with hundreds and thousands of cores. Because of a large number of nodes, failure can happen quite often, sometimes within hours. This makes fault-tolerance a crucial concern for nowadays HPC solutions.PCJ library is one of the most successful Java solutions allowing to parallelize large scale computations up to thousands of cores. This paper describes a minimal overhead failure discovery and mitigation mechanisms introduced to the PCJ library.Our implementation provides a programmer with a basic functionality to detect a node failure. Java exception mechanism and local node monitoring are used to detect execution problems. The problems are presented to the programmer through easy to use extensions. In the result, the programmer does not have to deal with low-level programming.The detailed scenario how to recover from the failure has to be decided and implemented by the programmer.

Michał Szynkiewicz, Marek Nowicki

### Exploring Graph Analytics with the PCJ Toolbox

Graph analysis is an intrinsic tool embedded in the big data domain. The demand in processing of bigger and bigger graphs requires highly efficient and parallel applications. In this work we explore the possibility of employing the new PCJ library for distributed calculations in Java. We apply the toolbox to sparse matrix matrix multiplications and the k-means clustering problem. We benchmark the strong scaling performance against an equivalent C++/MPI implementation. Our benchmarks found comparable good scaling results for algorithms using mainly local point-to-point communications, and exposed the potential for logarithmic collective operations directly available in the PCJ library. Further more, we also experienced an improvement of development time to solution, as a result of the high level abstractions provided by Java and PCJ.

Roxana Istrate, Panagiotis Kl. Barkoutsos, Michele Dolfi, Peter W. J. Staar, Costas Bekas

### Big Data Analytics in Java with PCJ Library: Performance Comparison with Hadoop

The focus of this article is to present Big Data analytics using Java and PCJ library. The PCJ library is an award-winning library for development of parallel codes using PGAS programming paradigm. The PCJ can be used for easy implementation of the different algorithms, including ones used for Big Data processing. In this paper, we present performance results for standard benchmarks covering different types of applications from computational intensive, through traditional map-reduce up to communication intensive. The performance is compared to one achieved on the same hardware but using Hadoop. The PCJ implementation has been used with both local file system and HDFS. The code written with the PCJ can be developed much faster as it requires a smaller number of libraries used. Our results show that applications developed with the PCJ library are much faster compare to Hadoop implementation.

Marek Nowicki, Magdalena Ryczkowska, Łukasz Górski, Piotr Bala

### Performance Comparison of Graph BFS Implemented in MapReduce and PGAS Programming Models

Computations based on graphs are very common problems but complexity, increasing size of analyzed graphs and a huge amount of communication make this analysis a challenging task. In this paper, we present a comparison of two parallel BFS (Breath-First Search) implementations: MapReduce run on Hadoop infrastructure and in PGAS (Partitioned Global Address Space) model. The latter implementation has been developed with the help of the PCJ (Parallel Computations in Java) - a library for parallel and distributed computations in Java. Both implementations realize the level synchronous strategy - Hadoop algorithm assumes iterative MapReduce jobs, whereas PCJ uses explicit synchronization after each level. The scalability of both solutions is similar. However, the PCJ implementation is much faster (about 100 times) than the MapReduce Hadoop solution.

Magdalena Ryczkowska, Marek Nowicki

### Efficient Parallel Generation of Many-Nucleon Basis for Large-Scale Ab Initio Nuclear Structure Calculations

We address the problem of generating a many-nucleon basis for ab initio nuclear structure modeling, which quickly becomes a significant runtime bottleneck for large model spaces. We first analyze the original basis generation algorithm, which does not employ multi-threading parallel paradigm. Based on the analysis, we propose and empirically evaluate a new efficient scalable basis generation algorithm. We report a reduction of basis generation runtime by a factor of 42 on the Blue Waters supercomputer and by two orders of magnitude on our test-bed computer system with Broadwell CPUs.

Daniel Langr, Tomáš Dytrych, Tomáš Oberhuber, František Knapp

### Parallel Exact Diagonalization Approach to Large Molecular Nanomagnets Modelling

The exact diagonalization method is used to calculate the energy levels of ring-shaped molecular nanomagnets of different sizes and spin numbers. Two-level hybrid parallelization is used to increase the efficiency and obtain the optimally balanced workload. The results of the successful runs of our application on two Tier-0 supercomputers are presented with emphasis on the satisfactory speedup obtained by threading the diagonalization process.

Michał Antkowiak

### Application of Numerical Quantum Transfer-Matrix Approach in the Randomly Diluted Quantum Spin Chains

The description of the numerical method of simulation based on the quantum transfer-matrix (QTM) approach is presented for diluted spin $$S=1/2$$S=1/2 chains. Modification of the extrapolation technique has been used to obtain better accuracy of numerical results. The simulations have been performed using the $$S=1/2$$S=1/2 antiferromagnetic Heisenberg model with the transverse staggered field and a uniform magnetic field perpendicular to the staggered field applicable for the diluted compound (Yb$$_{1-x}$$1-xLu$$_x$$x)$$_4$$4As$$_3$$3. In the model calculations the fixed microscopic parameters established earlier for the pure system have been assumed and the random impurity distribution has been considered. The experimental field-dependent specific heat of the polydomain diluted (Yb$$_{1-x}$$1-xLu$$_x$$x)$$_4$$4As$$_3$$3 sample is compared with that calculated using the HPC resources and providing additional verification of both the QTM method and the physical model.

Ryszard Matysiak, Philipp Gegenwart, Akira Ochiai, Frank Steglich

### A New Method for Solving Nonlinear Interval and Fuzzy Equations

In this paper, a new concept called “interval extended zero” method which recently was used for solving interval and fuzzy linear equations is adapted to the solution of nonlinear interval and fuzzy equations. The known “test” example of quadratic fuzzy equation is used to perform the advantages of a new method. In this example, only the positive solution can be obtained using known methods, whereas generally a negative fuzzy root can exits too. The sources of this problem are clarified. It is shown that opposite to the known methods, a new approach makes it possible to get both the positive and negative solutions of quadratic fuzzy equation. Generally, the developed method can be applied for solving a wide range of nonlinear interval and fuzzy equations if some initial constraints on the bounds of solution are known.

Ludmila Dymova, Pavel Sevastjanov

### Role of Hull-Consistency in the HIBA_USNE Multithreaded Solver for Nonlinear Systems

This paper considers incorporating a hull-consistency enforcing procedure in an interval branch-and-prune method. Hull-consistency has been used with interval algorithms in several solvers, but its implementation in a multithreaded environment is non-trivial. We describe arising issues and discuss the ways to deal with them. Numerical results for some benchmark problems are presented and analyzed.

Bartłomiej Jacek Kubica

### Parallel Computing of Linear Systems with Linearly Dependent Intervals in MATLAB

We implemented several known algorithms for finding an interval enclosure of the solution set of a linear system with linearly dependent interval parameters. To do that we have chosen MATLAB environment with use of INTLAB and VERSOFT libraries. Because our implementation is tested on Toeplitz and symmetric matrices, among others, there is a problem with a sparsity. We introduce straightforward format for representing such matrices, which seems to be almost as effective as the standard matrix representation but with less memory demands. Moreover, we take an advantage of Parallel Computing Toolbox to enhance the performance of implemented methods and to get more insights on how the methods stands in a scope of a tightness-performance ratio. The contribution is a time-tightness performance comparison of such methods, memory efficient representation and an exploration of explicit parallelization impact.

### What Decision to Make in a Conflict Situation Under Interval Uncertainty: Efficient Algorithms for the Hurwicz Approach

In this paper, we show how to take interval uncertainty into account when solving conflict situations. Algorithms for conflict situations under interval uncertainty are known under the assumption that each side of the conflict maximizes its worst-case expected gain. However, it is known that a more general Hurwicz approach provides a more adequate description of decision making under uncertainty. In this approach, each side maximizes the convex combination of the worst-case and the best-case expected gains. In this paper, we describe how to resolve conflict situations under the general Hurwicz approach to interval uncertainty.

Bartłomiej Jacek Kubica, Andrzej Pownuk, Vladik Kreinovich

### Practical Need for Algebraic (Equality-Type) Solutions of Interval Equations and for Extended-Zero Solutions

One of the main problems in interval computations is solving systems of equations under interval uncertainty. Usually, interval computation packages consider united, tolerance, and control solutions. In this paper, we explain the practical need for algebraic (equality-type) solutions, when we look for solutions for which both sides are equal. In situations when such a solution is not possible, we provide a justification for extended-zero solutions, in which we ignore intervals of the type $$[-a,a]$$[-a,a].

Ludmila Dymova, Pavel Sevastjanov, Andrzej Pownuk, Vladik Kreinovich

### Application of Local Search with Perturbation Inspired by Cellular Automata for Heuristic Optimization of Sensor Network Coverage Problem

A cellular automata inspired approach to the problem of effective energy management in a sensor network is presented. The network consisting of a set of sensors is disseminated over an area where a number of points of interest (POI) is localized. The aim is to maximize the time when a sufficient number of POIs is monitored by active sensors. A schedule of sensor activity over time is a solution of this problem. A new heuristic algorithm inspired by a cellular automata engine is proposed. It searches for such schedules maximizing the lifetime of the sensor network. We also present a set of test cases for experimental evaluation of our approach. The proposed algorithm is experimentally tested using these test cases and the obtained results are statistically verified to prove significant contribution of the algorithm components.

Krzysztof Trojanowski, Artur Mikitiuk, Krzysztof J. M. Napiorkowski

### A Fuzzy Logic Inspired Cellular Automata Based Model for Simulating Crowd Evacuation Processes

This work investigates the incorporation of fuzzy logic principles in a cellular automata (CA) based model that simulates crowd dynamics and crowd evacuation processes with the usage of a Mamdani type fuzzy inference system. Major attributes of the model that affect its response, such as orientation, have been deployed as linguistic variables whose values are words rather than numbers. Thus, a basic concept of fuzzy logic is realised. Moreover, fuzzy if-then rules constitute the mechanism that deals with fuzzy consequents and fuzzy antecedents. The proposed model also maintains its CA prominent features, thus exploiting parallel activation of transition rules for all cells and efficient use of computational resources. In case of evacuation, the selection of the appropriate path is primarily addressed using the criterion of distance. To further speed up the execution of the Fuzzy CA model the concept of the inherent parallelization was considered through the GPU programming principles. Finally, validation process of the proposed model incorporates comparison of the corresponding fundamental diagram with those from the literature for a building that has been selected for hosting the museum ‘CONSTANTIN XENAKIS’, in Serres, Greece.

Prodromos Gavriilidis, Ioannis Gerakakis, Ioakeim G. Georgoudas, Giuseppe A. Trunfio, Georgios Ch. Sirakoulis

### Nondeterministic Cellular Automaton for Modelling Urban Traffic with Self-organizing Control

Controlling flow in networks by means of decentralized strategies have gained a lot of attention in recent years. Typical advantages of such approach – efficiency, scalability, versatility, fault tolerance – make it an interesting alternative to more traditional, global optimization. In the paper it is shown how the continuous, macroscopic, self-organizing control proposed by Lämmer and Helbing [10] can be implemented in the discrete, nondeterministic cellular automaton (CA) model of urban traffic. Using various examples, it is demonstrated that the decentralized approach outperforms the best nonresponsive solution based on fixed cycles. In order to analyse relatively large parameter space, an HPC cluster has been used to run multiple versions of a serial CA simulator. The presented model can serve as a test bed for testing other optimization methods and vehicle routing algorithms realized with the use of CA.

Jacek Szklarski

### Towards Multi-Agent Simulations Accelerated by GPU

At present, GPUs (Graphics Processing Units) are commonly used to speedup any kind of computations. In this paper we present how GPUs and Nvidia CUDA can be used to accelerate the updating of and agent state in Multi-Agent Simulations. We use the AgE (Agent Evolution) software framework written in Java, which supports agent-based computations. In our simulations agents represent living organisms that interact with the virtual habitat and with each other. At each step of the simulation thousands of agents update their state according to a defined set of rules. We use Java bindings for CUDA (JCUDA) to move massive computations to GPU.

Kamil Piętak, Paweł Topa

### Tournament-Based Convection Selection in Evolutionary Algorithms

One of the problems that single-threaded (non-parallel) evolutionary algorithms encounter is premature convergence and the lack of diversity in the population. To counteract this problem and improve the performance of evolutionary algorithms in terms of the quality of optimized solutions, a new subpopulation-based selection scheme – the convection selection – is introduced and analyzed in this work. This new selection scheme is compared against traditional selection of individuals in a single-population evolutionary processes. The experimental results indicate that the use of subpopulations with fitness-based assignment of individuals yields better results than both random assignment and a traditional, non-parallel evolutionary architecture.

### Multi-agent Systems Programmed Visually with Google Blockly

In this paper we propose a very user-friendly method for programming multi-agent systems. We use the well known visual programming library Blockly from Google. With this library the behaviour of agents can by programmed intuitively even by those not skilled in programming. We demonstrate this idea using an agent-based simulator named eVolutus, designed and implemented for conducting large scale ecological and evolutionary experiments.

Szymon Górowski, Robert Maguda, Paweł Topa

### Backmatter

Weitere Informationen