scroll identifier for mobile
main-content

## Über dieses Buch

The two-volume set LNCS 10777 and 10778 constitutes revised selected papers from the 12th International Conference on Parallel Processing and Applied Mathematics, PPAM 2017, held in Lublin, Poland, in September 2017.

The 49 regular papers presented in this volume were selected from 98 submissions. For the workshops and special sessions, that were held as integral parts of the PPAM 2017 conference, a total of 51 papers was accepted from 75 submissions.

The papers were organized in topical sections named as follows:

Part I: numerical algorithms and parallel scientific computing; particle methods in simulations; task-based paradigm of parallel computing; GPU computing; parallel non-numerical algorithms; performance evaluation of parallel algorithms and applications; environments and frameworks for parallel/distributed/cloud computing; applications of parallel computing; soft computing with applications; and special session on parallel matrix factorizations.

Part II: workshop on models, algorithms and methodologies for hybrid parallelism in new HPC systems; workshop power and energy aspects of computations (PEAC 2017); workshop on scheduling for parallel computing (SPC 2017); workshop on language-based parallel programming models (WLPP 2017); workshop on PGAS programming; minisymposium on HPC applications in physical sciences; minisymposium on high performance computing interval methods; workshop on complex collective systems.

## Inhaltsverzeichnis

### Advances in Incremental PCA Algorithms

We present a range of new incremental (single-pass streaming) algorithms for incremental principal components analysis (IPCA) and show that they are more effective than exiting ones. IPCA algorithms process the columns of a matrix A one at a time and attempt to build a basis for a low-dimensional subspace that spans the dominant subspace of A. We present a unified framework for IPCA algorithms, show that many existing ones are parameterizations of it, propose new sophisticated algorithms, and show that both the new algorithms and many existing ones can be implemented more efficiently than was previously known. We also show that many existing algorithms can fail even in easy cases and we show experimentally that our new algorithms outperform existing ones.

Tal Halpern, Sivan Toledo

### Algorithms for Forward and Backward Solution of the Fokker-Planck Equation in the Heliospheric Transport of Cosmic Rays

Motion of charged particles in an inhomogeneous turbulent medium as magnetic field is described by partial differential equations of the Fokker-Planck-Kolmogorov type. We present an algorithm of numerical solution of the four-dimensional Fokker-Planck equation in three-dimensional spherical coordinates system. The algorithm is based on Monte Carlo simulations of the stochastic motion of quasi-particles guided by the set of stochastic differential equations corresponding to the Fokker-Planck equation by the Ito formalism. We present the parallel algorithm in Julia programming language. We simulate the transport of cosmic rays in the heliosphere considering the full three-dimensional diffusion tensor. We compare forward- and backward-in-time solutions of the transport equation and discuss its computational advantages and disadvantages.

Anna Wawrzynczak, Renata Modzelewska, Agnieszka Gil

### Efficient Evaluation of Matrix Polynomials

We revisit the problem of evaluating matrix polynomials and introduce memory and communication efficient algorithms. Our algorithms, based on that of Patterson and Stockmeyer, are more efficient than previous ones, while being as memory-efficient as Van Loan’s variant. We supplement our theoretical analysis of the algorithms, with matching lower bounds and with experimental results showing that our algorithms outperform existing ones.

Niv Hoffman, Oded Schwartz, Sivan Toledo

### A Comparison of Soft-Fault Error Models in the Parallel Preconditioned Flexible GMRES

The effect of two soft fault error models on the convergence of the parallel flexible GMRES (FGMRES) iterative method solving an elliptical PDE problem on a regular grid is evaluated. We consider two types of preconditioners: an incomplete LU factorization with dual threshold (ILUT), and an algebraic recursive multilevel solver (ARMS) combined with random butterfly transformation (RBT). The experiments quantify the difference between two soft fault error models considered in this study and compare their potential impact on the convergence.

Evan Coleman, Aygul Jamal, Marc Baboulin, Amal Khabou, Masha Sosonkina

### Multilayer Approach for Joint Direct and Transposed Sparse Matrix Vector Multiplication for Multithreaded CPUs

One of the most common operations executed on modern high-performance computing systems is multiplication of a sparse matrix by a dense vector within a shared-memory computational node. Strongly related but far less studied problem is joint direct and transposed sparse matrix-vector multiplication, which is widely needed by certain types of iterative solvers. We propose a multilayer approach for joint sparse multiplication that balances the workload of threads. Measurements prove that our algorithm is scalable and achieve high computational performance for multiple benchmark matrices that arise from various scientific and engineering disciplines.

Ivan Šimeček, Daniel Langr, Ivan Kotenkov

### Comparison of Parallel Time-Periodic Navier-Stokes Solvers

In this paper we compare two different methods to compute time-periodic steady states of the Navier-Stokes equations. The first one is a traditional time-stepping scheme which has to be evolved until the state is reached. The second one uses periodic boundary conditions in time and uses a spectral discretization in time. The methods are compared with regard to accuracy and scalability by solving for a time-periodic Taylor-Green vortex. We show that the time-periodic steady state can be computed much faster with the spectral in time method than with the standard time-stepping method if the Womersley number is sufficiently large.

Peter Arbenz, Daniel Hupp, Dominik Obrist

### Blocked Algorithms for Robust Solution of Triangular Linear Systems

We consider the problem of computing a scaling $$\alpha$$ such that the solution $${\varvec{x}}$$ of the scaled linear system $${\varvec{Tx}} = \alpha {\varvec{b}}$$ can be computed without exceeding an overflow threshold $$\varOmega$$. Here $${\varvec{T}}$$ is a non-singular upper triangular matrix and $${\varvec{b}}$$ is a single vector, and $$\varOmega$$ is less than the largest representable number. This problem is central to the computation of eigenvectors from Schur forms. We show how to protect individual arithmetic operations against overflow and we present a robust scalar algorithm for the complete problem. Our algorithm is very similar to xLATRS in LAPACK. We explain why it is impractical to parallelize these algorithms. We then derive a robust blocked algorithm which can be executed in parallel using a task-based run-time system such as StarPU. The parallel overhead is increased marginally compared with regular blocked backward substitution.

Carl Christian Kjelgaard Mikkelsen, Lars Karlsson

### A Comparison of Accuracy and Efficiency of Parallel Solvers for Fractional Power Diffusion Problems

In this paper, we construct and investigate parallel solvers for three dimensional problems described by fractional powers of elliptic operators. The main aim is to make a scalability analysis of parallel versions of several state of the art solvers. The originality of this work is that we also consider the accuracy of the selected numerical algorithms. For comparison of accuracy, we use solutions obtained solving the test problem by the Fourier algorithm. Such analysis enables to compare the efficiency of the proposed parallel algorithms depending on the required accuracy of solution and on a number of processes used in computations.

Raimondas Čiegis, Vadimas Starikovičius, Svetozar Margenov, Rima Kriauzienė

### Efficient Cross Section Reconstruction on Modern Multi and Many Core Architectures

The classical Monte Carlo (MC) neutron transport employs energy lookup on long tables to compute the cross sections needed for the simulation. This process has been identified as an important performance hotspot of MC simulations, because poor cache utilization caused by random access patterns and large memory footprint makes it unfriendly to modern architectures. A former study [1] shows that such method presents little vectorization potential in a real-case simulation due to the memory-bound nature. In this paper, we revisit a cross section reconstruction method introduced by Hwang [2] to evaluate another solution. The reconstruction converts the problem from memory-bound to compute-bound. Only several variables for each resonance are required instead of the conventional pointwise table covering the entire resolved resonance region. Though the memory space is largely reduced, this method is really time-consuming. After a series of optimizations, results show that the reconstruction kernel benefits well from vectorization and can achieve 1806 GFLOPS (single precision) on a Knights Landing 7250, which represents 67% of its effective peak performance.

Yunsong Wang, François-Xavier Hugot, Emeric Brun, Fausto Malvagi, Christophe Calvin

### Parallel Assembly of ACA BEM Matrices on Xeon Phi Clusters

The paper presents parallelization of the boundary element method in distributed memory of a cluster equipped with many-core based compute nodes. A method for efficient distribution of boundary element matrices among MPI processes based on the cyclic graph decompositions is described. In addition, we focus on the intra-node optimization of the code, which is necessary in order to fully utilize the many-core processors with wide SIMD registers. Numerical experiments carried out on a cluster consisting of the Intel Xeon Phi processors of the Knights Landing generation are presented.

Michal Kravcenko, Lukas Maly, Michal Merta, Jan Zapletal

### Stochastic Bounds for Markov Chains on Intel Xeon Phi Coprocessor

The author presents an approach to find stochastic bounds for Markov chains with the use of Intel Xeon Phi coprocessor. A known algorithm is adapted to study the potential of the MIC architecture in algorithms needing a lot of memory access and exploit it in the best way.The paper also discusses possible sparse matrices storage schemes suitable to the investigated algorithm on Intel Xeon Phi coprocessor.The article shows also results of the experiments with the algorithm with different compile-time and runtime parameters (like scheduling, different number of threads, threads to cores mapping).

Jarosław Bylina

### Fast DEM Collision Checks on Multicore Nodes

Many particle simulations today rely on spherical or analytical particle shape descriptions. They find non-spherical, triangulated particle models computationally infeasible due to expensive collision detections. We propose a hybrid collision detection algorithm based upon an iterative solve of a minimisation problem that automatically falls back to a brute-force comparison-based algorithm variant if the problem is ill-posed. Such a hybrid can exploit the vector facilities of modern chips and it is well-prepared for the arising manycore era. Our approach pushes the boundary where non-analytical particle shapes and the aligning of more accurate first principle physics become manageable.

Konstantinos Krestenitis, Tobias Weinzierl, Tomasz Koziara

### A Space and Bandwidth Efficient Multicore Algorithm for the Particle-in-Cell Method

The Particle-in-Cell (PIC) method allows solving partial differential equation through simulations, with important applications in plasma physics. To simulate thousands of billions of particles on clusters of multicore machines, prior work has proposed hybrid algorithms that combine domain decomposition and particle decomposition with carefully optimized algorithms for handling particles processed on each multicore socket. Regarding the multicore processing, existing algorithms either suffer from suboptimal execution time, due to sorting operations or use of atomic instructions, or suffer from suboptimal space usage. In this paper, we propose a novel parallel algorithm for two-dimensional PIC simulations on multicore hardware that features asymptotically-optimal memory consumption, and does not perform unnecessary accesses to the main memory. In practice, our algorithm reaches 65% of the maximum bandwidth, and shows excellent scalability on the classical Landau damping and two-stream instability test cases.

Yann Barsamian, Arthur Charguéraud, Alain Ketterlin

### Load Balancing for Particle-in-Cell Plasma Simulation on Multicore Systems

Particle-in-cell plasma simulation is an important area of computational physics. The particle-in-cell method naturally allows parallel processing on distributed and shared memory. In this paper we address the problem of load balancing on multicore systems. While being well-studied for many traditional applications of the method, it is a relevant problem for the emerging area of particle-in-cell simulations with account for effects of quantum electrodynamics. Such simulations typically produce highly non-uniform, and sometimes volatile, particle distributions, which could require custom load balancing schemes. In this paper we present a computational evaluation of several standard and custom load balancing schemes for the particle-in-cell method on a high-end system with 96 cores on shared memory. We use a test problem with static non-uniform particle distribution and a real problem with account for quantum electrodynamics effects, which produce dynamically changing highly non-uniform distributions of particles and workload. For these problems the custom schemes result in increase of scaling efficiency by up to 20% compared to the standard OpenMP schemes.

Anton Larin, Sergey Bastrakov, Aleksei Bashinov, Evgeny Efimenko, Igor Surmin, Arkady Gonoskov, Iosif Meyerov

### The Impact of Particle Sorting on Particle-In-Cell Simulation Performance

The Particle-In-Cell (PIC) simulation method is a modern technique in studies of collisionless plasmas in applications to astrophysics and laboratory plasma physics. Inherent to this method is its parallel nature, which enables massively parallel MPI applications which can use thousands of CPU-cores on HPC systems. In order to achieve a good performance of a PIC code several techniques are available. In this work we study the impact of particle sorting on the performance of the PIC code THISMPI. We compare dual-pivot five-way quicksort with the standard quicksort. We focus on finding optimum sorting frequency.

Andrzej Dorobisz, Michał Kotwica, Jacek Niemiec, Oleh Kobzar, Artem Bohdan, Kazimierz Wiatr

#### Frontmatter

Task based parallel programming has shown competitive outcomes in many aspects of parallel programming such as efficiency, performance, productivity and scalability. Different approaches are used by different software development frameworks to provide these outcomes to the programmer, while making the underlying hardware architecture transparent to her. However, since programs are not portable between these frameworks, using one framework or the other is still a vital decision by the programmer whose concerns are expandability, adaptivity, maintainability and interoperability of the programs. In this work, we propose a unified programming interface that a programmer can use for working with different task based parallel frameworks transparently. In this approach we abstract the common concepts of task based parallel programming and provide them to the programmer in a single programming interface uniformly for all frameworks. We have tested the interface by running programs which implement matrix operations within frameworks that are optimized for shared and distributed memory architectures and accelerators, while the cooperation between frameworks is configured externally with no need to modify the programs. Further possible extensions of the interface and future potential research are also described.

Afshin Zafari

### Comparison of Time and Energy Oriented Scheduling for Task-Based Programs

The purpose of task scheduling is to find a beneficial assignment of tasks to execution units of a parallel system, where the specific goal is captured in a special optimization function, and tasks are usually described by corresponding properties, such as the execution time. However, today not only the parallel execution time is to be minimized, but also other metrics, such as the energy consumption. In this article, we investigate several scheduling algorithms with different frequency scaling policies. Our specific goal is to consider application specific scheduling with respect to time and energy. For this purpose, we use real measured data for the tasks leading to diverse effects concerning time, energy and power consumption. As application tasks we use the SPEC benchmarks.

Thomas Rauber, Gudula Rünger

### Experiments with Sparse Cholesky Using a Parametrized Task Graph Implementation

We describe the design of a sparse direct solver for symmetric positive-definite systems using the PaRSEC runtime system. In this approach the application is represented as a DAG of tasks and the runtime system runs the DAG on the target architecture. Portability of the code across different architectures is enabled by delegating to the runtime system the task scheduling and data management. Although runtime systems have been exploited widely in the context of dense linear algebra, the DAGs arising in sparse linear algebra algorithms remain a challenge for such tools because of their irregularity. In addition to overheads induced by the runtime system, the programming model used to describe the DAG impacts the performance and the scalability of the code. In this study we investigate the use of a Parametrized Task Graph (PTG) model for implementing a task-based supernodal method. We discuss the benefits and limitations of this model compared to the popular Sequential Task Flow model (STF) and conduct numerical experiments on a multicore system to assess our approach. We also validate the performance of our solver SpLLT by comparing it to the state-of-the-art solver MA87 from the HSL library.

Iain Duff, Florent Lopez

### A Task-Based Algorithm for Reordering the Eigenvalues of a Matrix in Real Schur Form

A task-based parallel algorithm for reordering the eigenvalues of a matrix in real Schur form is presented. The algorithm is realized on top of the StarPU runtime system. Only the aspects which are relevant for shared memory machines are discussed here, but the implementation can be configured to run on distributed memory machines as well. Various techniques to reduce the overhead and the core idle time are discussed. Computational experiments indicate that the new algorithm is between 1.5 and 6.6 times faster than a state of the art MPI-based implementation found in ScaLAPACK. With medium to large matrices, strong scaling efficiencies above 60% up to 28 CPU cores are reported. The overhead and the core idle time are shown to be negligible with the exception of the smallest matrices and highest core counts.

Mirko Myllykoski

### Radix Tree for Binary Sequences on GPU

In this paper, we present radix tree index structure (R-Trie) able to perform lookup over a set of keys of arbitrary length optimized for GPU processors. We present a fully parallel SIMD organized creation and search strategies. The R-Trie supports configurable bit stride for each level and nodes statistics for optimization purposes. We evaluate the performance using two search strategies and Longest Prefix Match (LPM) problem for computer networks. Unlike dedicated LPM algorithms we do not incorporate knowledge about the data or the network masks statistics into the tree construction or algorithm behavior. Our solution may be used in general purpose indexing structures where a batch search of massive number of keys is needed. (The research was funded by National Science Center, decision DEC-2012/07/D/ST6/02483.)

Krzysztof Kaczmarski, Albert Wolant

### A Comparison of Performance Tuning Process for Different Generations of NVIDIA GPUs and an Example Scientific Computing Algorithm

We consider the performance of a selected computational kernel from a scientific code on different generations of NVIDIA GPUs. The code that we use for tests is an OpenCL implementation of finite element numerical integration algorithm. In the current contribution we describe the performance tuning for the code, done by searching a parameter space associated with the code. The results of tuning for different generations of NVIDIA GPUs serve as a basis for analyses and conclusions.

Krzysztof Banaś, Filip Krużel, Jan Bielański, Kazimierz Chłoń

### NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems Implementation of cuThomasBatch

The solving of tridiagonal systems is one of the most computationally expensive parts in many applications, so that multiple studies have explored the use of NVIDIA GPUs to accelerate such computation. However, these studies have mainly focused on using parallel algorithms to compute such systems, which can efficiently exploit the shared memory and are able to saturate the GPUs capacity with a low number of systems, presenting a poor scalability when dealing with a relatively high number of systems. We propose a new implementation (cuThomasBatch) based on the Thomas algorithm. To achieve a good scalability using this approach is necessary to carry out a transformation in the way that the inputs are stored in memory to exploit coalescence (contiguous threads access to contiguous memory locations). The results given in this study proves that the implementation carried out in this work is able to beat the reference code when dealing with a relatively large number of Tridiagonal systems (2,000–256,000), being closed to $$3{\times }$$ (in double precision) and $$4{\times }$$ (in single precision) faster using one Kepler NVIDIA GPU.

Pedro Valero-Lara, Ivan Martínez-Pérez, Raül Sirvent, Xavier Martorell, Antonio J. Peña

### Two-Echelon System Stochastic Optimization with R and CUDA

Parallelizing of the supply chain simulator is considered in this paper. The simulator is a key element of the algorithm optimizing inventory levels and order sizes in a two-echelon logistic system. The mode of operation of the logistic system and the optimization problem are defined first. Then, the inventory optimization algorithm is introduced. Parallelization for CUDA platform is presented. Benchmarking of the parallelized code demonstrates high efficiency of the software hybrid.

Witold Andrzejewski, Maciej Drozdowski, Gang Mu, Yong Chao Sun

### Parallel Hierarchical Agglomerative Clustering for fMRI Data

This paper describes three parallel strategies for Ward’s algorithm with OpenMP or/and CUDA. Faced with the difficulty of a priori modelling of elicited brain responses by a complex paradigm in fMRI experiments, data-driven analysis have been extensively applied to fMRI data. A promising approach is clustering data which does not make stringent assumptions such as spatial independence of sources. Thirion et al. have shown that hierarchical agglomerative clustering (HAC) with Ward’s minimum variance criterion is a method of choice. However, HAC is computationally demanding, especially for distance computation. With our strategy, for single subject analysis, a speed-up of up to 7 was achieved on a workstation. For group analysis (concatenation of several subjects), a speed-up of up to 20 was achieved on a workstation.

Mélodie Angeletti, Jean-Marie Bonny, Franck Durif, Jonas Koko

### Two Parallelization Schemes for the Induction of Nondeterministic Finite Automata on PCs

In the paper we study the induction of minimal nondeterministic finite automata consistent with the sets of examples and counterexamples. The induced automata are minimal with respect to the number of states. We devise a generic parallel induction algorithm and two original parallelization schemes. The schemes take into account the possibility of solving the induction task on a PC with a multi-core processor. We consider theoretically different possible configurations of the parallelization schemes. We also provide some experimental results obtained for selected configurations.

Tomasz Jastrzab

### Approximating Personalized Katz Centrality in Dynamic Graphs

Dynamic graphs can capture changing relationships in many real datasets that evolve over time. One of the most basic questions about networks is the identification of the “most important” vertices in a network. Measures of vertex importance called centrality measures are used to rank vertices in a graph. In this work, we focus on Katz Centrality. Typically, scores are calculated through linear algebra but in this paper we present an new alternative, agglomerative method of calculating Katz scores and extend it for dynamic graphs. We show that our static algorithm is several orders of magnitude faster than the typical linear algebra approach while maintaining good quality of the scores. Furthermore, our dynamic graph algorithm is faster than pure static recomputation every time the graph changes and maintains high recall of the highly ranked vertices on both synthetic and real graphs.

### Graph-Based Speculative Query Execution for RDBMS

The paper concerns parallelized speculative support for query execution in RDBMS. The support is based on dynamic analysis of input query stream in databases serviced in SQLite. A multi-threaded middleware called the Speculative Layer is introduced which, based on a specific graph representation of a query stream, determines the most promising speculative queries for execution. The paper briefly presents the query graph modelling and analysis methods. Then, an extended version of speculative query execution algorithm is presented which allows combining results of multiple speculative queries for execution of one user input query. Experimental results are presented based on the proposed algorithm assessment in a multi-threaded testbed cooperating with a database serviced in SQLite.

Anna Sasak-Okoń, Marek Tudruj

### A GPU Implementation of Bulk Execution of the Dynamic Programming for the Optimal Polygon Triangulation

The optimal polygon triangulation problem for a convex polygon is an optimization problem to find a triangulation with minimum total weight. It is known that this problem can be solved using the dynamic programming technique in $$O(n^3)$$ time. The main contribution of this paper is to present an efficient parallel implementation of this $$O(n^3)$$-time algorithm for a lot of instances on the GPU (Graphics Processing Unit). In our proposed GPU implementation, we focused on the computation for a lot of instances and considered programming issues of the GPU architecture such as coalesced access of the global memory, warp divergence. Our implementation solves the optimal polygon triangulation problem for 1024 convex 1024-gons in 4.77 s on the NVIDIA TITAN X, while a conventional CPU implementation runs in 241.53 s. Thus, our GPU implementation attains a speedup factor of 50.6.

Kohei Yamashita, Yasuaki Ito, Koji Nakano

### Early Performance Evaluation of the Hybrid Cluster with Torus Interconnect Aimed at Molecular-Dynamics Simulations

In this paper, we describe the Desmos cluster that consists of 32 hybrid nodes connected by a low-latency high-bandwidth torus interconnect. This cluster is aimed at cost-effective classical molecular dynamics calculations. We present strong scaling benchmarks for GROMACS, LAMMPS and VASP and compare the results with other HPC systems. This cluster serves as a test bed for the Angara interconnect that supports 3D and 4D torus network topologies, and verifies its ability to unite MPP systems speeding-up effectively MPI-based applications. We describe the interconnect presenting typical MPI benchmarks.

Vladimir Stegailov, Alexander Agarkov, Sergey Biryukov, Timur Ismagilov, Mikhail Khalilov, Nikolay Kondratyuk, Evgeny Kushtanov, Dmitry Makagon, Anatoly Mukosey, Alexander Semenov, Alexey Simonov, Alexey Timofeev, Vyacheslav Vecher

### Load Balancing for CPU-GPU Coupling in Computational Fluid Dynamics

This paper investigates static load balancing models for CPU-GPU coupling from a computational fluid dynamics perspective. While able to generate a benefit, traditional load balancing models are found to be too inaccurate to predict the runtime of a preconditioned conjugate gradient solver. Hence, an expanded model is derived that accounts for the multi-step nature of the solver, i.e. several communication barriers per iteration. It is able to predict the runtime to a margin of 5%, rendering CPU-GPU coupling better predictable so that load balancing can be improved substantially.

Immo Huismann, Matthias Lieber, Jörg Stiller, Jochen Fröhlich

### Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer

In this study, we propose a 2D-compatible implementation of 2.5D parallel matrix multiplication (2.5D-PDGEMM), which was designed to perform computations of 2D distributed matrices on a 2D process grid. We evaluated the performance of our implementation using 16384 nodes (131072 cores) on the K computer, which is a highly parallel computer. The results show that our 2.5D implementation outperforms conventional 2D implementations including the ScaLAPACK PDGEMM routine, in terms of strong scaling, even when the cost for matrix redistribution between 2D and 2.5D distributions is included. We discussed the performance of our implementation by providing a breakdown of the performance and describing the performance model of the implementation.

Daichi Mukunoki, Toshiyuki Imamura

### An Approach for Detecting Abnormal Parallel Applications Based on Time Series Analysis Methods

The low efficiency of parallel program execution is one of the most serious problems in high-performance computing area. There are many researches and software tools aimed at analyzing and improving the performance of a particular program, but the task of detecting such applications that need to be analyzed is still far from being solved.In this research, methods for detecting abnormal behavior of the programs in the overall supercomputer task flow are being developed. There are no clear criteria for anomalous behavior, and also these criteria can differ significantly for different computing systems, therefore machine learning methods are being used. These methods take system monitoring data as an input, since they provide the most complete information about the dynamics of program execution.In this article we propose a method based on the time series analysis of dynamic characteristics describing the behavior of programs. In this method, the time series is divided into a set of intervals, where the anomalous ones are detected. After that the final classification of the entire application is performed based on the results of interval classification. The developed method is being tested on real-life data of the Petaflops-level Lomonosov-2 supercomputer.

### Prediction of the Inter-Node Communication Costs of a New Gyrokinetic Code with Toroidal Domain

We consider the communication costs of gyrokinetic plasma physics simulations running at large scale. For this we apply virtual decompositions of the toroidal domain in three dimensions and additional domain cloning to existing simulations done with the ORB5 code. The communication volume and the number of communication partners per timestep for every virtual task (node) are evaluated for the particles and the structured mesh. Thus the scaling properties of a code with the new domain decompositions are derived for simple models of a modern computer network and corresponding processing units. The effectiveness of the suggested decomposition has been shown. For a typical simulation with $$2\cdot 10^9$$ particles and a mesh of $$256\times 1024\times 512$$ grid points scaling to 2, 800 nodes should be achieved.

Andreas Jocksch, Noé Ohana, Emmanuel Lanti, Aaron Scheinberg, Stephan Brunner, Claudio Gheller, Laurent Villard

### D-Spline Performance Tuning Method Flexibly Responsive to Execution Time Perturbation

Various software automatic tuning methods have been proposed to search for the optimum parameter setting from among a combination of performance parameters. We have been studying a discrete spline (d-Spline)-based incremental performance parameter estimation (IPPE) method that does not require the approximation function to have differential continuity. In this method, a d-Spline generated from the minimum sample point is used to estimate the optimum value of the performance parameter. In prior methods, one measurement result was used to conduct sample point estimation; however, perturbations arising from the computing environment can affect estimates made in this manner. Such perturbations include disturbances introduced by the computing environment and OS jitters. In this study, we propose a method that considers execution time perturbation in performance parameter estimation by allowing for re-measurement under certain conditions by using an actual IPPE measurement. This lowers the inclusion of execution time perturbation in d-Spline approximation, thus enhancing the reliability of software automatic tuning.

Guning Fan, Masayoshi Mochizuki, Akihiro Fujii, Teruo Tanaka, Takahiro Katagiri

### Dfuntest: A Testing Framework for Distributed Applications

New ideas in distributed systems (algorithms or protocols) are commonly tested by simulation, because experimenting with a prototype deployed on a realistic platform is cumbersome. However, a prototype not only measures performance but also verifies assumptions about the underlying system. We developed dfuntest—a testing framework for distributed applications that defines abstractions and test structure, and automates experiments on distributed platforms. Dfuntest aims to be jUnit’s analogue for distributed applications; a framework that enables the programmer to write robust and flexible scenarios of experiments. Dfuntest requires minimal bindings that specify how to deploy and interact with the application. Dfuntest’s abstractions allow execution of a scenario on a single machine, a cluster, a cloud, or any other distributed infrastructure, e.g. on PlanetLab. A scenario is a procedure; thus, our framework can be used both for functional tests and for performance measurements. We show how to use dfuntest to deploy our DHT prototype on 60 PlanetLab nodes and verify whether the prototype maintains a correct topology.

### Security Monitoring and Analytics in the Context of HPC Processing Model

In this paper an overview of the problem of cybersecurity monitoring and analytics in HPC centers is performed from two intersecting points of view: challenges of assuring the necessary security level of HPC infrastructures themselves as well as new, not available earlier, opportunities to effectively analyze large volumes of heterogeneous data, facilitated by using large HPC clusters together with scalable analytic software. A major part of this paper is devoted to the most relevant methodologies and solutions that can be used by security analytics in order to at least partially face the challenge of analyzing large volumes of data potentially related with cyber-security events, in real-time or quasi-real-time. Particular solutions are considered in the context of their applicability in an HPC infrastructure. Relying on the results of experiments conducted within the SECOR project we have shown an approach of further development of the prepared architecture in HPC environment – within the confines of another R&D project, PROTECTIVE.

Mikołaj Dobski, Gerard Frankowski, Norbert Meyer, Maciej Miłostan, Michał Pilc

### Multidimensional Performance and Scalability Analysis for Diverse Applications Based on System Monitoring Data

The availability of high performance computing resources enables us to perform very large numerical simulations and in this way to tackle challenging real life problems. At the same time, in order to efficiently utilize the computational power at our disposal, the ever growing complexity of the computer architecture poses high demands on the algorithms and their implementation.Performing large scale high performance simulations can be done by utilizing available general libraries, writing libraries that suit particular classes of problems or developing software from scratch. Clearly, the possibilities to enhance the efficiency of the software tools in the three cases is very different, ranging from nearly impossible to full capacity. In this work we exemplify the efficiency of the three approaches on benchmark problems, using monitoring tools that provide a very rich spectrum of data on the performance of the applied codes as well as on the utilization of the supercomputer itself.

Maya Neytcheva, Sverker Holmgren, Jonathan Bull, Ali Dorostkar, Anastasia Kruchinina, Dmitry Nikitenko, Nina Popova, Pavel Shvets, Alexey Teplov, Vadim Voevodin, Vladimir Voevodin

### Bridging the Gap Between HPC and Cloud Using HyperFlow and PaaSage

A hybrid HPC/Cloud architecture is a potential solution to the ever-increasing demand for high-availability on-demand resources for eScience applications. eScience applications are primarily compute-intensive, and thus require HPC resources. They usually also include pre- and post-processing steps, which can be moved into the Cloud in order to keep costs low. We believe that currently no methodology exists to bridge the gap between HPC and Cloud in a seamless manner. The goal is to lower the gap for non-professionals in order to exploit external facilities through an automated deployment and scaling both vertically (HPC) and horizontally (Cloud). This paper demonstrates how representative eScience applications can easily be transferred from HPC to Cloud using the model-based cross-cloud deployment platform PaaSage.

Dennis Hoppe, Yosandra Sandoval, Anthony Sulistio, Maciej Malawski, Bartosz Balis, Maciej Pawlik, Kamil Figiela, Dariusz Krol, Michal Orzechowski, Jacek Kitowski, Marian Bubak

### A Memory Efficient Parallel All-Pairs Computation Framework: Computation – Communication Overlap

All-Pairs problems require each data element in a set of N data elements to be paired with every other data element for specific computation using the two data elements. Our framework aims to address recurring problems of scalability, distributing equal work load to all nodes and by reducing memory footprint. We reduce memory footprint of All-Pairs problems, by reducing memory requirement from $$N/\sqrt{P}$$ to 3N/P. A bio-informatics application is implemented to demonstrate the scalability ranging up to 512 cores for the data set we experimented, redundancy management, and speed up performance of the framework.

Venkata Kasi Viswanath Yeleswarapu, Arun K. Somani

### Automatic Parallelization of ANSI C to CUDA C Programs

Writing efficient general-purpose programs for Graphics Processing Units (GPU) is a complex task. In order to be able to program these processors efficiently, one has to understand their intricate architecture, memory subsystem as well as the interaction with the Central Processing Unit (CPU). The paper presents the GAP - an automatic parallelizer designed to translate sequential ANSI C code to parallel CUDA C programs. Developed and implemented compiler was tested on the series of ANSI C programs. The generated code performed very well, achieving significant speed-ups for the programs that expose high degree of data-parallelism. Thus, the idea of applying the automatic parallelization for generating the CUDA C code is feasible and realistic.

Jan Kwiatkowski, Dzanan Bajgoric

### Consistency Models for Global Scalable Data Access Services

Developing and deploying a global and scalable data access service is a challenging task. We assume that the globalization is achieved by creating and maintaining appropriate metadata while the scalability is achieved by limiting the number of entities taking part in keeping the metadata consistency. In this paper, we present different consistency and synchronization models for various metadata types chosen for implementation of global and scalable data access service.

Michał Wrzeszcz, Darin Nikolow, Tomasz Lichoń, Rafał Słota, Łukasz Dutka, Renata G. Słota, Jacek Kitowski

### Global State Monitoring in Optimization of Parallel Event–Driven Simulation

The paper presents results of experimental work in the field of optimization of parallel, event-driven simulation via application of global state monitoring. Discrete event simulation is a well known technique used for modelling and simulating complex parallel systems. Parallel simulation employs multiple simulated event queues processed in parallel. Absence of proper synchronization between parallel queues can cause massive simulation rollbacks, which slow down the simulation process. We propose a new method for parallel simulation control with monitoring of global program states, which prevent excessive number of rollbacks. Every queue process reports its local progress to a global synchronizer which monitors the global simulation state as timestamps of recently processed events in distributed queues. Based on this state the synchronizer checks the progress of simulation and sends signals limiting progress in too advanced queues. This control is done asynchronously, and thus it has small time overheads in case of correct simulation order. The paper describes the proposed approach and the experimental results of its basic program implementation.

Łukasz Maśko, Marek Tudruj

### High Performance Optimization of Independent Component Analysis Algorithm for EEG Data

Independent Component Analysis (ICA) is known as a signal cleaning method that allows the artifacts to be extracted and subsequently eliminated. It is especially essential while processing the EEG data. However, this is a time-consuming algorithm especially if we deal with a high-dimensional data and take care about the calculation accuracy. One of the known implementations of this algorithm, which can be found in MATLAB or the open library it++ – fastICA – does not use parallel implementations nor take benefit of the current capabilities of the Intel architecture. Also for large data, fastICA’s accuracy and stability decrease due to the reduction of data dimension. The paper introduces an implementation that uses Intel Cilk Plus, BLAS and MKL library built-in functions as well as array notation and OpenMP parallelization to optimize the algorithm.

Anna Gajos-Balińska, Grzegorz M. Wójcik, Przemysław Stpiczyński

### Continuous and Discrete Models of Melanoma Progression Simulated in Multi-GPU Environment

Existing computational models of cancer evolution mostly represent very general approaches for studying tumor dynamics in a homogeneous tissue. Here we present two very different cancer models: the heterogeneous continuous/discrete and purely discrete one, focusing on a specific cancer type – melanoma. This tumor proliferates in a complicated heterogeneous environment of the human skin. The results from simulations obtained for the two models are confronted in the context of their possible integration into a single multi-scale system. We demonstrate that the interaction between the tissue – represented by both the concentration fields (the continuous model) and the particles (the discrete model) – and the discrete network of blood vessels is the crucial component, which can increase the simulation time even one order of magnitude. To compensate this time lag, we developed GPU/CUDA implementations of the two melanoma models. Herein, we demonstrate that the continuous/discrete model, run on a multi-GPU cluster, almost fifteen times outperforms its multi-threaded CPU implementation.

Witold Dzwinel, Adrian Kłusek, Rafał Wcisło, Marta Panuszewska, Paweł Topa

### Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications

The Knights Landing (KNL) is the codename for the latest generation of Intel processors based on Intel Many Integrated Core (MIC) architecture. It relies on massive thread and data parallelism, and fast on-chip memory. This processor operates in standalone mode, booting an off-the-shelf Linux operating system. The KNL peak performance is very high – approximately 3 Tflops in double precision and 6 Tflops in single precision – but sustained performance depends critically on how well all parallel features of the processor are exploited by real-life applications. We assess the performance of this processor for Lattice Boltzmann codes, widely used in computational fluid-dynamics. In our OpenMP code we consider several memory data-layouts that meet the conflicting computing requirements of distinct parts of the application, and sustain a large fraction of peak performance. We make some performance comparisons with other processors and accelerators, and also discuss the impact of the various memory layouts on energy efficiency.

Enrico Calore, Alessandro Gabbana, Sebastiano Fabio Schifano, Raffaele Tripiccione

### Towards a Model of Semi-supervised Learning for the Syntactic Pattern Recognition-Based Electrical Load Prediction System

The paper is devoted to one of the key open problems of development of SPRELP system (the Syntactic Pattern Recognition-based Electrical Load Prediction System). The main module of SPRELP System is based on a GDPLL($$k$$) grammar that is built according to the unsupervised learning paradigm. The GDPLL($$k$$) grammar is generated by a grammatical inference algorithm. The algorithm doesn’t take into account an additional knowledge (the knowledge is partial and corresponds only to some examples) provided by a human expert. The accuracy of the forecast could be better if we took advantage of this knowledge. The problem of how to construct the model of a semi-supervised learning for SPRLP system that includes the additional expert knowledge is discussed in the paper. We also present several possible solutions.

Janusz Jurek

### Parallel Processing of Color Digital Images for Linguistic Description of Their Content

This paper presents different aspects of parallelization of a problem of processing color digital images in order to generate linguistic description of their content. A parallel architecture of an intelligent image recognition system is proposed. Fuzzy classification and inference is performed in parallel, based on the CIE chromaticity color model and granulation approach. In addition, the parallelization concerns e.g. processing a large collection of images or parts of a single image.

### Co-evolution of Fitness Predictors and Deep Neural Networks

Deep neural networks proved to be a very useful and powerful tool with many applications. In order to achieve good learning results, the network architecture has, however, to be carefully designed, which requires a lot of experience and knowledge. Using an evolutionary process to develop new network topologies can facilitate this process. The limiting factor is the speed of evaluation of a single specimen (a single network architecture), which includes learning based on a large dataset. In this paper we propose a new approach which uses subsets of the original training set to approximate the fitness. We describe a co-evolutionary algorithm and discuss its key elements. Finally we draw conclusions from experiments and outline plans for future work.

Włodzimierz Funika, Paweł Koperek

### Performance Evaluation of DBN Learning on Intel Multi- and Manycore Architectures

In our previous papers [12, 13], we proposed the parallel realization of the Deep Belief Network (DBN). This research confirmed the potential usefulness of the first generation of the Intel MIC architecture for implementing DBN and similar algorithms. In this work, we investigate how the Intel MIC and CPU platforms can be applied to implement efficiently the complete learning process using DBNs with layers corresponding to the Restricted Boltzman Machines. The focus is on the new generation of Intel MIC devices known as Knights Landing. Unlike the previous generation, called Knights Corner, they are delivered not as coprocessors, but as standalone processors.The learning procedure is based on the matrix approach, where learning samples are grouped into packages, and represented as matrices. We study the possible ways of improving the performance taking into account features of the Knights Landing architecture, and parameters of the learning algorithm. In particular, the influence of the package size on the accuracy of learning, as well as on the performance of computations are investigated using conventional CPU and Intel Xeon Phi. The performance advantages of Knights Landing over Knights Corner are presented and discussed.

Tomasz Olas, Wojciech K. Mleczko, Marcin Wozniak, Robert K. Nowicki, Pawel Gepner

### On the Tunability of a New Hessenberg Reduction Algorithm Using Parallel Cache Assignment

The reduction of a general dense square matrix to Hessenberg form is a well known first step in many standard eigenvalue solvers. Although parallel algorithms exist, the Hessenberg reduction is one of the bottlenecks in AED, a main part in state-of-the-art software for the distributed multishift QR algorithm. We propose a new NUMA-aware algorithm that fits the context of the QR algorithm and evaluate the sensitivity of its algorithmic parameters. The proposed algorithm is faster than LAPACK for all problem sizes and faster than ScaLAPACK for the relatively small problem sizes typical for AED.

Mahmoud Eljammaly, Lars Karlsson, Bo Kågström

### New Preconditioning for the One-Sided Block-Jacobi SVD Algorithm

New preconditioning for the one-sided block-Jacobi algorithm used for the computation of the singular value decomposition of a matrix A is proposed. To achieve the asymptotic quadratic convergence quickly, one can apply the Jacobi algorithm to the matrix $$AV_1$$ instead of A, where $$V_1$$ is the matrix of eigenvectors from the eigenvalue decomposition of the Gram matrix $$A^TA$$. In exact arithmetic, $$V_1$$ is also the matrix of right singular vectors of A so that the columns of $$AV_1$$ lie in $$\mathrm {span}(U)$$, where U is the matrix of left singular vectors. However, in finite arithmetic, this is not true in general, and the deviation of $$\mathrm {span}(AV_1)$$ from $$\mathrm {span}(U)$$ depends on the 2-norm condition number $$\kappa (A)$$. The performance of the new preconditioned one-sided block-Jacobi algorithm was compared with three other SVD procedures. In the case of well-conditioned matrix, the new algorithm is up to 25 times faster than the LAPACK Jacobi procedure DGESVJ.

Martin Bečka, Gabriel Okša, Eva Vidličková

### Structure-Preserving Technique in the Block SS–Hankel Method for Solving Hermitian Generalized Eigenvalue Problems

The block SS–Hankel method is one of the most efficient methods for solving interior generalized eigenvalue problems (GEPs) when only the eigenvalues are required. However, even if the target GEP is Hermitian, the block SS–Hankel method does not always preserve the Hermitian structure. To overcome this issue, in this paper, we propose a structure-preserving technique of the block SS–Hankel method for solving Hermitian GEPs. We also analyse the error bound of the proposed method and show that the proposed method improves the accuracy of the eigenvalues. The numerical results support the results of the analysis.

Akira Imakura, Yasunori Futamura, Tetsuya Sakurai

### On Using the Cholesky QR Method in the Full-Blocked One-Sided Jacobi Algorithm

The one-sided Jacobi method is known as an alternative of the bi-diagonalization based singular value decomposition (SVD) algorithms like QR, divide-and-conquer and MRRR, because of its accuracy and comparable performance. There is an extension of the one-sided Jacobi method called “full-blocked” method, which can further improve the performance by replacing level-1 BLAS like operations with matrix multiplications. The main part of the full-blocked one-sided Jacobi method (OSBJ) is computing the SVD of a pair of block columns of the input matrix. Thus, the computation method of this partial SVD is important for both accuracy and performance of OSBJ. Hari proposed three methods for this computation, and we found out that one of the method called “V2”, which computes the QR decomposition in this partial SVD using the Cholesky QR method, is the fastest and has comparable accuracy with other method. This is interesting considering that Cholesky QR is generally known as fast but unstable algorithm. In this article, we analyze the accuracy of V2 and explain why and when the Cholesky QR method used in it can compute the QR decomposition accurately. We also show the performance and accuracy comparisons with other computational methods.

Shuhei Kudo, Yusaku Yamamoto

### Parallel Divide-and-Conquer Algorithm for Solving Tridiagonal Eigenvalue Problems on Manycore Systems

We present a new parallel divide-and-conquer (DC) algorithm based on an execution scheduling by batched kernels for solving real-symmetric tridiagonal eigenvalue problems on manycore systems. Our algorithm has higher parallelism and requires less global synchronizations than a conventional algorithm. We compared the performance of the solver based on our algorithm with that of Intel MKL’s DC solver and PLASMA’s one on Xeon E5, Xeon Phi Knights Corner, and Xeon Phi Knights Landing. The numerical tests show that the implementation of our algorithm is comparable to Intel MKL on Xeon E5 and outperforms Intel MKL and PLASMA on the two Xeon Phi systems.

Yusuke Hirota, Toshiyuki Imamura

### Partial Inverses of Complex Block Tridiagonal Matrices

The algorithm detailed below extends previous work on inversion of block tridiagonal matrices from the Hermitian/symmetric case to the general case and allows for varying sub-block sizes. The blocks of the matrix are evenly distributed across p processes. Local sub-blocks are combined to form a matrix on each process. These matrices are inverted locally and the inverses are combined in a pairwise manner. At each combination step, the updates to the global inverse are represented by updating “matrix maps” on each process. The matrix maps are finally applied to the original local inverse to retrieve the block tridiagonal elements of the global inverse. This algorithm has been implemented in Fortran with MPI. Calculated inverses are compared with inverses obtained using the well known libraries ScaLAPACK and MUMPS. Results are given for matrices arising from DFT applications.

Louise Spellacy, Darach Golden

### Parallel Nonnegative Matrix Factorization Based on Newton Iteration with Improved Convergence Behavior

The Nonnegative Matrix Factorization (NMF) approximates a large nonnegative matrix as a product of two significantly smaller nonnegative matrices. Because of the nonnegativity constraints, all existing methods for NMF are iterative. Newton-type methods promise good convergence rate and can also be parallelized very well because Newton iterations can be performed in parallel without exchanging data between processes. However, previous attempts have revealed problematic convergence behavior, limiting their efficiency. Therefore, we combine Karush-Kuhn-Tucker (KKT) conditions and a reflective technique for constraint handling, take care of global convergence by backtracking line search, and apply a modified target function in order to satisfy KKT inequalities. By executing only few Newton iterations per outer iteration, the algorithm is turned into a so-called inexact method. Experiments show that this leads to faster convergence in the sequential as well as in the parallel case. Although shorter Newton phases increase the relative parallel communication overhead, speedups are still satisfactory.

Rade Kutil, Markus Flatz, Marián Vajteršic

### Backmatter

Weitere Informationen