main-content

This book constitutes the refereed proceedings of the 4th Russian Supercomputing Days, RuSCDays 2018, held in Moscow, Russia, in September 2018.

The 59 revised full papers and one revised short paper presented were carefully reviewed and selected from 136 submissions. The papers are organized in topical sections on parallel algorithms; supercomputer simulation; high performance architectures, tools and technologies.

### A Parallel Algorithm for Studying the Ice Cover Impact onto Seismic Waves Propagation in the Shallow Arctic Waters

The seismic study in the Arctic transition zones in the summer season is troublesome because of the presence of large areas covered by shallow waters like bays, lakes, rivers, their estuaries and so on. The winter season is more convenient and essentially facilitates logistic operations and implementation of seismic acquisition. However in the winter there is another complicating factor: intensive seismic noise generated by sources installed on the floating ice. To understand peculiarities of seismic waves and the origin of such an intensive noise, a representative series of numerical experiments has been performed. In order to simulate the interaction of seismic waves with irregular perturbations of underside of the ice cover, a finite-difference technique based on locally refined in time and in space grids is used. The need to use such grids is primarily due to the different scales of heterogeneities in a reference medium and the ice cover should be taken into account. We use the domain decomposition method to separate the elastic/viscoelastic model into subdomains with different scales. Computations for each subdomain are carried out in parallel. The data exchange between the two groups of CPU is done simultaneously by coupling a coarse and a fine grids. The results of the numerical experiments prove that the main impact to noise is multiple conversions of flexural waves to the body ones and vice versa and open the ways to reduce this noise.

### An Efficient Parallel Algorithm for Numerical Solution of Low Dimension Dynamics Problems

Present work is focused on speeding up computer simulations of continuously variable transmission (CVT) dynamics. A simulation is constituted by an initial value problem for ordinary differential equations (ODEs) with highly nonlinear right hand side. Despite low dimension, simulations take considerable CPU time due to internal stiffness of the ODEs, which leads to a large number of integration steps when a conventional numerical method is used. One way to speed up simulations is to parallelize the evaluation of ODE right hand side using the OpenMP technology. The other way is to apply a numerical method more suitable for stiff systems. The paper presents current results obtained by employing a combination of both approaches. Difficulties on the way towards good scalability are pointed out.

Stepan Orlov, Alexey Kuzin, Nikolay Shabrov

### Analysis of Means of Simulation Modeling of Parallel Algorithms

At the ICMMG, an integral approach to creating algorithms and software for exaflop computers is being developed. Within the framework of this approach, the study touches upon the scalability of parallel algorithms by using the method of simulation modeling with the help of an AGNES modeling system. Based on a JADE agent platform, AGNES has a number of essential shortcomings in the modeling of hundreds of thousands and millions of independent computing cores, which is why it is necessary to find an alternative tool for simulation modeling. Various instruments of agent and actor modeling were studied in the application to modeling of millions of computing cores, such as QP/C++, CAF, SObjectizer, Erlang, and Akka. As a result, on the basis of ease of implementation, scalability, and fault tolerance, the Erlang functional programming language was chosen, which originally was developed to create telephony programs. Today Erlang is meant for developing distribution computing systems and includes means for generating parallel lightweight processes and their interaction through exchange of asynchronous messages in accordance with an actor model.Testing the performance of this tool in the implementation of parallel algorithms on future exaflop supercomputers is carried out by investigating the scalability of the statistical simulation algorithm by the Monte Carlo methods on a million computing cores. The results obtained in this paper are compared with the results obtained earlier by using AGNES.

D. V. Weins, B. M. Glinskiy, I. G. Chernykh

### Block Lanczos-Montgomery Method over Large Prime Fields with GPU Accelerated Dense Operations

Solution of huge linear systems over large prime fields is a problem that arises in such applications as discrete logarithm computation. Lanczos-Montgomery method is one of the methods to solve such problems. Main parallel resource of the method us the size of the block. But computational cost of dense matrix operations is increasing with block size growth. Thus, parallel scaling is close to linear only while complexity of such operations are relatively small. In this paper block Lanczos-Montgomery method with dense matrix operations accelerated on GPU is implemented. Scalability tests are performed (including tests with multiple GPU per node) and compared to CPU only version.

Nikolai Zamarashkin, Dmitry Zheltkov

### Comparison of Dimensionality Reduction Schemes for Parallel Global Optimization Algorithms

This work considers a parallel algorithms for solving multi-extremal optimization problems. Algorithms are developed within the framework of the information-statistical approach and implemented in a parallel solver Globalizer. The optimization problem is solved by reducing the multidimensional problem to a set of joint one-dimensional problems that are solved in parallel. Five types of Peano-type space-filling curves are employed to reduce dimension. The results of computational experiments carried out on several hundred test problems are discussed.

Konstantin Barkalov, Vladislav Sovrasov, Ilya Lebedev

### Efficiency Estimation for the Mathematical Physics Algorithms for Distributed Memory Computers

The paper presents several models of parallel program runs on computer platforms with distributed memory. The prediction of the parallel algorithm efficiency is based on algorithm arithmetic and communication complexities. For some mathematical physics algorithms for explicit schemes of the solution of the heat transfer equation the speedup estimations were obtained, as well as numerical experiments were performed to compare the actual and theoretically predicted speedups.

Igor Konshin

### Extremely High-Order Optimized Multioperators-Based Schemes and Their Applications to Flow Instabilities and Sound Radiation

Multioperators-base schemes up to 32nd-order for fluid dynamics calculations are described. Their parallel implementation is outlined. The results of applications of their versions to instability and sound radiation problems are presented. The extension to strongly discontinuous solutions is briefly outlined.

Andrei Tolstykh, Michael Lipavskii, Dmitrii Shirobokov, Eugenii Chigerev

### GPU-Based Parallel Computations in Multicriterial Optimization

In the present paper, an efficient approach for solving the time-consuming multicriterial optimization problems, in which the optimality criteria could be the multiextremal ones and computing the criteria values could require a large amount of computations is proposed. The proposed approach is based on the reduction of the multicriterial problems to the scalar optimization ones with the use of the minimax convolution of the partial criteria, on the dimensionality reduction with the use of the Peano space-filling curves, and on the application of the efficient information-statistical global optimization methods. An additional application of the block multistep scheme provides the opportunity of the large-scale parallel computations with the use of the graphics processing units (GPUs) with thousands of computational cores. The results of the numerical experiments have demonstrated such an approach to allow improving the computational efficiency of solving the multicriterial optimization problems considerably – hundreds and thousands.

Victor Gergel, Evgeny Kozinov

### LRnLA Algorithm ConeFold with Non-local Vectorization for LBM Implementation

We have achieved a $${\sim }0.3$$ GLUps performance on a 4 core CPU for the D3Q19 Lattice Boltzmann method by taking an advanced time-space decomposition approach. The LRnLA algorithm ConeFold was used with a new non-local mirrored vectorization. The roofline model was used for the performance estimation and parameter choice. There are many expansion possibilities, so the developed kernel may become a foundation for more complex LBM variations.

### Numerical Method for Solving a Diffraction Problem of Electromagnetic Wave on a System of Bodies and Screens

The three-dimensional vector problem of electromagnetic wave diffraction by systems of intersecting dielectric bodies and infinitely thin perfectly conducting screens of irregular shapes is considered. The original boundary value problem for Maxwell‘s equations is reduced to a system of integro-differential equations. Methods of surface and volume integral equations are used. The system of linear algebraic equations is obtained using the Galerkin method with compactly supported basis functions. The subhierarchical method is applied to solve the diffraction problem by scatterers of irregular shapes. Several results are presented. Also we used a parallel algorithm.

Mikhail Medvedik, Marina Moskaleva, Yury Smirnov

### Parallel Algorithm for One-Way Wave Equation Based Migration for Seismic Imaging

Seismic imaging is the final stage of the seismic processing allowing to reconstruct the internal subsurface structure. This procedure is one of the most time consuming and it requires huge computational resources to get high-quality amplitude-preserving images. In this paper, we present a parallel algorithm of seismic imaging, based on the solution of the one-way wave equation. The algorithm includes parallelization of the data flow, due to the multiple sources/receivers pairs processing. Wavefield extrapolation is performed by pseudo-spectral methods and applied by qFFT - each dataset is processed by a single MPI process. Common-offset vector images are constructed using all the solutions from all datasets thus by all-to-all MPI communications.

### Parallel Simulation of Community-Wide Information Spreading in Online Social Networks

Models of information spread in online social networks (OSNs) are in high demand these days. Most of them consider peer-to-peer interaction on a predefined topology of friend network. However, in particular types of OSNs the largest information cascades are observed during the community-user interaction when communities play the role of superspreaders for their audience. In the paper, we consider the problem of the parallel simulation of community-wide information spreading in large-scale (up to dozens of millions of nodes) networks. The efficiency of parallel algorithm is studied for synthetic and real-world social networks from VK.com using the Lomonosov supercomputer (Moscow State University, Russian Federation).

Sergey Kesarev, Oksana Severiukhina, Klavdiya Bochenina

### Technique for Teaching Parallel Programming via Solving a Computational Electrodynamics Problems

Three-dimensional problems of computational electrodynamics for the regions of complex shape can be solved within the reasonable time only using multiprocessor computer systems. The paper discusses the process of converting sequential algorithms into more efficient programs using some special techniques, including object-oriented programming concepts. The special classes for data storage are recommended to use at the first stage of programming. Many objects in the program can be destroyed when optimizing the code. Special attention is paid to the testing of computer programs. As an example, the problem of the electromagnetic waves diffraction by screens in three-dimensional waveguide structures and its particular cases are considered. The technique of constructing a parallel code for solving the diffraction problem is used in teaching parallel programming.

Sergey Mosin, Nikolai Pleshchinskii, Ilya Pleshchinskii, Dmitrii Tumakov

### The Algorithm for Transferring a Large Number of Radionuclide Particles in a Parallel Model of Ocean Hydrodynamics

The aim of the research is concerned with the description of algorithm of transferring a large, up to 106, number of radionuclide particles in a general circulation model of the ocean, INMIO. Much attention is paid to the functioning of the algorithm in conditions of the original model parallelism. The order of the information storage necessary in the course of model calculations is given in this paper. The important aspects of saving calculated results to external media are revealed. The algorithm of radionuclide particles decay is described. The results of the experiment obtained by calculation of the original model based on the configuration of the Laptev Sea are presented.

Vladimir Bibin, Rashit Ibrayev, Maxim Kaurkin

### Aerodynamic Models of Complicated Constructions Using Parallel Smoothed Particle Hydrodynamics

In current paper we consider new industrial tasks requiring of air dynamics calculations inside and outside of huge and geometrically complicated building constructions. An example of such constructions are sport facilities of a semi-open type for which is necessary to evaluate comfort conditions depending on external factors both at the stage of design and during the further operation of the building. Among the distinguishing features of such multiscale task are the considerable size of building with a scale of hundreds of meters and complicated geometry of external and internal details with characteristic sizes of an order of a meter. Such tasks require using of supercomputer technologies and creating of a 3D-model adapted for computer modeling. We have developed specialized software for numerical aerodynamic simulations of such buildings utilizing the smoothed particle method for Nvidia Tesla GPUs based on CUDA technology. The SPH method allows conducting through hydrodynamic calculations in presence of large number of complex internal surfaces. These surfaces can be designed by 3D-model of a building. We have paid particular attention to the parallel computing efficiency accounting for boundary conditions on geometrically complex solid surfaces and on free boundaries. The discussion of test simulations of the football stadium is following.

Alexander Titov, Sergey Khrapov, Victor Radchenko, Alexander Khoperskov

### Ballistic Resistance Modeling of Aramid Fabric with Surface Treatment

The minimization of mass and reducing the value of deflection of the back surface of an armored panel, which will lower the level of trauma to the human body, are crucial tasks in the current development of body armors. A significant part of the bullet energy is dissipated due to the friction of pulling-out yarns from ballistic fabrics in the body armor. We present a method for controlling the process of dry friction between yarns – surface treatment with various compositions (PVA suspension, rosin). This procedure causes only a slight increase weighting of the fabric. We investigated an impact loading of aramid fabrics of plain weave P110 with different types of surface treatment and without it (the samples were located on the backing material – technical plasticine). The indenter speed was in the range of 100–130 m/s. We also developed a model of an impact loading of considered samples in explicit FE code LS-DYNA. The surface treatment of the fabric in the model was taken into account by only one parameter – the coefficient of dry friction. We considered several methods of the task parallelizing. Numerical experiments were conducted to study the problem scalability. We found that the surface treatment reduces deflection of fabric up to 37% with an increase in weight up to 5.1%. The numerical values of the depths of the dents in the technical plasticine are in good agreement with the experimental data.

Natalia Yu. Dolganina, Anastasia V. Ignatova, Alexandra A. Shabley, Sergei B. Sapozhnikov

### CardioModel – New Software for Cardiac Electrophysiology Simulation

The rise of supercomputing technologies during the last decade has enabled significant progress towards the invention of a personal biologically relevant computer model of a human heart. In this paper we present a new code for numerical simulation of cardiac electrophysiology on supercomputers. Having constructed a personal segmented tetrahedral grid of the human heart based on a tomogram, we solve the bidomain equations of cardiac electrophysiology using the finite element method thus achieving the goal of modeling of the action potential propagation in heart. Flexible object-oriented architecture of the software allows us to expand its capabilities by using relevant cell models, preconditioners and numerical methods for solving SLAEs. The results of numerical modeling of heart under normal conditions as well as a number of simulated pathologies are in a good agreement with theoretical expectations. The software achieves at least 75% scaling efficiency on the 120 ranks on the Lobachevsky supercomputer.

Valentin Petrov, Sergey Lebedev, Anna Pirova, Evgeniy Vasilyev, Alexander Nikolskiy, Vadim Turlapov, Iosif Meyerov, Grigory Osipov

### Examination of Clastic Oil and Gas Reservoir Rock Permeability Modeling by Molecular Dynamics Simulation Using High-Performance Computing

“Digital rock” is a multi-purpose tool for solving a variety of tasks in the field of geological exploration and production of hydrocarbons at various stages, designed to improve the accuracy of geological study of subsurface resources, the efficiency of reproduction and usage of mineral resources, as well as applying of the results obtained in industry. This paper presents the results of numerical calculations and their comparison with the full-scale natural examination. Laboratory studies have been supplemented with petrographic descriptions to deepen an insight into behaviors of the studied rock material. There is a general tendency to overestimate the permeability, which may be associated with the application of a rather crude resistive model for assessing permeability and owing to the porous cement has not been considered.

Vladimir Berezovsky, Marsel Gubaydullin, Alexander Yur’ev, Ivan Belozerov

### Hybrid Codes for Atomistic Simulations on the Desmos Supercomputer: GPU-acceleration, Scalability and Parallel I/O

In this paper, we compare different GPU accelerators and algorithms for classical molecular dynamics using LAMMPS and GROMACS codes. BigDFT is considered as an example of the modern ab initio code that implements the density functional theory algorithms in the wavelet basis and uses effectively GPU acceleration. Efficiency of distributed storage managed by the BeeGFS parallel file system is analysed with respect to saving of large molecular-dynamics trajectories. Results have been obtained using the Desmos supercomputer in JIHT RAS.

Nikolay Kondratyuk, Grigory Smirnov, Vladimir Stegailov

### INMOST Parallel Platform for Mathematical Modeling and Applications

In the present work we present INMOST, the programming platform for mathematical modelling and its application to a couple of practical problems. INMOST consists of a number of tools: mesh and mesh data manipulation, automatic differentiation, linear solvers, support for multiphysics modelling. The application of INMOST to black-oil reservoir simulation and blood coagulation problem is considered.

Kirill Terekhov, Yuri Vassilevski

### Maximus: A Hybrid Particle-in-Cell Code for Microscopic Modeling of Collisionless Plasmas

A second-order accurate divergence-conserving hybrid particle-in-cell code Maximus has been developed for microscopic modeling of collisionless plasmas. The main specifics of the code include a constrained transport algorithm for exact conservation of magnetic field divergence, a Boris-type particle pusher, a weighted particle momentum deposit on the cells of the 3d spatial grid, an ability to model multispecies plasmas, and an adaptive time step. The code is efficiently parallelized for running on supercomputers by means of the message passing interface (MPI) technology; an analysis of parallelization efficiency and overall resource intensity is presented. A Maximus simulation of the shocked flow in the Solar wind is shown to agree well with the observations of the Ion Release Module (IRM) aboard the Active Magnetospheric Particle Tracer Explorers interplanetary mission.

Julia Kropotina, Andrei Bykov, Alexandre Krassilchtchikov, Ksenia Levenfish

### Microwave Radiometry of Atmospheric Precipitation: Radiative Transfer Simulations with Parallel Supercomputers

In the present paper, the problems of formation and observation of spatial and angular distribution of thermal radiation of raining atmosphere in the millimeter wave band are addressed. Radiative transfer of microwave thermal radiation in three-dimensional dichroic medium is simulated numerically using high performance parallel computer systems. Governing role of three dimensional cellular inhomogeneity of the precipitating atmosphere in the formation of thermal radiation field is shown.

Yaroslaw Ilyushin, Boris Kutuza

### Modeling Groundwater Flow in Unconfined Conditions of Variable Density Solutions in Dual-Porosity Media Using the GeRa Code

Flow of variable density solution in unconfined conditions and transport in dual-porosity media mathematical model is introduced. We show the application of the model to a real object which is a polygon of deep well liquid radioactive waste injection. Several assumptions are justified to simplify the model and it is discretized. The model is aimed at assessment of the role of density changes on the contaminant propagation dynamics within the polygon. The method of parallelization is described and the results of numerical experiments are presented herein.

Ivan Kapyrin, Igor Konshin, Vasily Kramarenko, Fedor Grigoriev

### New QM/MM Implementation of the MOPAC2012 in the GROMACS

Hybrid QM/MM simulations augmented with enhanced sampling techniques proved to be advantageous in different usage scenarios, from studies of biological systems to drug and enzyme design. However, there are several factors that limit the applicability of the approach. First, typical biologically relevant systems are too large and hence computationally expensive for many QM methods. Second, a majority of fast non ab initio QM methods contain parameters for a very limited set of elements, which restrains their usage for applications involving radionuclides and other unusual compounds. Therefore, there is an incessant need for new tools which will expand both type and size of simulated objects. Here we present a novel combination of widely accepted molecular modelling packages GROMACS and MOPAC2012 and demonstrate its applicability for design of a catalytic antibody capable of organophosphorus compound hydrolysis.

Arthur O. Zalevsky, Roman V. Reshetnikov, Andrey V. Golovin

### Orlando Tools: Energy Research Application Development Through Convergence of Grid and Cloud Computing

The paper addresses the relevant problem related to the development of scientific applications (applied software packages) to solve large-scale problems in heterogeneous distributed computing environments that can include various infrastructures (clusters, Grid systems, clouds) and provide their integrated use. We propose a new approach to the development of applications for such environments. It is based on the integration of conceptual and modular programming. The application development is implemented with a special framework named Orlando Tools. In comparison to the known tools, used for the development and execution of distributed applications in the current practice, Orlando Tools provides executing application jobs in the integrated environment of virtual machines that include both the dedicated and non-dedicated resources. The distributed computing efficiency is improved through the multi-agent management. Experiments of solving the large-scale practical problems of energy security research show the effectiveness of the developed application for solving the aforementioned problem in the environment that supports the hybrid computational model including Grid and cloud computing.

Alexander Feoktistov, Sergei Gorsky, Ivan Sidorov, Roman Kostromin, Alexei Edelev, Lyudmila Massel

### Parallel FDTD Solver with Static and Dynamic Load Balancing

Finite-difference time-domain method (FDTD) is widely used for modeling of computational electrodynamics by numerically solving Maxwell’s equations and finding approximate solution at each time step. Overall computational time of FDTD solvers could become significant when large numerical grids are used. Parallel FDTD solvers usually help with reduction of overall computational time, however, the problem of load balancing arises on parallel computational systems. Load balancing of FDTD algorithm for homogeneous computational systems could be performed statically, before computations. In this article static and dynamic load balancing of FDTD algorithm for heterogeneous computational systems is described. Dynamic load balancing allows to redistribute grid points between computational nodes and effectively manage computational resources during process of computations for arbitrary computational system. Dynamic load balancing could be turned into static, if data required for balancing was gathered during previous computations. Measurements for presented algorithms are provided for IBM Blue Gene/P supercomputer and Tesla CMC server. Further directions for optimizations are also discussed.

Gleb Balykov

### Parallel Supercomputer Docking Program of the New Generation: Finding Low Energy Minima Spectrum

The results of studies of the energy surfaces of the protein-ligand complexes carried out with the help of the FLM docking program belonging to the new generation of gridless docking programs are presented. It is demonstrated that the ability of the FLM docking program to find the global energy minimum is much higher than one of the “classical” SOL docking program using the genetic algorithm and the preliminary calculated grid of potentials of ligand atoms interactions with the target protein. The optimal number of FLM local optimization reliable finding of the global energy minimum and all local minima with energies in the 2 kcal/mol interval above the energy of the global minimum is found. This number is 250 thousand. For complexes with the ligand containing more than 60 atoms and having more than 12 torsions and with more than protein 4500 protein atoms the number of FLM local optimizations should be noticeably increased. There are several unique energy minima in this energy interval and for most complexes these minima are located near (RMSD < 3 $${\AA}$$ ) the global minimum. However, there a complexes where such minima are located far from the global minimum with RMSD (on all ligand atoms) > 5 $${\AA}$$ .

Alexey Sulimov, Danil Kutov, Vladimir Sulimov

### Parallelization Strategy for Wavefield Simulation with an Elastic Iterative Solver

We present a parallelization strategy for our novel iterative method to simulate elastic waves in 3D land inhomogeneous isotropic media via MPI and OpenMP. The unique features of the solver are the preconditioner developed to assure fast convergence of the Krylov-type iteration method at low time frequencies and the way to calculate how the forward modeling operator acts on a vector. We successfully benchmark the accuracy of our solver against the exact solution and compare it to another iterative solver. The quality of the parallelization is justified by weak and strong scaling analysis. Our modification allows simulation in big models including a modified 2.5D Marmousi model comprising 90 million cells.

Mikhail Belonosov, Vladimir Cheverda, Victor Kostin, Dmitry Neklyudov

### Performance of Time and Frequency Domain Cluster Solvers Compared to Geophysical Applications

In the framework of frequency-domain full waveform inversion (FWI), we compare the performance of two MPI-based acoustic solvers. One of the solvers is the time-domain solver developed by the SEISCOPE consortium. The other solver is a frequency-domain multifrontal direct solver developed by us. For the high-contrast 3D velocity model, we perform the series of experiments for varying numbers of cluster nodes and shots, and conclude that in FWI applications the solvers complement each other in terms of performance. Theoretically, the conclusion follows from considerations of structures of the solvers and their scalabilities. Relations between the number of cluster nodes, the size of the geophysical model and the number of shots define which solver would be preferable in terms of performance.

Victor Kostin, Sergey Solovyev, Andrey Bakulin, Maxim Dmitriev

### Population Annealing and Large Scale Simulations in Statistical Mechanics

Population annealing is a novel Monte Carlo algorithm designed for simulations of systems of statistical mechanics with rugged free-energy landscapes. We discuss a realization of the algorithm for the use on a hybrid computing architecture combining CPUs and GPGPUs. The particular advantage of this approach is that it is fully scalable up to many thousands of threads. We report on applications of the developed realization to several interesting problems, in particular the Ising and Potts models, and review applications of population annealing to further systems.

Lev Shchur, Lev Barash, Martin Weigel, Wolfhard Janke

### Simulation and Optimization of Aircraft Assembly Process Using Supercomputer Technologies

Airframe assembly is mainly based on the riveting of large-scale aircraft parts, and manufacturers are highly concerned about acceleration of this process. Simulation of riveting emerges the necessity for contact problem solving in order to prevent the penetration of parts under the loads from fastening elements (fasteners). Specialized methodology is elaborated that allows reducing the dimension and transforming the original problem into quadratic programming one with input data provided by disposition of fasteners and initial gap field between considered parts.While optimization of a manufacturing process the detailed analysis of the assembly has to be done. This leads to series of similar computations that differ only in input data sets provided by the variations of gap and fastener locations. Thus, task parallelism can be exploited, and the problem can be efficiently solved by means of supercomputer.The paper is devoted to the cluster version of software complex developed for aircraft assembly simulation in the terms of the joint project between Peter the Great St.Petersburg Polytechnic University and Airbus SAS. The main features of the complex are described, and application cases are considered.

Tatiana Pogarskaia, Maria Churilova, Margarita Petukhova, Evgeniy Petukhov

### SL-AV Model: Numerical Weather Prediction at Extra-Massively Parallel Supercomputer

The SL-AV global atmosphere model is used for operational medium-range and long-range forecasts at Hydrometcentre of Russia. The program complex uses the combination of MPI and OpenMP technologies. Currently, a new version of the model with the horizontal resolution about 10 km is being developed. In 2017, preliminary experiments have shown the scalability of the SL-AV model program complex up to 9000 processor cores with the efficiency of about 45% for grid dimensions of 3024 × 1513 × 51. The profiling analysis for these experiments revealed bottlenecks of the code: non-optimal memory access in OpenMP threads in some parts of the code, time losses in the MPI data exchanges in the dynamical core, and the necessity to replace some numerical algorithms. The review of model code improvements targeting the increase of its parallel efficiency is presented. The new code is tested at the new Cray XC40 supercomputer installed at Roshydromet Main Computer Center.

### Supercomputer Simulation Study of the Convergence of Iterative Methods for Solving Inverse Problems of 3D Acoustic Tomography with the Data on a Cylindrical Surface

This paper is dedicated to developing effective methods of 3D acoustic tomography. The inverse problem of acoustic tomography is formulated as a coefficient inverse problem for a hyperbolic equation where the speed of sound and the absorption factor in three-dimensional space are unknown. Substantial difficulties in solving this inverse problem are due to its nonlinear nature. A method which uses short sounding pulses of two different central frequencies is proposed. The method employs an iterative parallel gradient-based minimization algorithm at the higher frequency with the initial approximation of unknown coefficients obtained by solving the inverse problem at the lower frequency. The efficiency of the proposed method is illustrated via a model problem. In the model problem an easy to implement 3D tomographic scheme is used with the data specified at a cylindrical surface. The developed algorithms can be efficiently parallelized using GPU clusters. Computer simulations show that a GPU cluster capable of performing 3D image reconstruction within reasonable time.

Sergey Romanov

### Supercomputer Technology for Ultrasound Tomographic Image Reconstruction: Mathematical Methods and Experimental Results

This paper is concerned with layer-by-layer ultrasound tomographic imaging methods for differential diagnosis of breast cancer. The inverse problem of ultrasound tomography is formulated as a coefficient inverse problem for a hyperbolic differential equation. The scalar mathematical model takes into account wave phenomena, such as diffraction, refraction, multiple scattering, and absorption of ultrasound. The algorithms were tested on real data obtained in experiments on a test bench for ultrasound tomography studies. Low-frequency ultrasound in the 100–500 kHz band was used for sounding. An important result of this study is an experimental confirmation of the adequacy of the underlying mathematical model. The ultrasound tomographic imaging methods developed have a spatial resolution of 2 mm, which is acceptable for medical diagnostics. The experiments were carried out using phantoms with parameters close to the acoustical properties of human soft tissues. The image reconstruction algorithms are designed for graphics processors. Architecture of the GPU cluster for ultrasound tomographic imaging is proposed, which can be employed as a computing device in a tomographic complex.

Alexander Goncharsky, Sergey Seryozhnikov

### The Parallel Hydrodynamic Code for Astrophysical Flow with Stellar Equations of State

In this paper, a new calculation method for numerical simulation of astrophysical flow at the supercomputers is described. The co-design of parallel numerical algorithms for astrophysical simulations is described in detail. The hydrodynamical numerical model with stellar equations of state (EOS), numerical methods for solving the hyperbolic equations and a short description of the parallel implementation of the code are described. For problems using large amounts of RAM, for example, the collapse of a molecular cloud core, our code upgraded for Intel Memory Drive Technology (IMDT) support. In this paper, we present the results of some IMDT performance tests based on Siberian Supercomputer Center facilities equipped with Intel Optane Memory. The results of numerical experiments of hydrodynamical simulations of the model stellar explosion are presented.

Igor Kulikov, Igor Chernykh, Vitaly Vshivkov, Vladimir Prigarin, Vladimir Mironov, Alexander Tutukov

### Three-Dimensional Simulation of Stokes Flow Around a Rigid Structure Using FMM/GPU Accelerated BEM

Composite materials play an important role in aircraft, space and automotive industries, wind power industry. One of the most commonly used methods for the manufacture of composite materials is the impregnation of dry textiles by a viscous liquid binder. During the process, cavities (voids) of various sizes can be formed and then move in a liquid resin flows in the complex system of channels formed by textile fibers. The presence of such cavities results in a substantial deterioration of the mechanical properties of the composites. As a result, the development and effective implementation of the numerical methods and approaches for the effective 3D simulation of the viscous liquid flow around a rigid structure of different configuration. In the present study, the mathematical model and its effective numerical implementation for the study of hydrodynamic processes around fixed structure at low Reynolds numbers is considered. The developed approach is based on the boundary element method for 3D problems accelerated both via an advanced scalable algorithm (FMM), and via utilization of a heterogeneous computing architecture (multicore CPUs and graphics processors). This enables direct large scale simulations on a personal workstation, which is confirmed by test and demo computations. The simulation results and details of the method and accuracy/performance of the algorithm are discussed. The results of the research may be used for the solution of problems related to microfluidic device construction, theory of the composite materials production, and are of interest for computational hydrodynamics as a whole.

Olga A. Abramova, Yulia A. Pityuk, Nail A. Gumerov, Iskander Sh. Akhatov

### Using of Hybrid Cluster Systems for Modeling of a Satellite and Plasma Interaction by the Molecular Dynamics Method

This article deals with a model of interaction between a positively charged microsatellite and thermal space plasma. The model is based on the method of molecular dynamics (MMD). The minimum possible number of particles necessary for modeling in the simplest geometric problem formulation for a microsatellite in the form of a sphere 10 cm in diameter is ten million. This value is determined by the plasma parameters, including the value of the Debye radius, which is the basis for estimating the spatial dimensions of the modeling domain.For the solution, MPI and CUDA technologies are used in the version of one MPI process per node. An intermediate layer in the form of multithreading, implemented on the basis of the C++ library of threads, is also used, this provides more flexible control over the management of all kinds of node memory (video memory, RAM), which provides a performance boost. The GPU optimizes the use of shared memory, records the allocation of registers between threads and the features associated with calculating trigonometric functions.The results of numerical simulation for a single-ion thermal plasma showed significant changes in the spatial distribution of the concentration around the satellite, which depends on three main parameters - the temperature of the plasma components, the velocity of the satellite relative to the plasma and the potential of the spacecraft. The presence of a region of reduced ion concentration behind the satellite and the region of condensation in front of it is shown.

Leonid Zinin, Alexander Sharamet, Sergey Ishanov

### High Performance Architectures, Tools and Technologies

#### Frontmatter

This work describes BOINC-based Desktop Grid implementation of adaptive task scheduling algorithm for virtual drug screening. The algorithm bases on a game-theoretical mathematical model where computing nodes act as players. The model allows to adjust the balance between the results retrieval rate and the search space coverage. We present the developed scheduling algorithm for BOINC-based Desktop Grid and evaluate its performance by simulations. Experimental analysis shows that the proposed scheduling algorithm allows to adjust the results retrieval rate and the search space coverage in a flexible way so as to reach the maximal efficiency of a BOINC-based Desktop Grid.

Natalia Nikitina, Evgeny Ivashko

### Advanced Vectorization of PPML Method for Intel® Xeon® Scalable Processors

Piecewise Parabolic Method on a Local Stencil is very useful for numerical simulation of fluid dynamics, astrophysics. The main idea of the PPML method is the use of a piecewise parabolic numerical solution on the previous time step for computing the Riemann problem solving partial differential equations system (PDE). In this paper, we present the new version of PDE solver which is based on the PPML method optimized for Intel Xeon Scalable processor family. The results of performance comparison between different types of AVX-512 compatible Intel Xeon Scalable processors are presented. Special attention is paid to comparing the performance of Intel Xeon Phi (KNL) and Intel Xeon Scalable processors.

Igor Chernykh, Igor Kulikov, Boris Glinsky, Vitaly Vshivkov, Lyudmila Vshivkova, Vladimir Prigarin

### Analysis of Results of the Rating of Volunteer Distributed Computing Projects

Volunteer distributed computing (VDC) is a fairly popular way of conducting large scientific experiments. The organization of computational experiments on a certain subject implies the creation of a project of volunteer distributed computing. In this project, computing resources are provided by volunteers. The community of volunteers is about several million people around the world. To increase the computing power of the volunteer distributed computing project, technical methods for increasing the efficiency of computation can be used. However, no less important are methods of attracting new volunteers and motivating this virtual community to provide computing resources. The organizers of VDC projects, as a rule, are experts in applied fields, but not in the organization of volunteer distributed computing. To assist the organizers of the VDC projects authors conducted a sociological study to determine the motivation of volunteers, created a multiparameter method and rating for evaluating various VDC projects. This article proposes a method for assessing the strengths and weaknesses of VDC projects, based on the approach. The results of multiparameter evaluation and rating of projects can help the organizers of the VDC projects to increase the efficiency of computations, and the community of volunteers to provide a tool for comparing the various VDC projects.

Vladimir N. Yakimets, Ilya I. Kurochkin

### Application of the LLVM Compiler Infrastructure to the Program Analysis in SAPFOR

The paper proposes an approach to implementation of program analysis in SAPFOR (System FOR Automated Parallelization). This is a software development suit that is focused on cost reduction of manual program parallelization. It was primarily designed to perform source-to-source transformation of a sequential program for execution on parallel architectures with distributed memory. LLVM (Low Level Virtual Machine) compiler infrastructure is used to examine a program. This paper focuses on establishing a correspondence between the properties of the program in the programming language and the properties of its low-level representation.

Nikita Kataev

### Batch of Tasks Completion Time Estimation in a Desktop Grid

This paper describes a statistical approach used to estimate batch of tasks completion time in a Desktop Grid. The statistical approach based on Holt model is presented. The results of numerical experiments based on statistics of RakeSearch and LHC@home volunteer computing projects are given.

Evgeny Ivashko, Valentina Litovchenko

### BOINC-Based Branch-and-Bound

The paper proposes an implementation of the Branch-and-Bound method for an enterprise grid based on the BOINC infrastructure. The load distribution strategy and the overall structure of the developed system are described with special attention payed to some specific issues such as incumbent updating and load distribution. The implemented system was experimentally tested on a moderate size enterprise grid. The achieved results demonstrate an adequate efficiency of the proposed approach.

Andrei Ignatov, Mikhail Posypkin

### Comprehensive Collection of Time-Consuming Problems for Intensive Training on High Performance Computing

Training specialists capable of applying models, methods, technologies and tools of parallel computing to solve problems is of great importance for further progress in many areas of modern science and technology. Qualitative training of such engineers requires the development of appropriate curriculum, largely focused on practice. In this paper, we present a new handbook of problems on parallel computing. The book contains methodological materials, problems and examples of their solution. The final section describes the automatic solution verification software. The handbook of problems will be employed to train students of the Lobachevsky University of Nizhni Novgorod.

Iosif Meyerov, Sergei Bastrakov, Alexander Sysoyev, Victor Gergel

### Dependable and Coordinated Resources Allocation Algorithms for Distributed Computing

In this work, we introduce slot selection and co-allocation algorithms for parallel jobs in distributed computing with non-dedicated and heterogeneous resources. A single slot is a time span that can be assigned to a task, which is a part of a parallel job. The job launch requires a co-allocation of a specified number of slots starting and finishing synchronously. Some existing resource co-allocation algorithms assign a job to the first set of slots matching the resource request without any optimization (the first fit type), while other algorithms are based on an exhaustive search. In this paper, algorithms for efficient, dependable and coordinated slot selection are studied and compared with known approaches. The novelty of the proposed approach is in a general algorithm efficiently selecting a set of slots according to the specified criterion.

Victor Toporkov, Dmitry Yemelyanov

### Deploying Elbrus VLIW CPU Ecosystem for Materials Science Calculations: Performance and Problems

Modern Elbrus-4S and Elbrus-8S processors show floating point performance comparable to the popular Intel processors in the field of high-performance computing. Tasks oriented to take advantage of the VLIW architecture show even greater efficiency on Elbrus processors. In this paper the efficiency of the most popular materials science codes in the field of classical molecular dynamics and quantum-mechanical calculations is considered. A comparative analysis of the performance of these codes on Elbrus processor and other modern processors is carried out.

### Design Technology for Reconfigurable Computer Systems with Immersion Cooling

In this paper, we consider the implementation of reconfigurable computer systems based on advanced Xilinx UltraScale and UltraScale+FPGAs and a design method of immersion cooling systems for computers containing 96–128 chips. We propose the selection criteria of key technical solutions for creation of high-performance computer systems with liquid cooling. The construction of the computational block prototype and the results of its experimental thermal testing are presented. The results demonstrate high energy efficiency of the proposed open cooling system and existence of power reserve for the next-generation FPGAs. Effective cooling of 96–128 FPGAs with the total thermal power of 9.6–12.8 kW in a 3U computational module is the key feature of the considered system. Insensitivity to leakages and their consequences, and compatibility with traditional water cooling systems based on industrial chillers are the advantages of the developed technical solution. These features allow installation of liquid-cooled computer systems with no fundamental change of the computer hall infrastructure.

Ilya Levin, Alexey Dordopulo, Alexander Fedorov, Yuriy Doronchenko

### Designing a Parallel Programs on the Base of the Conception of Q-Determinant

The paper describes a design method of parallel programs for numerical algorithms based on their representation in the form of Q-determinant. The result of the method is Q-effective program. It uses the parallelism resource of the algorithm completely. The results of this research can be applied to increase the implementation efficiency of algorithms on parallel computing systems. This should help to improve the performance of parallel computing systems.

Valentina Aleeva

### Enumeration of Isotopy Classes of Diagonal Latin Squares of Small Order Using Volunteer Computing

The paper is devoted to discovering new features of diagonal Latin squares of small order. We present an algorithm, based on a special kind of transformations, that constructs a canonical form of a given diagonal Latin square. Each canonical form corresponds to one isotopy class of diagonal Latin squares. The algorithm was implemented and used to enumerate the isotopy classes of diagonal Latin squares of order at most 8. For order 8 the computational experiment was conducted in a volunteer computing project. The algorithm was also used to estimate how long it would take to enumerate the isotopy classes of diagonal Latin squares of order 9 in the same volunteer computing project.

Eduard Vatutin, Alexey Belyshev, Stepan Kochemazov, Oleg Zaikin, Natalia Nikitina

### Interactive 3D Representation as a Method of Investigating Information Graph Features

An algorithm information graph is a structure of wide variety. It can tell a lot about algorithm features, such as computational complexity and resource of parallelism, as well as about sequential operations blocks within an algorithm. Graphs of different algorithms often share similar regular structures — their presence is an indicator of potentially similar algorithm behavior. Convenient, interactive 3D representation of an information graph is a decent method of researching it; it can demonstrate algorithm characteristics listed above and its structural features. In this article we investigate an approach to creating such representations, implement it using our AlgoView system and give examples of using a resulting tool.

Alexander Antonov, Nikita Volkov

### On Sharing Workload in Desktop Grids

We consider two optimization problems of trade-off between risk of not getting an answer (due to failures or errors) and precision or accuracy in Desktop Grid computing. Quite simple models are general enough to be applicable for optimizing real systems. We support the made assumptions by statistics collected in a Desktop Grid computing project.

Ilya Chernov

### On-the-Fly Calculation of Performance Metrics with Adaptive Time Resolution for HPC Compute Jobs

Performance monitoring is a method to debug performance issues in different types of applications. It uses various performance metrics obtained from the servers the application runs on, and also may use metrics which are produced by the application itself. The common approach to building performance monitoring systems is to store all the data to a database and then to retrieve the data which correspond to the specific job and perform an analysis using that portion of the data. This approach works well when the data stream is not very large. For large performance monitoring data stream this incurs much IO and imposes high requirements on storage systems which process the data.In this paper we propose an adaptive on-the-fly approach to performance monitoring of High Performance Computing (HPC) compute jobs which significantly lowers data streams to be written to a storage. We used this approach to implement performance monitoring system for HPC cluster to monitor compute jobs. The output of our performance monitoring system is a time-series graph representing aggregated performance metrics for the job. The time resolution of the resulted graph is adaptive and depends on the duration of the analyzed job.

### Residue Logarithmic Coprocessor for Mass Arithmetic Computations

The work is aimed at solving the urgent problems of modern high-performance computing. The purpose of the study is to increase the speed, accuracy and reliability of mass arithmetic calculations. To achieve the goal, author’s methods of performing operations and transforming data in the prospective residue logarithmic number system are used. This numbering system makes it possible to unite the advantages of non-conventional number systems: a residue number system and a logarithmic number system. The subject of study is a parallel-pipelined coprocessor implementing the proposed calculation methods. The study was carried out using the theory of computer design and systems, methods and means of experimental analysis of computers and systems. As a result of the research and development new scientific and technical solutions are proposed that implement the proposed methods of data computation and coding. The proposed coprocessor has high speed, accuracy and reliability of processing of real operands in comparison with known analogs based on the floating-point positioning system.

Ilya Osinin

### Supercomputer Efficiency: Complex Approach Inspired by Lomonosov-2 History Evaluation

These days the number of supercomputer users and the jobs they execute is rapidly growing, especially for supercomputers, providing computing time to external users. Supercomputers and their computing time are highly expensive, so their efficiency is crucial for both users and owners. There are several ways to increase operational efficiency, however, in most cases it involves a trade-off between efficiency metrics. This brings about a need to define “efficiency” in each specific case. We use the historical data from two largest Russian supercomputers to create a number of metrics in order to provide the definition of resource management “efficiency”. The data from both Lomonosov and Lomonosov-2 supercomputers consists of over one year history of job executions. Lomonosov and Lomonosov-2 efficiency in terms of CPU hours utilization is considerably high, nevertheless, our global goal is to offer the way to maintain or improve this metric when maximizing others examined in the paper.

Sergei Leonenkov, Sergey Zhumatiy

### Supercomputer Real-Time Experimental Data Processing: Technology and Applications

The study is focused on the technology of remote real-time processing of intensive data streams from experimental stands using supercomputers. The structure of distributing data system, software for data processing, optimized PIV algorithm are presented. Using of real-time data processing makes possible realization of experiments with feedback when external forcing depends on internal characteristics of the system. Approbation of this technique is demonstrated on experimental study of intensive cyclonic vortex formation from localized heat source in a rotating layer of fluid. In this study the heating intensity depends on velocity of the flow. The characteristics of the flow obtained by supercomputer real-time processing of PIV images are used as input parameters for the heating system. The concept of using developed technology in the experimental stands of aircraft industry is also described.

Vladislav A. Shchapov, Alexander M. Pavlinov, Elena N. Popova, Andrei N. Sukhanovskii, Stanislav L. Kalyulin, Vladimir Ya. Modorskii

### The Conception, Requirements and Structure of the Integrated Computational Environment

The general conception, main requirements and functional architecture of the integrated computational environment (ICE) for the high-performance mathematical modeling of a wide class of the multi-physics processes and phenomena on the modern and future postpetaflops supercomputers are considered. The new generation problems to be solved are described by the multi-dimensional direct and inverse statements for the systems of nonlinear differential and/or integral equations, as well as by variational and discrete inequalities. The objective of the ICE is to support all the main technological stages of large-scale computational experiments and to provide a permanent and extendable mathematical innovation structure for wide groups of the users from various fields, based on the advanced software and on integration of the external products. The technical requirements and architecture solutions of the project proposed are discussed.

V. P. Il’in

### The Elbrus-4C Based Node as Part of Heterogeneous Cluster for Oil and Gas Processing Researches

This paper briefly examines the advantages and disadvantages of Elbrus architectures as building blocks for Seismic Processing cluster system. The configuration of a heterogeneous clustered system build for research Oil and Gas Company is examined in more detail. In this system, processing nodes with different architecture (x86, GPU and e2k) are integrated in a single computing cluster through a high performance global networking topologies. Heterogeneous cluster with Elbrus node provides a good opportunity for software cross-architectural migration. To demonstrate the potential of Elbrus nodes usage, the multispectral data analysis application has been optimized for e2k architecture. Paper includes performance results and scalability analysis for implemented module using e2k and x86 nodes. It is anticipated that the heterogeneous cluster with Elbrus node will form an integral part of the preparation process of the domestic supercomputer under development, based on the Elbrus processors. The basic software stack in Seismic Processing will be naturally emerged on the use of Elbrus node as part of the heterogeneous cluster.

Ekaterina Tyutlyaeva, Igor Odintsov, Alexander Moskovsky, Sergey Konyukhov, Alexander Kalyakin, Murad I. Neiman-zade

### The Multi-level Adaptive Approach for Efficient Execution of Multi-scale Distributed Applications with Dynamic Workload

Today advanced research is based on complex simulations which require a lot of computational resources that usually are organized in a very complicated way from technical part of the view. It means that a scientist from physics, biology or even sociology should struggle with all technical issues on the way of building distributed multi-scale application supported by a stack of specific technologies on high-performance clusters. As the result, created applications have partly implemented logic and are extremely inefficient in execution. In this paper, we present an approach which takes away the user from the necessity to care about an efficient resolving of imbalance of computations being performed in different processes and on different scales of his application. The efficient balance of internal workload in distributed and multi-scale applications may be achieved by introducing: a special multi-level model; a contract (or domain-specific language) to formulate the application in terms of this model; and a scheduler which operates on top of that model. The multi-level model consists of computing routines, computational resources and executed processes, determines a mapping between them and serves as a mean to evaluate the resulting performance of the whole application and its individual parts. The contract corresponds to unification interface of application integration in the proposed framework while the scheduling algorithm optimizes the execution process taking into consideration the main computational environment aspects.

Denis Nasonov, Nikolay Butakov, Michael Melnik, Alexandr Visheratin, Alexey Linev, Pavel Shvets, Sergey Sobolev, Ksenia Mukhina

### Using Resources of Supercomputing Centers with Everest Platform

High-performance computing plays an increasingly important role in modern science and technology. However, the lack of convenient interfaces and automation tools greatly complicates the widespread use of HPC resources among scientists. The paper presents an approach to solving these problems relying on Everest, a web-based distributed computing platform. The platform enables convenient access to HPC resources by means of domain-specific computational web services, development and execution of many-task applications, and pooling of multiple resources for running distributed computations. The paper describes the improvements that have been made to the platform based on the experience of integration with resources of supercomputing centers. The use of HPC resources via Everest is demonstrated on the example of loosely coupled many-task application for solving global optimization problems.

Sergey Smirnov, Oleg Sukhoroslov, Vladimir Voloshinov