Skip to main content
Top

2016 | Book

Supercomputing

Second Russian Supercomputing Days, RuSCDays 2016, Moscow, Russia, September 26–27, 2016, Revised Selected Papers

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the Second Russian Supercomputing Days, RuSCDays 2016, held in Moscow, Russia, in September 2016.

The 28 revised full papers presented were carefully reviewed and selected from 32 submissions. The papers are organized in topical sections on the present of supercomputing: large tasks solving experience; the future of supercomputing: new technologies.

Table of Contents

Frontmatter

The Present of Supercomputing: Large Tasks Solving Experience

Frontmatter
Accelerating Assembly Operation in Element-by-Element FEM on Multicore Platforms
Abstract
The speedup of element-by-element FEM algorithms depends not only on the peak processor performance but also on the access time to shared mesh data. Eliminating memory boundness would significantly speedup unstructured mesh computations on hybrid multicore architectures, where the gap between processor and memory performance continues to grow. The speedup can be achieved by ordering unknowns so that only those elements are processed in parallel which do not have common nodes. If vectors are composed with respect to the ordering, memory conflicts will be minimized. Mesh was partitioned into layers by using neighborhood relationship. We evaluated several partitioning schemes (block, odd-even parity, and their modifications) on multicore platforms, using Gunther’s Universal Law of Computational Scalability. We performed numerical experiments with element-by-element matrix-vector multiplication on unstructured meshes on multicore processors accelerated by MIC and GPU. We achieved 5-times speedup on CPU, 40-times — on MIC, and 200-times — on GPU.
Sergey Kopysov, Alexander Novikov, Nikita Nedozhogin, Vladimir Rychkov
Block Lanczos–Montgomery Method with Reduced Data Exchanges
Abstract
We propose a new implementation of the block Lanczos–Montgomery method with reduced data exchanges for the solution of large linear systems over finite fields. The theoretical estimates obtained for parallel complexity indicate that the data exchanges in the proposed implementation for record-high matrix sizes and block sizes 50 require less time than those in ideally parallelizable computations with dense blocks. According to numerical results, the acceleration depends almost linearly on the number of cores (up to 2,000 cores). Then, the dependence becomes close to the square root of the number of cores.
Nikolai Zamarashkin, Dmitry Zheltkov
ChronosServer: Fast In Situ Processing of Large Multidimensional Arrays with Command Line Tools
Abstract
Explosive growth of raster data volumes in numerical simulations, remote sensing and other fields stimulate the development of new efficient data processing techniques. For example, in-situ approach queries data in diverse file formats avoiding time-consuming import phase. However, after data are read from file, their further processing always takes place with code developed almost from scratch. Standalone command line tools are one of the most popular ways for in-situ processing of raster files. Decades of development and feedback resulted in numerous feature-rich, elaborate, free and quality-assured tools optimized mostly for a single machine. The paper reports current development state and first results on performance evaluation of ChronosServer – distributed system partially delegating in-situ raster data processing to external tools. The new delegation approach is anticipated to readily provide rich collection of raster operations at scale. ChronosServer already outperforms state-of-the-art array DBMS on single machine up to 193×.
Ramon Antonio Rodriges Zalipynis
Dynamics of Formation and Fine Structure of Flow Pattern Around Obstacles in Laboratory and Computational Experiment
Abstract
Non-stationary dynamics and structure of stratified and homogeneous fluid flows around a plate and a wedge were studied on basis of the fundamental equations set using methods of laboratory and numerical modeling. Fields of various physical variables and their gradients were visualized in a wide range of the problem parameters. Eigen temporal and spatial scales of large (vortices, internal waves, wake) and fine flow components were defined. The same system of equations and numerical algorithm were used for the whole range of the parameters under consideration. The computation results are in a good agreement with the data of laboratory experiments.
Yu. D. Chashechkin, Ya. V. Zagumennyi, N. F. Dimitrieva
EnOI-Based Data Assimilation Technology for Satellite Observations and ARGO Float Measurements in a High Resolution Global Ocean Model Using the CMF Platform
Abstract
A parallel implementation of the ensemble optimal interpolation (EnOI) data assimilation method for the high resolution general circulation ocean model is presented. The data assimilation algorithm is formulated as a service block of the Compact Modeling Framework (CMF 3.0) developed for providing the software environment for stand-alone and coupled models of the Global geophysical fluids. In CMF 3.0 the direct MPI approach is replaced by the PGAS communication paradigm implemented in the third-party Global Arrays (GA) toolkit, and multiple coupler functions are encapsulated in the set of simultaneously working parallel services. Performance tests for data assimilation system have been carried out on the Lomonosov supercomputer.
Maxim Kaurkin, Rashit Ibrayev, Alexandr Koromyslov
Experience of Direct Numerical Simulation of Turbulence on Supercomputers
Abstract
Direct Numerical Simulation, i.e., numerical integration of the unsteady 3D Navier-Stokes equations is the most rigorous approach to turbulence simulation, which ensures an accurate prediction of characteristics of turbulent flows of any level of complexity. However its application to complex real-life flows, e.g. the flows past airplanes, cars, etc., demands huge computational resources and, even according to optimistic assessments of the rate of growth of computer power, will become possible only in the end of the current century. Nonetheless, already today the most powerful supercomputers allow DNS of some flows of high practical importance. The present paper demonstrates this by an example of DNS of high Reynolds number transonic flow over a circular cylinder with axisymmetric bump. The flow features shock wave formation and its interaction with a turbulent boundary layer on the cylinder surface, the phenomenon being of great importance for the aerodynamic design of the commercial airplanes.
Kirill V. Belyaev, Andrey V. Garbaruk, Mikhail L. Shur, Mikhail Kh. Strelets, Philippe R. Spalart
GPU-Accelerated Molecular Dynamics: Energy Consumption and Performance
Abstract
Energy consumption of hybrid systems is an actual problem of modern high-performance computing. The trade-off between power consumption and performance becomes more and more prominent. In this paper, we discuss the energy and power efficiency of two modern hybrid minicomputers Jetson TK1 and TX1. We use the Empirical Roofline Tool to obtain peak performance data and the molecular dynamics package LAMMPS as an example of a real-world benchmark. Using the precise wattmeter, we measure Jetsons power consumption profiles. The effectiveness of DVFS is examined as well. We determine the optimal GPU and DRAM frequencies that give the minimum energy-to-solution value.
Vyacheslav Vecher, Vsevolod Nikolskii, Vladimir Stegailov
Implementation and Evaluation of the PO-HEFT Problem-Oriented Workflow Scheduling Algorithm for Cloud Environments
Abstract
Modern computational experiments imply that the resources of the cloud computing environment are often used to solve a large number of tasks, which differ only in the values of a relatively small set of simulation parameters. Such sets of tasks may occur while implementing multivariate calculations aimed at finding the simulation parameter values, which optimize certain characteristics of the computational model. Applications of this type make a large percentage of modern HPC systems load, which implies a need for methods and algorithms for efficient allocation of resources in order to optimize systems for solving such problems. The aim of this work is to implement a PO-HEFT problem-oriented scientific workflow scheduling algorithm and to compare it with other workflow scheduling algorithms.
Gleb Radchenko, Ivan Lyzhin, Ekaterina Nepovinnyh
Layer-by-Layer Partitioning of Finite Element Meshes for Multicore Architectures
Abstract
In this paper, we present new partitioning algorithms for unstructured meshes that prevent conflicts during parallel assembly of FEM matrices and vectors in shared memory. The algorithms use a criterion that determines if any two mesh cells are neighboring. This neighborhood criterion is used to partition the mesh into layers, which are then combined into blocks and assigned to different parallel processes/threads. The proposed partitioning algorithms are compared with the existing algorithms on quasi-structured and unstructured meshes by the number of potential conflicts and by the load imbalance.
Alexander Novikov, Natalya Piminova, Sergey Kopysov, Yulia Sagdeeva
Multilevel Parallelization: Grid Methods for Solving Direct and Inverse Problems
Abstract
In this paper we present grid methods which we have developed for solving direct and inverse problems, and their realization with different levels of optimization. We have focused on solving systems of hyperbolic equations using finite difference and finite volume numerical methods on multicore architectures. Several levels of parallelism have been applied: geometric decomposition of the calculative domain, workload distribution over threads within OpenMP directives, and vectorization. The run-time efficiency of these methods has been investigated. These developments have been tested using the astrophysics code AstroPhi on a hybrid cluster Polytechnic RSC PetaStream (consisting of Intel Xeon Phi accelerators) and a geophysics (seismic wave) code on an Intel Core i7-3930K multicore processor. We present the results of the calculations and study MPI run-time energy efficiency.
Sofya S. Titarenko, Igor M. Kulikov, Igor G. Chernykh, Maxim A. Shishlenin, Olga I. Krivorot’ko, Dmitry A. Voronov, Mark Hildyard
Numerical Model of Shallow Water: The Use of NVIDIA CUDA Graphics Processors
Abstract
In the paper we discuss the main features of the software package for numerical simulations of the surface water dynamics. We consider an approximation of the shallow water equations together with the parallel technologies for NVIDIA CUDA graphics processors. The numerical hydrodynamic code is based on the combined Lagrangian-Euler method (CSPH-TVD). We focused on the features of the parallel implementation of Tesla line of graphics processors: C2070, K20, K40, K80. By using hierarchical grid systems at different spatial scales we increase the efficiency of the computing resources usage and speed up our simulations of a various flooding problems.
Tatyana Dyakonova, Alexander Khoperskov, Sergey Khrapov
Parallel Algorithm for Simulation of Fragmentation and Formation of Filamentous Structures in Molecular Clouds
Abstract
The report is devoted to numerical simulation of interaction between the post-shock wave frontal of supernova blast remnants and the gas of two molecular clouds (MC). The dynamical formation of MC structures associated with Kelvin-Helmholtz and Richtmayer-Meshkov instabilities occurring in the cloud and interstellar medium interaction zone is simulated. The MC gas flow evolution is derived from the time dependent equations of mass, momentum, and energy conservation. High resolution computational meshes (more than two billion nodes) were used in parallel computing on multiprocessor hybrid computers. In the model two initially spatially separated clouds with different gas density distribution fields interact with the post-shock medium. The peculiarities of clump and shell fragmentation of clouds and formation of filamentous rudiment structures are considered.
Boris Rybakin, Nikolai Smirnov, Valery Goryachev
Parallel Algorithms for a 3D Photochemical Model of Pollutant Transport in the Atmosphere
Abstract
In this paper, a numerical scheme for solving a system of convection-diffusion-kinetics equations of a mathematical model of transport of small pollutant components with their chemical interactions in the atmospheric boundary layer is presented. A new monotonized high-accuracy spline scheme is proposed to approximate the convective terms. Various approaches to parallelization of the computational algorithm are developed and tested. These are based on a two-dimensional decomposition of the calculation domain with synchronous or asynchronous interprocessor data communications for distributed memory computer systems.
Alexander Starchenko, Evgeniy Danilkin, Anastasiya Semenova, Andrey Bart
Parallel Computation of Normalized Legendre Polynomials Using Graphics Processors
Abstract
To carry out some calculations in physics and Earth sciences, for example, to determine spherical harmonics in geodesy or angular momentum in quantum mechanics, it is necessary to compute normalized Legendre polynomials. We consider the solution to this problem on modern graphics processing units, whose massively parallel architectures allow to perform calculations for many arguments, orders and degrees of polynomials simultaneously. For higher degrees of a polynomial, computations are characterized by a considerable spread in numerical values and lead to overflow and/or underflow problems. In order to avoid such problems, support for extended-range arithmetic has been implemented.
Konstantin Isupov, Vladimir Knyazkov, Alexander Kuvaev, Mikhail Popov
Parallel Software for Simulation of Nonlinear Processes in Technical Microsystems
Abstract
The modern stage of the industry evolution is characterized by introduction of nanotechnologies in production. Therefore, scientific researches of various technological processes and facilities at different levels of detailing up to atomic are become actual. Multiscale modeling of microsystems, which combines the methods of continuum mechanics and molecular dynamics, has become one of the possible approaches. This report presents elements of supercomputer technology and software, which enable us to solve some problems of nanotechnologies within the chosen approach.
Sergey Polyakov, Viktoriia Podryga, Dmitry Puzyrkov, Tatiana Kudryashova
Performance of MD-Algorithms on Hybrid Systems-on-Chip Nvidia Tegra K1 & X1
Abstract
In this paper we consider the efficiency of hybrid systems-on-a-chip for high-performance calculations. Firstly, we build Roofline performance models for the systems considered using Empirical Roofline Toolkit and compare the results with the theoretical estimates. Secondly, we use LAMMPS as an example of the molecular dynamic package to demonstrate its performance and efficiency in various configurations running on Nvidia Tegra K1 & X1. Following the Roofline approach, we attempt to distinguish compute-bound and memory-bound conditions for the MD algorithm using the Lennard-Jones liquid model. The results are discussed in the context of the LAMMPS performance on Intel Xeon CPUs and the Nvidia Tesla K80 GPU.
Vsevolod Nikolskii, Vyacheslav Vecher, Vladimir Stegailov
Revised Pursuit Algorithm for Solving Non-stationary Linear Programming Problems on Modern Computing Clusters with Manycore Accelerators
Abstract
This paper is devoted to the new edition of the parallel Pursuit algorithm proposed the authors in previous works. The Pursuit algorithm uses Fejer’s mappings for building pseudo-projection on polyhedron. The algorithm tracks changes in input data and corrects the calculation process. The previous edition of the algorithm assumed using a cube-shaped pursuit region with the number of K cells in one dimension. The total number of cells is \(K^n\), where n is the problem dimension. This resulted in high computational complexity of the algorithm. The new edition uses a cross-shaped pursuit region with one cross-bar per dimension. Such a region consists of only \(n(K-1)+1\) cells. The new algorithm is intended for cluster computing system with Xeon Phi processors.
Irina Sokolinskaya, Leonid Sokolinsky
Solving Multidimensional Global Optimization Problems Using Graphics Accelerators
Abstract
In the present paper an approach to solving the global optimization problems using a nested optimization scheme is developed. The use of different algorithms at different nesting levels is the novel element. A complex serial algorithm (on CPU) is used at the upper level, and a simple parallel algorithm (on GPU) is used at the lower level. This computational scheme has been implemented in ExaMin parallel solver. The results of computational experiments demonstrating the speedup when solving a series of test problems are presented.
Konstantin Barkalov, Ilya Lebedev
Supercomputer Simulation of Physicochemical Processes in Solid Fuel Ramjet Design Components for Hypersonic Flying Vehicle
Abstract
A step-by-step computer simulation variant for making scramjet mathematical model is offered. The report considers an approach related to 3D mathematical models development of scramjet components further reduced to 1D models. Mathematical models of physicochemical processes in combustor cooling system are discussed with the aim of subsequent engine performance optimization depending on fuels used. Then 1D separate component models are used to make up a full-scale scramjet model. The one-dimensional models allow calculation times significantly reduce, and the simulation accuracy is conditioned by precision of 3D models to 1D models reduction.
Vadim Volokhov, Pavel Toktaliev, Sergei Martynenko, Leonid Yanovskiy, Aleksandr Volokhov, Dmitry Varlamov

The Future of Supercomputing: New Technologies

Frontmatter
Addition for Supercomputer Functionality
Abstract
The addition of an optical wireless switching network with advanced functionalities to a supercomputer system is proposed. The structure of links of nodes (computer devices) a complete graph in which only links are realized is necessary. The switching units are located only in the sources and receivers. The structure of the network links can be changed quickly during execution of the single program instruction. The calculations may be executed for the data in the message without requiring additional time.
Gennady Stetsyura
Analysis of Processes Communication Structure for Better Mapping of Parallel Applications
Abstract
We consider a new approach to the classical problem of allocating parallel application processes to nodes of a high-performance computing system. A new algorithm which analyzes the communication structure of processes is presented. The obtained communication structure can be used to recommend mapping for a high-performance computing system with a given topology. The input for the proposed algorithm is the data representing the total length of messages sent between every two processes. A set of processes is analyzed as a system of particles which evolve under the influence of attractive and repulsive forces. The identified configuration of the particles reflects the communication structure of the underlying parallel application and can be used for effective mapping heuristics.
Ksenia Bobrik, Nina Popova
Experimental Comparison of Performance and Fault Tolerance of Software Packages Pyramid, X-COM, and BOINC
Abstract
The paper is devoted to the experimental comparison of performance and fault tolerance of software packages Pyramid, X-COM and BOINC. This paper contains the technique of carrying out the experiments and the results of these experiments. The performance comparison was carried out by assessing the overhead costs to arrange parallelization by data. In this case special tests simulating typical tasks of parallelization by data were designed by the authors. The comparison of fault tolerance was performed by simulating various emergency situations that arise during computations.
Anton Baranov, Evgeny Kiselev, Denis Chernyaev
Internet-Oriented Educational Course “Introduction to Parallel Computing”: A Simple Way to Start
Abstract
Educational course “Introduction to Parallel Computing” is discussed. A modern method of presentation of the educational materials for simultaneous teaching a large number of attendees (Massive Open Online Course, MOOC) has been applied. The educational course is delivered in the simplest form with a wide use of the presentational materials. Lectures of the course are subdivided into relatively small topics, which do not require significant effort to learn. This provides a continuous success of learning and increases the motivation of the students. For evaluation of the progress in the understanding of the educational content being studied, the course contains the test questionnaires and the tasks for the development of the parallel programs by the students themselves. The automated validation and scalability program evaluation are provided. These features can attract a large number of attendees and pay the students’ attention to the professional activity in the field of supercomputer technologies.
Victor Gergel, Valentina Kustikova
Parallel Computational Models to Estimate an Actual Speedup of Analyzed Algorithm
Abstract
The paper presents two models of parallel program runs on platforms with shared and distributed memory. By means of these models, we can estimate the speedup when running on a particular computer system. To estimate the speedup of OpenMP program the first model applies the Amdahl’s law. The second model uses properties of the analyzed algorithm, such as algorithm arithmetic and communication complexities. To estimate speedup the computer arithmetic performance and data transfer rate are used. For some algorithms, such as the preconditioned conjugate gradient method, the speedup estimations were obtained, as well as numerical experiments were performed to compare the actual and theoretically predicted speedups.
Igor Konshin
Techniques for Solving Large-Scale Graph Problems on Heterogeneous Platforms
Abstract
The paper introduces techniques for solving various large-scale graph problems on hybrid architectures. The proposed approach is illustrated on the computation of minimum spanning tree and shortest paths. We provide a precise mathematical description accompanied by the information structure of required algorithms. Efficient parallel implementations of several graph algorithms are proposed based on this analysis. Hybrid computations allow using all the available resources on both multi-core CPUs and GPUs. Our implementation uses out-of-core memory algorithms to handle graphs that don’t fit in the main memory. Experimental results confirm high performance and scalability of the proposed solutions. Moreover, the proposed approach can be applied to other graph processing problems, which have recently rapidly increased in demand.
Ilya Afanasyev, Alexander Daryin, Jack Dongarra, Dmitry Nikitenko, Alexey Teplov, Vladimir Voevodin
The Elbrus Platform Feasibility Assessment for High-Performance Computations
Abstract
This paper examines the prospects of the Elbrus computing platform for high-performance computations. The results of the most representative HPC benchmarks (HPCC, NPB, HPCG) and their analysis were presented. The testbed node was equipped with four MCST Elbrus-4C processors and DRAM DDR3 with total capacity 48 Gb. Different factors affecting the performance of FT and MG tests from NPB benchmark suite were analyzed by using Paraver tool, hardware performance counters (HPC) and MPI communications data. The scalability of geological application implementing the double-square-root (DSR) prestack migration method was investigated. Benchmark results show that the code customization to reveal platform-specific optimizations is required for the best performance. Nevertheless, the scalability analysis demonstrates that most tests are linearly scalable within a certain range of processor numbers.
Ekaterina Tyutlyaeva, Sergey Konyukhov, Igor Odintsov, Alexander Moskovsky
Using Machine Learning Methods to Detect Applications with Abnormal Efficiency
Abstract
At the moment a lot of supercomputing applications are inefficient in terms of the usage of available resources. To decrease the number of such inefficient applications, a tool for supercomputer task flow analysis and detection of inefficient application runs is needed. In this paper several supervised machine learning methods are considered to solve this issue. The classification performed by these methods is based on system monitoring data (e.g. CPU load, network usage etc.). The experiments on real data show that the Random Forest algorithm is currently the best option to accomplish given goal. At the moment the resulting classifier model is being tested on the “Lomonosov” supercomputer. The experiment results demonstrating the efficiency of the resulting model are also included in this paper.
Denis Shaykhislamov
Using Simulation for Performance Analysis and Visualization of Parallel Branch-and-Bound Methods
Abstract
The Branch-and-Bound (B&B) is a fundamental algorithmic scheme for a large variety of global optimization methods. For many problems B&B requires the amount of computing resources far beyond the power of a single-CPU workstation thus making parallelization almost inevitable. The approach proposed in this paper allows one to evaluate load balancing algorithms for parallel B&B with various numbers of processors, sizes of the search tree, the characteristics of the supercomputer’s interconnect. The proposed approach was implemented as a special tool that simulates the process of resolution of the optimization problem by B&B method as a stochastic tree branching process. Data exchanges are modeled using the concept of logical time. The user-friendly graphical interface can render both real traces and ones produced by the simulator. It provides efficient visualization of the CPU’s load, data exchanges and progress of the optimization process.
Yury Evtushenko, Yana Golubeva, Yury Orlov, Mikhail Posypkin
Backmatter
Metadata
Title
Supercomputing
Editors
Vladimir Voevodin
Sergey Sobolev
Copyright Year
2016
Electronic ISBN
978-3-319-55669-7
Print ISBN
978-3-319-55668-0
DOI
https://doi.org/10.1007/978-3-319-55669-7

Premium Partner