Zum Inhalt

Algorithms and Architectures for Parallel Processing

ICA3PP 2016 Collocated Workshops: SCDT, TAPEMS, BigTrust, UCER, DLMCS, Granada, Spain, December 14-16, 2016, Proceedings

  • 2016
  • Buch
insite
SUCHEN

Über dieses Buch

This book constitutes the refereed workshop proceedings of the 16th International Conference on Algorithms and Architectures for Parallel Processing, ICA3PP 2016, held in Granada, Spain, in December 2016. The 30 full papers presented were carefully reviewed and selected from 58 submissions. They cover many dimensions of parallel algorithms and architectures, encompassing fundamental theoretical approaches, practical experimental projects, and commercial components and systems trying to push beyond the limits of existing technologies, including experimental efforts, innovative systems, and investigations that identify weaknesses in existing parallel processing technology.

Inhaltsverzeichnis

Frontmatter

TAPEMS 2016: International Workshop in Theoretical Approaches to Performance Evaluation, Modeling, and Simulation

Frontmatter
OTFX: An In-memory Event Tracing Extension to the Open Trace Format 2
Abstract
In event-based performance analysis the amount of collected data is one of the most urgent challenges. It can massively slow down application execution, overwhelm the underlying file system and introduce significant measurement bias due to intermediate memory buffer flushes. To address these issues we propose an in-memory event tracing approach that dynamically adapts the volume of application events to an amount that is guaranteed to fit into a single memory buffer, and therefore, avoiding file interaction entirely. These concepts include runtime filtering, enhanced encoding techniques, and novel strategies for runtime event reduction. The concepts further include the hierarchical memory buffer a multi-dimensional, hierarchical data structure allowing to realize these concepts with minimal overhead. We demonstrate the capabilities of our concepts with a prototype implementation called OTFX, based on the Open Trace Format 2, a state-of-the-art open source tracing library used by the performance analyzers Vampir, Scalasca, and Tau.
Michael Wagner, Andreas Knüpfer, Wolfgang E. Nagel
Tuning the Blocksize for Dense Linear Algebra Factorization Routines with the Roofline Model
Abstract
The optimization of dense linear algebra operations is a fundamental task in the solution of many scientific computing applications. The Roofline Model is a tool that provides an estimation of the performance that a computational kernel can attain on a hardware platform. Therefore, the RM can be used to investigate whether a computational kernel can be further accelerated. We present an approach, based on the RM, to optimize the algorithmic parameters of dense linear algebra kernels. In particular, we perform a basic analysis to identify the optimal values for the kernel parameters. As a proof-of-concept, we apply this technique to optimize a blocked algorithm for matrix inversion via Gauss-Jordan elimination. In addition, we extend this technique to multi-block computational kernels. An experimental evaluation validates the method and shows its convenience. We remark that the results obtained can be extended to other computational kernels similar to Gauss-Jordan elimination such as, e.g., matrix factorizations and the solution of linear least squares problems.
Peter Benner, Pablo Ezzatti, Enrique S. Quintana-Ortí, Alfredo Remón, Juan P. Silva
Network-Aware Optimization of MPDATA on Homogeneous Multi-core Clusters with Heterogeneous Network
Abstract
The communication layer of modern HPC platforms is getting increasingly heterogeneous and hierarchical. As a result, even on platforms with homogeneous processors, the communication cost of many parallel applications will significantly vary depending on the mapping of their processes to the processors of the platform. The optimal mapping, minimizing the communication cost of the application, will strongly depend on the network structure and performance as well as the logical communication flow of the application. In our previous work, we proposed a general approach and two approximate heuristic algorithms aimed at minimization of the communication cost of data parallel applications which have two-dimensional symmetric communication pattern on heterogeneous hierarchical networks, and tested these algorithms in the context of the parallel matrix multiplication application. In this paper, we develop a new algorithm that is built on top of one of these heuristic approaches in the context of a real-life application, MPDATA, which is one of the major parts of the EULAG geophysical model. We carefully study the communication flow of MPDATA and discover that even under the assumption of a perfectly homogeneous communication network, the logical communication links of this application will have different bandwidths, which makes the optimization of its communication cost particularly challenging. We propose a new algorithm that is based on cost functions of one of our general heuristic algorithms and apply it to optimization of the communication cost of MPDATA, which has asymmetric heterogeneous communication pattern. We also present experimental results demonstrating performance gains due to this optimization.
Tania Malik, Lukasz Szustak, Roman Wyrzykowski, Alexey Lastovetsky
Formalizing Data Locality in Task Parallel Applications
Abstract
Task-based programming provides programmers with an intuitive abstraction to express parallelism, and runtimes with the flexibility to adapt the schedule and load-balancing to the hardware. Although many profiling tools have been developed to understand these characteristics, the interplay between task scheduling and data reuse in the cache hierarchy has not been explored. These interactions are particularly intriguing due to the flexibility task-based runtimes have in scheduling tasks, which may allow them to improve cache behavior.
This work presents StatTask, a novel statistical cache model that can predict cache behavior for arbitrary task schedules and cache sizes from a single execution, without programmer annotations. StatTask enables fast and accurate modeling of data locality in task-based applications for the first time. We demonstrate the potential of this new analysis to scheduling by examining applications from the BOTS benchmarks suite, and identifying several important opportunities for reuse-aware scheduling.
Germán Ceballos, Erik Hagersten, David Black-Schaffer
Improving the Energy Efficiency of Evolutionary Multi-objective Algorithms
Abstract
Problems for which many objective functions have to be simultaneously optimized can be easily found in many fields of science and industry. Solving this kind of problems in a reasonable amount of time while taking into account the energy efficiency is still a relevant task. Most of the evolutionary multi-objective optimization algorithms based on parallel computing are focused only on performance. In this paper, we propose a parallel implementation of the most time consuming parts of the Evolutionary Multi-Objective algorithms with major attention to energy consumption. Specifically, we focus on the most computationally expensive part of the state-of-the-art evolutionary NSGA-II algorithm – the Non-Dominated Sorting (NDS) procedure. GPU platforms have been considered due to their high acceleration capacity and energy efficiency. A new version of NDS procedure is proposed (referred to as EFNDS). A made-to-measure data structure to store the dominance information has been designed to take advantage of the GPU architecture. NSGA-II based on EFNDS is comparatively evaluated with another state-of-art GPU version, and also with a widely used sequential version. In the evaluation we adopt a benchmark that is scalable in the number of objectives as well as decision variables (the DTLZ test suite) using a large number of individuals (from 500 up to 30000). The results clearly indicate that our proposal achieves the best performance and energy efficiency for solving large scale multi-objective optimization problems on GPU.
J. J. Moreno, G. Ortega, E. Filatovas, J. A. Martínez, E. M. Garzón
A Parallel Model for Heterogeneous Cluster
Abstract
The LogP model was used to measure the effects of latency, occupancy and bandwidth on distributed memory multiprocessors. The idea was to characterize distributed memory multiprocessor using these key parameters, studying their impacts on performance in simulation environments. This work proposes a new model, based on LogP, that describes the impacts on performance of applications executing on a heterogeneous cluster. This model can be used, in a near future, to help choose the best way to split a parallel application to be executed on this architecture. The model considers that a heterogeneous cluster is composed by distinct types of processors, accelerators and networks.
Thiago Marques Soares, Rodrigo Weber dos Santos, Marcelo Lobosco
Comparative Analysis of OpenACC Compilers
Abstract
OpenACC has been on development for a few years now. The OpenACC 2.5 specification was recently made public and there are some initiatives for developing full implementations of the standard to make use of accelerator capabilities. There is much to be done yet, but currently, OpenACC for GPUs is reaching a good maturity level in various implementations of the standard, using CUDA and OpenCL as backends. Nvidia is investing in this project and they have released an OpenACC Toolkit, including the PGI Compiler. There are, however, more developments out there. In this work, we analyze different available OpenACC compilers that have been developed by companies or universities during the last years. We check their performance and maturity, keeping in mind that OpenACC is designed to be used without extensive knowledge about parallel programming. Our results show that the compilers are on their way to a reasonable maturity, presenting different strengths and weaknesses.
Daniel Barba, Arturo Gonzalez-Escribano, Diego R. Llanos

BigTrust 2016: The 1st International Workshop on Trust, Security and Privacy for Big Data

Frontmatter
The Research of Recommendation System Based on User-Trust Mechanism and Matrix Decomposition
Abstract
Recommendation system is a tool that can help users quickly and effectively obtain useful resources in the face of the large amounts of information. Collaborative filtering is a widely used recommendation technology which recommends source for users through similar neighbors’ scores, but is faced with the problem of data sparseness and “cold start”. Although recommendation system based on trust model can solve the above problems to some extent, but still need further improvement to its coverage. To solve these problems, the paper proposes a matrix decomposition algorithm mixed with user trust mechanism (hereinafter referred to as UTMF), The algorithm uses matrix decomposition to fill the score matrix, and combine trust rating information of users in the filling process. According to the results of experiment using the E-opinions Data set, UTMF algorithm can improve the precision of the recommended, effectively ease the cold start problem.
PanPan Zhang, Bin Jiang
Traffic Sign Recognition Based on Parameter-Free Detector and Multi-modal Representation
Abstract
For the traffic sign that is difficult to detect in traffic environment, a traffic sign detection and recognition is proposed in this paper. First, the color characteristics of the traffic sign are segmented, and region of interest is expanded and extracts edge. Then edge is roughly divided by linear drawing and miscellaneous points removing. Turing angle curvature is computed according to the relations between the curvature of the vertices, vertices type is classified. The standard shapes such as circular, triangle, rectangle, etch are detected by parameter-free detector. For improving recognition accuracy, two different methods were presented to classify the detected candidate regions of traffic sign. The one method was dual-tree complex wavelet transform (DT-CWT) and 2D independent component analysis (2DICA) that represented candidate regions on grayscale image and reduced feature dimension, then a nearest neighbor classifier was employed to classify traffic sign image and reject noise regions. The other method was template matching based on intra pictograms of traffic sign. The obtained different recognition results were fused by some decision rules. The experimental results show that the detection and recognition rate of the proposed algorithm is higher for conditions such as traffic signs obscured, uneven illumination, color distortion, and it can achieve the effect of real-time processing.
Gu Mingqin, Chen Xiaohua, Zhang Shaoyong, Ren Xiaoping
Reversible Data Hiding Using Non-local Means Prediction
Abstract
In this paper, we propose a prediction-error expansion based reversible data hiding scheme by incorporating non-local means (NLM) prediction. The traditional local predictors reported in literatures rely on the local correlation and behave badly in predicting textural pixels. To remedy this, we propose to use NLM to achieve better prediction in texture regions and globally utilize the potential self-similarity contained in the image itself. More specifically, the textural pixels distinguished by its local complexity are predicted by NLM while the smooth pixels having high local correlation are predicted by a local predictor. The incorporation of NLM makes the proposed method possible to achieve accurate predictions in both smooth and texture regions. Optimal parameters in the method are obtained by minimizing the prediction-error entropy. Experimental results show that the proposed method can yield an improvement compared with some state-of-the-art methods.
Yingying Fang, Bo Ou
Secure Data Access in Hadoop Using Elliptic Curve Cryptography
Abstract
Big data analytics allows to obtain valuable information from different data sources. It is important to maintain control of those data because unauthorised copies could be used by other entities or companies interested in them. Hadoop is widely used for processing large volumes of information and therefore is ideal for developing big data applications. Its security model focuses on the control within a cluster by preventing unauthorised users, or encrypting data distributed among nodes. Sometimes, data theft is carried out by personnel who have access to the system so they can skip most of the security features. In this paper, we present an extension to the Hadoop security model that lets control the information from the source, avoiding that data can be used by unauthorised users and improving corporative e-governance. We use an eToken with elliptic curve cryptography that performs a robust operation of the system and prevents from being falsified, duplicated or manipulated.
Antonio F. Díaz, Ilia Blokhin, Julio Ortega, Raúl H. Palacios, Cristina Rodríguez-Quintana, Juan Díaz-García
Statistical Analysis of CCM.M-K1 International Comparison Based on Monte Carlo Method
Abstract
The application of the Monte Carlo method is used in the processing of the measurement result of CCM.M-K1. This method can get over the limitations that apply in certain cases to the method described in GUM. Introduction and analysis of CCM.M-K1 measurement result was given out and commercial software named @RISK was used to purse numerical simulation and the result was compared with the final report of CCM.M-K1, which showed that differences between results of these two were negligible.
Chang-qing Cai, Xiao-ping Ren, Guo-dong Hao, Jian Wang, Tao Huang

First International Workshop on Data Locality in Modern Computing Systems (DLMCS 2016)

Frontmatter
Redundancy Elimination in the ExaStencils Code Generator
Abstract
Optimizing the performance of compute-bound codes requires, among other techniques, the elimination of redundant computations. The well-known concept of common subexpression elimination can achieve this in parts, and almost every production compiler conducts such an optimization. However, due to the conservative nature of these compilers, an external redundancy elimination can additionally increase the performance. For stencil codes using finite volume discretizations, an extension to eliminate redundancies between loop iterations is also very promising. We integrated both a classic common subexpression elimination and an extended version in the Exastencils code generator and present their impact on a real-world application.
Stefan Kronawitter, Sebastian Kuckuk, Christian Lengauer
A Dataflow IR for Memory Efficient RIPL Compilation to FPGAs
Abstract
Field programmable gate arrays (FPGAs) are fundamentally different to fixed processors architectures because their memory hierarchies can be tailored to the needs of an algorithm. FPGA compilers for high level languages are not hindered by fixed memory hierarchies. The constraint when compiling to FPGAs is the availability of resources.
In this paper we describe how the dataflow intermediary of our declarative FPGA image processing DSL called RIPL (Rathlin Image Processing Language) enables us to constrain memory. We use five benchmarks to demonstrate that memory use with RIPL is comparable to the Vivado HLS OpenCV library without the need for language pragmas to guide hardware synthesis. The benchmarks also show that RIPL is more expressive than the Darkroom FPGA image processing language.
Robert Stewart, Greg Michaelson, Deepayan Bhowmik, Paulo Garcia, Andy Wallace

Ultrascale Computing for Early Researchers (UCER 2016)

Frontmatter
Exploring a Distributed Iterative Reconstructor Based on Split Bregman Using PETSc
Abstract
The proliferation in the last years of many iterative algorithms for Computed Tomography is a result of the need of finding new ways for obtaining high quality images using low dose acquisition methods. These iterative algorithms are, in many cases, computationally much more expensive than traditional analytic ones. Based on the resolution of large linear systems, they normally make use of backprojection and projections operands in an iterative way reducing the performance of the algorithms compared to traditional ones. They are also algorithms that rely on a large quantity of memory because they need of working with large coefficient matrices. As the resolution of the available detectors increase, the size of these matrices starts to be unmanageable in standard workstations. In this work we propose a distributed solution of an iterative reconstruction algorithm with the help of the PETSc library. We show in our preliminary results the good scalability of the solution in one node (close to the ideal one) and the possibilities offered with a larger number of nodes. However, when increasing the number of nodes the performance degrades due to the poor scalability of some fundamental pieces of the algorithm as well as the increase of the time spend in both MPI communication and reduction.
Estefania Serrano, Tom Vander Aa, Roel Wuyts, Javier Garcia Blas, Jesus Carretero, Monica Abella
Implementation of the Beamformer Algorithm for the NVIDIA Jetson
Abstract
Nowadays, the aim of the technology industry is intensively shifting to improve the ratio Gflop/watt of computation. Many processors implement the low power design of ARM architecture like, e.g. the NVIDIA TK1, a chip which also includes a GPU embedded in the same die to improve performance at a low energy consumption. This type of devices are very suitable target machines to be used on applications that require mobility like, e.g. those that manage and reproduce real acoustics environments. One of the most used algorithms in these reproduction environments is the Beamformer Algorithm. We have implemented the variant called Beamformer QR-LCMV, based on the QR decomposition, which is a very computationally demanding operation. We have explored different options differing basically in the high performance computing library used. Also we have built our own version with the aim of approaching the real-time processing goal when working on this type of low power devices.
Fran J. Alventosa, Pedro Alonso, Gema Piñero, Antonio M. Vidal
MARL-Ped+Hitmap: Towards Improving Agent-Based Simulations with Distributed Arrays
Abstract
Multi-agent systems allow the modelling of complex, heterogeneous, and distributed systems in a realistic way. MARL-Ped is a multi-agent system tool, based on the MPI standard, for the simulation of different scenarios of pedestrians who autonomously learn the best behavior by Reinforcement Learning. MARL-Ped uses one MPI process for each agent by design, with a fixed fine-grain granularity. This requirement limits the performance of the simulations for a restricted number of processors that is lesser than the number of agents. On the other hand, Hitmap is a library to ease the programming of parallel applications based on distributed arrays. It includes abstractions for the automatic partition and mapping of arrays at runtime with arbitrary granularity, as well as functionalities to build flexible communication patterns that transparently adapt to the data partitions.
In this work, we present the methodology and techniques of granularity selection in Hitmap, applied to the simulations of agent systems. As a first approximation, we use the MARL-Ped multi-agent pedestrian simulation software as a case of study for intra-node cases. Hitmap allows to transparently map agents to processes, reducing oversubscription and intra-node communication overheads. The evaluation results show significant advantages when using Hitmap, increasing the flexibility, performance, and agent-number scalability for a fixed number of processing elements, allowing a better exploitation of isolated nodes.
Eduardo Rodriguez-Gutiez, Francisco Martinez-Gil, Juan Manuel Orduña, Arturo Gonzalez-Escribano
Efficiency of GPUs for Relational Database Engine Processing
Abstract
Relational database management systems (RDBMS) are still widely required by numerous business applications. Boosting performances without compromising functionalities represents a big challenge. To achieve this goal, we propose to boost an existing RDBMS by making it able to use hardware architectures with high memory bandwidth like GPUs. In this paper we present a solution named CuDB. We compare the performances and energy efficiency of our approach with different GPU ranges. We focus on technical specificities of GPUs which are most relevant for designing high energy efficient solutions for database processing.
Samuel Cremer, Michel Bagein, Saïd Mahmoudi, Pierre Manneback
Geocon: A Middleware for Location-Aware Ubiquitous Applications
Abstract
A core functionality of any location-aware ubiquitous system is storing, indexing, and retrieving information about entities that are commonly involved in these scenarios, such as users, places, events and other resources. The goal of this work is to design and provide the prototype of a service-oriented middleware, called Geocon, which can be used by mobile application developers to implement such functionality. In order to represent information about users, places, events and resources of mobile location-aware applications, Geocon defines a basic metadata model that can be extended to match most application requirements. The middleware includes a geocon-service for storing, searching and selecting metadata about users, resources, events and places of interest, and a geocon-client library that allows mobile applications to interact with the service through the invocation of local methods. The paper describes the metadata model and the components of the Geocon middleware. A prototype of Geocon is available at https://​github.​com/​SCAlabUnical/​Geocon.
Loris Belcastro, Giulio Di Lieto, Marco Lackovic, Fabrizio Marozzo, Paolo Trunfio
I/O-Focused Cost Model for the Exploitation of Public Cloud Resources in Data-Intensive Workflows
Abstract
Ultrascale computing systems will blur the line between HPC and cloud platforms, transparently offering to the end-user every possible available computing resource, independently of their characteristics, location, and philosophy. However, this horizon is still far from complete. In this work, we propose a model for calculating the costs related with the deployment of data-intensive applications in IaaS cloud platforms. The model will be especially focused on I/O-related costs in data-intensive applications and on the evaluation of alternative I/O solutions. This paper also evaluates the differences in costs of a typical cloud storage service in contrast with our proposed in-memory I/O accelerator, Hercules, showing great flexibility potential in the price/performance trade-off. In Hercules cases, the execution time reductions are up to 25% in the best case, while costs are similar to Amazon S3.
Francisco Rodrigo Duro, Javier Garcia Blas, Jesus Carretero

SCDT-2016: Supercomputing Co-Design Technology Workshop

Frontmatter
Cellular ANTomata as Engines for Highly Parallel Pattern Processing
Abstract
One important approach to high-performance computing has a (relatively) simple physical computer architecture emulate virtual algorithmic architectures (VAAs) that are highly optimized for important application domains. We expose the Cellular ANTomaton (CAnt) computing model—cellular automata enhanced with mobile FSMs (Ants)—as a highly efficient VAA for a variety of pattern-processing problems that are inspired by biocomputing applications. We illustrate the CAnt model via a scalable design for an \(n \times n\) CAnt that solves the following bio-inspired problem in linear time.
  • The Pattern-Assembly Problem.
  • Inputs: a length-n master pattern \(\varPi \) and r test patterns \(\pi _0, \ldots , \pi _{r-1}\), of respective lengths \(m_0 \ge \cdots \ge m_{r-1}\).
  • The problem: Find every sequence \(\langle \pi _{j_0}, \ldots , \pi _{j_{s-1}} \rangle \) of \(\pi _k\)’s, possibly with repetitions, that “assemble” (i.e., concatenate) to produce \(\varPi \); i.e., \(\pi _{j_0} \cdots \pi _{j_{s-1}} = \varPi \).
  • Timing: \(m_1 + \cdots + m_r + O(n)\) steps, with a quite-small big-O constant.
Arnold L. Rosenberg
Educational and Research Systems for Evaluating the Efficiency of Parallel Computations
Abstract
In this paper we consider the educational and research systems that can be used to estimate the efficiency of parallel computing. ParaLab allows parallel computation methods to be studies. With the ParaLib library, we can compare the parallel programming languages and technologies. The Globalizer Lab system is capable of estimating the efficiency of algorithms for solving computationally intensive global optimization problems. These systems can build models of various high-performance systems, formulate the problems to be solved, perform computational experiments in the simulation mode and analyze the results. The crucial matter is that the described systems support a visual representation of the parallel computation process. If combined, these systems can be useful for developing high-performance parallel programs which take the specific features of modern supercomputing systems into account.
Victor Gergel, Evgeny Kozinov, Alexey Linev, Anton Shtanyk
Generalized Approach to Scalability Analysis of Parallel Applications
Abstract
This article describes an approach to scalability analysis of parallel applications, which is a major part of the algorithm description used in AlgoWiki, the Open Encyclopedia of Parallel Algorithmic Features. The proposed approach is based on the suggested definition of generalized scalability of a parallel application. This study uses joined and structured data on an application’s execution and supercomputing co-design technologies. Parallel application properties are studied by analyzing data collected from all available sources of its dynamic characteristics and information about the hardware and software platforms corresponding with the features of an algorithm and its implementation. This allows reasonable conclusion to be drawn regarding potential reasons of changes in the execution quality for any parallel applications and to compare the scalability of various programs.
Alexander Antonov, Alexey Teplov
System Monitoring-Based Holistic Resource Utilization Analysis for Every User of a Large HPC Center
Abstract
The problem of effective resource utilization is very challenging nowadays, especially for HPC centers running top-level supercomputing facilities with high energy consumption and significant number of workgroups. The weakness of many system monitoring based approaches to efficiency study is the basic orientation on professionals and analysis of specific jobs with low availability for regular users. The proposed all-round performance analysis approach, covering single application performance, project-level and overall system resource utilization based on system monitoring data that promises to be an effective and low cost technique aimed at all types of HPC center users. Every user of HPC center can access details on any of his executed jobs to better understand application behavior and sequences of job runs including scalability study, helping in turn to perform appropriate optimizations and implement co-design techniques. Taking into consideration all levels (user, project manager, administrator), the approach aids to improve output of HPC centers.
Dmitry Nikitenko, Konstantin Stefanov, Sergey Zhumatiy, Vadim Voevodin, Alexey Teplov, Pavel Shvets
Co-design of a Particle-in-Cell Plasma Simulation Code for Intel Xeon Phi: A First Look at Knights Landing
Abstract
Three dimensional particle-in-cell laser-plasma simulation is an important area of computational physics. Solving state-of-the-art problems requires large-scale simulation on a supercomputer using specialized codes. A growing demand in computational resources inspires research in improving efficiency and co-design for supercomputers based on many-core architectures. This paper presents first performance results of the particle-in-cell plasma simulation code PICADOR on the recently introduced Knights Landing generation of Intel Xeon Phi. A straightforward rebuilding of the code yields a 2.43 x speedup compared to the previous Knights Corner generation. Further code optimization results in an additional 1.89 x speedup. The optimization performed is beneficial not only for Knights Landing, but also for high-end CPUs and Knights Corner. The optimized version achieves 100 GFLOPS double precision performance on a Knights Landing device with the speedups of 2.35 x compared to a 14-core Haswell CPU and 3.47 x compared to a 61-core Knights Corner Xeon Phi.
Igor Surmin, Sergey Bastrakov, Zakhar Matveev, Evgeny Efimenko, Arkady Gonoskov, Iosif Meyerov
Efficient Distributed Computations with DIRAC
Abstract
High Energy Physics (HEP) experiments at the LHC collider at CERN were among the first scientific communities with very high computing requirements. Nowadays, researchers in other scientific domains are in need of similar computational power and storage capacity. Solution for the HEP experiments was found in the form of computational grid - distributed computing infrastructure integrating large number of computing centers based on commodity hardware. These infrastructures are very well suited for High Throughput applications used for analysis of large volumes of data with trivial parallelization in multiple independent execution threads. More advanced applications in HEP and other scientific domains can exploit complex parallelization techniques using multiple interacting execution threads. A growing number of High Performance Computing (HPC) centers, or supercomputers, support this mode of operation. One of the software toolkits developed for building distributed computing systems is the DIRAC Interware. It allows seamless integration of computing and storage resources based on different technologies into a single coherent system. This product was very successful to solve problems of large HEP experiments and was upgraded in order to offer a general-purpose solution. The DIRAC Interware can help including also HPC centers into a common federation to achieve similar goals as for computational grids. However, integration of HPC centers imposes certain requirements on their internal organization and external connectivity presenting a complex co-design problem. A distributed infrastructure including supercomputers is planned for construction. It will be applied for inter-disciplinary large-scale problems of modern science and technology.
Viktor Gergel, Vladimir Korenkov, Andrei Tsaregorodtsev, Alexey Svistunov
The Co-design of Astrophysical Code for Massively Parallel Supercomputers
Abstract
The rapid growth of supercomputer technologies became a driver for the development of natural sciences. Most of the discoveries in astronomy, in physics of elementary particles, in the design of new materials in the DNA research are connected with numerical simulation and with supercomputers. Supercomputer simulation became an important tool for the processing of the great volume of the observation and experimental data accumulated by the mankind. Modern scientific challenges put the actuality of the works in computer systems and in the scientific software design to the highest level. The architecture of the future exascale systems is still being discussed. Nevertheless, it is necessary to develop the algorithms and software for such systems right now. It is necessary to develop software that is capable of using tens and hundreds of thousands of processors and of transmitting and storing of large volumes of data. In the present work the technology for the development of such algorithms and software is proposed. As an example of the use of the technology, the process of the software development is considered for some problems of astrophysics.
Boris Glinsky, Igor Kulikov, Igor Chernykh, Dmitry Weins, Alexey Snytnikov, Vladislav Nenashev, Andrey Andreev, Vitaly Egunov, Egor Kharkov
Hardware-Specific Selection the Most Fast-Running Software Components
Abstract
Software development problems include, in particular, selection of the most fast-running software components among the available ones. In the paper it is proposed to develop a prediction model that can estimate software component runtime to solve this problem. Such a model is built as a function of algorithm parameters and computational system characteristics. It also has been studied which of those features are the most representative ones. As a result of these studies a two-stage scheme of prediction model development based on linear and non-linear machine learning algorithms has been formulated. The paper presents a comparative analysis of runtime prediction results for solving several linear algebra problems on 84 personal computers and servers. The use of the proposed approach shows an error of less than 22% for computational systems represented in the training data set.
Alexey Sidnev
Automated Parallel Simulation of Heart Electrical Activity Using Finite Element Method
Abstract
In this paper we present an approach to the parallel simulation of the heart electrical activity using the finite element method with the help of the FEniCS automated scientific computing framework. FEniCS allows scientific software development using the near-mathematical notation and provides automatic parallelization on MPI clusters. We implemented the ten Tusscher–Panfilov (TP06) cell model of cardiac electrical activity. The scalability testing of the implementation was performed using up to 240 CPU cores and the 95 times speedup was achieved. We evaluated various combinations of the Krylov parallel linear solvers and the preconditioners available in FEniCS. The best performance was provided by the conjugate gradient method and the biconjugate gradient stabilized method solvers with the successive over-relaxation preconditioner. Since the FEniCS-based implementation of TP06 model uses notation close to the mathematical one, it can be utilized by computational mathematicians, biophysicists, and other researchers without extensive parallel computing skills.
Andrey Sozykin, Timofei Epanchintsev, Vladimir Zverev, Svyatoslav Khamzin, Aleksandr Bersenev
Using hStreams Programming Library for Accelerating a Real-Life Application on Intel MIC
Abstract
The main goal of this paper is the suitability assessment of the hStreams programming library for porting a real-life scientific application to heterogeneous platforms with Intel Xeon Phi coprocessors. This emerging library offers a higher level of abstraction to provide effective concurrency among tasks, and control over the overall performance. In our study, we focus on applying the FIFO streaming model for a parallel application which implements the numerical model of alloy solidification. In the paper, we show how scientific applications can benefit from multiple streams. To take full advantages of hStreams, we propose a decomposition of the studied application that allows us to distribute tasks belonging to the computational core of the application among two logical streams within two logical/physical domains. Effective overlapping computations with data transfers is another goal achieved in this way. The proposed approach allows us to execute the whole application 3.5 times faster than the original parallel version running on two CPUs.
Lukasz Szustak, Kamil Halbiniak, Adam Kulawik, Roman Wyrzykowski, Piotr Uminski, Marcin Sasinowski
Backmatter
Titel
Algorithms and Architectures for Parallel Processing
Herausgegeben von
Jesus Carretero
Javier Garcia-Blas
Victor Gergel
Vladimir Voevodin
Iosif Meyerov
Juan A. Rico-Gallego
Juan C. Díaz-Martín
Pedro Alonso
Juan Durillo
José Daniel Garcia Sánchez
Alexey L. Lastovetsky
Fabrizio Marozzo
Qin Liu
Zakirul Alam Bhuiyan
Karl Fürlinger
Josef Weidendorfer
José Gracia
Copyright-Jahr
2016
Electronic ISBN
978-3-319-49956-7
Print ISBN
978-3-319-49955-0
DOI
https://doi.org/10.1007/978-3-319-49956-7

Informationen zur Barrierefreiheit für dieses Buch folgen in Kürze. Wir arbeiten daran, sie so schnell wie möglich verfügbar zu machen. Vielen Dank für Ihre Geduld.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH