Skip to main content

Über dieses Buch

This book constitutes the proceedings of the 5th Latin American Conference, CARLA 2018, held in Bucaramanga, Colombia, in September 2018.

The 24 papers presented in this volume were carefully reviewed and selected from 38 submissions. They are organized in topical sections on:

Artificial Intelligence; Accelerators; Applications; Performance Evaluation; Platforms and Infrastructures; Cloud Computing.



Artificial Intelligence


Parallel and Distributed Processing for Unsupervised Patient Phenotype Representation

The value of data-driven healthcare is the possibility to detect new patterns for inpatient care, treatment, prevention, and comprehension of disease or to predict the duration of hospitalization, its cost or whether death is likely to occur during the hospital stay.
Modeling precise patients phenotype representation from clinical data is challenging over its high-dimensionality, noisy and missing data to be processed into a new low-dimensionality space. Likewise, processing unsupervised learning models into a growing clinical data raises many issues, in terms of algorithmic complexity, such as time to model convergence and memory capacity.
This paper presents DiagnoseNET framework to automate patient phenotype extractions and apply them to predict different medical targets. It provides three high-level features: a full-workflow orchestration into stage pipelining for mining clinical data and using unsupervised feature representations to initialize supervised models; a data resource management for training parallel and distributed deep neural networks.
As a case of study, we have used a clinical dataset from admission and hospital services to build a general purpose inpatient phenotype representation to be used in different medical targets, the first target is to classify the main purpose of inpatient care.
The research focuses on managing the data according to its dimensions, the model complexity, the workers number selected and the memory capacity, for training unsupervised staked denoising auto-encoders over a Mini-Cluster Jetson TX2.
Therefore, mapping tasks that fit over computational resources is a key factor to minimize the number of epochs necessary to model converge, reducing the execution time and maximizing the energy efficiency.
John Anderson García Heano, Frédéric Precioso, Pascal Staccini, Michel Riveill

Evolutionary Algorithms for Convolutional Neural Network Visualisation

Deep Learning is based on deep neural networks trained over huge sets of examples. It enabled computers to compete with—or even outperform—humans at many tasks, from playing Go to driving vehicules.
Still, it remains hard to understand how these networks actually operate. While an observer sees any individual local behaviour, he gets little insight about their global decision-making process.
However, there is a class of neural networks widely used for image processing, convolutional networks, where each layer contains features working in parallel. By their structure, these features keep some spatial information across a network’s layers. Visualisation of this spatial information at different locations in a network, notably on input data that maximise the activation of a given feature, can give insights on the way the model works.
This paper investigates the use of Evolutionary Algorithms to evolve such input images that maximise feature activation. Compared with some pre-existing approaches, ours seems currently computationally heavier but with a wider applicability.
Nicolas Bernard, Franck Leprévost

Breast Cancer Classification: A Deep Learning Approach for Digital Pathology

Breast cancer is the second leading cause of cancer death among women. Breast cancer is not a single disease, but rather is comprised of many different biological entities with distinct pathological features and clinical implications. Pathologists face a substantial increase in workload and complexity of digital pathology in cancer diagnosis due to the advent of personalized medicine, and diagnostic protocols have to focus equally on efficiency and accuracy. Computerized image processing technology has been shown to improve efficiency, accuracy and consistency in histopathology evaluations, and can provide decision support to ensure diagnostic consistency. We propose using deep learning and convolutional neural networks (CNN) to classify a subset of breast cancer histopathological images of benign and malignant breast tumors, from the publicly available BreakHis dataset. We design a workflow featuring patch extraction from whole slide images, CNN training and performance evaluation to solve this problem.
Pablo Guillén-Rondon, Melvin Robinson, Jerry Ebalunode

Where Do HPC and Cognitive Science Meet in Latin America?

In the last few decades there has been a noticeable shift of attention of the high-performance computing (HPC) applications development community from deterministic to heuristic models of problem solving, mainly due to observation that models based on human knowledge and expertise have proven to be good approaches to solving complex problems. Also, a shift of artificial intelligence (AI) to HPC has occurred, as AI researchers now find in HPC the means to build more complex models of human cognition. This is in general the case, and it is also true in the Latin America region. On the other hand, in this region there seems to be an estrangement between the cognitive science (CS) and the AI communities, perhaps due to the shift of AI to HPC and the resulting change of attention of AI researchers. However, there is a noticeable increase in the number of academic programs in the region focusing on CS. In this article we provide evidence of the previous assertions and propose a list of suggestions or recommendations on how to bring the HPC and CogSci communities closer in the region, as well as the potential benefits of such a process.
Alvaro de la Ossa Osegueda



A Hybrid Reinforcement Learning and Cellular Automata Model for Crowd Simulation on the GPU

We present a GPU-based hybrid model for crowd simulations. The model uses reinforcement learning to guide groups of pedestrians towards a goal while adapting to environmental dynamics, and a cellular automaton to describe individual pedestrians’ interactions. In contrast to traditional multi-agent reinforcement learning methods, our model encodes the learned navigation policy into a navigation map, which is used by the cellular automaton’s update rule to calculate the next simulation step. As a result, reinforcement learning is independent of the number of agents, allowing the simulation of large crowds. Implementation of this model on the GPU allows interactive simulations of several hundreds of pedestrians.
Sergio Ruiz, Benjamín Hernández

In-situ Visualization of the Propagation of the Electric Potential in a Human Atrial Model Using GPU

Computational heart-tissue models envelope the solution of non-linear partial and ordinary differential equations. After applying certain discretization methods (finite difference, finite elements) to them for its solution, result in a set of operations between matrices in the order of millions. The outcome of this are programs with high execution times.
The current work simulates a human atrium tissue using the Courtemanche electrical model [1]. The cell pairing is made using the finite difference method and its computational implementation was made using the Armadillo C++ library [2], for the CPU version and the acceleration was made through the CUDA library [3] on a nVidia Tesla K40 card.
Additionally the visualization process was made using Paraview-Catalyst [4], two computing nodes permits that the execution process of the numerical method runs on a node while the other node makes the visualization simultaneously.
A novel process to make atrium human visualizations was implemented, a 200X acceleration was achieved using CUDA and Arrayfire [5].
John H. Osorio, Andres P. Castano, Oscar Henao, Juan Hincapie

GPU Acceleration for Directional Variance Based Intra-prediction in HEVC

HEVC (High Efficiency Video Encoding) greatly improves the efficiency of intra-prediction in video compression. However, such gains are achieved with an encoder of significantly increased computational complexity. In this paper we present a Graphic Processing Unit (GPU) implementation of our modified intra-prediction algorithm: Mean Directional Variance in Sliding Window (MDV-SW). MDV-SW detects the texture orientation of a block of input pixels, and allows easy parallelization of intra-prediction; by doubling the detectable number of texture orientations and eliminating the data dependency generated by using pixels from the original image as reference samples instead of the reconstructed pixels. Once this dependency was removed we were able to calculate all intra-prediction blocks in a frame in parallel by hardware accelerators, specifically the GPU. Results show that the GPU implementation speeds up the execution by 10x compared to sequential implementation.
Derek Nola, Elena G. Paraschiv, Damián Ruiz-Coll, María Pantoja, Gerardo Fernández-Escribano

Fast Marching Method in Seismic Ray Tracing on Parallel GPU Devices

Sequential fast marching method relies on serial priority queues, which, in turn, imply high complexity for large volumes of data. In this paper, an algorithm to compute the shortest path in the fast marching method for 3D data on graphics processing units devices (GPUs) is introduced. Numerical simulations show that the proposed algorithm achieves speedups of 2\(\times \) and 3\(\times \) compared to the sequential algorithm.
Jorge Monsegny, Jonathan Monsalve, Kareth León, Maria Duarte, Sandra Becerra, William Agudelo, Henry Arguello

Improving Performance and Energy Efficiency of Geophysics Applications on GPU Architectures

Energy and performance of parallel systems are an increasing concern for new large-scale systems. Research has been developed in response to this challenge aiming the manufacture of more energy efficient systems. In this context, this paper proposes optimization methods to accelerate performance and increase energy efficiency of geophysics applications used in conjunction to algorithm and GPU memory characteristics. The optimizations we developed applied to Graphics Processing Units (GPU) algorithms for stencil applications achieve a performance improvement of up to 44.65% compared with the read-only version. The computational results have shown that the combination of use read-only memory, the Z-axis internalization and reuse of specific architecture registers allow increase the energy efficiency of up to 54.11% when shared memory was used and increase of up to 44.53% when read-only was used.
Pablo J. Pavan, Matheus S. Serpa, Emmanuell Diaz Carreño, Víctor Martínez, Edson Luiz Padoin, Philippe O. A. Navaux, Jairo Panetta, Jean-François Mehaut

FleCSPHg: A GPU Accelerated Framework for Physics and Astrophysics Simulations

This paper presents FleCSPHg, a GPU accelerated framework dedicated to Smoothed Particle Hydrodynamics (SPH) and gravitation (FMM) computation. Astrophysical simulations, with the case of binary neutron stars coalescence, are used as test cases. In this context we show the efficiency of the tree data structure in two conditions. The first for near-neighbors search with SPH and the second with N-body algorithm for the gravitation computation.
FleCSPHg is based on FleCI and FleCSPH developed at the Los Alamos National Laboratory. This work is a first step to provide a multi-physics framework for tree-based methods.
This paper details either SPH, FMM methods and the simulation we propose. It describes FleCSI and FleCSPH and our strategy to divide the work load between CPU and GPU. The CPU is associate with the tree traversal and generates tasks at a specific depth for the GPU. These tasks are offloaded to the GPU and gathered on the CPU at the end of the traversal.
The computation time is up to 3.5 times faster on the GPU version than classical CPU. We also give details on the simulation itself for the binary neutron star coalescence.
Julien Loiseau, François Alin, Christophe Jaillet, Michaël Krajecki



Comparison of Tree Based Strategies for Parallel Simulation of Self-gravity in Agglomerates

This article presents an algorithm conceived to improve the computational efficiency of simulations in ESyS-Particle that involve a large number of particles. ESyS-Particle applies the Discrete Element Method to simulate the interaction of agglomerates of particles. The proposed algorithm is based on the Barnes & Hut method, in which a domain is divided and organized in an octal tree. The algorithm is compared to a variation of the octal tree version that uses a binary tree instead. Experimental evaluation is performed over two scenarios: a collapsing cube scenario and two agglomerates orbiting each other. The experimental evaluation comprises the performance analysis of the two scenarios using the two algorithms, including a comparison of the results obtained and the analysis of the numerical accuracy. Results indicate that the octal tree version performs faster and is more accurate than the binary tree version.
Nestor Rocchetti, Sergio Nesmachnow, Gonzalo Tancredi

Parallel Implementations of Self-gravity Calculation for Small Astronomical Bodies on Xeon Phi

This article presents parallel implementations of the Mass Approximation Distance Algorithm for self-gravity calculation on Xeon Phi. The proposed method is relevant for performing simulations on realistic systems modeling small astronomical bodies, which are agglomerates of thousand/million of particles. Specific strategies and optimizations are described for execution on the Xeon Phi architecture. The experimental analysis evaluates the computational efficiency of the proposed implementations on realistic scenarios, reporting the best options for the implementation. Specific performance improvements of up to 146.4\(\times \) are reported for scenarios with more than one million particles.
Sebastián Caballero, Andrés Baranzano, Sergio Nesmachnow

Visualization of a Jet in Turbulent Crossflow

Direct Numerical Simulation (DNS) with high spatial and temporal resolution of a jet transversely issuing into a turbulent boundary layer subject to very strong favorable pressure gradient (FPG) has been performed. The analysis is done by prescribing accurate turbulent information (instantaneous velocity and temperature) at the inlet of a computational domain for simulations of spatially-developing turbulent boundary layers based on the Dynamic Multiscale Approach (JFM, 670, pp. 581–605, 2011). Scientific visualization of flow parameters is carried out with the main purpose of gaining a better insight into the complex set of vortical structures that emerge from the jet-crossflow interaction. An interface has been created to convert the original binary output files by the parallel flow solver PHASTA into readable input documents for Autodesk Maya software. Specifically, a set of scripts that create customized Maya nCache files from structured datasets. Inside Maya, standard tools and techniques, commonly utilized in feature film production, are used to produce high-end renderings of the converted files. The major effect of strong FPG on crossflow jets has been identified as a damping process of the counter-rotating vortex pair system (CVP).
Guillermo Araya, Guillermo Marin, Fernando Cucchietti, Irene Meta, Rogeli Grima

Acceleration of Hydrology Simulations Using DHSVM for Multi-thousand Runs and Uncertainty Assessment

Hydrology is the study of water resources. Hydrology tracks various attributes of water such as its quality and movement. As a tool Hydrology allows researchers to investigate topics such as the impacts of wildfires, logging, and commercial development. Due to cost and difficulty of collecting complete sets of data, researchers rely on simulations. The Distributed Hydrology Soil Vegetation Model (DHSVM) is a software package that uses mathematical models to numerically represent watersheds. In this paper we present an acceleration of DHSVM. As hydrology research produces large amounts of data and the accurate simulation of realistic hydrology events can take prohibitive amounts of time accelerating these simulations becomes a crucial task. The paper implements and analyzes various high-performance computing (HPC) advancements to the original code base at different levels; at compiler, multicore level, and distributed computing level. Results show that compiler optimization provides improvements of 220% on a single computer and multicore features improve execution times by about 440% compared by a sequential implementation.
Andrew Adriance, Maria Pantoja, Chris Lupo

Fine-Tuning an OpenMP-Based TVD–Hopmoc Method Using Intel® Parallel Studio XE Tools on Intel® Xeon® Architectures

This paper is concerned with parallelizing the TVD–Hopmoc method for numerical time integration of evolutionary differential equations. Using Intel® Parallel Studio XE tools, we studied three OpenMP implementations of the TVD–Hopmoc method (naive, CoP and EWS-Sync), with executions performed on Intel® Xeon® Many Integrated Core Architecture and Scalable processor. Our implementation, named EWS-Sync, defines an array that represents threads and the scheme consists of synchronizing only adjacent threads. Moreover, this approach reduces the OpenMP scheduling time by employing an explicit work-sharing strategy. Instead of permitting the OpenMP API to perform thread scheduling implicitly, this implementation of the 1-D TVD-Hopmoc method partitions among threads the array that represents the computational mesh of the numerical method. Thereby, this scheme diminishes the OpenMP spin time by avoiding barriers using an explicit synchronization mechanism where a thread only waits for its two adjacent threads. Numerical simulations show that this approach achieves promising performance gains in shared memory for multi-core and many-core environments.
Frederico L. Cabral, Carla Osthoff, Roberto P. Souto, Gabriel P. Costa, Sanderson L. Gonzaga de Oliveira, Diego Brandão, Mauricio Kischinhevsky

Performance Evaluation


Performance Evaluation of Stencil Computations Based on Source-to-Source Transformations

Stencil computations are commons in High Performance Computing (HPC) applications, they consist in a pattern that replicates the same calculation in a data domain. The Finite-Difference Method is an example of stencil computations and it is used to solve real problems in diverse areas related to Partial Differential Equations (electromagnetics, fluid dynamics, geophysics, etc.). Although a large body of literature on optimization of this class of applications is available, the performance evaluation and its optimization on different HPC architectures remain a challenge. In this work, we implemented the 7-point Jacobian stencil in a Source-to-Source Transformation Framework (BOAST) to evaluate the performance of different HPC architectures. Achieved results present that the same source code can be executed on current architectures with a performance improvement, and it helps the programmer to develop the applications without dependence on hardware features.
Víctor Martínez, Matheus S. Serpa, Pablo J. Pavan, Edson Luiz Padoin, Philippe O. A. Navaux

Benchmarking LAMMPS: Sensitivity to Task Location Under CPU-Based Weak-Scaling

This investigation summarizes a set of executions completed on the supercomputers Stampede at TACC (USA), Helios at IFERC (Japan), and Eagle at PSNC (Poland), with the molecular dynamics solver LAMMPS, compiled for CPUs. A communication-intensive benchmark based on long-distance interactions tackled by the Fast Fourier Transform operator has been selected to test its sensitivity to rather different patterns of tasks location, hence to identify the best way to accomplish further simulations for this family of problems. Weak-scaling tests show that the attained execution time of LAMMPS is closely linked to the cluster topology and this is revealed by the varying time-execution observed in scale up to thousands of MPI tasks involved in the tests. It is noticeable that two clusters exhibit time saving (up to 61% within the parallelization range) when the MPI-task mapping follows a concentration pattern over as few nodes as possible. Besides this result is useful from the user’s standpoint, it may also help to improve the clusters throughput by, for instance, adding live-migration decisions in the scheduling policies in those cases of communication-intensive behaviour detected in characterization tests. Also, it opens a similar output for a more efficient usage of the cluster from the energy consumption point of view.
José A. Moríñigo, Pablo García-Muller, Antonio J. Rubio-Montero, Antonio Gómez-Iglesias, Norbert Meyer, Rafael Mayo-García

Analyzing Communication Features and Community Structure of HPC Applications

A few exascale machines are scheduled to become operational in the next couple of years. Reaching such achievement required the HPC community to overcome obstacles in programmability, power management, memory hierarchy, and reliability. Similar challenges are to be faced in the pursuit of greater performance gains. In particular, design of interconnects stands out as a major hurdle. Computer networks for extreme-scale system will need a deeper understanding of the communication characteristics of applications that will run on those systems. We analyzed a set of nine representative HPC applications and created a catalog of well-defined communication patterns that constitute building blocks for modern scientific codes. Furthermore, we found little difference between popular community-detection algorithms, which tend to form few but relatively big communities.
Manfred Calvo, Diego Jiménez, Esteban Meneses

Power Efficiency Analysis of a Deep Learning Workload on an IBM “Minsky” Platform

The rise of Deep Learning techniques has attracted special attention to GPUs usage for better performance of model computation. Most frameworks for Cognitive Computing include support to offload model training and inferencing to graphics hardware, and this is so common that GPU designers are reserving die area for special function units tailored to accelerating Deep Learning computation. Measuring the capability of a hardware platform to run these workloads is a major concern for vendors and consumers of this exponentially growing market. In a previous work [9] we analyzed the execution times of the Fathom AI workloads [2] in CPUs and CPUs+GPUs. In this work we measure the Fathom workloads in the POWER8-based “Minsky” [15] platform, profiling power consumption and energy efficiency in GPUs. We explore alternative forms of execution via GPU power and frequency capping with the aim of reducing Energy-to-Solution (ETS) and Energy-Delay-Product (EDP). We show important ETS savings of up to 27% with half of the workloads decreasing the EDP. We also expose the advantages of frequency capping with respect to power capping in NVIDIA GPUs.
Mauricio D. Mazuecos Pérez, Nahuel G. Seiler, Carlos Sergio Bederián, Nicolás Wolovick, Augusto J. Vega

Platforms and Infrastructures


Orlando Tools: Development, Training, and Use of Scalable Applications in Heterogeneous Distributed Computing Environments

We address concepts and principles of the development, training, and use of applications in heterogeneous environments that integrate different computational infrastructures including HPC-clusters, grids, and clouds. Existing differences in the Grid and cloud computing models significantly complicate problem-solving processes in such environments for end-users. In this regards, we propose the toolkit named Orlando Tools for creating scalable applications for solving large-scale scientific and applied problems. It provides mechanisms for the subject domain specification, problem formulation, problem-solving time prediction, problem-solving scheme execution, monitoring, etc. The toolkit supports hands-on training skills for end-users. To demonstrate the practicability and benefits of Orlando Tools, we present an example of the development and use of the scalable application for solving practical problems of warehouse logistics.
Andrei Tchernykh, Alexander Feoktistov, Sergei Gorsky, Ivan Sidorov, Roman Kostromin, Igor Bychkov, Olga Basharina, Vassil Alexandrov, Raul Rivera-Rodriguez

Methodology for Tailored Linux Distributions Development for HPC Embedded Systems

Hardware for embedded devices has increasing capabilities, popular Linux distributions incorporate large sets of applications and services that require intensive use of resources, limiting the hardware that can run these distributions. This work is concerned with developing a methodology to build a light operating system, oriented to both scientific applications, mainly considering its behavior in terms of the type of resource most used.
Gilberto Díaz, Pablo Rojas, Carlos Barrios

Cloud Computing


Cost and QoS Optimization of Cloud-Based Content Distribution Networks Using Evolutionary Algorithms

This work addresses the multi-objective resource provisioning problem for building cloud-based CDNs. The optimization objectives are the minimization of VM, network and storage cost, and the maximization of the QoS for the end-user. A brokering model is proposed such that a single cloud-based CDN is able to host multiple content providers applying a resource sharing strategy. Following this model, an offline multiobjective evolutionary approach is applied to optimize resource provisioning while a greedy heuristic is proposed for addressing online routing of content. Experimental results indicate the proposed approach may reduce total costs by up to 10.6% while maintaining high QoS values.
Santiago Iturriaga, Gerardo Goñi, Sergio Nesmachnow, Bernabé Dorronsoro, Andrei Tchernykh

Bi-objective Analysis of an Adaptive Secure Data Storage in a Multi-cloud

Security issues related to cloud computing as well as all solutions proposed in the literature are one of the high topics for research. However, there are many unsolved problems regarded to cloud storage. In this paper, we focused on an adaptive model of data storage based on Secret Sharing Schemes (SSS) and Residue Number System (RNS). We proposed five strategies to minimize information loss and time to data upload and download into the cloud. We evaluate these strategies on seven Cloud Storage Providers (CSPs). We study a correlation of system settings with the probability of information loss, data redundancy, speed of access to CSPs, and encoding/decoding speeds We demonstrate that strategies that consider CSPs with the best upload access speeds and then, after storing, migrate to the CSPs with the least probability of information loss or best download speeds show better performance behavior.
Esteban C. Lopez-Falcon, Vanessa Miranda-López, Andrei Tchernykh, Mikhail Babenko, Arutyun Avetisyan

Fault Characterization and Mitigation Strategies in Desktop Cloud Systems

Desktop cloud platforms, such as UnaCloud and CernVM, run clusters of virtual machines taking advantage of idle resources on desktop computers. These platforms execute virtual machines along with the applications started by the users in those desktops. Unfortunately, although the use of computer resources is better, desktop user actions, such as turning off the computer or running certain applications may conflict with the virtual machines. Desktop clouds commonly run applications based on technologies such as Tensorflow or Hadoop that rely on master-worker architectures and are sensitive to failures in specific nodes. To support these new types of applications, it is important to understand which failures may interrupt the execution of these clusters, what faults may cause some errors and which strategies can be used to mitigate or tolerate them. Using the UnaCloud platform as a case study, this paper presents an analysis of (1) the failures that may occur in desktop clouds and (2) the mitigation strategies available to improve dependability.
Carlos E. Gómez, Jaime Chavarriaga, Harold E. Castro


Weitere Informationen

Premium Partner