Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the First HPCLATAM - CLCAR Joint Latin American High Performance Computing Conference, CARLA 2014, held in Valparaiso, Chile, in October 2014. The 17 revised full papers and the one paper presented were carefully reviewed and selected from 42 submissions. The papers are organized in topical sections on grid and cloud computing; HPC architectures and tools; parallel programming; scientific computing.



Track: GPU Computing

Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs

Matrix-matrix multiplication is an important linear algebra operation with a myriad of applications in scientific and engineering computing. Due to the relevance and inner parallelism of this operation, there exist many high performance implementations for a variety of hardware platforms. Exploit the structure of the matrices involved in the operation in general provides relevant time and memory savings. This is the case, e.g., when one of the matrices is a symmetric band matrix. This work presents two efficient specialized implementations of the operation when a symmetric band matrix is involved and the target architecture contains a graphics processor (GPU). In particular, both implementations exploit the structure of the matrices to leverage the vast parallelism of the underlying hardware. The experimental results show remarkable reductions in the computation time over the tuned implementations of the same operation provided by MKL and CUBLAS.
Ernesto Dufrechou, Pablo Ezzatti, Enrique S. Quintana-Ortí, Alfredo Remón

Track: Grid and Cloud Computing

Adaptive Spot-Instances Aware Autoscaling for Scientific Workflows on the Cloud

This paper deals with the problem of autoscaling for cloud computing scientific workflows. Autoscaling is a process in which the infrastructure scaling (i.e. determining the number and type of instances to acquire for executing an application) interleaves with the scheduling of tasks for reducing time and monetary cost of executions. This work proposes a novel strategy called Spots Instances Aware Autoscaling (SIAA) designed for the optimized execution of scientific workflow applications. SIAA takes advantage of the better prices of Amazon’s EC2-like spot instances to achieve better performance and cost savings. To deal with execution efficiency, SIAA uses a novel heuristic scheduling algorithm to optimize workflow makespan and reduce the effect of tasks failures that may occur by the use of spot instances. Experiments were carried out using several types of real-world scientific workflows. Results demonstrated that SIAA is able to greatly overcome the performance of state-of-the-art autoscaling mechanisms in terms of makespan (up to 88.0%) and cost of execution (up to 43.6%).
David A. Monge, Carlos García Garino

SI-Based Scheduling of Parameter Sweep Experiments on Federated Clouds

Scientists and engineers often require huge amounts of computing power to execute their experiments. This work focuses on the federated Cloud model, where custom virtual machines (VM) are launched in appropriate hosts belonging to different providers to execute scientific experiments and minimize response time. Here, scheduling is performed at three levels. First, at the broker level, datacenters are selected by their network latencies via three policies –Lowest-Latency-Time-First, First-Latency-Time-First, and Latency-Time-In-Round–. Second, at the infrastructure level, two Cloud VM schedulers based on Ant Colony Optimization (ACO) and Particle Swarm Optimization (PSO) for mapping VMs to appropriate datacenter hosts are implemented. Finally, at the VM level, jobs are assigned for execution into the preallocated VMs. Simulated experiments show that the combination of policies at the broker level with ACO and PSO succeed in reducing the response time compared to using the broker level policies combined with Genetic Algorithms.
Elina Pacini, Cristian Mateos, Carlos García Garino

Distributed Cache Strategies for Machine Learning Classification Tasks over Cluster Computing Resources

Scaling machine learning (ML) methods to learn from large datasets requires devising distributed data architectures and algorithms to support their iterative nature where the same data records are processed several times. Data caching becomes key to minimize data transmission through iterations at each node and, thus, contribute to the overall scalability. In this work we propose a two level caching architecture (disk and memory) and benchmark different caching strategies in a distributed machine learning setup over a cluster with no shared RAM memory. Our results strongly favour strategies where (1) datasets are partitioned and preloaded throughout the distributed memory of the cluster nodes and (2) algorithms use data locality information to synchronize computations at each iteration. This supports the convergence towards models where “ computing goes to data” as observed in other Big Data contexts, and allows us to align strategies for parallelizing ML algorithms and configure appropriately computing infrastructures.
John Edilson Arévalo Ovalle, Raúl Ramos-Pollan, Fabio A. González

A Flexible Strategy for Distributed and Parallel Execution of a Monolithic Large-Scale Sequential Application

A wide range of scientific computing applications still use algorithms provided by large old code or libraries, that rarely make profit from multiple cores architectures and hardly ever are distributed. In this paper we propose a flexible strategy for execution of those legacy codes, identifying main modules involved in the process. Key technologies involved and a tentative implementation are provided allowing to understand challenges and limitations that surround this problem. Finally a case study is presented for a large-scale, single threaded, stochastic geostatistical simulation, in the context of mining and geological modeling applications. A successful execution, running time and speedup results are shown using a workstation cluster up to eleven nodes.
Felipe Navarro, Carlos González, Óscar Peredo, Gerson Morales, Álvaro Egaña, Julián M. Ortiz

A Model to Calculate Amazon EC2 Instance Performance in Frost Prediction Applications

Frosts are one of the main causes of economic losses in the Province of Mendoza, Argentina. Although it is a phenomenon that happens every year, frosts can be predicted using Agricultural Monitoring Systems (AMS). AMS provide information to start and stop frosts defense systems and thus reduce economic losses. In recent years, the emergence of infrastructures called Sensor Clouds improved AMS in several aspects such as scalability, reliability, fault tolerance, etc. Sensor Clouds use Wireless Sensor Networks (WSN) to collect data in the field and Cloud Computing to store and process these data. Currently, Cloud providers like Amazon offer different instances to store and process data in a profitable way. Moreover, due to the variety of offered instances arises the need for tools to determine which is the most appropriate instance type, in terms of execution time and economic costs, for running agro-meteorological applications. In this paper we present a model targeted to estimate the execution time and economic cost of Amazon EC2 instances for frosts prediction applications.
Lucas Iacono, José Luis Vázquez-Poletti, Carlos García Garino, Ignacio Martín Llorente

Ensemble Learning of Run-Time Prediction Models for Data-Intensive Scientific Workflows

Workflow applications for in-silico experimentation involve the processing of large amounts of data. One of the core issues for the efficient management of such applications is the prediction of tasks performance. This paper proposes a novel approach that enables the construction models for predicting task’s running-times of data-intensive scientific workflows. Ensemble Machine Learning techniques are used to produce robust combined models with high predictive accuracy. Information derived from workflow systems and the characteristics and provenance of the data are exploited to guarantee the accuracy of the models. The proposed approach has been tested on Bioinformatics workflows for Gene Expressions Analysis over homogeneous and heterogeneous computing environments. Obtained results highlight the convenience of using ensemble models in comparison with single/standalone prediction models. Ensemble learning techniques permitted reductions of the prediction error up to 24.9% in comparison with single-model strategies.
David A. Monge, Matĕj Holec, Filip Z̆elezný, Carlos García Garino

Implications of CPU Dynamic Performance and Energy-Efficient Technologies on the Intrusiveness Generated by Desktop Grids Based on Virtualization

We evaluate how dynamic performance and energy-efficient technologies, as features introduced in modern processor architectures, affect the intrusiveness that Desktop Grids based on virtualization generate on desktops. Such intrusiveness is defined as degradation in the performance perceived by an end-user that is using a desktop while it is opportunistically utilized by Desktop Grid systems. To achieve this, we deploy virtual machines on a selection of desktops representing recent processor architectures. We then benchmark CPU intensive workloads simultaneously executed on both the virtual and the physical environment. The results show that dynamic performance and energy-efficient technologies, when incorporated on the supporting desktops, directly affect the level of intrusiveness an end-user perceives. Furthermore, depending on the processor architecture the intrusiveness percentage varies in a range from 3% to 100%. Finally, we propose policies aimed to minimize such intrusiveness according to the supporting processor architectures to be utilized and end-user profiles.
Germán Sotelo, Eduardo Rosales, Harold Castro

Efficient Fluorescence Microscopy Analysis over a Volunteer Grid/Cloud Infrastructure

This work presents a distributed computing algorithm over volunteer grid/cloud computing systems for Fluorescence Correlation Spectroscopy, a computational biology technique for obtaining quantitative information about the motion of molecules in living cells. High performance computing is needed to cope with large computing times when performing complex simulations, and volunteer grid/cloud computing emerges as a powerful paradigm to solve this kind of problems by coordinately using many computing resources distributed around the world. The proposed algorithm applies a domain decomposition technique for performing many simulations using different cell models at the same time. The experimental evaluation performed on a volunteer distributing computing infrastructure demonstrates that efficient execution times are obtained when using OurGrid middleware.
Miguel Da Silva, Sergio Nesmachnow, Maximiliano Geier, Esteban Mocskos, Juan Angiolini, Valeria Levi, Alfredo Cristobal

Track: HPC Architectures and Tools

Multiobjective Energy-Aware Datacenter Planning Accounting for Power Consumption Profiles

Energy efficiency is one of the major concerns when operating datacenters nowadays, as the large amount of energy used by parallel computing infrastructures impacts on both the energy cost and the electricity grid. Power consumption can be lowered by dynamically adjusting the power demand of datacenters, but conflicting objectives such as temperature and quality of service must be taken into account. This paper proposes a multiobjective evolutionary approach to solve the energy-aware scheduling problem in datacenters, regarding power consumption, temperature, and quality of service when controlling servers and cooling infrastructures. Accurate results are reported for both best solutions regarding each of the problem objectives and best trade-off solutions.
Sergio Nesmachnow, Cristian Perfumo, Íñigo Goiri

An Empirical Study of the Robustness of Energy-Aware Schedulers for High Performance Computing Systems under Uncertainty

This article presents an empirical evaluation of energy-aware schedulers under uncertainties in both the execution time of tasks and the energy consumption of the computing infrastructure. We address an important problem with direct application in current clusters and distributed computing systems, by analyzing how the list scheduling techniques proposed in a previous work behave when considering errors in the execution time estimation of tasks and realistic deviations in the power consumption. The experimental evaluation is performed over realistic workloads and scenarios, and validated by in-situ measurements using a power distribution unit. Results demonstrate that errors in real-world scenarios have a significant impact on the accuracy of the scheduling algorithms. Different online and offline scheduling approaches were evaluated, and online approach showed improvements of up to 32% in computing performance and up to 18% in energy consumption over the offline approach using the same scheduling algorithm.
Santiago Iturriaga, Sebastián García, Sergio Nesmachnow

MBSPDiscover: An Automatic Benchmark for MultiBSP Performance Analysis

Multi-Bulk Synchronous Parallel (MultiBSP) is a recently proposed parallel programming model for multicore machines that extends the classic BSP model. MultiBSP is very useful to design algorithms and estimate their running time, which are hard to do in High Performance Computing applications. For a correct estimation of the running time, the main parameters of the MultiBSP model for different multicore architectures need to be determined. This article presents a benchmark proposal for measuring the parameters that characterize the communication and synchronization cost for the model. Our approach discovers automatically the hierarchical structure of the multicore architecture by using a specific tool (hwloc) that allows obtaining runtime information about the machine. We describe the design, implementation and the results of benchmarking two multicore machines. Furthermore, we report the validation of the proposed method by using a real MultiBSP implementation of the vector inner product algorithm and comparing the predicted execution time against the real execution time.
Marcelo Alaniz, Sergio Nesmachnow, Brice Goglin, Santiago Iturriaga, Veronica Gil Gosta, Marcela Printista

Flexicache: Highly Reliable and Low Power Cache under Supply Voltage Scaling

Processors supporting a wide range of supply voltages are necessary to achieve high performance in nominal supply voltage and to reduce the power consumption in low supply voltage. However, when the supply voltage is lowered below the safe margin (especially close to the threshold voltage level), the memory cell failure rate increases drastically. Thus, it is essential to provide reliability solutions for memory structures. This paper proposes a novel, reliable L1 cache design, Flexicache, which automatically configures itself for different supply voltages in order to tolerate different fault rates. Flexicache is a circuit-driven solution achieving in-cache replication with no increase in the access latency and with a minimum increase in the energy consumption. It defines three operating modes: Single Version Mode, Double Version Mode and Triple Version Mode. Compared to the best previous proposal, Flexicache can provide 34% higher energy reduction for L1 caches with 2× higher error correction capability in the low-voltage mode.
Gulay Yalcin, Azam Seyedi, Osman S. Unsal, Adrian Cristal

Track: Parallel Programming

A Parallel Discrete Firefly Algorithm on GPU for Permutation Combinatorial Optimization Problems

The parallelism provided by low cost environments as multi-core and GPU processors has encouraged the design of algorithms that can utilize it. In the last time, the GPU approach constitutes an environment of proven successful progress in the implementation of different bio-inspired algorithms without major additional costs of performance. Among these techniques, the Firefly Algorithm (FA) is a recent method based on the flashing light of fireflies. As a population-based algorithm with operations without a high level of divergence, it is well suited as a highly parallelizable model on GPU. In this work we describe the design of a Discrete Firefly Algorithm (GPU-DFA) to solve permutation combinatorial problems. Two well-known permutation optimization problems (Travelling Salesman Problem and DNA Fragment Assembling Problem) were employed in order to test GPU-DFA. We have evaluated numerical efficacy and performance with respect to a CPU-DFA version. Results demonstrate that our algorithm is a fast robust procedure for the treatment of heterogeneous permutation combinatorial problems.
Pablo Vidal, Ana Carolina Olivera

A Parallel Multilevel Data Decomposition Algorithm for Orientation Estimation of Unmanned Aerial Vehicles

Fast orientation estimation of unmanned aerial vehicles is important for maintain stable flight as well as to perform more complex task like obstacle avoidance, search, mapping, etc. The orientation estimation can be performed by means of the fusion of different sensors like accelerometers, gyroscopes and magnetometers, however magnetometers suffer from high distortion in indoor flights, therefore information from cameras can be used as a replacement. This article presents a multilevel decomposition method to process images sent from an unmanned aerial vehicle to a ground station composed by an heterogeneous set of desktop computers. The multilevel decomposition is performed using an alternative hierarchy called Master/Taskmaster/Slaves in order to minimize the network latency. Results shows that using this hierarchy the speed of traditional Master/Slave can be doubled.
Claudio Paz, Sergio Nesmachnow, Julio H. Toloza

Track: Scientific Computing

A Numerical Solution for Wootters Correlation

This paper describes QDsim, a parallel application designed to compute the quantum concurrence by calculating the Wootters correlation of a quantum system. The system is based on a two-level two quantum dots inside a resonant cavity. A Beowulf-like cluster was used for running QDsim. The application was developed using open, portable and scalable software and can be controlled via a GUI client from a remote terminal over either the Internet or a local network. A serial version and three parallel models (shared memory, distributed memory and hybrid –distributed/shared memory) using two different partitioning schemes were implemented to assess their performance. Results showed that the hybrid model approach using domain decomposition achieves the highest performance (12.2X speedup in front of the sequential version) followed by the distributed memory model (6.6X speedup). In both cases, the numerical error is within 1×10− 4, which is accurate enough for estimating the correlation trend.
Abdul Hissami, Alberto Pretel, E. Tamura

Solving Nonlinear, High-Order Partial Differential Equations Using a High-Performance Isogeometric Analysis Framework

In this paper we present PetIGA, a high-performance implementation of Isogeometric Analysis built on top of PETSc. We show its use in solving nonlinear and time-dependent problems, such as phase-field models, by taking advantage of the high-continuity of the basis functions granted by the isogeometric framework. In this work, we focus on the Cahn-Hilliard equation and the phase-field crystal equation.
Adriano M. A. Côrtes, Philippe Vignal, Adel Sarmiento, Daniel García, Nathan Collier, Lisandro Dalcin, Victor M. Calo

Alya Multiphysics Simulations on Intel’s Xeon Phi Accelerators

In this paper we describe the porting of Alya, our HPC-based multiphysics simulation code to Intel’s Xeon Phi, assessing code performance. This is a continuation of a short white paper where the solid mechanics module was tested. Here, we add two tests more and asses the code on a much wider context. From the Physical point of view, we solve a complex multiphysics problem (combustion in a kiln furnace) and a single-physics problem with an explicit scheme (compressible flow around a wing). From the architecture point of view, we perform new tests using multiple accelerators on different hosts.
Mariano Vázquez, Guillaume Houzeaux, Félix Rubio, Christian Simarro


Weitere Informationen

Premium Partner