Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 19th Symposium on High Performance Computing System, WSCAD 2018, held in São Paulo, Brazil, in October 2018.
The 12 revised full papers presented were carefully reviewed and selected out of 61 submissions. The papers included in this book are organized according to the following topics: cloud computing; performance; processors and memory architectures; power and energy.



Cloud Computing


An Interference-Aware Strategy for Co-locating High Performance Computing Applications in Clouds

Cross-interference may happen when applications share a common physical machine, affecting negatively their performances. This problem occurs frequently when high performance applications are executed in clouds. Some papers of the related literature have considered this problem when proposing strategies for Virtual Machine Placement. However, they neither have employed a suitable method for predicting interference nor have considered the minimization of the number of used physical machines and interference at the same time. In this paper, we present a solution based on the Iterated Local Search framework to solve the Interference-aware Virtual Machine Placement Problem for HPC applications in Clouds (IVMP). This problem aims to minimize, at the same time, the interference suffered by HPC applications which share common physical machines and the number of physical machines used to allocate them. Experiments were conducted in a real scenario by using applications from oil and gas industry and applications from the HPCC benchmark. They showed that our method reduced interference in more than 40%, using the same number of physical machines of the most widely employed heuristics to solve the problem.
Maicon Melo Alves, Luan Teylo, Yuri Frota, Lúcia Maria de A. Drummond

Automatic Minimization of Execution Budgets of SPITS Programs in AWS

Cloud computing platforms offer a wide variety of computational resources with different performance specifications for different prices. In this work, we experiment how Spot instances and Availability Zones on the Amazon Web Services (AWS) could be utilized to reduce the processing budget. Not only that, but we propose instance selection algorithms in AWS to minimize the execution budget of programs implemented using the programming model Scalable Partially Idempotent Task System (SPITS). Our results show that the proposed method can identify and dynamically adjust the virtual machine types that offer the best price per performance ratio. Therefore, we conclude that our algorithms can minimize the budget given a long enough execution time, except in situations where the startup overhead caused the budget difference or in a short period execution.
Nicholas T. Okita, Tiago A. Coimbra, Charles B. Rodamilans, Martin Tygel, Edson Borin

Analysis of Virtualized Congestion Control in Applications Based on Hadoop MapReduce

Among the existing applications for processing massive volumes of data, the Hadoop MapReduce (HMR) is widely used in clouds, having above all internal network flows of different volume and periodicity. In this regard, providers have the challenge of managing data centers with a wide range of operating systems and features. The diversity of algorithms and parameters related to TCP constitutes a heterogeneous communication scenario prone to degradation of communication-intensive applications. Due to total control in the data center, providers can apply the Virtualized Congestion Control (VCC) to generate optimized algorithms. From the tenant’s perspective, virtualization is a transparently performed. Some technologies have made possible to develop such virtualization. Explicit Congestion Notification (ECN) is a technique for congestion identification which acts by monitoring the queues occupancy. Although promising, the specialized literature lacks on a deep analysis of the VCC impact on the applications. Our work characterizes the VCC impact on HMR on scenarios in which there are present applications competing for network resources using optimized and non-optimized TCP stacks. We identified the HMR has its performance substantially influenced by the data volume according to the employed TCP stack. Moreover, we highlight some VCC limitations.
Vilson Moro, Maurício Aronne Pillon, Charles Christian Miers, Guilherme Piêgas Koslovski



Improving Oil and Gas Simulation Performance Using Thread and Data Mapping

Oil and gas have been among the most important commodities for over a century. To improve their extraction, companies invest in new technology, which reduces extraction cost and allow new areas to be explored. Computing science has also been employed to support advances in oil and gas extraction technologies. Techniques such as computing simulation can be used to evaluate scenarios quicker and with a lower cost. Several mathematical models that simulate oil and gas extraction are based on wave propagation. To simulate with high performance, the software must be written considering the characteristics of the underlying hardware. In this context, our work shows how thread and data mapping policies can improve the performance of a wave propagation model provided by Petrobras, a multinational corporation in the petroleum industry. In our experiments, we are revealing that, with smart mapping policies, we reduced the execution time by up to 48.6% on Intel’s multi-core Xeon.
Matheus S. Serpa, Eduardo H. M. Cruz, Jairo Panetta, Antônio Azambuja, Alexandre S. Carissimi, Philippe O. A. Navaux

SMCis: Scientific Applications Monitoring and Prediction for HPC Environments

Understanding the computational requirements of scientific applications and their relation to power consumption is a fundamental task to overcome the current barriers to achieve the computational exascale. However, this imposes some challenging tasks, such as to monitor a wide range of parameters in heterogeneous environments, to enable fine grained profiling and power consumed across different components, to be language independent and to avoid code instrumentation. Considering these challenges, this work proposes the SMCis, an application monitoring tool developed with the goal of collecting all these aspects in an effective and accurate way, as well as to correlate these data graphically, with the environment of analysis and visualization. In addition, SMCis integrates and facilitates the use of Machine Learning tools for the development of predictive runtime and power consumption models.
Gabrieli Silva, Vinícius Klôh, André Yokoyama, Matheus Gritz, Bruno Schulze, Mariza Ferro

Video7 Extended Architecture: Project Design and Statistical Analysis

The increasing number of both digital audio and video, brings up the necessity of appropriate tools for the storage and management of those kind of data. As options for the storage, there are non-relational databases (NoSQL). The diversity of existing systems provokes the interest in proposing an architecture for the management of that content, in different types of databases. This work deepens the Video7 architecture for storing and retrieving streaming audio and video files stored in non-relational key-value, tabular, document databases. Based on the architecture and suggested project design, a tool was implemented that makes use of the Apache HBase, Apache Cassandra, Project Voldemort, Redis and MongoDB databases, and is subjected to stressful routines. The purpose of stress routines is to measure insertion and queries times, in addition to their transfer rates in response to requests to a media server. The kruskal-wallis test was used to validated the measurements. Redis database presents better performance in the submitted routines, while Project Voldemort and Apache Cassandra perform poorly than other databases.
Vanderson S. de O. L. Sampaio, Douglas D. J. de Macedo, André Britto

Parallel Stream Processing with MPI for Video Analytics and Data Visualization

The amount of data generated is increasing exponentially. However, processing data and producing fast results is a technological challenge. Parallel stream processing can be implemented for handling high frequency and big data flows. The MPI parallel programming model offers low-level and flexible mechanisms for dealing with distributed architectures such as clusters. This paper aims to use it to accelerate video analytics and data visualization applications so that insight can be obtained as soon as the data arrives. Experiments were conducted with a Domain-Specific Language for Geospatial Data Visualization and a Person Recognizer video application. We applied the same stream parallelism strategy and two task distribution strategies. The dynamic task distribution achieved better performance than the static distribution in the HPC cluster. The data visualization achieved lower throughput with respect to the video analytics due to the I/O intensive operations. Also, the MPI programming model shows promising performance outcomes for stream processing applications.
Adriano Vogel, Cassiano Rista, Gabriel Justo, Endrius Ewald, Dalvan Griebler, Gabriele Mencagli, Luiz Gustavo Fernandes

Tangible Assets to Improve Research Quality: A Meta Analysis Case Study

This paper presents a meta-analysis of the publications from all 18 previous editions of WSCAD in order to understand how performance results are validated and reported. This meta-analysis extract from these papers terms (keywords) belonging to three categories: statistics, metrics and tests. From all 426 papers analyzed, 93% referred at least one of the terms considered, indicating that there is a concern that results should be reported in order to the paper be considered relevant for this conference. Nevertheless, this analysis shows that only 3% of the papers applies reliable statistical tests to validate them. This paper depicts the meta-analysis achieved and proposes a direction to promote the adoption of a guideline to improve the results reporting in this conference and other with related subjects.
Alessander Osorio, Marina Dias, Gerson Geraldo H. Cavalheiro

Processors and Memory Architectures


High-Performance RISC-V Emulation

RISC-V is an open ISA that has been calling the attention worldwide by its fast growth and adoption. It is already supported by GCC, Clang and the Linux Kernel. However, none of the currently available RISC-V emulators are capable of providing good, near-native, emulation performance. Thus, in this work, we investigate if faster emulators for RISC-V could be created. Since Dynamic Binary Translation (DBT) is the most common, and fastest, technique to implement emulators, we focus our investigation on the quality of the translated code, arguably the most important source of overhead when emulating code with DBT. To this end, we implemented and evaluated a LLVM-based Static Binary Translation (SBT) engine to investigate whether or not it is possible to produce high-quality translations from RISC-V to x86 and ARM. We explored different translation techniques and managed to design an SBT engine that produces translated code that is only 1.2x/1.3x slower than native x86/ARM code, which supports the claim that it is possible to build near-native RISC-V emulators for x86 and ARM hosts. We also analyze the main sources of overheads, compare the code produced by our SBT against the code produced by a popular DBT and provide insights on the potential performance impact of the proposed techniques on DBTs.
Leandro Lupori, Vanderson Martins do Rosario, Edson Borin

Evaluation and Mitigation of Timing Side-Channel Leakages on Multiple-Target Dynamic Binary Translators

Timing side-channel attacks are an important issue for cryptographic algorithms. If the execution time of an implementation depends on secret information, an adversary may recover the latter through measuring the former. Different approaches have emerged to exploit information leakage on cryptographic implementations and to protect them against these attacks, and recent works extend the concerns to dynamic execution systems [3, 15, 24]. However, little has been said about Cross-ISA emulation and its impact on timing leakages. In this paper, we investigate the impact of dynamic binary translators in the constant-time property of known cryptographic implementations, using different Region Formation Techniques (RFTs). We show that the emulation may have a significant impact by inserting non constant-time constructions during the translation, leading to significant timing leakages in QEMU and HQEMU emulators. These leakages are then verified using a statistical approach. In order to guarantee the constant-time property, we have implemented a solution in the QEMU dynamic binary translator, mitigating the inserted timing side-channels.
Otávio Oliveira Napoli, Vanderson Martins do Rosario, Diego Freitas Aranha, Edson Borin

A GPU-Based Parallel Reduction Implementation

Reduction operations aggregate a finite set of numeric elements into a single value. They are extensively employed in many computational tasks and can be performed in parallel when multiple processing units are available. This work presents a GPU-based approach for parallel reduction, which employs techniques like loop unrolling, persistent threads and algebraic expressions. It avoids thread divergence and it is able to surpass the methods currently in use. Experiments conducted to evaluate the approach show that the strategy performs efficiently on both AMD and NVidia’s hardware platforms, as well as using OpenCL and CUDA, making it portable.
Walid Abdala Rfaei Jradi, Hugo Alexandre Dantas do Nascimento, Wellington Santos Martins

Power and Energy


Evaluating Cache Line Behavior Predictors for Energy Efficient Processors

The ever-increasing size of cache memories, nowadays achieving almost half of the area for modern processors, and so essential to the performance of the systems, are leading into a crescent static energy consumption. In order to save some of this energy and optimize its component performance, many techniques were proposed. Cache line reuse predictors and dead line predictors are some examples. These mechanisms predict whenever a cache line shall be dead, in order to turn it off, also applying other policies on them, such as replacement prioritization or bypassing its installation inside the cache. However, not all mechanisms implement all these policies, that directly affect the cache behavior in different ways. This paper evaluates the impacts of the priority and bypass policies over two dead line predictors, the Dead Block and Early Write Back Predictor (DEWP) and the Skewed Dead Block predictor (SDP). Both mechanisms turn off dead cache lines using Gated-Vdd technique in order to save their static energy, thus analyzing how each policy (Priority replacement and cache Bypass) affects the energy savings and the system performance.
Rodrigo Machniewicz Sokulski, Emmanuell Diaz Carreno, Marco Antonio Zanata Alves


Weitere Informationen

Premium Partner