Skip to main content

Über dieses Buch

This book constitutes the thoroughly refereed post-conference proceedings of the 23rd International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP 2020, held in New Orleans, LA, USA, in May 2020.*

The 6 revised full papers presented were carefully reviewed and selected from 8 submissions. In addition to this, one invited paper and one keynote pare were included in the workshop. The papers cover topics within the fields of resource management and scheduling. They focus on several interesting problems such as resource contention and workload interference, new scheduling policy, scheduling ultrasound simulation workflows, and walltime prediction.

* The conference was held virtually due to the COVID-19 pandemic.



Towards Interference-Aware Dynamic Scheduling in Virtualized Environments

Our previous work shows that multiple applications contending for shared resources in virtualized environments are susceptible to cross-application interference, which can lead to significant performance degradation and consequently an increase in the number of broken SLAs. Nevertheless, state of the art in resource scheduling in virtualized environments still relies mainly on resource capacity, adopting heuristics such as bin packing, overlooking this source of overhead. However, in recent years interference-aware scheduling has gained traction, with the investigation of ways to classify applications regarding their interference levels and the proposal of static cost models and policies for scheduling co-hosted cloud applications. Preliminary results in this area already show a considerable improvement on resource usage and in the reduction of broken SLAs, but we strongly believe that there are still opportunities for improvement in the areas of application classification and pro-active dynamic scheduling strategies. This paper presents the state of the art in interference-aware scheduling for virtualized environments and the challenges and advantages of a dynamic scheme.
Vinícius Meyer, Uillian L. Ludwig, Miguel G. Xavier, Dionatrã F. Kirchoff, Cesar A. F. De Rose

Towards Hybrid Isolation for Shared Multicore Systems

Co-locating and running multiple applications on a multicore system is inevitable for data centers to achieve high resource efficiency. However, it causes performance degradation due to the contention for shared resources, such as cache and memory bandwidth. Several approaches use software or hardware isolation techniques to mitigate resource contentions. Nevertheless, the existing approaches have not fully exploited differences in isolation techniques by the characteristics of applications to maximize the performance. Software techniques bring more flexibility than hardware ones in terms of performance while sacrificing strictness and responsiveness. In contrast, hardware techniques provide more strict and faster isolations compared to software ones. In this paper, we illustrate the trade-offs between software and hardware isolation techniques and also show the benefit of coordinated enforcement of multiple isolation techniques. Also, we propose HIS, a hybrid isolation system that dynamically uses either the software or hardware isolation technique. Our preliminary results show that HIS can improve the performance of foreground applications by from 1.7–2.14\(\times \) compared with static isolations for the selected benchmarks.
Yoonsung Nam, Byeonghun Yoo, Yongjun Choi, Yongseok Son, Hyeonsang Eom

Improving Resource Isolation of Critical Tasks in a Workload

Typical cluster schedulers co-locate critical tasks and background batch tasks to improve the utilization of resources in the cluster. However, this leads to resource contention and interference between the diverse co-located tasks. To ensure guaranteed resource allocation and predictability, critical tasks are executed within containers as they provide resource isolation using container resource allocation mechanisms. Linux-based containers achieve resource allocation and isolation using a kernel feature known as Control Groups (cgroups). Cgroups allow the division of CPU time into shares which can be allocated to different groups of tasks. In our study, we run workloads on servers with different hardware configurations and measure the CPU time per second, or the CPU bandwidth, that the critical tasks in the workloads can consume. Our workloads have been generated using a cluster trace published by Google, and contain a mixture of critical and background tasks. The results of the experiments show that under high CPU load conditions, the CPU bandwidth consumed by the critical tasks is inadequate and unstable because of the poor resource isolation offered by cgroups. However, when these tasks are scheduled with the careful use of SCHED_DEADLINE policy, which is based on the Global Earliest Deadline First and Constant Bandwidth Server algorithms, they steadily consume their required CPU bandwidth irrespective of the load on the CPU. As a result, when critical tasks are scheduled using SCHED_DEADLINE, they experience 3\(\times \)–40\(\times \) smaller delays than under cgroups.
Meghana Thiyyakat, Subramaniam Kalambur, Dinkar Sitaram

Optimizing Biomedical Ultrasound Workflow Scheduling Using Cluster Simulations

Therapeutic ultrasound plays an increasing role in dealing with oncological diseases, drug delivery and neurostimulation. To maximize the treatment outcome, thorough pre-operative planning using complex numerical models considering patient anatomy is crucial. From the computational point of view, the treatment planning can be seen as the execution of a complex workflow consisting of many different tasks with various computational requirements on a remote cluster or in cloud. Since these resources are precious, workflow scheduling plays an important part in the whole process.
This paper describes an extended version of the k-Dispatch workflow management system that uses historical performance data collected on similar workflows to choose suitable amount of computational resources and estimates execution time and cost of particular tasks. This paper also introduces necessary extensions to the Alea cluster simulator that enable the estimation of the queuing and total execution time of the whole workflow. The conjunction of both systems then allows for fine-grain optimization of the workflow execution parameters with respect to the current cluster utilization. The experimental results show that this approach is able to reduce the computational time by 26%.
Marta Jaros, Dalibor Klusáček, Jiri Jaros

Evaluating Controlled Memory Request Injection to Counter PREM Memory Underutilization

Modern heterogeneous systems-on-chip (HeSoC) feature high-performance multi-core CPUs tightly integrated with data-parallel accelerators. Such HeSoCS heavily rely on shared resources, which hinder their adoption in the context of Real-Time systems. The predictable execution model (PREM) has proven effective at preventing uncontrolled execution time lengthening due to memory interference in HeSoC sharing main memory (DRAM). However, PREM only allows one task at a time to access memory, which inherently under-utilizes the available memory bandwidth in modern HeSoCs. In this paper, we conduct a thorough experimental study aimed at assessing the potential benefits of extending PREM so as to inject controlled amounts of memory requests coming from other tasks than the one currently granted exclusive DRAM access. Focusing on a state-of-the-art HeSoC, the NVIDIA TX2, we extensively characterize the relation between the injected bandwidth and the latency experienced by the task under test. The results confirm that for various types of workload it is possible to exploit the available bandwidth much more efficiently than standard PREM arbitration, often close to its maximum, while keeping latency inflation below 10%. We discuss possible practical implementation directions, highlighting the expected benefits and technical challenges.
Roberto Cavicchioli, Nicola Capodieci, Marco Solieri, Marko Bertogna, Paolo Valente, Andrea Marongiu

Accelerating 3-Way Epistasis Detection with CPU+GPU Processing

A Single Nucleotide Polymorphism (SNP) is a DNA variation occurring when a single nucleotide differs between individuals of a species. Some conditions can be explained with a single SNP. However, the combined effect of multiple SNPs, known as epistasis, allows to better correlate genotype with a number of complex traits. We propose a highly optimized GPU+CPU based approach for epistasis detection. The GPU portion of the approach relies only on CUDA cores to score sets of SNPs, based on the copresence of genetic variants and a specific outcome (case or control), making it suitable for a large number of computing devices. Considering datasets with different shapes (more SNPs than patients, or vice versa) and sizes, combining an analytical analysis and an experimental evaluation with five CPU+GPU configurations covering different GPU architectures from the last five years, we show that the performance achieved by our proposal is close to what is theoretically possible on the targeted GPUs. Comparing, in 3-way epistasis detection, with a state-of-the-art GPU-based approach which also does not rely on specialized hardware cores, MPI3SNP, the proposal is on average \(3.83\times \), \(2.72\times \), \(2.44\times \) and \(2.71\times \) faster on systems with a Titan X (Maxwell 2.0), a Titan XP (Pascal), a Titan V (Volta) and a GeForce 2070 SUPER (Turing) GPU, respectively.
Ricardo Nobre, Sergio Santander-Jiménez, Leonel Sousa, Aleksandar Ilic

Walltime Prediction and Its Impact on Job Scheduling Performance and Predictability

For more than two decades researchers have been analyzing the impact of inaccurate job walltime (runtime) estimates on the performance of job scheduling algorithms, especially the backfilling. In this paper, we extend these existing works by focusing on the overall impact that improved walltime estimates have both on job scheduling performance and predictability. For this purpose, we evaluate such impact in several steps. First, we present a simple walltime predictor and analyze its accuracy with respect to original user walltime estimates captured in real-life workload traces. Next, we use these traces and a simulator to see what is the impact of improved estimates on general performance (backfilling ratio and wait time) as well as predictability. We show that even a simple predictor can significantly decrease user-based errors in runtime estimates, while also slightly improving job wait times and backfilling ratio. Concerning predictions, we show that walltime predictor significantly decreases errors in job wait time forecasting while having little effect on the ability of the scheduler to provide solid advance predictions about which nodes will be used by a given waiting job.
Dalibor Klusáček, Mehmet Soysal

PDAWL: Profile-Based Iterative Dynamic Adaptive WorkLoad Balance on Heterogeneous Architectures

While High Performance Computing systems are increasingly based on heterogeneous cores, their effectiveness depends on how well the scheduler can allocate workloads onto appropriate computing devices and how communication and computation can be overlapped. With different types of resources integrated into one system, the complexity of the scheduler correspondingly increases. Moreover, for applications with varying problem sizes on different heterogeneous resources, the optimal scheduling approach may vary accordingly. We thus present PDAWL, an event-driven profile-based Iterative Dynamic Adaptive Work-Load balance scheduling approach to dynamically and adaptively adjust workload to efficiently utilize heterogeneous resources. It combines online scheduling (DAWL), which can adaptively adjust workload based on available real time heterogeneous resources, with an offline machine learning (profile-based estimation model) which can build a device-specific communication computation estimation model. Our scheduling approach is tested on control-regular applications, Stencil kernel (based on a Jacobi Algorithm) and Sparse Matrix-Vector Multiplication (SpMV) in an event-driven runtime system. Experimental results show that PDAWL is either on-par or far outperforms whichever yields the best results (CPU or GPU).
Tongsheng Geng, Marcos Amaris, Stéphane Zuckerman, Alfredo Goldman, Guang R. Gao, Jean-Luc Gaudiot


Weitere Informationen

Premium Partner