Optimizing latency and throughput of application workflows on clusters☆
Introduction
Data-analysis steps in a wide range of applications [1], [2], [3] can be expressed as workflows that operate on a stream of input data—tasks in the workflow repeatedly receive input data items from their predecessors, compute on them, and write the output to their successors. Efficient execution of such workflows is gauged by two metrics: latency and throughput. Latency is the time to process an individual data item through the workflow, while throughput is a measure of the aggregate rate of processing of data. It is often desirable to optimize on one of the metrics while meeting the requirements of the other metric. Applications with real-time constraints, for example, can have strict throughput requirements and desire low latency, whereas interactive query processing may have strict latency constraints and desire a high aggregate query processing rate. To be able to meet requirements and minimize resource usage is also important especially in settings such as Supercomputer centers where resources (e.g., a compute cluster) have an associated cost and are contended for by multiple clients.
This paper presents a novel approach for the scheduling of streaming workflows on a cluster machine such that the resulting schedule meets throughput and latency requirements. We integrate the algorithms presented in our previous work [4], [5] and include more extensive analysis and proofs. Our algorithm optimizes latency of streaming workflows while meeting throughput requirements. We also describe how the above approach can be applied to maximize throughput while meeting latency requirements.
Our algorithm employs pipelined-, task- and data-parallelism in an integrated manner to meet the above described performance goals. Throughput requirements are satisfied by leveraging pipelined parallelism and through intelligent clustering, duplication and/or replication of tasks. Pipelined-parallelism is the concurrent execution of dependent tasks that operate on different instances of the input data stream, while data-parallelism is the concurrent processing of multiple data items by replicas of a task. Latency is minimized by exploiting task-parallelism, which is the concurrent execution of independent tasks or task-clusters on the same instance of the data stream, and minimizing communication costs along the critical path of the task graph through duplication and clustering. Here, duplication refers to the redundant allocation of tasks, working on the same data item, to multiple processors, and clustering refers to mapping of two or more communicating task to the same processor to avoid expensive communications. We employ a flexible k-port communication model, where each processor can communicate with at most k distinct processors simultaneously. The value of k is determined by the ratio of the network-card capacity on a processor and the link capacity.
We compare the proposed approach against two previously proposed schemes: Filter Copy Pipeline (FCP) [6] and EXPERT (EXploiting Pipeline Execution undeR Time constraints) [7]. Experimental evaluations are done using synthetic benchmarks and task graphs derived from real applications in the domains of Image Analysis, Video Processing and Computer Vision [7], [8], [9], [10]. We show that our algorithm is able to generate low latency schedules that meet throughput requirements, even when previously proposed approaches fail.
Section snippets
Related work
Several algorithms for scheduling streaming workflows focus on maximizing the throughput. These algorithms leverage pipelined-parallelism between dependent tasks to improve the throughput. Lee et al. [11] propose a three-step mapping methodology for maximizing the throughput of applications comprising of a sequence of computation stages, each consisting of a set of identical sequential tasks. Jonsson and Vasell [12] and Hary and Ozguner [13] discuss heuristics for maximizing the throughput of
Task graph and system model
An application workflow can be represented as a connected, weighted directed acyclic graph (DAG), G = (V, E), where V, the set of vertices, represents non-homogeneous sequential application components (tasks) and E, the set of edges, represents precedence constraints and data dependences. There are two distinguished vertices (tasks) in the task graph: the source task which precedes all other tasks and the sink task which succeeds all other tasks.
The task graph G acts on a stream of data, where
Optimizing latency under throughput constraint
Given a workflow-DAG G, P homogeneous processors and a throughput constraint T, our throughput constrained latency optimization heuristic (TCLO) generates a mapping and schedule of G on P that minimizes the latency while satisfying the throughput constraint. TCLO is designed for a k-port communication model and takes into account communication contention and its impact, if it is on the critical path, while deriving low latency schedules that meet the throughput requirement. Consider the
Performance analysis
This section presents an experimental evaluation of the performance of our scheduling heuristic, TCLO, against previously proposed schemes: Filter Copy Pipeline (FCP) [6] and EXploiting Pipeline Execution undeR Time constraints (EXPERT, abbreviated as EXP in this section) [7], and FCP-e and EXP-e, modified versions of the above schemes. When FCP and EXP fail to utilize all processors and do not meet the throughput requirement T, FCP-e recursively calls FCP on the remaining processors until T is
Optimizing throughput under latency constraint
The solution described in Section 4 for generating schedules that optimize latency while meeting throughput requirements, can be used to solve the inverse problem of optimizing throughput while meeting latency constraints. In this section, we describe an approach that applies binary search techniques combined with a bounded look-ahead using the algorithm proposed in Section 4 to generate schedules that optimize throughput under latency constraints. Proposition 4 For a given workflow and a fixed number of
Conclusions and future works
This work presents heuristics for scheduling streaming application workflows with stringent performance requirements. Through coordinated leveraging of pipelined, task and data parallelism and use of techniques like task duplication, the proposed algorithm, TCLO, minimizes the latency of streaming workflows, while satisfying strict throughput requirements. We also describe a binary search based algorithm using TCLO, that optimizes throughput of streaming workflows while meeting latency
Acknowledgments
The authors thank Yves Robert and Anne Benoit for their valuable discussions and constructive reviews on this work. We also would like to thank anonymous referees for their valuable comments, which helped us improve the presentation of this paper.
References (36)
- et al.
A pipeline-based approach for scheduling video processing algorithms on now
IEEE Transactions on Parallel and Distributed Systems
(2003) - et al.
Mapping pipeline skeletons onto heterogeneous platforms
Journal of Parallel and Distributed Systems
(2008) - A. Choudhary, W. Lio, D. Weiner, P. Varshney, R. Linderman, M. Linderman, Design, implementation and evaluation of...
- V.S. Kumar, B. Rutt, T. Kurc, U. Catalyurek, J. Saltz, S. Chow, S. Lamont, M. Martone, Imaging and visual...
- M. Yang, T. Gandhi, R. Kasturi, L. Coraror, O. Camps, J. McCandless, Real-time obstacle detection system for high speed...
- N. Vydyanathan, Ü.V. Çatalyürek, T.M. Kurç, P. Sadayappan, J.H. Saltz, Toward optimizing latency under throughput...
- N. Vydyanathan, Ü.V. Çatalyürek, T. Kurç, P. Sadayappan, J. Saltz, A duplication based algorithm for scheduling of...
- et al.
Executing multiple pipelined data analysis operations in the grid
- F. Guirado, A.Ripoll, C. Roig, E. Luque, Optimizing latency under throughput requirements for streaming applications on...
- B. Agarwalla, N. Ahmed, D. Hilley, U. Ramachandran, Streamline: A scheduling heuristic for streaming application on the...
Scheduling pipelined communication in distributed memory multiprocessors for real-time applications
ACM SIGARCH Computer Architecture News
Precedence-constrained task allocation onto point-to-point networks for pipelined execution
IEEE Transactions on Parallel and Distributed Systems
Streamline: scheduling streaming applications in a wide area environment
Multimedia Systems
Resource allocation in a middleware for streaming data
Cited by (12)
Enhancing throughput for streaming applications running on cluster systems
2013, Journal of Parallel and Distributed ComputingCitation Excerpt :The mechanisms can be applied under different conditions: The idea of task replication was already introduced by some authors in the literature with different goals [19,24]. The novelty of our proposal consists on developing task replication techniques that are suitable for being applied at the application level, in data streaming parallel applications that are developed for being executed in distributed platforms.
Distribution slack allocation algorithm for energy aware task scheduling in cloud datacenters
2021, Journal of Intelligent and Fuzzy SystemsUse case-based evaluation of workflow optimization strategy in real-time computation system
2020, Journal of SupercomputingEATSDCD: A green energy-aware scheduling algorithm for parallel task-based application using clustering, duplication and DVFS technique in cloud datacenters
2019, Journal of Intelligent and Fuzzy SystemsDual-paradigm stream processing
2018, ACM International Conference Proceeding Series
- ☆
This research was supported in part by the National Science Foundation under Grants #CCF-0342615, #CNS-0403342 and #CNS-0643969.