Elsevier

Parallel Computing

Volume 37, Issues 10–11, October–November 2011, Pages 694-712
Parallel Computing

Optimizing latency and throughput of application workflows on clusters

https://doi.org/10.1016/j.parco.2010.05.003Get rights and content

Abstract

Scheduling, in many application domains, involves optimization of multiple performance metrics. For example, application workflows with real-time constraints have strict throughput requirements and also desire a low latency or response time. In this paper, we present a novel algorithm for the scheduling of workflows that act on a stream of input data. Our algorithm focuses on the two performance metrics, latency and throughput, and minimizes the latency of workflows while satisfying strict throughput requirements. We also describe steps to use the above approach to solve the problem of meeting latency requirements while maximizing throughput. We leverage pipelined, task and data parallelism in a coordinated manner to meet these objectives and investigate the benefit of task duplication in alleviating communication overheads in the pipelined schedule for different workflow characteristics. The proposed algorithm is designed for a realistic bounded multi-port communication model, where each processor can simultaneously communicate with at most k distinct processors. Experimental evaluation using synthetic benchmarks as well as those derived from real applications shows that our algorithm consistently produces lower latency schedules that meet throughput requirements, even when previously proposed schemes fail.

Introduction

Data-analysis steps in a wide range of applications [1], [2], [3] can be expressed as workflows that operate on a stream of input data—tasks in the workflow repeatedly receive input data items from their predecessors, compute on them, and write the output to their successors. Efficient execution of such workflows is gauged by two metrics: latency and throughput. Latency is the time to process an individual data item through the workflow, while throughput is a measure of the aggregate rate of processing of data. It is often desirable to optimize on one of the metrics while meeting the requirements of the other metric. Applications with real-time constraints, for example, can have strict throughput requirements and desire low latency, whereas interactive query processing may have strict latency constraints and desire a high aggregate query processing rate. To be able to meet requirements and minimize resource usage is also important especially in settings such as Supercomputer centers where resources (e.g., a compute cluster) have an associated cost and are contended for by multiple clients.

This paper presents a novel approach for the scheduling of streaming workflows on a cluster machine such that the resulting schedule meets throughput and latency requirements. We integrate the algorithms presented in our previous work [4], [5] and include more extensive analysis and proofs. Our algorithm optimizes latency of streaming workflows while meeting throughput requirements. We also describe how the above approach can be applied to maximize throughput while meeting latency requirements.

Our algorithm employs pipelined-, task- and data-parallelism in an integrated manner to meet the above described performance goals. Throughput requirements are satisfied by leveraging pipelined parallelism and through intelligent clustering, duplication and/or replication of tasks. Pipelined-parallelism is the concurrent execution of dependent tasks that operate on different instances of the input data stream, while data-parallelism is the concurrent processing of multiple data items by replicas of a task. Latency is minimized by exploiting task-parallelism, which is the concurrent execution of independent tasks or task-clusters on the same instance of the data stream, and minimizing communication costs along the critical path of the task graph through duplication and clustering. Here, duplication refers to the redundant allocation of tasks, working on the same data item, to multiple processors, and clustering refers to mapping of two or more communicating task to the same processor to avoid expensive communications. We employ a flexible k-port communication model, where each processor can communicate with at most k distinct processors simultaneously. The value of k is determined by the ratio of the network-card capacity on a processor and the link capacity.

We compare the proposed approach against two previously proposed schemes: Filter Copy Pipeline (FCP) [6] and EXPERT (EXploiting Pipeline Execution undeR Time constraints) [7]. Experimental evaluations are done using synthetic benchmarks and task graphs derived from real applications in the domains of Image Analysis, Video Processing and Computer Vision [7], [8], [9], [10]. We show that our algorithm is able to generate low latency schedules that meet throughput requirements, even when previously proposed approaches fail.

Section snippets

Related work

Several algorithms for scheduling streaming workflows focus on maximizing the throughput. These algorithms leverage pipelined-parallelism between dependent tasks to improve the throughput. Lee et al. [11] propose a three-step mapping methodology for maximizing the throughput of applications comprising of a sequence of computation stages, each consisting of a set of identical sequential tasks. Jonsson and Vasell [12] and Hary and Ozguner [13] discuss heuristics for maximizing the throughput of

Task graph and system model

An application workflow can be represented as a connected, weighted directed acyclic graph (DAG), G = (V, E), where V, the set of vertices, represents non-homogeneous sequential application components (tasks) and E, the set of edges, represents precedence constraints and data dependences. There are two distinguished vertices (tasks) in the task graph: the source task which precedes all other tasks and the sink task which succeeds all other tasks.

The task graph G acts on a stream of data, where

Optimizing latency under throughput constraint

Given a workflow-DAG G, P homogeneous processors and a throughput constraint T, our throughput constrained latency optimization heuristic (TCLO) generates a mapping and schedule of G on P that minimizes the latency while satisfying the throughput constraint. TCLO is designed for a k-port communication model and takes into account communication contention and its impact, if it is on the critical path, while deriving low latency schedules that meet the throughput requirement. Consider the

Performance analysis

This section presents an experimental evaluation of the performance of our scheduling heuristic, TCLO, against previously proposed schemes: Filter Copy Pipeline (FCP) [6] and EXploiting Pipeline Execution undeR Time constraints (EXPERT, abbreviated as EXP in this section) [7], and FCP-e and EXP-e, modified versions of the above schemes. When FCP and EXP fail to utilize all processors and do not meet the throughput requirement T, FCP-e recursively calls FCP on the remaining processors until T is

Optimizing throughput under latency constraint

The solution described in Section 4 for generating schedules that optimize latency while meeting throughput requirements, can be used to solve the inverse problem of optimizing throughput while meeting latency constraints. In this section, we describe an approach that applies binary search techniques combined with a bounded look-ahead using the algorithm proposed in Section 4 to generate schedules that optimize throughput under latency constraints.

Proposition 4

For a given workflow and a fixed number of

Conclusions and future works

This work presents heuristics for scheduling streaming application workflows with stringent performance requirements. Through coordinated leveraging of pipelined, task and data parallelism and use of techniques like task duplication, the proposed algorithm, TCLO, minimizes the latency of streaming workflows, while satisfying strict throughput requirements. We also describe a binary search based algorithm using TCLO, that optimizes throughput of streaming workflows while meeting latency

Acknowledgments

The authors thank Yves Robert and Anne Benoit for their valuable discussions and constructive reviews on this work. We also would like to thank anonymous referees for their valuable comments, which helped us improve the presentation of this paper.

References (36)

  • M.-T. Yang et al.

    A pipeline-based approach for scheduling video processing algorithms on now

    IEEE Transactions on Parallel and Distributed Systems

    (2003)
  • A. Benoit et al.

    Mapping pipeline skeletons onto heterogeneous platforms

    Journal of Parallel and Distributed Systems

    (2008)
  • A. Choudhary, W. Lio, D. Weiner, P. Varshney, R. Linderman, M. Linderman, Design, implementation and evaluation of...
  • V.S. Kumar, B. Rutt, T. Kurc, U. Catalyurek, J. Saltz, S. Chow, S. Lamont, M. Martone, Imaging and visual...
  • M. Yang, T. Gandhi, R. Kasturi, L. Coraror, O. Camps, J. McCandless, Real-time obstacle detection system for high speed...
  • N. Vydyanathan, Ü.V. Çatalyürek, T.M. Kurç, P. Sadayappan, J.H. Saltz, Toward optimizing latency under throughput...
  • N. Vydyanathan, Ü.V. Çatalyürek, T. Kurç, P. Sadayappan, J. Saltz, A duplication based algorithm for scheduling of...
  • M. Spencer et al.

    Executing multiple pipelined data analysis operations in the grid

  • F. Guirado, A.Ripoll, C. Roig, E. Luque, Optimizing latency under throughput requirements for streaming applications on...
  • B. Agarwalla, N. Ahmed, D. Hilley, U. Ramachandran, Streamline: A scheduling heuristic for streaming application on the...
  • S.B. Shukla et al.

    Scheduling pipelined communication in distributed memory multiprocessors for real-time applications

    ACM SIGARCH Computer Architecture News

    (1991)
  • The placenta image analysis pipeline,...
  • M. Lee, W. Liu, V.K. Prasanna, A mapping methodology for designing software task pipelines for embedded signal...
  • J. Jonsson, J. Vasell, Real-time scheduling for pipelined execution of data flow graphs on a realistic multiprocessor...
  • S.L. Hary et al.

    Precedence-constrained task allocation onto point-to-point networks for pipelined execution

    IEEE Transactions on Parallel and Distributed Systems

    (1999)
  • V. Suhendra, C. Raghavan, T. Mitra, Integrated scratchpad memory optimization and task scheduling for MPSoC...
  • B. Agarwalla et al.

    Streamline: scheduling streaming applications in a wide area environment

    Multimedia Systems

    (2007)
  • L. Chen et al.

    Resource allocation in a middleware for streaming data

  • Cited by (12)

    • Enhancing throughput for streaming applications running on cluster systems

      2013, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      The mechanisms can be applied under different conditions: The idea of task replication was already introduced by some authors in the literature with different goals [19,24]. The novelty of our proposal consists on developing task replication techniques that are suitable for being applied at the application level, in data streaming parallel applications that are developed for being executed in distributed platforms.

    • Dual-paradigm stream processing

      2018, ACM International Conference Proceeding Series
    View all citing articles on Scopus

    This research was supported in part by the National Science Foundation under Grants #CCF-0342615, #CNS-0403342 and #CNS-0643969.

    View full text