research-article

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Authors:
Virajith Jalaparti

University of Illinois, Urbana-Champaign, Urbana, IL, USA

University of Illinois, Urbana-Champaign, Urbana, IL, USA
View Profile

,
Peter Bodik

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Ishai Menache

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Sriram Rao

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Konstantin Makarychev

Microsoft Research, Redmond, WA, USA

Microsoft Research, Redmond, WA, USA
View Profile

,
Matthew Caesar

University of Illinois, Urbana-Champaign, Urbana, IL, USA

University of Illinois, Urbana-Champaign, Urbana, IL, USA
View Profile

ACM SIGCOMM Computer Communication Review Volume 45 Issue 4October 2015pp 407–420https://doi.org/10.1145/2829988.2787488

Published:17 August 2015Publication History

ACM SIGCOMM Computer Communication Review

Abstract

To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greedily schedule the tasks and flows that are ready to run. However, a large fraction of production jobs are recurring with predictable characteristics, which allows us to plan ahead for them. Coordinating the placement of data and tasks of these jobs allows for significantly improving their network locality and freeing up bandwidth, which can be used by other jobs running on the cluster. With this intuition, we develop Corral, a scheduling framework that uses characteristics of future workloads to determine an offline schedule which (i) jointly places data and compute to achieve better data locality, and (ii) isolates jobs both spatially (by scheduling them in different parts of the cluster) and temporally, improving their performance. We implement Corral on Apache Yarn, and evaluate it on a 210 machine cluster using production workloads. Compared to Yarn's capacity scheduler, Corral reduces the makespan of these workloads up to 33% and the median completion time up to 56%, with 20-90% reduction in data transferred across racks.

Supplemental Material

p407-jalaparti.webm

webm

129.6 MB

Download

References

Amazon S3. https://aws.amazon.com/s3/.Google Scholar
Amazon Web Services. http://aws.amazon.com/.Google Scholar
Apache Hadoop. http://hadoop.apache.org/.Google Scholar
Apache Tez. http://hortonworks.com/hadoop/tez/.Google Scholar
Facebook data grows by over 500 TB daily. http://tinyurl.com/96d8oqj/.Google Scholar
Hadoop Distributed Filesystem. http://hadoop.apache.org/hdfs.Google Scholar
Hadoop MapReduce Next Generation - Capacity Scheduler. http://tinyurl.com/no2evu5.Google Scholar
Hadoop YARN Project. http://tinyurl.com/bnadg9l.Google Scholar
Microsoft Azure. https://azure.microsoft.com/.Google Scholar
Microsoft Azure Storage. https://azure.microsoft.com/en-us/services/storage/.Google Scholar
ORC File Format. http://tinyurl.com/n4pxofh.Google Scholar
TPC Benchmark H. http://www.tpc.org/tpch/.Google Scholar
Windows Azure's Flat Network Storage and 2012 Scalability Targets. http://bit.ly/1A4Hbjt.Google Scholar
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing Data-parallel Computing. In NSDI 2012. Google ScholarDigital Library
S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Reoptimizing Data Parallel Computing. In NSDI'12, 2012. Google ScholarDigital Library
F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters. In USENIX ATC, 2014. Google ScholarDigital Library
G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, and E. Harris. Scarlett: Coping with Skewed Content Popularity in Mapreduce Clusters. In EuroSys, 2011. Google ScholarDigital Library
G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated Memory Caching for Parallel Jobs. In NSDI, 2012. Google ScholarDigital Library
K. P. Belkhale and P. Banerjee. An approximate algorithm for the partitionable independent task scheduling problem. Urbana, 51:61801, 1990.Google Scholar
R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008. Google ScholarDigital Library
Y. Chen, A. Ganapathi, R. Griffith, and Y. Katz. The Case for Evaluating MapReduce Performance Using Workload Suites. In MASCOTS, 2011. Google ScholarDigital Library
M. Chowdhury, S. Kandula, and I. Stoica. Leveraging Endpoint Flexibility in Data-Intensive Clusters. In ACM SIGCOMM, 2013. Google ScholarDigital Library
M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing Data Transfers in Computer Clusters with Orchestra. In ACM SIGCOMM, 2011. Google ScholarDigital Library
M. Chowdhury, Y. Zhong, and I. Stoica. Efficient Coflow Scheduling with Varys. In ACM SIGCOMM, 2014. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
F. R. Dogar, T. Karagiannis, H. Ballani, and A. Rowstron. Decentralized Task-aware Scheduling for Data Center Networks. In ACM SIGCOMM, August 2014. Google ScholarDigital Library
J. Du and J. Y.-T. Leung. Complexity of Scheduling Parallel Task Systems. SIAM J. Discret. Math., 1989. Google ScholarDigital Library
K. Elmeleegy. Piranha: Optimizing Short Jobs in Hadoop. Proc. VLDB Endow., 2013. Google ScholarDigital Library
M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPherson. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. Proc. VLDB Endow., 2011. Google ScholarDigital Library
A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In EuroSys, 2012. Google ScholarDigital Library
R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.Google ScholarDigital Library
R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource Packing for Cluster Schedulers. In SIGCOMM, 2014. Google ScholarDigital Library
H. Herodotou, F. Dong, and S. Babu. No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics. In SOCC, 2011. Google ScholarDigital Library
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011.Google Scholar
C. Y. Hong, M. Caesar, and P. B. Godfrey. Finishing Flows Quickly with Preemptive Scheduling. In SIGCOMM, 2012. Google ScholarDigital Library
M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. In SOSP, 2009. Google ScholarDigital Library
V. Jalaparti, H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Bridging the Tenant-provider Gap in Cloud Services. In SOCC, 2012. Google ScholarDigital Library
Y. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys (CSUR), 1999. Google ScholarDigital Library
Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. SkewTune: Mitigating Skew in Mapreduce Applications. In ACM SIGMOD, 2012. Google ScholarDigital Library
R. Lepère, D. Trystram, and G. J. Woeginger. Approximation Algorithms for Scheduling Malleable Tasks Under Precedence Constraints. International Journal of Foundations of Computer Science, 13(04):613--627, 2002.Google ScholarCross Ref
M. Li, D. Subhraveti, A. R. Butt, A. Khasymski, and P. Sarkar. CAM: A Topology Aware Minimum Cost Flow Based Resource Manager for MapReduce Applications in the Cloud. In HPDC, 2012. Google ScholarDigital Library
M. Ovsiannikov, S. Rus, D. Reeves, P. Sutter, S. Rao, and J. Kelly. The Quantcast File System. Proc. VLDB Endow. Google ScholarDigital Library
B. Palanisamy, A. Singh, L. Liu, and B. Jain. Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud. In SC, 2011. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive- a warehousing solution over a map-reduce framework. In VLDB, 2009. Google ScholarDigital Library
J. Turek, J. L. Wolf, and P. S. Yu. Approximate Algorithms Scheduling Parallelizable Tasks. In SPAA, 1992. Google ScholarDigital Library
G. Wang, A. Butt, P. Pandey, and K. Gupta. A Simulation Approach to Evaluating Design Decisions in MapReduce Setups. In MASCOTS, 2009.Google Scholar
C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron. Better Never than Late: Meeting Deadlines in Datacenter Networks. In ACM SIGCOMM, 2011. Google ScholarDigital Library
M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In EuroSys, 2010. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, 2010. Google ScholarDigital Library
J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. Scope: parallel databases meet mapreduce. VLDB J., 21(5):611--636, 2012. Google ScholarDigital Library

Index Terms

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
      1. Cloud computing
2. Networks
  1. Network types
    1. Data center networks

Recommendations

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication

To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greedily schedule the tasks ...
Read More
Network Scheduling Aware Task Placement in Datacenters
CoNEXT '16: Proceedings of the 12th International on Conference on emerging Networking EXperiments and Technologies

To improve the performance of data-intensive applications, existing datacenter schedulers optimize either the placement of tasks or the scheduling of network flows. The task scheduler strives to place tasks close to their input data (i.e., maximize data ...
Read More
Single machine parallel-batch scheduling with deteriorating jobs

We consider several single machine parallel-batch scheduling problems in which the processing time of a job is a linear function of its starting time. We give a polynomial-time algorithm for minimizing the maximum cost, an O(n⁵) time algorithm for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGCOMM Computer Communication Review Volume 45, Issue 4
SIGCOMM'15
October 2015
659 pages
ISSN:0146-4833
DOI:10.1145/2829988
Editors:
Konstantina Papagiannaki
Telefonica Research, Barcelona, Spain
,
Katerina Argyraki
EPFL, Switzerland
,
Hitesh Ballani
Microsoft Research Cambridge, UK
,
Fabián Bustamante
Northwestern University, USA
,
Joseph Camp
SMU, USA
,
Augustin Chaintreau
Columbia University, USA
,
Phillipa Gill
Stony Brook University, USA
,
Marco Mellia
Politecnico di Torino, Italy
,
Bhaskaran Raman
IIT Bombay, India
,
Joel Sommers
Colgate University, USA
,
Aline Carneiro Viana
INRIA, France
Issue’s Table of Contents
SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication
August 2015
684 pages
ISBN:9781450335423
DOI:10.1145/2785956
General Chairs:
Steve Uhlig
Queen Mary University of London, UK
,
Olaf Maennel
Tallinn U. of Technology in Estonia, Estonia
,
Program Chairs:
Brad Karp
University College London, UK
,
Jitendra Padhye
Microsoft, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 August 2015
Check for updates
Author Tags
cluster schedulers
cross-layer optimization
data-intensive applications
joint data and compute placement
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 149
  Total Citations
  View Citations
- 1,150
  Total Downloads
- Downloads (Last 12 months)130
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

ACM SIGCOMM Computer Communication Review

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Network Scheduling Aware Task Placement in Datacenters

Single machine parallel-batch scheduling with deteriorating jobs