Abstract
To reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greedily schedule the tasks and flows that are ready to run. However, a large fraction of production jobs are recurring with predictable characteristics, which allows us to plan ahead for them. Coordinating the placement of data and tasks of these jobs allows for significantly improving their network locality and freeing up bandwidth, which can be used by other jobs running on the cluster. With this intuition, we develop Corral, a scheduling framework that uses characteristics of future workloads to determine an offline schedule which (i) jointly places data and compute to achieve better data locality, and (ii) isolates jobs both spatially (by scheduling them in different parts of the cluster) and temporally, improving their performance. We implement Corral on Apache Yarn, and evaluate it on a 210 machine cluster using production workloads. Compared to Yarn's capacity scheduler, Corral reduces the makespan of these workloads up to 33% and the median completion time up to 56%, with 20-90% reduction in data transferred across racks.
Supplemental Material
- Amazon S3. https://aws.amazon.com/s3/.Google Scholar
- Amazon Web Services. http://aws.amazon.com/.Google Scholar
- Apache Hadoop. http://hadoop.apache.org/.Google Scholar
- Apache Tez. http://hortonworks.com/hadoop/tez/.Google Scholar
- Facebook data grows by over 500 TB daily. http://tinyurl.com/96d8oqj/.Google Scholar
- Hadoop Distributed Filesystem. http://hadoop.apache.org/hdfs.Google Scholar
- Hadoop MapReduce Next Generation - Capacity Scheduler. http://tinyurl.com/no2evu5.Google Scholar
- Hadoop YARN Project. http://tinyurl.com/bnadg9l.Google Scholar
- Microsoft Azure. https://azure.microsoft.com/.Google Scholar
- Microsoft Azure Storage. https://azure.microsoft.com/en-us/services/storage/.Google Scholar
- ORC File Format. http://tinyurl.com/n4pxofh.Google Scholar
- TPC Benchmark H. http://www.tpc.org/tpch/.Google Scholar
- Windows Azure's Flat Network Storage and 2012 Scalability Targets. http://bit.ly/1A4Hbjt.Google Scholar
- S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Re-optimizing Data-parallel Computing. In NSDI 2012. Google ScholarDigital Library
- S. Agarwal, S. Kandula, N. Bruno, M.-C. Wu, I. Stoica, and J. Zhou. Reoptimizing Data Parallel Computing. In NSDI'12, 2012. Google ScholarDigital Library
- F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. ShuffleWatcher: Shuffle-aware Scheduling in Multi-tenant MapReduce Clusters. In USENIX ATC, 2014. Google ScholarDigital Library
- G. Ananthanarayanan, S. Agarwal, S. Kandula, A. Greenberg, I. Stoica, D. Harlan, and E. Harris. Scarlett: Coping with Skewed Content Popularity in Mapreduce Clusters. In EuroSys, 2011. Google ScholarDigital Library
- G. Ananthanarayanan, A. Ghodsi, A. Wang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated Memory Caching for Parallel Jobs. In NSDI, 2012. Google ScholarDigital Library
- K. P. Belkhale and P. Banerjee. An approximate algorithm for the partitionable independent task scheduling problem. Urbana, 51:61801, 1990.Google Scholar
- R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Datasets. In VLDB, 2008. Google ScholarDigital Library
- Y. Chen, A. Ganapathi, R. Griffith, and Y. Katz. The Case for Evaluating MapReduce Performance Using Workload Suites. In MASCOTS, 2011. Google ScholarDigital Library
- M. Chowdhury, S. Kandula, and I. Stoica. Leveraging Endpoint Flexibility in Data-Intensive Clusters. In ACM SIGCOMM, 2013. Google ScholarDigital Library
- M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing Data Transfers in Computer Clusters with Orchestra. In ACM SIGCOMM, 2011. Google ScholarDigital Library
- M. Chowdhury, Y. Zhong, and I. Stoica. Efficient Coflow Scheduling with Varys. In ACM SIGCOMM, 2014. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- F. R. Dogar, T. Karagiannis, H. Ballani, and A. Rowstron. Decentralized Task-aware Scheduling for Data Center Networks. In ACM SIGCOMM, August 2014. Google ScholarDigital Library
- J. Du and J. Y.-T. Leung. Complexity of Scheduling Parallel Task Systems. SIAM J. Discret. Math., 1989. Google ScholarDigital Library
- K. Elmeleegy. Piranha: Optimizing Short Jobs in Hadoop. Proc. VLDB Endow., 2013. Google ScholarDigital Library
- M. Y. Eltabakh, Y. Tian, F. Özcan, R. Gemulla, A. Krettek, and J. McPherson. CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop. Proc. VLDB Endow., 2011. Google ScholarDigital Library
- A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: Guaranteed Job Latency in Data Parallel Clusters. In EuroSys, 2012. Google ScholarDigital Library
- R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.Google ScholarDigital Library
- R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource Packing for Cluster Schedulers. In SIGCOMM, 2014. Google ScholarDigital Library
- H. Herodotou, F. Dong, and S. Babu. No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics. In SOCC, 2011. Google ScholarDigital Library
- H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A Self-tuning System for Big Data Analytics. In CIDR, 2011.Google Scholar
- C. Y. Hong, M. Caesar, and P. B. Godfrey. Finishing Flows Quickly with Preemptive Scheduling. In SIGCOMM, 2012. Google ScholarDigital Library
- M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair Scheduling for Distributed Computing Clusters. In SOSP, 2009. Google ScholarDigital Library
- V. Jalaparti, H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Bridging the Tenant-provider Gap in Cloud Services. In SOCC, 2012. Google ScholarDigital Library
- Y. Kwok and I. Ahmad. Static scheduling algorithms for allocating directed task graphs to multiprocessors. ACM Computing Surveys (CSUR), 1999. Google ScholarDigital Library
- Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. SkewTune: Mitigating Skew in Mapreduce Applications. In ACM SIGMOD, 2012. Google ScholarDigital Library
- R. Lepère, D. Trystram, and G. J. Woeginger. Approximation Algorithms for Scheduling Malleable Tasks Under Precedence Constraints. International Journal of Foundations of Computer Science, 13(04):613--627, 2002.Google ScholarCross Ref
- M. Li, D. Subhraveti, A. R. Butt, A. Khasymski, and P. Sarkar. CAM: A Topology Aware Minimum Cost Flow Based Resource Manager for MapReduce Applications in the Cloud. In HPDC, 2012. Google ScholarDigital Library
- M. Ovsiannikov, S. Rus, D. Reeves, P. Sutter, S. Rao, and J. Kelly. The Quantcast File System. Proc. VLDB Endow. Google ScholarDigital Library
- B. Palanisamy, A. Singh, L. Liu, and B. Jain. Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud. In SC, 2011. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive- a warehousing solution over a map-reduce framework. In VLDB, 2009. Google ScholarDigital Library
- J. Turek, J. L. Wolf, and P. S. Yu. Approximate Algorithms Scheduling Parallelizable Tasks. In SPAA, 1992. Google ScholarDigital Library
- G. Wang, A. Butt, P. Pandey, and K. Gupta. A Simulation Approach to Evaluating Design Decisions in MapReduce Setups. In MASCOTS, 2009.Google Scholar
- C. Wilson, H. Ballani, T. Karagiannis, and A. Rowtron. Better Never than Late: Meeting Deadlines in Datacenter Networks. In ACM SIGCOMM, 2011. Google ScholarDigital Library
- M. Zaharia, D. Borthakur, J. S. Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling. In EuroSys, 2010. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, 2010. Google ScholarDigital Library
- J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. Scope: parallel databases meet mapreduce. VLDB J., 21(5):611--636, 2012. Google ScholarDigital Library
Index Terms
- Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
Recommendations
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data CommunicationTo reduce the impact of network congestion on big data jobs, cluster management frameworks use various heuristics to schedule compute tasks and/or network flows. Most of these schedulers consider the job input data fixed and greedily schedule the tasks ...
Network Scheduling Aware Task Placement in Datacenters
CoNEXT '16: Proceedings of the 12th International on Conference on emerging Networking EXperiments and TechnologiesTo improve the performance of data-intensive applications, existing datacenter schedulers optimize either the placement of tasks or the scheduling of network flows. The task scheduler strives to place tasks close to their input data (i.e., maximize data ...
Single machine parallel-batch scheduling with deteriorating jobs
We consider several single machine parallel-batch scheduling problems in which the processing time of a job is a linear function of its starting time. We give a polynomial-time algorithm for minimizing the maximum cost, an O(n5) time algorithm for ...
Comments