Abstract
Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on p processors can use a factor of p or larger more space than a sequential implementation of the same program.
This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any computation with w units of work and critical path length d, and for any sequential schedule that takes space s1, we provide a parallel schedule that takes fewer than w/p + d steps on p processors and requires less than s1 + p·d space. This matches the lower bound that we show, and significantly improves upon the best previous bound of s1·p spaces for the common case where d«s1.
The paper then describes a scheduler for implementing high-level languages with nested parallelism, that generates schedules in this class. During program execution, as the structure of the computation is revealed, the scheduler keeps track of the active tasks, allocates the tasks to the processors, and performs the necessary task synchronization. The scheduler is itself a parallel algorithm, and incurs at most a constant factor overhead in time and space, even when the scheduling granularity is individual units of work. The algorithm is the first efficient solution to the scheduling problem discussed here, even if space considerations are ignored.
- ALLEN, F., BURKE, M., CHARLES, P., CYTRON, R., AND FERRANTE, J. 1988. An overview of the PTRAN analysis system for multiprocessing. J. Paral. Distr. Comput. 5, 5 (Oct.), 617-640. Google Scholar
- ARVIND, NIKHIL, R. S., AND PINGALI, K.K. 1989. I-structures: Data structures for parallel computing. ACM Trans. Prog. Lang. Syst. 11, 4 (Oct.), 598-632. Google Scholar
- BAKER, H.G., AND HEWITT, C. 1977. The incremental garbage collection of processes. ACM Sigplan Not. 12, 8 (Aug.), 55-59. Google Scholar
- BECKMAN, P., GANNON, D., AND JOHNSON, E. 1996. Portable parallel programming in HPC+ +. In Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing (Bloomingdale, Ill., Aug. 12), pp. 132-139.Google Scholar
- BERNSTEIN, D. 1988. PREFACE-2: Supporting nested parallelism in Fortran. Res. Rep. RC-14160. IBM Thomas J. Watson Research Center, Yorktown Heights, N.Y.Google Scholar
- BLELLOCH, G.E. 1989. Scans as primitive parallel operations. IEEE Trans. Comput. C-38, 11 (Nov.), 1526-1538. Google Scholar
- BLELLOCH, G.E. 1990. Vector Models for Data-Parallel Computing. The MIT Press, Cambridge, Mass. Google Scholar
- BLELLOCH, G. E., CHATTERJEE, S., HARDWICK, J. C., SIPELSTEIN, J., AND ZAGHA, M. 1994. Implementation of a portable nested data-parallel language. J. Paral. Distr. Comput. 21, 1 (Apr.), 4-14. Google Scholar
- BLELLOCH, G. E., GIBBONS, P. B., MATIAS, Y., AND NARLIKAR, G.J. 1997. Space-efficient scheduling of parallelism with synchronization variables. In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (Newport, R.I., June 22-25). ACM, New York, pp. 12-23. Google Scholar
- BLELLOCH, G.E., AND GREINER, J. 1995. Parallelism in sequential functional languages. In Proceedings of the 7th SIGPLAN-SIGARCH-WG2.8 Conference on Functional Programming Languages and Computer Architecture (La Jolla, Calif., June 25-28). ACM, New York, pp. 226-237. Google Scholar
- BLELLOCH, G. E., AND GREINER, J. 1996. A provable time and space efficient implementation of NESL. In Proceedings of the ACM SIGPLAN International Conference on Functional Programming (Philadelphia, Pa., May 24-26). ACM, New York, pp. 213-225. Google Scholar
- BLUMOFE, R.D., FRIGO, M., JOERG, C.F., LEISERSON, C.E., AND RANDALL, K.H. 1996. An analysis of DAG-consistent distributed shared-memory algorithms. In Proceedings of the 8th ACM Symposium on Parallel Algorithms and Architectures (Padua, Italy, June 24-26). ACM, New York, pp. 297-308. Google Scholar
- BLUMOFE, R. D., JOERG, C. F., KUSZMAUL, B. C., LEISERSON, C. E., RANDALL, K. H., AND ZHOU, Y. 1995. CILK: An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Santa Barbara, Calif., July 19-21). ACM, New York, pp. 207-216. Google Scholar
- BLUMOFE, R. D., AND LEISERSON, C.E. 1993. Space-efficient scheduling of multithreaded computations. In Proceedings of the 25th Annual ACM Symposium on Theory of Computing (San Diego, Calif., May 16-18). ACM, New York, pp. 362-371. Google Scholar
- BLUMOFE, R.D., AND LEISERSON, C.E. 1994. Scheduling multithreaded computations by work stealing. In Proceedings of the 35th IEEE Symposium on Foundations of Computer Science (Santa Fe, N.M., Nov. 20-22). IEEE Computer Science Press, Los Alamitos, Calif., pp. 356-368.Google Scholar
- BRENT, R.P. 1974. The parallel evaluation of general arithmetic expressions. J. ACM 21, 2 (Apr.), 201-208. Google Scholar
- BURTON, F.W. 1988. Storage management in virtual tree machines. IEEE Trans. Comput. 37, 3 (Mar.), 321-328. Google Scholar
- BURTON, F.W. 1996. Guaranteeing good memory bounds for parallel programs. IEEE Trans. Softw. Eng. 22, 10 (Nov.), 762-773. Google Scholar
- BURTON, F.W., MCKEOWN, G.P., AND RAYWARD-SMITH, g.J. Applications of UET scheduling theory to the implementation of declarative languages. Comput. J. 33, 4 (Aug.), 330-336. Google Scholar
- BURTON, F.W., AND SIMPSON, D.J. 1994. Space efficient execution of deterministic parallel programs. Manuscript.Google Scholar
- BURTON, F.W., AND SLEEP, M.R. 1981. Executing functional programs on a virtual tree of processors. In Proceedings of the 1981 Conference on Functional Programming Languages and Computer Architecture (Portsmouth, N.H., Oct. 18-22). ACM, New York, pp. 187-194. Google Scholar
- COFFMAN, E.G. 1976. Computer and Job-Shop Scheduling Theory. Wiley, New York.Google Scholar
- CORMEN, T. H., LEISERSON, C. E., AND RIVEST, R.L. 1990. Introduction to Algorithms. McGraw- Hill, New York, N.Y. Google Scholar
- CULLER, D. E., AND ARVIND 1988. Resource requirements of dataflow programs. In Proceedings of the 15th International Symposium on Computer Architecture (Honolulu, Ha., May 30-June 2). IEEE Computer Society Press, Los Alamitos, Calif., pp. 141-150. Google Scholar
- FANG, Z., TANG, P., YEW, P.-C., AND ZHU, C.-Q. 1990. Dynamic processor self-scheduling for general parallel nested loops. IEEE Trans. Comput. 39, 7 (July), 919-929. Google Scholar
- FEO, J. T., CANN, D. C., AND OLDEHOEFT, R.R. 1990. A report on the Sisal language project. J. Paral. Distr. Comput. 10, 4 (Dec.), 349-366. Google Scholar
- GII, J. 1994. Renaming and dispersing: Techniques for fast load balancing. J. Paral. Distr. Comput. 23, 2 (Nov.), 149-157. Google Scholar
- GIL, J., AND MATIAS, Y. 1991. Fast hashing on a PRAM-Designing by expectation. In Proceedings of the 2nd Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, Calif., Jan. 28-30). ACM, New York, pp. 271-280. Google Scholar
- GIL, J., AND MATIAS, Y. 1996. An effective load balancing policy for geometric decaying algorithms. J. Paral. Distr. Comput. 36, 2 (Aug.), 185-188. Google Scholar
- GIL, J., MATIAS, Y., AND VISHKIN, U. 1991. Towards a theory of nearly constant time parallel algorithms. In Proceedings of the 32nd IEEE Symposium on Foundations of Computer Science (San Juan, P.R., Oct. 1-4). IEEE Computer Society Press, Los Alamitos, Calif., pp. 698-710. Google Scholar
- GOLDBERG, T., AND ZWlCK, U. 1995. Optimal deterministic approximate parallel prefix sums and their applications. In Proceedings of the 3rd Israel Symposium on the Theory of Computing and Systems (Tel-Aviv, Israel, Jan. 4-6). pp. 220-228. Google Scholar
- GOODRICH, M.T. 1991. Using approximation algorithms to design parallel algorithms that may ignore processor allocation. In Proceedings of the 32nd IEEE Symposium on Foundations of Computer Science (San Juan, P.R., Oct. 1-4). IEEE Computer Society Press, Los Alamitos, Calif., pp. 711-722. Google Scholar
- GOODRICH, M. T., MATIAS, Y., AND VISHKIN, U. 1994. Optimal parallel approximation algorithms for prefix sums and integer sorting. In Proceedings of the 5th ACM-SIAM Symposium on Discrete Algorithms (Arlington, Va., Jan. 23-25). ACM, New York, pp. 241-250. Google Scholar
- GRAHAM, R.L. 1966. Bounds for certain multiprocessing anomalies. Bell Syst. Tech. J. 45, 9, 1563-1581.Google Scholar
- GRAHAM, R.L. 1969. Bounds on multiprocessing timing anomalies. SIAM J. Appl. Math. 17, 2, 416-429.Google Scholar
- HAGERUP, T. 1991. Fast parallel space allocation, estimation and integer sorting. Tech. Rep. 03/91, SFB 124. Fachbereich Informatik, Universitfit des Saarlandes, D-6600 Saarbrticken, Germany.Google Scholar
- HAGERUP, T. 1993. Fast deterministic processor allocation. In Proceedings of the 4th ACM-SIAM Symposium on Discrete Algorithms (Austin, Tex., Jan. 25-27). ACM, New York, pp. 1-10. Google Scholar
- HALSTEAD, R.H., JR. 1985. Multilisp: A language for concurrent symbolic computation. ACM Trans. Prog. Lang. Syst. 7, 4, 501-538. Google Scholar
- HIGH PERFORMANCE FORTRAN FORUM. 1993. High Performance FORTRAN Language Specification (May).Google Scholar
- HOPCROFT, J., PAUL, W., AND VALIANT, L. 1977. On time versus space. J. ACM 24, 2 (Apr.), 332-337. Google Scholar
- HUMMEL, S. F., AND SCHONBERG, E. 1991. Low-overhead scheduling of nested parallelism. IBM J. Res. Devel. 35, 5/6, 743-765. Google Scholar
- JAGANNATHAN, S., AND PHILBIN, J. 1992. A foundation for an efficient multi-threaded Scheme system. In Proceedings of the 1992 ACM Conference on Lisp and Functional Programming (San Francisco, Calif., June 22-24). ACM, New York, pp. 345-357. Google Scholar
- LEASURE, B. 1989. PCF programming model and FORTRAN bindings. In Proceedings of the 13th IEEE International Computer Software and Applications Conference (Orland, Fla., Sept.). IEEE Computer Society Press, Los Alamitos, Calif., pp. 111-118.Google Scholar
- LEIGHTON, F. T., MAGGS, B. M., RANADE, A. G., AND RAO, S.B. 1994. Randomized routing and sorting on fixed-connection networks. J. Algorithms 17, 1 (July), 157-205. Google Scholar
- MATIAS, Y. 1992. Highly Parallel Randomized Algorithmics. Ph.D. dissertation. Tel Aviv University, Tel Aviv, Israel.Google Scholar
- MATIAS, Y., AND VISHKIN, U. 1991. Converting high probability into nearly-constant time--with applications to parallel hashing. In Proceedings of the 23rd Annual ACM Symposium on Theory of Computing (New Orleans, La., May 6-8). ACM, New York, pp. 307-316. Google Scholar
- MILLS, P.H., NYLAND, L.S., PRINS, J.F., REIF, J.H., AND WAGNER, R.A. 1990. Prototyping parallel and distributed programs in Proteus. Tech. Rep. UNC-CH TR90-041. Computer Science Dept., Univ. North Carolina, Chapel Hill, N.C.Google Scholar
- NARLIKAR, G.J. 1999. Scheduling threads for low space requirement and goal locality. In Proceedings of the llth Annual ACM Symposium on Parallel Algorithms and Architecture (Saint-Halo, France, June 27-30). ACM, New York, to appear. Google Scholar
- NARLIKAR, G.J., AND BLELLOCH, G.E. 1997. Space-efficient implementation of nested parallelism. In Proceedings of the 6th A CM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Las Vegas, Nev., June 18-21). ACM, New York, pp. 25-36. Google Scholar
- PAPADIMITRIOU, C. H., AND ULLMAN, J.D. 1987. A communication-time tradeoff. SIAM J. Comput. 16, 4 (Aug.), 639-646. Google Scholar
- PAPADIMITRIOU, C. H., AND YANNAKAKIS, M. 1988. Towards an architecture-independent analysis of parallel algorithms. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing (Chicago, Ill., May 2-4). ACM, New York, pp. 510-513. Google Scholar
- PAUL, W. J., VISHKIN, U., AND WAGENER, H. 1983. Parallel dictionaries on 2-3 trees. In Proceedings of the l Oth International Colloquium on Automata Languages and Programming. Lecture Notes in Computer Science, vol. 154. Springer-Verlag, New York, pp. 597-609. Google Scholar
- PIPPENGER, N. 1980. Pebbling. In Proceedings of the 5th IBM Symposium on Mathematical Foundations of Computer Science: Computational Complexity. (May).Google Scholar
- RANADE, A.G. 1991. How to emulate shared memory. J. Comput. Syst. Sci. 42, 307-326. Google Scholar
- REIF, J. H., ED. 1993. A Synthesis of Parallel Algorithms. Morgan-Kaufmann, San Mateo, Calif. Google Scholar
- RUGGIERO, C.A., AND SARGEANT, J. 1987. Control of parallelism in the Manchester dataflow machine. In Functional Programming Languages and Computer Architecture. Lecture Notes in Computer Science, vol. 174. Springer-Verlag, New York, pp. 1-15. Google Scholar
- SABOT, G.W. 1988. The Paralation Model: Architecture-Independent Parallel Programming. The MIT Press, Cambridge, Mass. Google Scholar
- SAVAGE, J. E., AND VITTER, J.S. 1984. Parallelism in space-time tradeoffs. In Proceedings of the International Workshop on Parallel Computing and VLSI (Amulfi, Italy, May). Elsevier Science Press, Amsterdam, The Netherlands, pp. 49-58.Google Scholar
- SucIu, D., AND TANNEN, V. 1994. Efficient compilation of high-level data parallel algorithms. In Proceedings of the 6th ACM Symposium on Parallel Algorithms and Architectures (Cape May, N.J., June 27-29). ACM, New York, pp. 57-66. Google Scholar
- VALIANT, L.G. 1990. General purpose parallel architectures. In Handbook of Theoretical Computer Science, Volume A, J. van Leeuwen, ed. Elsevier Science Publishers B.V., Amsterdam, The Netherlands, pp. 943-972. Google Scholar
- VENKATESWARAN, H., AND TOMPA, M. 1989. A new pebble game that characterizes parallel complexity classes. SIAM J. Comput. 18, 3, 533-549. Google Scholar
- VENKATESWARAN, P. W., AND TOMPA, M. 1985. Speedups of deterministic machines by synchronous parallel machines. J. Comput. Syst. Sci. 30, 149-161.Google Scholar
Index Terms
- Provably efficient scheduling for languages with fine-grained parallelism
Recommendations
Provably Efficient Online Nonclairvoyant Adaptive Scheduling
Multiprocessor scheduling in a shared multiprogramming environment can be structured in two levels, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler maps the ready threads of a job onto the allotted ...
Provably efficient two-level adaptive scheduling
JSSPP'06: Proceedings of the 12th international conference on Job scheduling strategies for parallel processingMultiprocessor scheduling in a shared multiprogramming environment can be structured in two levels, where a kernel-level job scheduler allots processors to jobs and a user-level thread scheduler maps the ready threads of a job onto the allotted ...
Adaptive scheduling with parallelism feedback
PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programmingMultiprocessor scheduling in a shared multiprogramming environment is often structured as two-level scheduling, where a kernel-level job scheduler allots processors to jobs and a user-level task scheduler schedules the work of a job on the allotted ...
Comments