skip to main content
article

Instruction scheduling for a tiled dataflow architecture

Published:20 October 2006Publication History
Skip Abstract Section

Abstract

This paper explores hierarchical instruction scheduling for a tiled processor. Our results show that at the top level of the hierarchy, a simple profile-driven algorithm effectively minimizes operand latency. After this schedule has been partitioned into large sections, the bottom-level algorithm must more carefully analyze program structure when producing the final schedule.Our analysis reveals that at this bottom level, good scheduling depends upon carefully balancing instruction contention for processing elements and operand latency between producer and consumer instructions. We develop a parameterizable instruction scheduler that more effectively optimizes this trade-off. We use this scheduler to determine the contention-latency sweet spot that generates the best instruction schedule for each application. To avoid this application-specific tuning, we also determine the parameters that produce the best performance across all applications. The result is a contention-latency setting that generates instruction schedules for all applications in our workload that come within 17% of the best schedule for each.

References

  1. S.P. Amarasinghe and M.S. Lam. Communication optimization and code generation for distributed memory machines. In Proceedings of the Conference on Programming Language Design and Implementation, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J.M. Anderson and M.S. Lam. Global optimizations for parallelism and locality on scalable parallel machines. In Proceedings of the Conference on Programming Language Design and Implementation, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Arvind and R. Nikhil. Executing a program on the mit tagged-token dataflow architecture. IEEE Transactions on Computers, 39(3), 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Buell et al. Splash 2: FPGAs in a Custom Computing Machine. IEEE Computer Society, 1996.]]Google ScholarGoogle Scholar
  5. T.M. Chilimbi, M.D. Hill, and J.R. Larus. Cache-conscious structure layout. In Proceedings of the Conference on Programming Language Design and Implementation, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Coons, X. Chen, S. Kushwaha, K. McKinley, and D. Burger. A spatial path scheduling algorithm for EDGE architectures. In Symposium on Architectural Support for Programming Languages and Operating Systems, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D.E. Culler, A. Sah, K.E. Schauser, T. von Eicken, and J. Wawrzynek. Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A.L. Davis. The architecure and system method of DDM1: A recursively structured data driven machine. In Proceedings of the Annual Symposium on Computer Architecture, Palo Alto, California, April 3-5, 1978. IEEE Computer Society and ACM SIGARCH.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J.B. Dennis. A preliminary architecture for a basic dataflow processor. In Proceedings of the Symposium on Computer Architecture, 1975.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Desoli. Instruction assignment for clustered VLIW DSP compilers: A new approach. Technical Report HPL-98-13, Hewlett-Packard Laboratories, January 1998.]]Google ScholarGoogle Scholar
  11. J. Ellis. Bulldog: A Compiler for VLIW Architectures. PhD thesis, MIT, 1986.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J.R. Ellis. Bulldog: A Compiler for VLIW Architectures. ACM doctoral dissertation award; 1985. The MIT Press, 1986.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, and F. Homewood. Lx: A technology platform for customizable VLIW embedded processing. In International Symposium on Computer Architecture, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. V.G. Grafe, G.S. Davidson, J.E. Hoch, and V.P. Holmes. The Epsilon dataflow processor. In Proceedings of the International Symposium on Computer Architecture, 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J.R. Gurd, C.C. Kirkham, and I. Watson. The Manchester prototype dataflow computer. Communications of the ACM, 28(1), 1985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D.R. Kerns and S.J. Eggers. Balanced scheduling: Instruction scheduling when memory latency is uncertain. In Proceedings of the Conference on Programming Language Design and Implementation, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Kishi, H. Yasuhara, and Y. Kawamura. DDDP-A distributed data driven processor. In Proceedings of the International Symposium on Computer Architecture, 1983.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-time scheduling of instruction-level parallelism on a Raw machine. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. Lee et al. Space-time scheduling of instruction-level parallelism on a Raw machine. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. W. Lee, D. Puppin, S. Swenson, and S. Amarasinghe. Convergent scheduling. In Proceedings of the International Symposium on Microarchitecture, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J.L. Lo and S.J. Eggers. Improving balanced scheduling with compiler optimizations that increase instruction-level parallelism. In Proceedings of the Conference on Programming Language Design and Implementation, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P.G. Lowney, S.M. Freudenberger, T.J. Karzes, W.D. Lichtenstein, R.P. Nix, J.S. O'Donnell, and J. Ruttenberg. The multiflow trace scheduling compiler. J. Supercomputing, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz. Smart memories: A modular reconfigurable architecture. In International Symposium on Computer Architecture, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Nagarajan, S.K. Kushwaha, D. Burger, K.S. McKinley, C. Lin, and S.W. Keckler. Static placement, dynamic issue (SPDI) scheduling for EDGE architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Nagarajan, K. Sankaralingam, D. Burger, and S. Keckler. A design space evaluation of grid processor architectures. In Proceedings of the International Symposium on Microarchitecture, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Özer, S. Banerjia, and T.M. Conte. Unified assign and schedule: A new approach to scheduling for clustered register file microarchitectures. In Proceedings of the International Symposium on Microarchitecture, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Papadopoulos and D. Culler. Monsoon: An explicit token-store architecture. In Proceedings of the International Symposium on Computer Architecture, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G.M. Papadopoulos and K.R. Traub. Multithreading: A revisionist view of dataflow architectures. In Proceedings of the International Symposium on Computer Architecture, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T.A. Proebsting and C.N. Fischer. Linear-time, optimal code scheduling for delayed-load architectures. In Proceedings of the Conference on Programming Language Design and Implementation, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. Rivera and C.-W. Tseng. Data transformations for eliminating conflict misses. In Proceedings of the Conference on Programming Language Design and Implementation, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Sakai, y. Yamaguchi, K. Hiraki, Y. Kodama, and T. Yuba. An architecture of a dataflow single chip processor. In Proceedings of the International Symposium on Computer Architecture, 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Sanchez and A. Gonzalez. Instruction scheduling for clustered VLIW architectures. In Proceedings of the International Symposium on System Synthesis, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. Shimada, K. Hiraki, K. Nishida, and S. Sekiguchi. Evaluation of a prototype data flow processor of the sigma-1 for scientific computations. In Proceedings of the International Symposium on Computer Architecture, 1986.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In Proceedings of the Conference on Programming Language Design and Implementation, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. SPEC. Spec CPU 2000 benchmark specifications. SPEC2000 Benchmark Release, 2000.]]Google ScholarGoogle Scholar
  36. S. Swanson, K. Michelson, A. Schwerin, and M. Oskin. WaveScalar. In Proceedings of the International Symposium on Microarchitecture, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Swanson, A. Putnam, M. Mercaldi, K. Michelson, A. Petersen, A. Schwerin, M. Oskin, and S. Eggers. Area-performance trade-offs in tiled dataflow architectures. In Proceedings of the International Symposium on Computer Architecture, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. R. von Hanxleden and K. Kennedy. Give-n-take - a balanced code placement framework. In Proceedings of the Conference on Programming Language Design and Implementation, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, and A. Agarwal. Baring it all to software: Raw machines. Computer, 30(9), 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. K. Wilken, J. Liu, and M. Heffernan. Optimal instruction scheduling using integer programming. In Proceedings of the Conference on Programming Language Design and Implementation, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M.E. Wolf and M.S. Lam. A data locality optimizing algorithm. In Proceedings of the Conference on Programming Language Design and Implementation, 1991.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. T. Yang and A. Gerasoulis. PYRROS: static task scheduling and code generation for message passing multiprocessors. In Proceedings of the International Conference on Supercomputing, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Modulo scheduling with integrated register spilling for clustered vliw architectures. In Proceedings of International Symposium on Microarchitecture, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Instruction scheduling for a tiled dataflow architecture

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 34, Issue 5
        Proceedings of the 2006 ASPLOS Conference
        December 2006
        425 pages
        ISSN:0163-5964
        DOI:10.1145/1168919
        Issue’s Table of Contents
        • cover image ACM Conferences
          ASPLOS XII: Proceedings of the 12th international conference on Architectural support for programming languages and operating systems
          October 2006
          440 pages
          ISBN:1595934510
          DOI:10.1145/1168857

        Copyright © 2006 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 20 October 2006

        Check for updates

        Qualifiers

        • article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader