skip to main content
article

The STAMPede approach to thread-level speculation

Published:01 August 2005Publication History
Skip Abstract Section

Abstract

Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of write-back invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86% and 56%, four others by more than 8%, and an average across all applications of 16%---confirming that TLS is a promising way to exploit the naturally-multithreaded processing resources of future computer systems.

References

  1. Agarwal, W., Hrishikesh, M., Keckler, S., and Burger, D. 2000. Clock rate versus IPC: the end of the road for conventional microarchitectures. In Proceedings of ISCA 27. Google ScholarGoogle Scholar
  2. Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques and Tools. Addison Wesley. Google ScholarGoogle Scholar
  3. Akkary, H. and Driscoll, M. 1998. A dynamic multithreading processor. In MICRO-31. Google ScholarGoogle Scholar
  4. Breach, S. E., Vijaykumar, T. N., Gopal, S., Smith, J. E., and Sohi, G. S. 1996. Data memory alternatives for multiscalar processors. Tech. Rep. CS-TR-1997-1344, Computer Sciences Department, University of Wisconsin-Madison.Google ScholarGoogle Scholar
  5. Breach, S. E., Vijaykumar, T. N., and Sohi, G. S. 1994. The anatomy of the register file in a multiscalar processor. In Proceedings of the 27th Annual International Symposium on Microarchitecture. 181--190. Google ScholarGoogle Scholar
  6. Cintra, M. and Llanos, D. R. 2003. Toward efficient and robust software speculative parallelization on multiprocessors. In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM Press, 13--24. Google ScholarGoogle Scholar
  7. Cintra, M., Martínez, J. F., and Torrellas, J. 2000. Architectural support for scalable speculative parallelization in shared-memory multiprocessors. In Proceedings of ISCA 27. Google ScholarGoogle Scholar
  8. Cintra, M. and Torrellas, J. 2002. Learning cross-thread violations in speculative parallelization for multiprocessors. In Proceedings of the 8th HPCA. Google ScholarGoogle Scholar
  9. Emer, J. 2001. Ev8: The post-ultimate alpha (keynote address). In International Conference on Parallel Architectures and Compilation Techniques.Google ScholarGoogle Scholar
  10. Farrens, M., Tyson, G., and Pleszkun, A. 1994. A study of single-chip processor/cache organizations for large number of transistors. In Proceedings of ISCA 21. pp. 338--347. Google ScholarGoogle Scholar
  11. Frank, M., Moritz, C., Greenwald, B., Amarasinghe, S., and Agarwal, A. 1999. Suds: Primitive mechanisms for memory dependence speculation. Tech. Rep. MIT/LCS Technical Memo LCS-TM-591. January.Google ScholarGoogle Scholar
  12. Franklin, M. and Sohi, G. S. 1996. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Trans. Comput. 45, 5 (May). Google ScholarGoogle Scholar
  13. Garzaran, M. J., Prvulovic, M., Llaberia, J. M., Vinals, V., Rauchwerger, L., and Torrellas, J. 2003. Tradeoffs in buffering memory state for thread-level speculation in multiprocessors. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA). Google ScholarGoogle Scholar
  14. Goldstein, S. C., Schauser, K. E., and Culler, D. E. 1996. Lazy threads: Implementing a fast parallel call. J. Para. Distrib. Comput. 37, 1 (Aug.), 5--20. Google ScholarGoogle Scholar
  15. Gopal, S., Vijaykumar, T., Smith, J., and Sohi, G. 1998. Speculative versioning cache. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture. Google ScholarGoogle Scholar
  16. Gupta, M. and Nim, R. 1998. Techniques for speculative run-time parallelization of loops. In Proceedings of Supercomputing 1998. Google ScholarGoogle Scholar
  17. Hammond, L., Willey, M., and Olukotun, K. 1998. Data speculation support for a chip multiprocessor. In Proceedings of ASPLOS-VIII. Google ScholarGoogle Scholar
  18. Kahle, J. 1999. Power4: A Dual-CPU processor chip. Microprocessor Forum '99.Google ScholarGoogle Scholar
  19. Knoop, J. and Ruthing, O. 1992. Lazy code motion. In Proceedings of the ACM SIGPLAN 92 Conference on Programming Language Design and Implementation. Google ScholarGoogle Scholar
  20. Krishnan, V. and Torrellas, J. 1999a. A chip multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. Special Issue on Multithreaded Architecture. Google ScholarGoogle Scholar
  21. Krishnan, V. and Torrellas, J. 1999b. The Need for Fast Communication in Hardware-Based Speculative Chip Multiprocessors. In International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle Scholar
  22. Laudon, J. and Lenoski, D. 1997. The SGI Origin: A ccNUMA highly scalable server. In Proceedings of the 24th ISCA. 241--251. Google ScholarGoogle Scholar
  23. Marcuello, P. and González, A. 2002. Thread-spawning scheme for speculative multithreading. In Proceedings of the 8th HPCA. Google ScholarGoogle Scholar
  24. Marcuello, P. and Gonzlez, A. 1999. Clustered speculative multithreaded processors. In Proceedings of the ACM International Conference on Supercomputing. Google ScholarGoogle Scholar
  25. Moshovos, A. I., Breach, S. E., Vijaykumar, T., and Sohi, G. S. 1997. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th ISCA. Google ScholarGoogle Scholar
  26. Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K., and Chang, K. 1996. The Case for a Single-Chip Multiprocessor. In Proceedings of ASPLOS-VII. Google ScholarGoogle Scholar
  27. Ooi, C. L., Kim, S. W., Park, I., Eigenmann, R., Falsafi, B., and Vijaykumar, T. N. 2001. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor. In Proceedings of the International Conference on Supercomputing. Google ScholarGoogle Scholar
  28. Oplinger, J., Heine, D., and Lam, M. S. 1999. In search of speculative thread-level parallelism. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT'99). Google ScholarGoogle Scholar
  29. Palacharla, S., Jouppi, N. P., and Smith, J. E. 1996. Quantifying the complexity of superscalar processors. Tech. Rep. CS-TR-1996-1328, University of Wisconsin-Madison.Google ScholarGoogle Scholar
  30. Park, I., Falsafi, B., and Vijaykumar, T. N. 2003. Implicitly-multithreaded processors. In Proceedings of the 30th annual international symposium on Computer architecture. Google ScholarGoogle Scholar
  31. Prabhu, M. and Olukotun, K. 2003. Using thread-level speculation to simplify manual parallelization. In Principles and Practices of Parallel Programming. Google ScholarGoogle Scholar
  32. Prvulovic, M., Garzaran, M., Rauchwerger, L., and Torrellas, J. 2001. Removing architectural bottlenecks to the scalability of speculative parallelizatoin. In proceedings of the 28th Annual International Symposium on Computer Architecture. Google ScholarGoogle Scholar
  33. Rauchwerger, L. and Padua, D. 1995. The LRPD Test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of PLDI '95. 218--232. Google ScholarGoogle Scholar
  34. Rotenberg, E., Jacobson, Q., Sazeides, Y., and Smith, J. 1997. Trace processors. In Proceedings of Micro 30. Google ScholarGoogle Scholar
  35. Roth, A. and Sohi, G. 2001. Speculative data-driven multithreading. In 7th International Symposium on High Performance Computer Architecture (HPCA-7). 20--24. Google ScholarGoogle Scholar
  36. Rundberg, P. and Stenstrom, P. 2000. Low-cost thread-level data dependence speculation on multiprocessors. In Fourth Workshop on Multithreaded Execution, Architecture and Compilation.Google ScholarGoogle Scholar
  37. Sohi, G. S., Breach, S., and Vijaykumar, T. N. 1995. Multiscalar processors. In Proceedings of ISCA 22. 414--425. Google ScholarGoogle Scholar
  38. SPEC. 2000. The SPEC Benchmark Suite. Tech. rep., Standard Performance Evaluation Corporation. http://www.spechbench.org.Google ScholarGoogle Scholar
  39. Steffan, J. G. 2003. Hardware Support for Thread-Level Speculation. Ph.D. thesis, Carnegie Mellon University. Tech. Rep. CMU-CS-03-122. Google ScholarGoogle Scholar
  40. Steffan, J. G., Colohan, C. B., and Mowry, T. C. 1997. Architectural Support for Thread-Level Data Speculation. Tech. Rep. CMU-CS-97-188, School of Computer Science, Carnegie Mellon University. November. Google ScholarGoogle Scholar
  41. Steffan, J. G., Colohan, C. B., Zhai, A., and Mowry, T. C. 2002. Improving value communication for thread-level speculation. In Proceedings of the 8th HPCA. Google ScholarGoogle Scholar
  42. Steffan, J. G., Colohan, C. B., Zhaia, A., and Mowry, T. C. 2000. A scalable approach to thread-level speculation. In Proceedings of ISCA 27. Google ScholarGoogle Scholar
  43. Tjiang, S., Wolf, M., Lam, M., Pieper, K., and Hennessy, J. 1992. Languages and Compilers for Parallel Computing. Springer-Verlag, Berlin, Germany, 137--151.Google ScholarGoogle Scholar
  44. Tremblay, M. 1999. MAJC: Microprocessor Architecture for Java Computing. HotChips '99.Google ScholarGoogle Scholar
  45. Tullsen, D. M., Eggers, S. J., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of ISCA 22. 392--403. Google ScholarGoogle Scholar
  46. Veenstra, J. 2000. MINT+ mips emulator. Personal communication.Google ScholarGoogle Scholar
  47. Vijaykumar, T. 1998. Compiling for the multiscalar architecture. Ph.D. thesis, Computer Sciences Department, University of Wisconsin-Madison. Google ScholarGoogle Scholar
  48. Yeager, K. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro. Google ScholarGoogle Scholar
  49. Zhaia, A., Colohan, C. B., Steffan, J. G., and Mowry, T. C. 2002. Compiler optimization of scalar value communication between speculative threads. In Proceedings of ASPLOS-X. Google ScholarGoogle Scholar
  50. Zhaia, A., Colohan, C. B., Steffan, J. G., and Mowry, T. C. 2004. Compiler optimization of memory-resident value communication between speculative threads. In Proceedings of the International Symposium on Code Generation and Optimization. Google ScholarGoogle Scholar
  51. Zhang, Y., Rauchwerger, L., and Torrellas, J. 1998. Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture. Google ScholarGoogle Scholar
  52. Zhang, Y., Rauchwerger, L., and Torrellas, J. 1999. Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors. In Fifth International Symposium on High-Performance Computer Architecture (HPCA). 135--141. Google ScholarGoogle Scholar
  53. Zhang, Z. and Torrellas, J. 1995. Speeding up irregular applications in shared-memory multiprocessors: Memory binding and group prefetching. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 188--200. Google ScholarGoogle Scholar
  54. Zilles, C. and Sohi, G. 2002. Master/slave speculative parallelization. In 35th International Symposium on Microarchitecture (MICRO-35). 18--22. Google ScholarGoogle Scholar

Index Terms

  1. The STAMPede approach to thread-level speculation

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader