Abstract
Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of write-back invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86% and 56%, four others by more than 8%, and an average across all applications of 16%---confirming that TLS is a promising way to exploit the naturally-multithreaded processing resources of future computer systems.
- Agarwal, W., Hrishikesh, M., Keckler, S., and Burger, D. 2000. Clock rate versus IPC: the end of the road for conventional microarchitectures. In Proceedings of ISCA 27. Google Scholar
- Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques and Tools. Addison Wesley. Google Scholar
- Akkary, H. and Driscoll, M. 1998. A dynamic multithreading processor. In MICRO-31. Google Scholar
- Breach, S. E., Vijaykumar, T. N., Gopal, S., Smith, J. E., and Sohi, G. S. 1996. Data memory alternatives for multiscalar processors. Tech. Rep. CS-TR-1997-1344, Computer Sciences Department, University of Wisconsin-Madison.Google Scholar
- Breach, S. E., Vijaykumar, T. N., and Sohi, G. S. 1994. The anatomy of the register file in a multiscalar processor. In Proceedings of the 27th Annual International Symposium on Microarchitecture. 181--190. Google Scholar
- Cintra, M. and Llanos, D. R. 2003. Toward efficient and robust software speculative parallelization on multiprocessors. In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM Press, 13--24. Google Scholar
- Cintra, M., Martínez, J. F., and Torrellas, J. 2000. Architectural support for scalable speculative parallelization in shared-memory multiprocessors. In Proceedings of ISCA 27. Google Scholar
- Cintra, M. and Torrellas, J. 2002. Learning cross-thread violations in speculative parallelization for multiprocessors. In Proceedings of the 8th HPCA. Google Scholar
- Emer, J. 2001. Ev8: The post-ultimate alpha (keynote address). In International Conference on Parallel Architectures and Compilation Techniques.Google Scholar
- Farrens, M., Tyson, G., and Pleszkun, A. 1994. A study of single-chip processor/cache organizations for large number of transistors. In Proceedings of ISCA 21. pp. 338--347. Google Scholar
- Frank, M., Moritz, C., Greenwald, B., Amarasinghe, S., and Agarwal, A. 1999. Suds: Primitive mechanisms for memory dependence speculation. Tech. Rep. MIT/LCS Technical Memo LCS-TM-591. January.Google Scholar
- Franklin, M. and Sohi, G. S. 1996. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Trans. Comput. 45, 5 (May). Google Scholar
- Garzaran, M. J., Prvulovic, M., Llaberia, J. M., Vinals, V., Rauchwerger, L., and Torrellas, J. 2003. Tradeoffs in buffering memory state for thread-level speculation in multiprocessors. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA). Google Scholar
- Goldstein, S. C., Schauser, K. E., and Culler, D. E. 1996. Lazy threads: Implementing a fast parallel call. J. Para. Distrib. Comput. 37, 1 (Aug.), 5--20. Google Scholar
- Gopal, S., Vijaykumar, T., Smith, J., and Sohi, G. 1998. Speculative versioning cache. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture. Google Scholar
- Gupta, M. and Nim, R. 1998. Techniques for speculative run-time parallelization of loops. In Proceedings of Supercomputing 1998. Google Scholar
- Hammond, L., Willey, M., and Olukotun, K. 1998. Data speculation support for a chip multiprocessor. In Proceedings of ASPLOS-VIII. Google Scholar
- Kahle, J. 1999. Power4: A Dual-CPU processor chip. Microprocessor Forum '99.Google Scholar
- Knoop, J. and Ruthing, O. 1992. Lazy code motion. In Proceedings of the ACM SIGPLAN 92 Conference on Programming Language Design and Implementation. Google Scholar
- Krishnan, V. and Torrellas, J. 1999a. A chip multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. Special Issue on Multithreaded Architecture. Google Scholar
- Krishnan, V. and Torrellas, J. 1999b. The Need for Fast Communication in Hardware-Based Speculative Chip Multiprocessors. In International Conference on Parallel Architectures and Compilation Techniques (PACT). Google Scholar
- Laudon, J. and Lenoski, D. 1997. The SGI Origin: A ccNUMA highly scalable server. In Proceedings of the 24th ISCA. 241--251. Google Scholar
- Marcuello, P. and González, A. 2002. Thread-spawning scheme for speculative multithreading. In Proceedings of the 8th HPCA. Google Scholar
- Marcuello, P. and Gonzlez, A. 1999. Clustered speculative multithreaded processors. In Proceedings of the ACM International Conference on Supercomputing. Google Scholar
- Moshovos, A. I., Breach, S. E., Vijaykumar, T., and Sohi, G. S. 1997. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th ISCA. Google Scholar
- Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K., and Chang, K. 1996. The Case for a Single-Chip Multiprocessor. In Proceedings of ASPLOS-VII. Google Scholar
- Ooi, C. L., Kim, S. W., Park, I., Eigenmann, R., Falsafi, B., and Vijaykumar, T. N. 2001. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor. In Proceedings of the International Conference on Supercomputing. Google Scholar
- Oplinger, J., Heine, D., and Lam, M. S. 1999. In search of speculative thread-level parallelism. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT'99). Google Scholar
- Palacharla, S., Jouppi, N. P., and Smith, J. E. 1996. Quantifying the complexity of superscalar processors. Tech. Rep. CS-TR-1996-1328, University of Wisconsin-Madison.Google Scholar
- Park, I., Falsafi, B., and Vijaykumar, T. N. 2003. Implicitly-multithreaded processors. In Proceedings of the 30th annual international symposium on Computer architecture. Google Scholar
- Prabhu, M. and Olukotun, K. 2003. Using thread-level speculation to simplify manual parallelization. In Principles and Practices of Parallel Programming. Google Scholar
- Prvulovic, M., Garzaran, M., Rauchwerger, L., and Torrellas, J. 2001. Removing architectural bottlenecks to the scalability of speculative parallelizatoin. In proceedings of the 28th Annual International Symposium on Computer Architecture. Google Scholar
- Rauchwerger, L. and Padua, D. 1995. The LRPD Test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of PLDI '95. 218--232. Google Scholar
- Rotenberg, E., Jacobson, Q., Sazeides, Y., and Smith, J. 1997. Trace processors. In Proceedings of Micro 30. Google Scholar
- Roth, A. and Sohi, G. 2001. Speculative data-driven multithreading. In 7th International Symposium on High Performance Computer Architecture (HPCA-7). 20--24. Google Scholar
- Rundberg, P. and Stenstrom, P. 2000. Low-cost thread-level data dependence speculation on multiprocessors. In Fourth Workshop on Multithreaded Execution, Architecture and Compilation.Google Scholar
- Sohi, G. S., Breach, S., and Vijaykumar, T. N. 1995. Multiscalar processors. In Proceedings of ISCA 22. 414--425. Google Scholar
- SPEC. 2000. The SPEC Benchmark Suite. Tech. rep., Standard Performance Evaluation Corporation. http://www.spechbench.org.Google Scholar
- Steffan, J. G. 2003. Hardware Support for Thread-Level Speculation. Ph.D. thesis, Carnegie Mellon University. Tech. Rep. CMU-CS-03-122. Google Scholar
- Steffan, J. G., Colohan, C. B., and Mowry, T. C. 1997. Architectural Support for Thread-Level Data Speculation. Tech. Rep. CMU-CS-97-188, School of Computer Science, Carnegie Mellon University. November. Google Scholar
- Steffan, J. G., Colohan, C. B., Zhai, A., and Mowry, T. C. 2002. Improving value communication for thread-level speculation. In Proceedings of the 8th HPCA. Google Scholar
- Steffan, J. G., Colohan, C. B., Zhaia, A., and Mowry, T. C. 2000. A scalable approach to thread-level speculation. In Proceedings of ISCA 27. Google Scholar
- Tjiang, S., Wolf, M., Lam, M., Pieper, K., and Hennessy, J. 1992. Languages and Compilers for Parallel Computing. Springer-Verlag, Berlin, Germany, 137--151.Google Scholar
- Tremblay, M. 1999. MAJC: Microprocessor Architecture for Java Computing. HotChips '99.Google Scholar
- Tullsen, D. M., Eggers, S. J., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of ISCA 22. 392--403. Google Scholar
- Veenstra, J. 2000. MINT+ mips emulator. Personal communication.Google Scholar
- Vijaykumar, T. 1998. Compiling for the multiscalar architecture. Ph.D. thesis, Computer Sciences Department, University of Wisconsin-Madison. Google Scholar
- Yeager, K. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro. Google Scholar
- Zhaia, A., Colohan, C. B., Steffan, J. G., and Mowry, T. C. 2002. Compiler optimization of scalar value communication between speculative threads. In Proceedings of ASPLOS-X. Google Scholar
- Zhaia, A., Colohan, C. B., Steffan, J. G., and Mowry, T. C. 2004. Compiler optimization of memory-resident value communication between speculative threads. In Proceedings of the International Symposium on Code Generation and Optimization. Google Scholar
- Zhang, Y., Rauchwerger, L., and Torrellas, J. 1998. Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture. Google Scholar
- Zhang, Y., Rauchwerger, L., and Torrellas, J. 1999. Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors. In Fifth International Symposium on High-Performance Computer Architecture (HPCA). 135--141. Google Scholar
- Zhang, Z. and Torrellas, J. 1995. Speeding up irregular applications in shared-memory multiprocessors: Memory binding and group prefetching. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 188--200. Google Scholar
- Zilles, C. and Sohi, G. 2002. Master/slave speculative parallelization. In 35th International Symposium on Microarchitecture (MICRO-35). 18--22. Google Scholar
Index Terms
- The STAMPede approach to thread-level speculation
Recommendations
Compiler-Driven Software Speculation for Thread-Level Parallelism
Current parallelizing compilers can tackle applications exercising regular access patterns on arrays or affine indices, where data dependencies can be expressed in a linear form. Unfortunately, there are cases that independence between statements of code ...
Compiler and hardware support for reducing the synchronization of speculative threads
Thread-level speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this article, we focus on one important limitation of program ...
A thread partitioning approach for speculative multithreading
Speculative multithreading (SpMT) is a thread-level automatic parallelization technique, which partitions sequential programs into multithreads to be executed in parallel. This paper presents different thread partitioning strategies for nonloops and ...
Comments