article

The STAMPede approach to thread-level speculation

Authors:
J. Gregory Steffan

University of Toronto, Ontario, Canada

University of Toronto, Ontario, Canada
View Profile

,
Christopher Colohan

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

,
Antonia Zhai

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

,
Todd C. Mowry

Carnegie Mellon University, Pittsburgh, PA

Carnegie Mellon University, Pittsburgh, PA
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 23 Issue 3pp 253–300https://doi.org/10.1145/1082469.1082471

Published:01 August 2005Publication History

ACM Transactions on Computer Systems

Abstract

Multithreaded processor architectures are becoming increasingly commonplace: many current and upcoming designs support chip multiprocessing, simultaneous multithreading, or both. While it is relatively straightforward to use these architectures to improve the throughput of a multithreaded or multiprogrammed workload, the real challenge is how to easily create parallel software to allow single programs to effectively exploit all of this raw performance potential. One promising technique for overcoming this problem is Thread-Level Speculation (TLS), which enables the compiler to optimistically create parallel threads despite uncertainty as to whether those threads are actually independent. In this article, we propose and evaluate a design for supporting TLS that seamlessly scales both within a chip and beyond because it is a straightforward extension of write-back invalidation-based cache coherence (which itself scales both up and down). Our experimental results demonstrate that our scheme performs well on single-chip multiprocessors where the first level caches are either private or shared. For our private-cache design, the program performance of two of 13 general purpose applications studied improves by 86% and 56%, four others by more than 8%, and an average across all applications of 16%---confirming that TLS is a promising way to exploit the naturally-multithreaded processing resources of future computer systems.

References

Agarwal, W., Hrishikesh, M., Keckler, S., and Burger, D. 2000. Clock rate versus IPC: the end of the road for conventional microarchitectures. In Proceedings of ISCA 27. Google Scholar
Aho, A. V., Sethi, R., and Ullman, J. D. 1986. Compilers: Principles, Techniques and Tools. Addison Wesley. Google Scholar
Akkary, H. and Driscoll, M. 1998. A dynamic multithreading processor. In MICRO-31. Google Scholar
Breach, S. E., Vijaykumar, T. N., Gopal, S., Smith, J. E., and Sohi, G. S. 1996. Data memory alternatives for multiscalar processors. Tech. Rep. CS-TR-1997-1344, Computer Sciences Department, University of Wisconsin-Madison.Google Scholar
Breach, S. E., Vijaykumar, T. N., and Sohi, G. S. 1994. The anatomy of the register file in a multiscalar processor. In Proceedings of the 27th Annual International Symposium on Microarchitecture. 181--190. Google Scholar
Cintra, M. and Llanos, D. R. 2003. Toward efficient and robust software speculative parallelization on multiprocessors. In Proceedings of the ninth ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM Press, 13--24. Google Scholar
Cintra, M., Martínez, J. F., and Torrellas, J. 2000. Architectural support for scalable speculative parallelization in shared-memory multiprocessors. In Proceedings of ISCA 27. Google Scholar
Cintra, M. and Torrellas, J. 2002. Learning cross-thread violations in speculative parallelization for multiprocessors. In Proceedings of the 8th HPCA. Google Scholar
Emer, J. 2001. Ev8: The post-ultimate alpha (keynote address). In International Conference on Parallel Architectures and Compilation Techniques.Google Scholar
Farrens, M., Tyson, G., and Pleszkun, A. 1994. A study of single-chip processor/cache organizations for large number of transistors. In Proceedings of ISCA 21. pp. 338--347. Google Scholar
Frank, M., Moritz, C., Greenwald, B., Amarasinghe, S., and Agarwal, A. 1999. Suds: Primitive mechanisms for memory dependence speculation. Tech. Rep. MIT/LCS Technical Memo LCS-TM-591. January.Google Scholar
Franklin, M. and Sohi, G. S. 1996. ARB: A hardware mechanism for dynamic reordering of memory references. IEEE Trans. Comput. 45, 5 (May). Google Scholar
Garzaran, M. J., Prvulovic, M., Llaberia, J. M., Vinals, V., Rauchwerger, L., and Torrellas, J. 2003. Tradeoffs in buffering memory state for thread-level speculation in multiprocessors. In Proceedings of the Ninth International Symposium on High-Performance Computer Architecture (HPCA). Google Scholar
Goldstein, S. C., Schauser, K. E., and Culler, D. E. 1996. Lazy threads: Implementing a fast parallel call. J. Para. Distrib. Comput. 37, 1 (Aug.), 5--20. Google Scholar
Gopal, S., Vijaykumar, T., Smith, J., and Sohi, G. 1998. Speculative versioning cache. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture. Google Scholar
Gupta, M. and Nim, R. 1998. Techniques for speculative run-time parallelization of loops. In Proceedings of Supercomputing 1998. Google Scholar
Hammond, L., Willey, M., and Olukotun, K. 1998. Data speculation support for a chip multiprocessor. In Proceedings of ASPLOS-VIII. Google Scholar
Kahle, J. 1999. Power4: A Dual-CPU processor chip. Microprocessor Forum '99.Google Scholar
Knoop, J. and Ruthing, O. 1992. Lazy code motion. In Proceedings of the ACM SIGPLAN 92 Conference on Programming Language Design and Implementation. Google Scholar
Krishnan, V. and Torrellas, J. 1999a. A chip multiprocessor architecture with speculative multithreading. IEEE Trans. Comput. Special Issue on Multithreaded Architecture. Google Scholar
Krishnan, V. and Torrellas, J. 1999b. The Need for Fast Communication in Hardware-Based Speculative Chip Multiprocessors. In International Conference on Parallel Architectures and Compilation Techniques (PACT). Google Scholar
Laudon, J. and Lenoski, D. 1997. The SGI Origin: A ccNUMA highly scalable server. In Proceedings of the 24th ISCA. 241--251. Google Scholar
Marcuello, P. and González, A. 2002. Thread-spawning scheme for speculative multithreading. In Proceedings of the 8th HPCA. Google Scholar
Marcuello, P. and Gonzlez, A. 1999. Clustered speculative multithreaded processors. In Proceedings of the ACM International Conference on Supercomputing. Google Scholar
Moshovos, A. I., Breach, S. E., Vijaykumar, T., and Sohi, G. S. 1997. Dynamic speculation and synchronization of data dependences. In Proceedings of the 24th ISCA. Google Scholar
Olukotun, K., Nayfeh, B. A., Hammond, L., Wilson, K., and Chang, K. 1996. The Case for a Single-Chip Multiprocessor. In Proceedings of ASPLOS-VII. Google Scholar
Ooi, C. L., Kim, S. W., Park, I., Eigenmann, R., Falsafi, B., and Vijaykumar, T. N. 2001. Multiplex: Unifying conventional and speculative thread-level parallelism on a chip multiprocessor. In Proceedings of the International Conference on Supercomputing. Google Scholar
Oplinger, J., Heine, D., and Lam, M. S. 1999. In search of speculative thread-level parallelism. In Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques (PACT'99). Google Scholar
Palacharla, S., Jouppi, N. P., and Smith, J. E. 1996. Quantifying the complexity of superscalar processors. Tech. Rep. CS-TR-1996-1328, University of Wisconsin-Madison.Google Scholar
Park, I., Falsafi, B., and Vijaykumar, T. N. 2003. Implicitly-multithreaded processors. In Proceedings of the 30th annual international symposium on Computer architecture. Google Scholar
Prabhu, M. and Olukotun, K. 2003. Using thread-level speculation to simplify manual parallelization. In Principles and Practices of Parallel Programming. Google Scholar
Prvulovic, M., Garzaran, M., Rauchwerger, L., and Torrellas, J. 2001. Removing architectural bottlenecks to the scalability of speculative parallelizatoin. In proceedings of the 28th Annual International Symposium on Computer Architecture. Google Scholar
Rauchwerger, L. and Padua, D. 1995. The LRPD Test: Speculative run-time parallelization of loops with privatization and reduction parallelization. In Proceedings of PLDI '95. 218--232. Google Scholar
Rotenberg, E., Jacobson, Q., Sazeides, Y., and Smith, J. 1997. Trace processors. In Proceedings of Micro 30. Google Scholar
Roth, A. and Sohi, G. 2001. Speculative data-driven multithreading. In 7th International Symposium on High Performance Computer Architecture (HPCA-7). 20--24. Google Scholar
Rundberg, P. and Stenstrom, P. 2000. Low-cost thread-level data dependence speculation on multiprocessors. In Fourth Workshop on Multithreaded Execution, Architecture and Compilation.Google Scholar
Sohi, G. S., Breach, S., and Vijaykumar, T. N. 1995. Multiscalar processors. In Proceedings of ISCA 22. 414--425. Google Scholar
SPEC. 2000. The SPEC Benchmark Suite. Tech. rep., Standard Performance Evaluation Corporation. http://www.spechbench.org.Google Scholar
Steffan, J. G. 2003. Hardware Support for Thread-Level Speculation. Ph.D. thesis, Carnegie Mellon University. Tech. Rep. CMU-CS-03-122. Google Scholar
Steffan, J. G., Colohan, C. B., and Mowry, T. C. 1997. Architectural Support for Thread-Level Data Speculation. Tech. Rep. CMU-CS-97-188, School of Computer Science, Carnegie Mellon University. November. Google Scholar
Steffan, J. G., Colohan, C. B., Zhai, A., and Mowry, T. C. 2002. Improving value communication for thread-level speculation. In Proceedings of the 8th HPCA. Google Scholar
Steffan, J. G., Colohan, C. B., Zhaia, A., and Mowry, T. C. 2000. A scalable approach to thread-level speculation. In Proceedings of ISCA 27. Google Scholar
Tjiang, S., Wolf, M., Lam, M., Pieper, K., and Hennessy, J. 1992. Languages and Compilers for Parallel Computing. Springer-Verlag, Berlin, Germany, 137--151.Google Scholar
Tremblay, M. 1999. MAJC: Microprocessor Architecture for Java Computing. HotChips '99.Google Scholar
Tullsen, D. M., Eggers, S. J., and Levy, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In Proceedings of ISCA 22. 392--403. Google Scholar
Veenstra, J. 2000. MINT+ mips emulator. Personal communication.Google Scholar
Vijaykumar, T. 1998. Compiling for the multiscalar architecture. Ph.D. thesis, Computer Sciences Department, University of Wisconsin-Madison. Google Scholar
Yeager, K. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro. Google Scholar
Zhaia, A., Colohan, C. B., Steffan, J. G., and Mowry, T. C. 2002. Compiler optimization of scalar value communication between speculative threads. In Proceedings of ASPLOS-X. Google Scholar
Zhaia, A., Colohan, C. B., Steffan, J. G., and Mowry, T. C. 2004. Compiler optimization of memory-resident value communication between speculative threads. In Proceedings of the International Symposium on Code Generation and Optimization. Google Scholar
Zhang, Y., Rauchwerger, L., and Torrellas, J. 1998. Hardware for speculative run-time parallelization in distributed shared-memory multiprocessors. In Proceedings of the Fourth International Symposium on High-Performance Computer Architecture. Google Scholar
Zhang, Y., Rauchwerger, L., and Torrellas, J. 1999. Hardware for speculative parallelization of partially-parallel loops in DSM multiprocessors. In Fifth International Symposium on High-Performance Computer Architecture (HPCA). 135--141. Google Scholar
Zhang, Z. and Torrellas, J. 1995. Speeding up irregular applications in shared-memory multiprocessors: Memory binding and group prefetching. In Proceedings of the 22nd Annual International Symposium on Computer Architecture. 188--200. Google Scholar
Zilles, C. and Sohi, G. 2002. Master/slave speculative parallelization. In 35th International Symposium on Microarchitecture (MICRO-35). 18--22. Google Scholar

Index Terms

The STAMPede approach to thread-level speculation
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Compiler-Driven Software Speculation for Thread-Level Parallelism

Current parallelizing compilers can tackle applications exercising regular access patterns on arrays or affine indices, where data dependencies can be expressed in a linear form. Unfortunately, there are cases that independence between statements of code ...
Read More
Compiler and hardware support for reducing the synchronization of speculative threads

Thread-level speculation (TLS) allows us to automatically parallelize general-purpose programs by supporting parallel execution of threads that might not actually be independent. In this article, we focus on one important limitation of program ...
Read More
A thread partitioning approach for speculative multithreading

Speculative multithreading (SpMT) is a thread-level automatic parallelization technique, which partitions sequential programs into multithreads to be executed in parallel. This paper presents different thread partitioning strategies for nonloops and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 23, Issue 3
August 2005
117 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/1082469
Issue’s Table of Contents

Copyright © 2005 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 2005
Published in tocs Volume 23, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Thread-level speculation
automatic parallelization
cache coherence
chip-multiprocessing
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 163
  Total Citations
  View Citations
- 1,363
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The STAMPede approach to thread-level speculation

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Compiler-Driven Software Speculation for Thread-Level Parallelism

Compiler and hardware support for reducing the synchronization of speculative threads

A thread partitioning approach for speculative multithreading

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The STAMPede approach to thread-level speculation

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Compiler-Driven Software Speculation for Thread-Level Parallelism

Compiler and hardware support for reducing the synchronization of speculative threads

A thread partitioning approach for speculative multithreading

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media