article

Free Access

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Authors:
Jack L. Lo

Univ. of Washington, Seattle

Univ. of Washington, Seattle
View Profile

,
Joel S. Emer

Digital Equipment Corporation, Hudson, MA

Digital Equipment Corporation, Hudson, MA
View Profile

,
Henry M. Levy

Univ. of Washington, Seattle

Univ. of Washington, Seattle
View Profile

,
Rebecca L. Stamm

Digital Equipment Corporation, Hudson, MA

Digital Equipment Corporation, Hudson, MA
View Profile

,
Dean M. Tullsen

Univ. of California, San Diego

Univ. of California, San Diego
View Profile

,
S. J. Eggers

View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 15 Issue 3pp 322–354https://doi.org/10.1145/263326.263382

Published:01 August 1997Publication History

ACM Transactions on Computer Systems

Abstract

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to complete for and share all of the processor's resources every cycle.

The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting multiple threads to share the processor's functional units simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When a program has only a single thread, all of the SMT processor's resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP.

We examine two alternative on-chip parallel architectures for the next generation of processors. We compare SMT and small-scale, on-chip multiprocessors in their ability to exploit both ILP and TLP. First, we identify the hardware bottlenecks that prevent multiprocessors from effectively exploiting ILP. Then, we show that because of its dynamic resource sharing, SMT avoids these inefficiencies and benefits from being able to run more threads on a single processor. The use of TLP is especially advantageous when per-thread ILP is limited. The ease of adding additional thread contexts on an SMT (relative to adding additional processors on an MP) allows simultaneous multithreading to expose more parallelism, further increasing functional unit utilization and attaining a 52% average speedup (versus a four-processor, single-chip multiprocessor with comparable execution resources). This study also addresses an often-cited concern regarding the use of thread-level parallelism or multithreading: interference in the memory system and branch prediction hardware.

We find the multiple threads cause interthread interference in the caches and place greater demands on the memory system, thus increasing average memory latencies. By exploiting threading-level parallelism, however, SMT hides these additional latencies, so that they only have a small impact on total program performance. We also find that for parallel applications, the additional threads have minimal effects on branch prediction.

References

AGARWAL, A. 1992. Performance tradeoffs in multithreaded processors. IEEE Trans. Parallel Distrib. Syst. 3, 5 (Sept.), 525-539. Google Scholar
BECKMANN, C. AND POLYCHRONOPOULOS, C. 1992. Microarchitecture support for dynamic scheduling of acyclic task graphs. In the 25th Annual International Symposium on Microarchitecture (Portland, Oreg., Dec. 1-4). 140-148. Google Scholar
BOYLE, J., BUTLER, R., DIAZ, T. GLICKFELD, B., LUSK, E., OVERBEEK, R., PATTERSON, J., AND STEVENS, R. 1987. Portable Programs for Parallel Processors. Holt, Rinehart, and Winston, New York. Google Scholar
CALDER, B. AND GRUNWALD, D. 1994. Fast and accurate instruction fetch and branch prediction. In the 21st Annual International Symposium on Computer Architecture (Chicago, Ill., Apr. 18-21). 2-11. Google Scholar
DADDIS, a., JR. AND TORNG, g. 1991. The concurrent execution of multiple instruction streams on superscalar processors. In the International Conference on Parallel Processing (Aug.). I:76-83.Google Scholar
DIXIT, K. 1992. New CPU benchmark suites from SPEC. In COMPCON '92 Digest of Papers. 305-310. Google Scholar
EDMONDSON, J., RUBINFELD, P., PRESTON, R., AND RAJAGOPALAN, V. 1995. Superscalar instruction execution in the 21164 Alpha microprocessor. IEEE Micro 15, 2 (Apr.), 33-43. Google Scholar
FILLO, M., KECKLER, S., DALLY, W., CARTER, N., CHANG, A., GUREVICH, Y., AND LEE, W. 1995. The M-Machine multicomputer. In the 28th Annual International Symposium on Microarchitecture (Nov.). 146-156. Google Scholar
GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput. 5, 587-616. Google Scholar
GOVINDARAJAN, R., NEMAWARKAR, S., AND LENIR, P. 1995. Design and performance evaluation of a multithreaded architecture. In the 1st IEEE Symposium on High-Performance Computer Architecture (Jan.). IEEE, New York, 298-307. Google Scholar
GULATI, M. AND BAGHERZADEH, N. 1996. Performance study of a multithreaded superscalar microprocessor. In the 2nd International Symposium on High-Performance Computer Architecture (Feb.). 291-301. Google Scholar
GUNTHER, B. 1993. Superscalar performance in a multithreaded microprocessor. Ph.D. Thesis, Univ. of Tasmania (Dec.).Google Scholar
HIRATA, H., KIMURA, K., NAGAMINE, S., MOCHIZUKI, Y., NISHIMURA, A., NAKASE, Y., AND NISHIZAWA, T. 1992. An elementary processor architecture with simultaneous instruction issuing from multiple threads. In the 19th Annual International Symposium on Computer Architecture (May). 136-145. Google Scholar
IBM. 1997. RISC System/6000 model J50. IBM Corp., Armonk, N.Y. Available at http:// www.rs6OOO.ibm.com/hardware/enterprise/j50.html.Google Scholar
KECKLER, S. W. AND DALLY, W.J. 1992. Processor coupling: Integrating compile time and runtime scheduling for parallelism. In the 19th Annual International Symposium on Computer Architecture (May). 202-213. Google Scholar
LI, Y. AND CHU, W. 1995. The effects of STEP in finely parallel multithreaded processors. In the 1st IEEE Symposium on High-Performance Computer Architecture (Jan.). IEEE, New York, 318-325. Google Scholar
LOWNEY, P., FREUDENBERGER, S., KARZES, T., LICHTENSTEIN, W., NIX, R., O'DONNELL, J., AND RUTTENBERG, g. 1993. The Multiflow trace scheduling compiler. J. Supercomput. 7, 1 (May), 51-142. Google Scholar
NAYFEH, B. AND OLUKOTUN, K. 1994. Exploring the design space for a shared-cache multiprocessor. In the 21st Annual International Symposium on Computer Architecture. 166-175. Google Scholar
NAYFEH, B. A., HAMMOND, L., AND OLUKOTUN, K. 1996. Evaluation of design alternatives for a multiprocessor microprocessor. In the 23rd Annual International Symposium on Computer Architecture (May). 67-77. Google Scholar
OLUKOTUN, K., NAYFEH, B. A., HAMMOND, L., WILSON, K., AND CHANG, K. 1996. The case for a single-chip multiprocessor. In the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 2-11. Google Scholar
PALACHARLA, S., gouPPI, N. P., AND SMITH, J.E. 1997. Complexity-effective superscalar processors. In the 24th Annual International Symposium on Computer Architecture (June). 206-218. Google Scholar
PORTERFIELD, A. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D. Thesis, Rice Univ., Houston, Tex. May. Google Scholar
PRASADH, R. AND WU, C.-L. 1991. A benchmark evaluation of a multi-threaded RISC processor architecture. In the International Conference on Parallel Processing, (Aug.), I:84-91.Google Scholar
SAAVEDRA-BARRERA, R. H., CULLER, D. E., AND VON EICKEN, T. 1990. Analysis of multithreaded architectures for parallel computing. In the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures (July). ACM, New York, 169-178. Google Scholar
SILICON GRAPHICS 1996. The Onyx system family. Silicon Graphics, Inc., Palo Alto, Calif. Available at http://www.sgi.com/Products/hardware/Onyx/Products/sys_lineup.html.Google Scholar
SLATER, M. 1992. SuperSPARC premiers in SPARCstation 10. Microprocess. Rep. (May), 11-13.Google Scholar
SOHI, G. S., BREACH, S. E., AND VIJAYKUMAR, T. 1995. Multiscalar processors. In the 22nd Annual International Symposium on Computer Architecture (June). 414-425. Google Scholar
SUN MICROSYSTEMS. 1997. Ultra HPC Series Overview. Sun Microsystems, Inc., Mountain View, Calif. Available at http://www.sun, corn/hpc/products/index.htrnl.Google Scholar
THEKKATH, R. AND EGGERS, S. 1994. The effectiveness of multiple hardware contexts. In the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 328-337. Google Scholar
TULLSEN, D. M., EGGERS, S. J., EMER, J. S., LEVY, H. M., LO, J. L., AND STAMM, R.L. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In the 23rd Annual International Symposium on Computer Architecture (May). 191-202. Google Scholar
TULLSEN, D. M., EGGERS, S. J., AND LEVY, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In the 22nd Annual International Symposium on Computer Architecture (June). 392-403. Google Scholar
TYSON, G. AND FARRENS, M. 1993. Techniques for extracting instruction level parallelism on MIMD architectures. In the 26th International Symposium on Microarchitecture (Dec.). 128-137. Google Scholar
TYSON, G., FARRENS, M., AND PLESZKUN, A.R. 1992. MISC: A multiple instruction stream computer. In the 25th International Symposium on Microarchitecture (Dec.). 193-196. Google Scholar
VASELL, J. 1994. A fine-grain threaded abstract machine. In the 1994 International Conference on Parallel Architectures and Compilation Techniques (Aug.). 15-24. Google Scholar
WEBER, W. AND GUPTA, A. 1989. Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results. In the 16th Annual International Symposium on Computer Architecture (June). 273-280. Google Scholar
WILSON, K. M., OLUKOTUN, K., AND ROSENBLUM, M. 1996. Increasing cache port efficiency for dynamic superscalar microprocessors. In the 23rd Annual International Symposium on Computer Architecture (May). 147-157. Google Scholar
WILSON, R., FRENCH, R., WILSON, C., AMARASINGHE, S., ANDERSON, J., TJIANG, S., LIAO, S.-W., TSENG, C.-W., HALL, M., LAM, M., AND HENNESSY, J. 1994. SUIF: An infrastructure for research on parallelizing and optimizing compilers. ACM SIGPLAN Not. 29, 12 (Dec.), 31-37. Google Scholar
Woo, S. C., OHARA, M., TORRIE, E., SINGH, J. P., AND GUPTA, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In the 22nd Annual International Symposium on Computer Architecture (June). 24-36. Google Scholar
YAMAMOTO, W. AND NEMIROVSKY, M. 1995. Increasing superscalar performance through multistreaming. In IFIP WGIO.3 Working Conference on Parallel Architectures and Compilation Techniques (PACT 95) (June). 49-58. Google Scholar
YAMAMOTO, W., SERRANO, M. J., TALCOTT, A. R., WOOD, R. C., AND NEMIROVSKY, M. 1994. Performance estimation of multistreamed, superscalar processors. In the 27th Hawaii International Conference on System Sciences (Jan.). IEEE Computer Society, Washington, D.C., I: 195-204.Google Scholar
YEAGER, K. C. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro. (April), 28-40. Google Scholar

Index Terms

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Very long instruction word
    2. Serial architectures
      1. Complex instruction set computing
      2. Reduced instruction set computing
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management

Recommendations

An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Read More
Software thread integration for instruction-level parallelism

Multimedia applications require a significantly higher level of performance than previous workloads of embedded systems. They have driven digital signal processor (DSP) makers to adopt high-performance architectures like VLIW (Very-Long Instruction Word)...
Read More
Exploiting Java instruction/thread level parallelism with horizontal multithreading

Java bytecodes can be executed with the following three methods: a Java interpretor running on a particular machine interprets bytecodes; a Just-In-Time (JIT) compiler translates bytecodes to the native primitives of the particular machine and the ...
Read More

Reviews

Reviewer: David Michael Bowen

One of the highest compliments that an idea can be given is for others to ask,<__?__Pub Caret> “Why didn't I think of that__?__” As the push toward faster and faster computers relies increasingly on making use of the various levels of parallelism in programs, we all feel some frustration over the lack of a single architecture that is best for all types of problems. Some algorithms split nicely into independent threads that will run efficiently on massively parallel systems. Others show a lot of instruction-level parallelism that can be exploited by wide-issue architectures. A good match between architecture and algorithm can produce great computational efficiency; a mismatch is likely to produce computational paralysis. The results reported here receive my “Why didn't I think of that__?__” award for asking the question, “What if we built a CPU that could execute instructions from multiple threads at the same time__?__” and then going out to find the answer in quantitative style. The authors begin by describing their simultaneous multithreading (SMT) architecture and two comparative designs. All start from a base processor similar to the MIPS R10000. The instruction fetch mechanism and register file of the SMT design are modified so that up to four instructions from two of the eight threads the processor is working on can be issued at any clock cycle. The comparative designs are single-chip multiprocessors, one with two CPUs and one with four, so the total resources for each chip are comparable to those of the SMT design. The authors then discuss the benchmarks and, finally, the results of their simulations. The results are interesting not only because they show the success of the SMT design, but because they analyze the causes of inefficiencies in the competing designs and study possible ways to avoid them. While this research group has talked about this design at several architecture conferences, and these talks are included in the conference proceedings, this paper appears to be the first time the work has been discussed in the general journal literature. The ideas are interesting, and I hope we will see them included in a commercial product soon. The paper is worth reading, even by people who do not consider computer architecture their primary interest. I recommend it as outside reading for advanced classes in computer architecture.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 15, Issue 3
Aug. 1997
138 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/263326
Issue’s Table of Contents

Copyright © 1997 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 1997
Published in tocs Volume 15, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache interference
instruction-level parallelism
multiprocessors
multithreading
simultaneous multithreading
thread-level parallelism
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 153
  Total Citations
  View Citations
- 2,656
  Total Downloads
- Downloads (Last 12 months)183
- Downloads (Last 6 weeks)28
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

An evaluation of speculative instruction execution on simultaneous multithreaded processors

Software thread integration for instruction-level parallelism

Exploiting Java instruction/thread level parallelism with horizontal multithreading

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

An evaluation of speculative instruction execution on simultaneous multithreaded processors

Software thread integration for instruction-level parallelism

Exploiting Java instruction/thread level parallelism with horizontal multithreading

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media