skip to main content
article
Free Access

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

Published:01 August 1997Publication History
Skip Abstract Section

Abstract

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus preventing them from adapting to dynamically changing levels of ILP and TLP in a program. With insufficient TLP, processors in an MP will be idle; with insufficient ILP, multiple-issue hardware on a superscalar is wasted. This article explores parallel processing on an alternative architecture, simultaneous multithreading (SMT), which allows multiple threads to complete for and share all of the processor's resources every cycle.

The most compelling reason for running parallel applications on an SMT processor is its ability to use thread-level parallelism and instruction-level parallelism interchangeably. By permitting multiple threads to share the processor's functional units simultaneously, the processor can use both ILP and TLP to accommodate variations in parallelism. When a program has only a single thread, all of the SMT processor's resources can be dedicated to that thread; when more TLP exists, this parallelism can compensate for a lack of per-thread ILP.

We examine two alternative on-chip parallel architectures for the next generation of processors. We compare SMT and small-scale, on-chip multiprocessors in their ability to exploit both ILP and TLP. First, we identify the hardware bottlenecks that prevent multiprocessors from effectively exploiting ILP. Then, we show that because of its dynamic resource sharing, SMT avoids these inefficiencies and benefits from being able to run more threads on a single processor. The use of TLP is especially advantageous when per-thread ILP is limited. The ease of adding additional thread contexts on an SMT (relative to adding additional processors on an MP) allows simultaneous multithreading to expose more parallelism, further increasing functional unit utilization and attaining a 52% average speedup (versus a four-processor, single-chip multiprocessor with comparable execution resources). This study also addresses an often-cited concern regarding the use of thread-level parallelism or multithreading: interference in the memory system and branch prediction hardware.

We find the multiple threads cause interthread interference in the caches and place greater demands on the memory system, thus increasing average memory latencies. By exploiting threading-level parallelism, however, SMT hides these additional latencies, so that they only have a small impact on total program performance. We also find that for parallel applications, the additional threads have minimal effects on branch prediction.

References

  1. AGARWAL, A. 1992. Performance tradeoffs in multithreaded processors. IEEE Trans. Parallel Distrib. Syst. 3, 5 (Sept.), 525-539. Google ScholarGoogle Scholar
  2. BECKMANN, C. AND POLYCHRONOPOULOS, C. 1992. Microarchitecture support for dynamic scheduling of acyclic task graphs. In the 25th Annual International Symposium on Microarchitecture (Portland, Oreg., Dec. 1-4). 140-148. Google ScholarGoogle Scholar
  3. BOYLE, J., BUTLER, R., DIAZ, T. GLICKFELD, B., LUSK, E., OVERBEEK, R., PATTERSON, J., AND STEVENS, R. 1987. Portable Programs for Parallel Processors. Holt, Rinehart, and Winston, New York. Google ScholarGoogle Scholar
  4. CALDER, B. AND GRUNWALD, D. 1994. Fast and accurate instruction fetch and branch prediction. In the 21st Annual International Symposium on Computer Architecture (Chicago, Ill., Apr. 18-21). 2-11. Google ScholarGoogle Scholar
  5. DADDIS, a., JR. AND TORNG, g. 1991. The concurrent execution of multiple instruction streams on superscalar processors. In the International Conference on Parallel Processing (Aug.). I:76-83.Google ScholarGoogle Scholar
  6. DIXIT, K. 1992. New CPU benchmark suites from SPEC. In COMPCON '92 Digest of Papers. 305-310. Google ScholarGoogle Scholar
  7. EDMONDSON, J., RUBINFELD, P., PRESTON, R., AND RAJAGOPALAN, V. 1995. Superscalar instruction execution in the 21164 Alpha microprocessor. IEEE Micro 15, 2 (Apr.), 33-43. Google ScholarGoogle Scholar
  8. FILLO, M., KECKLER, S., DALLY, W., CARTER, N., CHANG, A., GUREVICH, Y., AND LEE, W. 1995. The M-Machine multicomputer. In the 28th Annual International Symposium on Microarchitecture (Nov.). 146-156. Google ScholarGoogle Scholar
  9. GANNON, D., JALBY, W., AND GALLIVAN, K. 1988. Strategies for cache and local memory management by global program transformation. J. Parallel Distrib. Comput. 5, 587-616. Google ScholarGoogle Scholar
  10. GOVINDARAJAN, R., NEMAWARKAR, S., AND LENIR, P. 1995. Design and performance evaluation of a multithreaded architecture. In the 1st IEEE Symposium on High-Performance Computer Architecture (Jan.). IEEE, New York, 298-307. Google ScholarGoogle Scholar
  11. GULATI, M. AND BAGHERZADEH, N. 1996. Performance study of a multithreaded superscalar microprocessor. In the 2nd International Symposium on High-Performance Computer Architecture (Feb.). 291-301. Google ScholarGoogle Scholar
  12. GUNTHER, B. 1993. Superscalar performance in a multithreaded microprocessor. Ph.D. Thesis, Univ. of Tasmania (Dec.).Google ScholarGoogle Scholar
  13. HIRATA, H., KIMURA, K., NAGAMINE, S., MOCHIZUKI, Y., NISHIMURA, A., NAKASE, Y., AND NISHIZAWA, T. 1992. An elementary processor architecture with simultaneous instruction issuing from multiple threads. In the 19th Annual International Symposium on Computer Architecture (May). 136-145. Google ScholarGoogle Scholar
  14. IBM. 1997. RISC System/6000 model J50. IBM Corp., Armonk, N.Y. Available at http:// www.rs6OOO.ibm.com/hardware/enterprise/j50.html.Google ScholarGoogle Scholar
  15. KECKLER, S. W. AND DALLY, W.J. 1992. Processor coupling: Integrating compile time and runtime scheduling for parallelism. In the 19th Annual International Symposium on Computer Architecture (May). 202-213. Google ScholarGoogle Scholar
  16. LI, Y. AND CHU, W. 1995. The effects of STEP in finely parallel multithreaded processors. In the 1st IEEE Symposium on High-Performance Computer Architecture (Jan.). IEEE, New York, 318-325. Google ScholarGoogle Scholar
  17. LOWNEY, P., FREUDENBERGER, S., KARZES, T., LICHTENSTEIN, W., NIX, R., O'DONNELL, J., AND RUTTENBERG, g. 1993. The Multiflow trace scheduling compiler. J. Supercomput. 7, 1 (May), 51-142. Google ScholarGoogle Scholar
  18. NAYFEH, B. AND OLUKOTUN, K. 1994. Exploring the design space for a shared-cache multiprocessor. In the 21st Annual International Symposium on Computer Architecture. 166-175. Google ScholarGoogle Scholar
  19. NAYFEH, B. A., HAMMOND, L., AND OLUKOTUN, K. 1996. Evaluation of design alternatives for a multiprocessor microprocessor. In the 23rd Annual International Symposium on Computer Architecture (May). 67-77. Google ScholarGoogle Scholar
  20. OLUKOTUN, K., NAYFEH, B. A., HAMMOND, L., WILSON, K., AND CHANG, K. 1996. The case for a single-chip multiprocessor. In the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 2-11. Google ScholarGoogle Scholar
  21. PALACHARLA, S., gouPPI, N. P., AND SMITH, J.E. 1997. Complexity-effective superscalar processors. In the 24th Annual International Symposium on Computer Architecture (June). 206-218. Google ScholarGoogle Scholar
  22. PORTERFIELD, A. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D. Thesis, Rice Univ., Houston, Tex. May. Google ScholarGoogle Scholar
  23. PRASADH, R. AND WU, C.-L. 1991. A benchmark evaluation of a multi-threaded RISC processor architecture. In the International Conference on Parallel Processing, (Aug.), I:84-91.Google ScholarGoogle Scholar
  24. SAAVEDRA-BARRERA, R. H., CULLER, D. E., AND VON EICKEN, T. 1990. Analysis of multithreaded architectures for parallel computing. In the 2nd Annual ACM Symposium on Parallel Algorithms and Architectures (July). ACM, New York, 169-178. Google ScholarGoogle Scholar
  25. SILICON GRAPHICS 1996. The Onyx system family. Silicon Graphics, Inc., Palo Alto, Calif. Available at http://www.sgi.com/Products/hardware/Onyx/Products/sys_lineup.html.Google ScholarGoogle Scholar
  26. SLATER, M. 1992. SuperSPARC premiers in SPARCstation 10. Microprocess. Rep. (May), 11-13.Google ScholarGoogle Scholar
  27. SOHI, G. S., BREACH, S. E., AND VIJAYKUMAR, T. 1995. Multiscalar processors. In the 22nd Annual International Symposium on Computer Architecture (June). 414-425. Google ScholarGoogle Scholar
  28. SUN MICROSYSTEMS. 1997. Ultra HPC Series Overview. Sun Microsystems, Inc., Mountain View, Calif. Available at http://www.sun, corn/hpc/products/index.htrnl.Google ScholarGoogle Scholar
  29. THEKKATH, R. AND EGGERS, S. 1994. The effectiveness of multiple hardware contexts. In the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (Oct.). ACM, New York, 328-337. Google ScholarGoogle Scholar
  30. TULLSEN, D. M., EGGERS, S. J., EMER, J. S., LEVY, H. M., LO, J. L., AND STAMM, R.L. 1996. Exploiting choice: Instruction fetch and issue on an implementable simultaneous multithreading processor. In the 23rd Annual International Symposium on Computer Architecture (May). 191-202. Google ScholarGoogle Scholar
  31. TULLSEN, D. M., EGGERS, S. J., AND LEVY, H. M. 1995. Simultaneous multithreading: Maximizing on-chip parallelism. In the 22nd Annual International Symposium on Computer Architecture (June). 392-403. Google ScholarGoogle Scholar
  32. TYSON, G. AND FARRENS, M. 1993. Techniques for extracting instruction level parallelism on MIMD architectures. In the 26th International Symposium on Microarchitecture (Dec.). 128-137. Google ScholarGoogle Scholar
  33. TYSON, G., FARRENS, M., AND PLESZKUN, A.R. 1992. MISC: A multiple instruction stream computer. In the 25th International Symposium on Microarchitecture (Dec.). 193-196. Google ScholarGoogle Scholar
  34. VASELL, J. 1994. A fine-grain threaded abstract machine. In the 1994 International Conference on Parallel Architectures and Compilation Techniques (Aug.). 15-24. Google ScholarGoogle Scholar
  35. WEBER, W. AND GUPTA, A. 1989. Exploring the benefits of multiple hardware contexts in a multiprocessor architecture: Preliminary results. In the 16th Annual International Symposium on Computer Architecture (June). 273-280. Google ScholarGoogle Scholar
  36. WILSON, K. M., OLUKOTUN, K., AND ROSENBLUM, M. 1996. Increasing cache port efficiency for dynamic superscalar microprocessors. In the 23rd Annual International Symposium on Computer Architecture (May). 147-157. Google ScholarGoogle Scholar
  37. WILSON, R., FRENCH, R., WILSON, C., AMARASINGHE, S., ANDERSON, J., TJIANG, S., LIAO, S.-W., TSENG, C.-W., HALL, M., LAM, M., AND HENNESSY, J. 1994. SUIF: An infrastructure for research on parallelizing and optimizing compilers. ACM SIGPLAN Not. 29, 12 (Dec.), 31-37. Google ScholarGoogle Scholar
  38. Woo, S. C., OHARA, M., TORRIE, E., SINGH, J. P., AND GUPTA, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In the 22nd Annual International Symposium on Computer Architecture (June). 24-36. Google ScholarGoogle Scholar
  39. YAMAMOTO, W. AND NEMIROVSKY, M. 1995. Increasing superscalar performance through multistreaming. In IFIP WGIO.3 Working Conference on Parallel Architectures and Compilation Techniques (PACT 95) (June). 49-58. Google ScholarGoogle Scholar
  40. YAMAMOTO, W., SERRANO, M. J., TALCOTT, A. R., WOOD, R. C., AND NEMIROVSKY, M. 1994. Performance estimation of multistreamed, superscalar processors. In the 27th Hawaii International Conference on System Sciences (Jan.). IEEE Computer Society, Washington, D.C., I: 195-204.Google ScholarGoogle Scholar
  41. YEAGER, K. C. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro. (April), 28-40. Google ScholarGoogle Scholar

Index Terms

  1. Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

          Recommendations

          Reviews

          David Michael Bowen

          One of the highest compliments that an idea can be given is for others to ask,<__?__Pub Caret> “Why didn't I think of that__?__” As the push toward faster and faster computers relies increasingly on making use of the various levels of parallelism in programs, we all feel some frustration over the lack of a single architecture that is best for all types of problems. Some algorithms split nicely into independent threads that will run efficiently on massively parallel systems. Others show a lot of instruction-level parallelism that can be exploited by wide-issue architectures. A good match between architecture and algorithm can produce great computational efficiency; a mismatch is likely to produce computational paralysis. The results reported here receive my “Why didn't I think of that__?__” award for asking the question, “What if we built a CPU that could execute instructions from multiple threads at the same time__?__” and then going out to find the answer in quantitative style. The authors begin by describing their simultaneous multithreading (SMT) architecture and two comparative designs. All start from a base processor similar to the MIPS R10000. The instruction fetch mechanism and register file of the SMT design are modified so that up to four instructions from two of the eight threads the processor is working on can be issued at any clock cycle. The comparative designs are single-chip multiprocessors, one with two CPUs and one with four, so the total resources for each chip are comparable to those of the SMT design. The authors then discuss the benchmarks and, finally, the results of their simulations. The results are interesting not only because they show the success of the SMT design, but because they analyze the causes of inefficiencies in the competing designs and study possible ways to avoid them. While this research group has talked about this design at several architecture conferences, and these talks are included in the conference proceedings, this paper appears to be the first time the work has been discussed in the general journal literature. The ideas are interesting, and I hope we will see them included in a commercial product soon. The paper is worth reading, even by people who do not consider computer architecture their primary interest. I recommend it as outside reading for advanced classes in computer architecture.

          Access critical reviews of Computing literature here

          Become a reviewer for Computing Reviews.

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader