Abstract
Transaction processing has emerged as the killer application for commercial servers. Most servers are engaged in transactional workloads such as processing search requests, serving middleware, evaluating decisions, managing databases, and powering online commerce. Currently, commercial servers are built from one or more high-performance superscalar processors. However, commercial server applications exhibit high cache miss rates, large memory footprints, and low instruction level parallelism (ILP), which leads to poor utilization on traditional ILP-focused superscalar processors [11]. In addition, these ILP-focused processors have been primarily optimized to deliver maximum performance by employing high clock rates and large amounts of speculation. As a result, we are now at the point where the performance/Watt of subsequent generations of traditional ILP-focused processors on server workloads has been flat [4] or even decreasing. The lack of increase in processor performance/Watt, coupled with the continued decrease in server hardware acquisition costs and likely increases in future power and cooling costs is leading to a situation where total cost of server ownership will soon be predominately determined by power [4]. In this paper, we argue that attacking thread-level parallelism (TLP) via a large number of simple cores on a chip multiprocessor (CMP) leads to much better performance/Watt for server workloads. As a case study, we compare Sun's TLP-oriented Niagara processor against the ILP-oriented dual-core Pentium Extreme Edition from Intel, showing that the Niagara processor has a significant performance/Watt advantage for throughput-oriented server applications.
- A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G. D'Souza, and M. Parkin, "Sparcle: An Evolutionary Processor Design for Large-scale Multiprocessors," IEEE Micro June 1993, pages 48--61. 0. Google ScholarDigital Library
- L. Barroso, K. Gharachorloo, and E. Bugnion, "Memory System Characterization of Commercial Workloads." Proceedings of the 25th Annual International Symposium on Computer Architecture, June 1998, pages 3--14. Google ScholarDigital Library
- L. Barroso, K. Charachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese, "Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing." Proceedings of the 27th Annual International Symposium on Computer Architecture, June 2000. Google ScholarDigital Library
- L. Barroso, "The Price of Performance", ACM Queue, Vol 3, Number 7, September 2005. Google ScholarDigital Library
- S. Chaudhry, P. Caprioli, S. Yip, and M. Tremblay, "High-Performance Throughput Computing," IEEE Micro, May/June 2005, pages 32--45. Google ScholarDigital Library
- J. Clabes, J. Friedrich, and M. Sweet, "Design and Implementation of the POWER5#8482; Microprocessor" ISSCC Dig. Tech. Papers, Feb. 2004, pages 56--57. Google ScholarDigital Library
- J. D. Davis, et. al. "Maximizing CMT Throughput with Mediocre Cores" In Proceeedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, Sep. 2005, pages 51--62. Google ScholarDigital Library
- J. Dundas and T. Mudge, "Improving Data Cache Performance by Pre-executing Instructions Under a Cache Miss", In Proceedings of the 1997 International Conference on Supercomputing, July 1997, pages 68--75. Google ScholarDigital Library
- M. Hrishikesh, et. al. "The Optimal Logic Depth per Pipeline Stage Is 6 to 8 FO4 Inverter Delays". In Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002, pages 14--24. Google ScholarDigital Library
- P. Kongetira, K. Aingaran, and K. Olukotun, "Niagara: A 32-way Multithreaded SPARC Processor," IEEE Micro, March/April 2005, pages 21--29. Google ScholarDigital Library
- S. Kunkel, R. Eickemeyer, M. Lip, T. Mullins, "A Performance Methodology for Commercial Servers," IBM Journal of Research and Development, Vol. 44, Number 6, 2000. Google ScholarDigital Library
- J. Laudon, A. Gupta, and M. Horowitz, "Interleaving: A Multithreading Technique Targeting Multiprocessors and Workstations," Proceedings of the 6th International Symposium on Architectural Support for Parallel Languages and Operating Systems, October 1994, pages 308--318. Google ScholarDigital Library
- J. Lo, L. Barroso, S. Eggers, K. Gharachorloo, et. al. "An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors," Proceeedings of the 25th Annual International Symposium on Computer Architecture, June 1998, pages39--50. Google ScholarDigital Library
- D. Marr, "Hyper-Threading Technology in the Netburst® Microarchitecture", 14th Hot Chips, August 2002.Google Scholar
- S. Mukherjee, M. Kontz, and S. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives," Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002, pages 99--110. Google ScholarDigital Library
- O. Mutlu, H. Kim, J. Stark, and Y. N. Patt, "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors," Proceedings of the 9th International Symposium on High Performance Computer Architecture, February 2003. Google ScholarDigital Library
- S. Naffzigerl, T. Grutkowski2, and B. Stackhouse, "The Implementation of a 2-core Multi-Threaded Itanium® Family Processor," IEEE Internation Solid-State Circuits Conference (ISSCC), Feb. 2005, pages 182--183Google Scholar
- C. Poirier, R. McGowen2, C. Bostak1, and S. Naffziger, "Power and Temperature Control on a 90nm Itanium®-Family Processor," ISSCC, Feb. 2005, pages 304--305Google Scholar
- Standard Performance Evaluation Corporation, SPEC*, http://www.spec.org, Warrenton, VA.Google Scholar
- Transaction Processing Performance Council, TPC-*, http:/www.tpc.org, San Francisco, CAGoogle Scholar
- D. Tullsen, S. Eggers, and H. Levy, "Simultaneous Multithreading: Maximizing On-Chip Parallism," Proceedings of the 22nd Annual International Symposium on Computer Architecture, June 1995, pages 392--403. Google ScholarDigital Library
- T. Vijaykumar, I. Pomeranz, and K. Cheng, "Transient-Fault Recovery Using Simultaneous Multithreading," Proceedings of the 29th Annual International Symposium on Computer Architecture, May 2002, pages 87--98. Google ScholarDigital Library
- "XML Processing Performance in Java and .NET", http://java.sun.com/performance/reference/whitepapers/XML_Test-1_0.pdfGoogle Scholar
Index Terms
- Performance/Watt: the new server focus
Recommendations
Improving performance per watt of asymmetric multi-core processors via online program phase classification and adaptive core morphing
Special section on adaptive power management for energy and temperature-aware computing systemsAsymmetric multi-core processors (AMPs) have been shown to outperform symmetric ones in terms of performance and performance/watt. Improved performance and power efficiency are achieved when the program threads are matched to their most suitable cores. ...
Dynamic Thread Scheduling in Asymmetric Multicores to Maximize Performance-per-Watt
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD ForumRecent trends in technology scaling have enabled the incorporation of multiple processor cores on a single die. Depending on the characteristics of the cores, the multicore may be either symmetric (SMP) or asymmetric (AMP). Several studies have shown ...
Efficient superscalar performance through boosting
ASPLOS V: Proceedings of the fifth international conference on Architectural support for programming languages and operating systemsThe foremost goal of superscalar processor design is to increase performance through the exploitation of instruction-level parallelism (ILP). Previous studies have shown that speculative execution is required for high instruction per cycle (IPC) rates ...
Comments