Abstract
Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.
- BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors. AMD, Sept. 2007.Google Scholar
- J. Appavoo, D. Da Silva, O. Krieger, M. Auslander, M. Ostrowski, B. Rosenburg, A. Waterland, R. Wisniewski, J. Xenidis, M. Stumm, and L. Soares. Experience distributing objects in an SMMP OS. ACM Transactions on Computer Systems (TOCS), 25(3):6:1--6:52, Aug. 2007. Google ScholarDigital Library
- R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Int'l Conf. on Supercomputing (ICS), 2005. Google ScholarDigital Library
- F. Bellosa and M. Steckermeier. The performance implications of locality information usage in shared-memory multiprocessors. J. of Parallel & Distributed Computing, 37(1):113--121, Aug. 1996. Google ScholarDigital Library
- B. Bershad, D. Lee, T. Romer, and J. Chen. Avoiding conflict misses dynamically in large direct-mapped caches. In Int'l Conf. on Architectural Support for Programming Languages & Operating Systems (ASPLOS), 1994. Google ScholarDigital Library
- R. Bryant, H.-Y. Chang, and B. Rosenburg. Operating system support for parallel programming on RP3. IBM J. of Research & Development, 35(5-6):617--634, Sept./Nov. 1991. Google ScholarDigital Library
- D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Int'l Symp. on High Performance Computer Architecture (HPCA), 2005. Google ScholarDigital Library
- J. Chapin, A. Herrod, M. Rosenblum, and A. Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In Int'l Conf. on Measurement & Modeling of Computer Systems (SIGMETRICS), 1995. Google ScholarDigital Library
- S. Cho and L. Jin. Managing distributed, shared L2 caches through OS-level page allocation. In Int'l Symp. on Microarchitecture (MICRO), 2006. Google ScholarDigital Library
- S. Cho and L. Jin. Better than the two: Exceeding private and shared caches via two-dimensional page coloring. InWorkshop on Chip Multiprocessor Memory Systems and Interconnects (CMP-MSI), 2007.Google Scholar
- J. Corbalan, X. Martorell, and J. Labarta. Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000. In Int'l Conf. on Supercomputing (ICS), 2003. Google ScholarDigital Library
- M. Flynn and P. Hung. Microprocessor design issues: Thoughts on the road ahead. IEEE Micro, 25(3):16--31, 2005. Google ScholarDigital Library
- B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In Symp. on Operating Systems Design & Implementation (OSDI), 1999. Google ScholarDigital Library
- F. Guo and Y. Solihin. An analytical model for cache replacement policy performance. In Int'l Conf. on Measurement & Modeling of Computer Systems (SIGMETRICS), 2006. Google ScholarDigital Library
- IBM PowerPC 970FX RISC Microprocessor User's Manual. IBM, 2006.Google Scholar
- Intel Itanium 2 Reference Manual: For Software Development & Optimization. Number 251110-003. Intel, May 2004.Google Scholar
- R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In Int'l Conf. on Supercomputing (ICS), 2004. Google ScholarDigital Library
- H. Kannan, F. Guo, L. Zhao, R. Illikkal, R. Iyer, D. Newell, Y. Solihin, and C.Kozyrakis. From chaos to QoS: Case studies in CMP resource management. In Workshop on Design, Architecture, & Simulation of Chip Multi-Processors (dasCMP), 2006.Google Scholar
- S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Int'l Conf. on Parallel Architecture & Compiliation Techniques (PACT), 2004. Google ScholarDigital Library
- R. LaRowe, J. Wilkes, and C. Ellis. Exploiting operating system support for dynamic page placement on a NUMA shared memory multiprocessor. In Symp. on Principles & Practice of Parallel Programming (PPoPP), 1991. Google ScholarDigital Library
- J. Liedtke, H. Härtig, and M. Hohmuth. OS-controlled cache predictability for real-time systems. In Real-Time Technology & Applications Symp. (RTAS), 1997. Google ScholarDigital Library
- J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Int'l Symp. on High Performance Computer Architecture (HPCA), 2008.Google Scholar
- W. Lynch, B. Bray, and M. Flynn. The effect of page allocation on caches. In Int'l Symp. on Microarchitecture (MICRO), 1992. Google ScholarDigital Library
- R. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation techniques and storage hierarchies. IBM Systems J., 9(2):78--117, 1970.Google ScholarDigital Library
- J. Mellor-Crummey and M. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):21--65, Feb 1991. Google ScholarDigital Library
- L. Noordergraaf and R. Zak. SMP system interconnect instrumentation for performance analysis. In Conf. on Supercomputing (SC), pages 1--9, 2002. Google ScholarDigital Library
- M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Int'l Symp. on Microarchitecture (MICRO), 2006. Google ScholarDigital Library
- N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In Int'l Conf. on Parallel Architecture & Compiliation Techniques (PACT), 2006. Google ScholarDigital Library
- T. Sherwood, B. Calder, and J. Emer. Reducing cache misses using hardware and software page placement. In Conf. on Supercomputing (SC), 1999. Google ScholarDigital Library
- B. Sinharoy, R. Kalla, J. Tendler, R. Eickemeyer, and J. Joyner. POWER5 system microarchitecture. IBM J. of Research & Development, 49(4/5):505--522, 2005. Google ScholarDigital Library
- L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an OS-level, softwareonly pollute buffer. In Int'l Symp. on Microarchitecture (MICRO), 2008. Google ScholarDigital Library
- B. Sprunt. Pentium 4 performance monitoring features. IEEE Micro, 22(4):72--82, Jul./Aug. 2002. Google ScholarDigital Library
- E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. of Supercomputing, 28(1):7--26, 2004. Google ScholarDigital Library
- D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared L2 caches on multicore systems in software. In Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), 2007.Google Scholar
- D. Tam, R. Azimi, L. Soares, and M. Stumm. RapidMRC: Approximating L2 miss rate curves on commodity systems for online optimizations. In Int'l Conf. on Architectural Support for Programming Languages & Operating Systems (ASPLOS), 2009. Google ScholarDigital Library
- D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In European Conf. on Computer Systems (EuroSys), 2007. Google ScholarDigital Library
- M. Tikir and J. Hollingsworth. Using hardware counters to automatically improve memory performance. In Conf. on Supercomputing (SC), 2004. Google ScholarDigital Library
- B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CCNUMA compute servers. In Conf. on Programming Language Design & Implementation (PLDI), 1996.Google Scholar
- K. Wilson and B. Aglietti. Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C. In Conf. on Supercomputing (SC), 2001. Google ScholarDigital Library
- M. Zaleski. YETI: a graduallY Extensible Trace Interpreter. PhD thesis, Univ. Toronto, 2008. Google ScholarDigital Library
Index Terms
- Enhancing operating system support for multicore processors by using hardware performance monitoring
Recommendations
Adaptive prefetching using global history buffer in multicore processors
Data prefetching is a well-known technique to hide the memory latency in the last-level cache (LCC). Among many prefetching methods in recent years, the Global History Buffer (GHB) proves to be efficient in terms of cost and speedup. In this paper, we ...
Increasing hardware data prefetching performance using the second-level cache
Techniques to reduce or tolerate large memory latencies are critical for achieving high processor performance. Hardware data prefetching is one of the most heavily studied solutions, but it is essentially applied to first-level caches where it can ...
Surpassing the TLB performance of superpages with less operating system support
Many commercial microprocessor architectures have added translation lookaside buffer (TLB) support for superpages. Superpages differ from segments because their size must be a power of two multiple of the base page size and they must be aligned in both ...
Comments