skip to main content
research-article

Enhancing operating system support for multicore processors by using hardware performance monitoring

Published:21 April 2009Publication History
Skip Abstract Section

Abstract

Multicore processors contain new hardware characteristics that are different from previous generation single-core systems or traditional SMP (symmetric multiprocessing) multiprocessor systems. These new characteristics provide new performance opportunities and challenges. In this paper, we show how hardware performance monitors can be used to provide a fine-grained, closely-coupled feedback loop to dynamic optimizations done by a multicore-aware operating system. These multicore optimizations are possible due to the advanced capabilities of hardware performance monitoring units currently found in commodity processors, such as execution pipeline stall breakdown and data address sampling. We demonstrate three case studies on how a multicore-aware operating system can use these online capabilities for (1) determining cache partition sizes, which helps reduce contention in the shared cache among applications, (2) detecting memory regions with bad cache usage, which helps in isolating these regions to reduce cache pollution, and (3) detecting sharing among threads, which helps in clustering threads to improve locality. Using realistic applications from standard benchmark suites, the following performance improvements were achieved: (1) up to 27% improvement in IPC (instructions-per-cycle) due to cache partition sizing; (2) up to 10% reduction in cache miss rates due to reduced cache pollution, resulting in up to 7% improvement in IPC; and (3) up to 70% reduction in remote cache accesses due to thread clustering, resulting in up to 7% application-level improvement.

References

  1. BIOS and Kernel Developer's Guide (BKDG) For AMD Family 10h Processors. AMD, Sept. 2007.Google ScholarGoogle Scholar
  2. J. Appavoo, D. Da Silva, O. Krieger, M. Auslander, M. Ostrowski, B. Rosenburg, A. Waterland, R. Wisniewski, J. Xenidis, M. Stumm, and L. Soares. Experience distributing objects in an SMMP OS. ACM Transactions on Computer Systems (TOCS), 25(3):6:1--6:52, Aug. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Azimi, M. Stumm, and R. Wisniewski. Online performance analysis by statistical sampling of microprocessor performance counters. In Int'l Conf. on Supercomputing (ICS), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Bellosa and M. Steckermeier. The performance implications of locality information usage in shared-memory multiprocessors. J. of Parallel & Distributed Computing, 37(1):113--121, Aug. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Bershad, D. Lee, T. Romer, and J. Chen. Avoiding conflict misses dynamically in large direct-mapped caches. In Int'l Conf. on Architectural Support for Programming Languages & Operating Systems (ASPLOS), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Bryant, H.-Y. Chang, and B. Rosenburg. Operating system support for parallel programming on RP3. IBM J. of Research & Development, 35(5-6):617--634, Sept./Nov. 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting interthread cache contention on a chip multi-processor architecture. In Int'l Symp. on High Performance Computer Architecture (HPCA), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Chapin, A. Herrod, M. Rosenblum, and A. Gupta. Memory system performance of UNIX on CC-NUMA multiprocessors. In Int'l Conf. on Measurement & Modeling of Computer Systems (SIGMETRICS), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Cho and L. Jin. Managing distributed, shared L2 caches through OS-level page allocation. In Int'l Symp. on Microarchitecture (MICRO), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Cho and L. Jin. Better than the two: Exceeding private and shared caches via two-dimensional page coloring. InWorkshop on Chip Multiprocessor Memory Systems and Interconnects (CMP-MSI), 2007.Google ScholarGoogle Scholar
  11. J. Corbalan, X. Martorell, and J. Labarta. Evaluation of the memory page migration influence in the system performance: the case of the SGI O2000. In Int'l Conf. on Supercomputing (ICS), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Flynn and P. Hung. Microprocessor design issues: Thoughts on the road ahead. IEEE Micro, 25(3):16--31, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Gamsa, O. Krieger, J. Appavoo, and M. Stumm. Tornado: Maximizing locality and concurrency in a shared memory multiprocessor operating system. In Symp. on Operating Systems Design & Implementation (OSDI), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Guo and Y. Solihin. An analytical model for cache replacement policy performance. In Int'l Conf. on Measurement & Modeling of Computer Systems (SIGMETRICS), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. IBM PowerPC 970FX RISC Microprocessor User's Manual. IBM, 2006.Google ScholarGoogle Scholar
  16. Intel Itanium 2 Reference Manual: For Software Development & Optimization. Number 251110-003. Intel, May 2004.Google ScholarGoogle Scholar
  17. R. Iyer. CQoS: a framework for enabling QoS in shared caches of CMP platforms. In Int'l Conf. on Supercomputing (ICS), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Kannan, F. Guo, L. Zhao, R. Illikkal, R. Iyer, D. Newell, Y. Solihin, and C.Kozyrakis. From chaos to QoS: Case studies in CMP resource management. In Workshop on Design, Architecture, & Simulation of Chip Multi-Processors (dasCMP), 2006.Google ScholarGoogle Scholar
  19. S. Kim, D. Chandra, and Y. Solihin. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Int'l Conf. on Parallel Architecture & Compiliation Techniques (PACT), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. LaRowe, J. Wilkes, and C. Ellis. Exploiting operating system support for dynamic page placement on a NUMA shared memory multiprocessor. In Symp. on Principles & Practice of Parallel Programming (PPoPP), 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Liedtke, H. Härtig, and M. Hohmuth. OS-controlled cache predictability for real-time systems. In Real-Time Technology & Applications Symp. (RTAS), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Lin, Q. Lu, X. Ding, Z. Zhang, X. Zhang, and P. Sadayappan. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems. In Int'l Symp. on High Performance Computer Architecture (HPCA), 2008.Google ScholarGoogle Scholar
  23. W. Lynch, B. Bray, and M. Flynn. The effect of page allocation on caches. In Int'l Symp. on Microarchitecture (MICRO), 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. R. Mattson, J. Gecsei, D. Slutz, and I. Traiger. Evaluation techniques and storage hierarchies. IBM Systems J., 9(2):78--117, 1970.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Mellor-Crummey and M. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM Transactions on Computer Systems (TOCS), 9(1):21--65, Feb 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Noordergraaf and R. Zak. SMP system interconnect instrumentation for performance analysis. In Conf. on Supercomputing (SC), pages 1--9, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Qureshi and Y. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Int'l Symp. on Microarchitecture (MICRO), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Rafique, W. Lim, and M. Thottethodi. Architectural support for operating system-driven CMP cache management. In Int'l Conf. on Parallel Architecture & Compiliation Techniques (PACT), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. T. Sherwood, B. Calder, and J. Emer. Reducing cache misses using hardware and software page placement. In Conf. on Supercomputing (SC), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. B. Sinharoy, R. Kalla, J. Tendler, R. Eickemeyer, and J. Joyner. POWER5 system microarchitecture. IBM J. of Research & Development, 49(4/5):505--522, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Soares, D. Tam, and M. Stumm. Reducing the harmful effects of last-level cache polluters with an OS-level, softwareonly pollute buffer. In Int'l Symp. on Microarchitecture (MICRO), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Sprunt. Pentium 4 performance monitoring features. IEEE Micro, 22(4):72--82, Jul./Aug. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. of Supercomputing, 28(1):7--26, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Tam, R. Azimi, L. Soares, and M. Stumm. Managing shared L2 caches on multicore systems in software. In Workshop on the Interaction between Operating Systems and Computer Architecture (WIOSCA), 2007.Google ScholarGoogle Scholar
  35. D. Tam, R. Azimi, L. Soares, and M. Stumm. RapidMRC: Approximating L2 miss rate curves on commodity systems for online optimizations. In Int'l Conf. on Architectural Support for Programming Languages & Operating Systems (ASPLOS), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Tam, R. Azimi, and M. Stumm. Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors. In European Conf. on Computer Systems (EuroSys), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Tikir and J. Hollingsworth. Using hardware counters to automatically improve memory performance. In Conf. on Supercomputing (SC), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CCNUMA compute servers. In Conf. on Programming Language Design & Implementation (PLDI), 1996.Google ScholarGoogle Scholar
  39. K. Wilson and B. Aglietti. Dynamic page placement to improve locality in CC-NUMA multiprocessors for TPC-C. In Conf. on Supercomputing (SC), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Zaleski. YETI: a graduallY Extensible Trace Interpreter. PhD thesis, Univ. Toronto, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Enhancing operating system support for multicore processors by using hardware performance monitoring

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGOPS Operating Systems Review
            ACM SIGOPS Operating Systems Review  Volume 43, Issue 2
            April 2009
            119 pages
            ISSN:0163-5980
            DOI:10.1145/1531793
            Issue’s Table of Contents

            Copyright © 2009 Authors

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 21 April 2009

            Check for updates

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader