skip to main content
research-article
Free Access

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Published:01 April 2013Publication History
Skip Abstract Section

Abstract

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and performance must be reevaluated. Our article begins by performing a thorough TLB performance evaluation of sequential and parallel benchmarks running on a real-world, modern CMP system using hardware performance counters. This analysis demonstrates the need for further improvement of TLB hit rates for both classes of application, and it also points out that the data TLB has a significantly higher miss rate than the instruction TLB in both cases.

In response to the characterization data, we propose and evaluate both Inter-Core Cooperative (ICC) TLB prefetchers and Shared Last-Level (SLL) TLBs as alternatives to the commercial norm of private, per-core L2 TLBs. ICC prefetchers eliminate 19% to 90% of Data TLB (D-TLB) misses across parallel workloads while requiring only modest changes in hardware. SLL TLBs eliminate 7% to 79% of D-TLB misses for parallel workloads and 35% to 95% of D-TLB misses for multiprogrammed sequential workloads. This corresponds to 27% and 21% increases in hit rates as compared to private, per-core L2 TLBs, respectively, and is achieved this using even more modest hardware requirements.

Because of their benefits for parallel applications, their applicability to sequential workloads, and their readily implementable hardware, SLL TLBs and ICC TLB prefetchers hold great promise for CMPs.

References

  1. Bhattacharjee, A. and Martonosi, M. 2009. Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bienia, C. and Li, K. 2010. Fidelity and scaling of the PARSEC benchmark inputs. In Proceedings of the International Semantic Web Conference (ISWC).Google ScholarGoogle Scholar
  3. Bienia, C., Kumar, S., Singh, J. P., and Li, K. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Chen, T. and Baer, J. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chen, J. B., Borg, A., and Jouppi, N. 1992. A simulation based study of TLB performance. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Choi, J., Cher, C. Y., Franke, H., Hamann, H., Weger, A., and Bose, P. 2007. Thermal aware task scheduling at the system software level. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Clark, D. and Emer, J. 1985. Performance of the VAX-11/780 translation buffers: Simulation and measurement. ACM Trans. Comput. Syst. 3, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dahlgren, F., Dubois, M., and Stenström, P. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proceedings of the International Conference on Parallel Processing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Donald, J. and Martonosi, M. 2006. Techniques for multicore thermal management: Classification and new exploration. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Drongowski, P. 2008. Basic performance measurements for AMD Athlon 64, AMD Opteron and AMD Phenom processors. http://developer.amd.com/Assets/Basic_Performance_Measurements.pdf.Google ScholarGoogle Scholar
  11. Ebrahimi, E., Lee, C. J., Mutlu, O, and Patt, Y. N. 2010. Fairness via source throttling: A configurable and high-perfonnance fairness substrate for multi-core memory systems. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarGoogle Scholar
  12. Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., et al. 2001. The microarchitecture of the Pentium 4. Intel Technol. J. QI.Google ScholarGoogle Scholar
  13. Huck, H. and Hays, H. 1993. Architectural support for translation table management in large address space machines. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Intel. 2012. Intel 64 and IA-32 architectures software developer’s manual. http://download.intel.com/products/processor/manual/325462.pdf.Google ScholarGoogle Scholar
  15. Jacob, B. and Mudge, T. 1998a. A look at several memory management units, TLB-Refill, and page table organizations. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jacob, B. and Mudge, T. 1998b. Virtual memory in contemporary microprocessors. IEEE Trans. Comput. 48, 2.Google ScholarGoogle Scholar
  17. Joseph, D. and Grunwald, D. 1997. Prefetching using Markov predictors. In Proceedings of the International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kandiraju, G. and Sivasubramaniam, A. 2002a. Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kandiraju, G. and Sivasubramaniam, A. 2002b. Going the distance for TLB prefetching: An application-driven study. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kim, C., Burger, D., and Keckler, S. 2003. NUCA: A non-uniform cache architecture for wire-delay dominated on-chip caches. In IEEE Micro’s Top Picks. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Korn, W. and Chang, M. 2007. SPEC CPU2006 sensitivity to memory page sizes. ACM SIGARCH Comp. Archit. News 35, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Martin, M. M. K., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., et al. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comp. Archit. News 33, 4, 92--99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Muralimanohar, N., Balasubramonian, R., and Jouppi, N. 2009. CACTI 6.0: A tool to model large caches. Tech. rep. HPL-2009-85, HP Labs.Google ScholarGoogle Scholar
  24. Nagle, D., Uhlig, R., Stanley, T., Sechrest, S., Mudge, T., and Brown, R. 1993. Design tradeoffs for software managed TLBs. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Phansalkar, A., Joshi, A., and John, L. K., 2007. Subsetting the SPEC CPU2006 benchmark suite. ACM SIGARCH Comput. Archit. News 35, 1, 69--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Qui, X. and Dubois, M. 1998. Options for dynamic address translations in COMAs. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Rangan, K., Wei, G., and Brooks, D. 2009. ThreadMotion: Fine-grained power management for multi-core systems. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Romanescu, B., Lebeck, A., Sorin, D., and Bracy, A. 2010. UNified instruction/translation/data (UNITD) coherence: One protocol to rule them all. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar
  29. Rosenblum, M., Bugnion, E., Herrod, S. A, Witchel, E., and Gupta, A. 1995. The impact of architectural trends on operating system performance. ACM SIGOPS Oper. Syst. Rev. 29, 5, 285--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Saulsbury, A., Dahlgren, F., and Stenström, P. 2000. Recency-based TLB preloading. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Sharif, A. and Lee, H.-H. 2009. Data prefetching mechanism by exploiting global and local access patterns. J. Instruction-Level Parallel. Data Prefetch. Championship (DPC).Google ScholarGoogle Scholar
  32. SPEC. 2006. SPEC CPU2006 results. The Standard Performance Evaluation Corporation. http://www.spec.org/cpu2006.Google ScholarGoogle Scholar
  33. Srikantaiah, S. and Kandemir, M. 2010. Synergistic TLBs for high performance address translation in chip multiprocessors. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Srinivasan, V., Davidson, E., and Tyson, G. 2004. A prefetch taxonomy. IEEE Trans. Computer. 53, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Sun. 2003. An overview of UltraSPARC III Cu. http://www.sun.com/processors/UltraSPARCIII/ USII1Cuoverview.pgf.Google ScholarGoogle Scholar
  36. Talluri, M. and Hill, M. 1994. Surpassing the TLB performance of superpages with less operating system support. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Tickoo, O., Kannan, H., Chadha, V., Illkkai, R., and Newell, D. 2007. qTLB: Looking inside the look-aside buffer. In High Performance Computing (HiPC). Lecture Notes in Computer Science, vol. 4873, Springer, 107--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Venkatasubramanian, G., Figueiredo, R. J., Illikkal, R., and Newell, D. 2009. TMT: A TLB tag management framework for virtualized platforms. In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing (SBACPAD). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Villavieja, C., Karakostas, V., Vilanova, L., Etsion, Y., Ramirez, A. et al. 2011. DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Virtutech. 2007. Simics for multicore software. http://www.virtutech.com/.Google ScholarGoogle Scholar
  41. Wilton, S. and Jouppi, N. 1994. An enhanced access and cycle time model for on-chip caches. West. Res. Lab. Res. Rep. 93, 5.Google ScholarGoogle Scholar
  42. Woo, D. H., Seong, N. H., Lewis, D. L., and Lee, H.-H. S. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).Google ScholarGoogle Scholar

Index Terms

  1. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Architecture and Code Optimization
        ACM Transactions on Architecture and Code Optimization  Volume 10, Issue 1
        April 2013
        151 pages
        ISSN:1544-3566
        EISSN:1544-3973
        DOI:10.1145/2445572
        Issue’s Table of Contents

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 April 2013
        • Accepted: 1 December 2012
        • Revised: 1 October 2012
        • Received: 1 October 2011
        Published in taco Volume 10, Issue 1

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader