research-article

Free Access

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

Authors:
Daniel Lustig

Princeton University

Princeton University
View Profile

,
Abhishek Bhattacharjee

Rutgers University

Rutgers University
View Profile

,
Margaret Martonosi

Princeton University

Princeton University
View Profile

ACM Transactions on Architecture and Code Optimization Volume 10 Issue 1Article No.: 2pp 1–38https://doi.org/10.1145/2445572.2445574

Published:01 April 2013Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and performance must be reevaluated. Our article begins by performing a thorough TLB performance evaluation of sequential and parallel benchmarks running on a real-world, modern CMP system using hardware performance counters. This analysis demonstrates the need for further improvement of TLB hit rates for both classes of application, and it also points out that the data TLB has a significantly higher miss rate than the instruction TLB in both cases.

In response to the characterization data, we propose and evaluate both Inter-Core Cooperative (ICC) TLB prefetchers and Shared Last-Level (SLL) TLBs as alternatives to the commercial norm of private, per-core L2 TLBs. ICC prefetchers eliminate 19% to 90% of Data TLB (D-TLB) misses across parallel workloads while requiring only modest changes in hardware. SLL TLBs eliminate 7% to 79% of D-TLB misses for parallel workloads and 35% to 95% of D-TLB misses for multiprogrammed sequential workloads. This corresponds to 27% and 21% increases in hit rates as compared to private, per-core L2 TLBs, respectively, and is achieved this using even more modest hardware requirements.

Because of their benefits for parallel applications, their applicability to sequential workloads, and their readily implementable hardware, SLL TLBs and ICC TLB prefetchers hold great promise for CMPs.

References

Bhattacharjee, A. and Martonosi, M. 2009. Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarDigital Library
Bienia, C. and Li, K. 2010. Fidelity and scaling of the PARSEC benchmark inputs. In Proceedings of the International Semantic Web Conference (ISWC).Google Scholar
Bienia, C., Kumar, S., Singh, J. P., and Li, K. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarDigital Library
Chen, T. and Baer, J. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. Google ScholarDigital Library
Chen, J. B., Borg, A., and Jouppi, N. 1992. A simulation based study of TLB performance. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Choi, J., Cher, C. Y., Franke, H., Hamann, H., Weger, A., and Bose, P. 2007. Thermal aware task scheduling at the system software level. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). Google ScholarDigital Library
Clark, D. and Emer, J. 1985. Performance of the VAX-11/780 translation buffers: Simulation and measurement. ACM Trans. Comput. Syst. 3, 1. Google ScholarDigital Library
Dahlgren, F., Dubois, M., and Stenström, P. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proceedings of the International Conference on Parallel Processing. Google ScholarDigital Library
Donald, J. and Martonosi, M. 2006. Techniques for multicore thermal management: Classification and new exploration. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Drongowski, P. 2008. Basic performance measurements for AMD Athlon 64, AMD Opteron and AMD Phenom processors. http://developer.amd.com/Assets/Basic_Performance_Measurements.pdf.Google Scholar
Ebrahimi, E., Lee, C. J., Mutlu, O, and Patt, Y. N. 2010. Fairness via source throttling: A configurable and high-perfonnance fairness substrate for multi-core memory systems. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google Scholar
Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., et al. 2001. The microarchitecture of the Pentium 4. Intel Technol. J. QI.Google Scholar
Huck, H. and Hays, H. 1993. Architectural support for translation table management in large address space machines. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Intel. 2012. Intel 64 and IA-32 architectures software developer’s manual. http://download.intel.com/products/processor/manual/325462.pdf.Google Scholar
Jacob, B. and Mudge, T. 1998a. A look at several memory management units, TLB-Refill, and page table organizations. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
Jacob, B. and Mudge, T. 1998b. Virtual memory in contemporary microprocessors. IEEE Trans. Comput. 48, 2.Google Scholar
Joseph, D. and Grunwald, D. 1997. Prefetching using Markov predictors. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
Kandiraju, G. and Sivasubramaniam, A. 2002a. Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS). Google ScholarDigital Library
Kandiraju, G. and Sivasubramaniam, A. 2002b. Going the distance for TLB prefetching: An application-driven study. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Kim, C., Burger, D., and Keckler, S. 2003. NUCA: A non-uniform cache architecture for wire-delay dominated on-chip caches. In IEEE Micro’s Top Picks. Google ScholarDigital Library
Korn, W. and Chang, M. 2007. SPEC CPU2006 sensitivity to memory page sizes. ACM SIGARCH Comp. Archit. News 35, 1. Google ScholarDigital Library
Martin, M. M. K., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., et al. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comp. Archit. News 33, 4, 92--99. Google ScholarDigital Library
Muralimanohar, N., Balasubramonian, R., and Jouppi, N. 2009. CACTI 6.0: A tool to model large caches. Tech. rep. HPL-2009-85, HP Labs.Google Scholar
Nagle, D., Uhlig, R., Stanley, T., Sechrest, S., Mudge, T., and Brown, R. 1993. Design tradeoffs for software managed TLBs. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Phansalkar, A., Joshi, A., and John, L. K., 2007. Subsetting the SPEC CPU2006 benchmark suite. ACM SIGARCH Comput. Archit. News 35, 1, 69--76. Google ScholarDigital Library
Qui, X. and Dubois, M. 1998. Options for dynamic address translations in COMAs. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Rangan, K., Wei, G., and Brooks, D. 2009. ThreadMotion: Fine-grained power management for multi-core systems. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Romanescu, B., Lebeck, A., Sorin, D., and Bracy, A. 2010. UNified instruction/translation/data (UNITD) coherence: One protocol to rule them all. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
Rosenblum, M., Bugnion, E., Herrod, S. A, Witchel, E., and Gupta, A. 1995. The impact of architectural trends on operating system performance. ACM SIGOPS Oper. Syst. Rev. 29, 5, 285--298. Google ScholarDigital Library
Saulsbury, A., Dahlgren, F., and Stenström, P. 2000. Recency-based TLB preloading. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
Sharif, A. and Lee, H.-H. 2009. Data prefetching mechanism by exploiting global and local access patterns. J. Instruction-Level Parallel. Data Prefetch. Championship (DPC).Google Scholar
SPEC. 2006. SPEC CPU2006 results. The Standard Performance Evaluation Corporation. http://www.spec.org/cpu2006.Google Scholar
Srikantaiah, S. and Kandemir, M. 2010. Synergistic TLBs for high performance address translation in chip multiprocessors. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarDigital Library
Srinivasan, V., Davidson, E., and Tyson, G. 2004. A prefetch taxonomy. IEEE Trans. Computer. 53, 2. Google ScholarDigital Library
Sun. 2003. An overview of UltraSPARC III Cu. http://www.sun.com/processors/UltraSPARCIII/ USII1Cuoverview.pgf.Google Scholar
Talluri, M. and Hill, M. 1994. Surpassing the TLB performance of superpages with less operating system support. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
Tickoo, O., Kannan, H., Chadha, V., Illkkai, R., and Newell, D. 2007. qTLB: Looking inside the look-aside buffer. In High Performance Computing (HiPC). Lecture Notes in Computer Science, vol. 4873, Springer, 107--118. Google ScholarDigital Library
Venkatasubramanian, G., Figueiredo, R. J., Illikkal, R., and Newell, D. 2009. TMT: A TLB tag management framework for virtualized platforms. In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing (SBACPAD). Google ScholarDigital Library
Villavieja, C., Karakostas, V., Vilanova, L., Etsion, Y., Ramirez, A. et al. 2011. DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarDigital Library
Virtutech. 2007. Simics for multicore software. http://www.virtutech.com/.Google Scholar
Wilton, S. and Jouppi, N. 1994. An enhanced access and cycle time model for on-chip caches. West. Res. Lab. Res. Rep. 93, 5.Google Scholar
Woo, D. H., Seong, N. H., Lewis, D. L., and Lee, H.-H. S. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).Google Scholar

Index Terms

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Multiple instruction, multiple data
2. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Morrigan: A Composite Instruction TLB Prefetcher
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture

The effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction ...
Read More
Inter-core cooperative TLB for chip multiprocessors
ASPLOS '10

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...
Read More
Inter-core cooperative TLB for chip multiprocessors
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems

Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Architecture and Code Optimization Volume 10, Issue 1
April 2013
151 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/2445572
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 April 2013
- Accepted: 1 December 2012
- Revised: 1 October 2012
- Received: 1 October 2011
Published in taco Volume 10, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
TLB prefetching
Translation lookaside buffer
performance evaluation
shared last-level TLB
simulation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 59
  Total Citations
  View Citations
- 1,030
  Total Downloads
- Downloads (Last 12 months)130
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Morrigan: A Composite Instruction TLB Prefetcher

Inter-core cooperative TLB for chip multiprocessors

Inter-core cooperative TLB for chip multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Morrigan: A Composite Instruction TLB Prefetcher

Inter-core cooperative TLB for chip multiprocessors

Inter-core cooperative TLB for chip multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media