Abstract
Translation Lookaside Buffers (TLBs) are critical to overall system performance. Much past research has addressed uniprocessor TLBs, lowering access times and miss rates. However, as Chip MultiProcessors (CMPs) become ubiquitous, TLB design and performance must be reevaluated. Our article begins by performing a thorough TLB performance evaluation of sequential and parallel benchmarks running on a real-world, modern CMP system using hardware performance counters. This analysis demonstrates the need for further improvement of TLB hit rates for both classes of application, and it also points out that the data TLB has a significantly higher miss rate than the instruction TLB in both cases.
In response to the characterization data, we propose and evaluate both Inter-Core Cooperative (ICC) TLB prefetchers and Shared Last-Level (SLL) TLBs as alternatives to the commercial norm of private, per-core L2 TLBs. ICC prefetchers eliminate 19% to 90% of Data TLB (D-TLB) misses across parallel workloads while requiring only modest changes in hardware. SLL TLBs eliminate 7% to 79% of D-TLB misses for parallel workloads and 35% to 95% of D-TLB misses for multiprogrammed sequential workloads. This corresponds to 27% and 21% increases in hit rates as compared to private, per-core L2 TLBs, respectively, and is achieved this using even more modest hardware requirements.
Because of their benefits for parallel applications, their applicability to sequential workloads, and their readily implementable hardware, SLL TLBs and ICC TLB prefetchers hold great promise for CMPs.
- Bhattacharjee, A. and Martonosi, M. 2009. Characterizing the TLB behavior of emerging parallel workloads on chip multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarDigital Library
- Bienia, C. and Li, K. 2010. Fidelity and scaling of the PARSEC benchmark inputs. In Proceedings of the International Semantic Web Conference (ISWC).Google Scholar
- Bienia, C., Kumar, S., Singh, J. P., and Li, K. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarDigital Library
- Chen, T. and Baer, J. 1995. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput. Google ScholarDigital Library
- Chen, J. B., Borg, A., and Jouppi, N. 1992. A simulation based study of TLB performance. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Choi, J., Cher, C. Y., Franke, H., Hamann, H., Weger, A., and Bose, P. 2007. Thermal aware task scheduling at the system software level. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). Google ScholarDigital Library
- Clark, D. and Emer, J. 1985. Performance of the VAX-11/780 translation buffers: Simulation and measurement. ACM Trans. Comput. Syst. 3, 1. Google ScholarDigital Library
- Dahlgren, F., Dubois, M., and Stenström, P. 1993. Fixed and adaptive sequential prefetching in shared memory multiprocessors. In Proceedings of the International Conference on Parallel Processing. Google ScholarDigital Library
- Donald, J. and Martonosi, M. 2006. Techniques for multicore thermal management: Classification and new exploration. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Drongowski, P. 2008. Basic performance measurements for AMD Athlon 64, AMD Opteron and AMD Phenom processors. http://developer.amd.com/Assets/Basic_Performance_Measurements.pdf.Google Scholar
- Ebrahimi, E., Lee, C. J., Mutlu, O, and Patt, Y. N. 2010. Fairness via source throttling: A configurable and high-perfonnance fairness substrate for multi-core memory systems. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google Scholar
- Hinton, G., Sager, D., Upton, M., Boggs, D., Carmean, D., et al. 2001. The microarchitecture of the Pentium 4. Intel Technol. J. QI.Google Scholar
- Huck, H. and Hays, H. 1993. Architectural support for translation table management in large address space machines. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Intel. 2012. Intel 64 and IA-32 architectures software developer’s manual. http://download.intel.com/products/processor/manual/325462.pdf.Google Scholar
- Jacob, B. and Mudge, T. 1998a. A look at several memory management units, TLB-Refill, and page table organizations. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
- Jacob, B. and Mudge, T. 1998b. Virtual memory in contemporary microprocessors. IEEE Trans. Comput. 48, 2.Google Scholar
- Joseph, D. and Grunwald, D. 1997. Prefetching using Markov predictors. In Proceedings of the International Symposium on Computer Architecture. Google ScholarDigital Library
- Kandiraju, G. and Sivasubramaniam, A. 2002a. Characterizing the d-TLB behavior of SPEC CPU2000 benchmarks. In Proceedings of the International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS). Google ScholarDigital Library
- Kandiraju, G. and Sivasubramaniam, A. 2002b. Going the distance for TLB prefetching: An application-driven study. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Kim, C., Burger, D., and Keckler, S. 2003. NUCA: A non-uniform cache architecture for wire-delay dominated on-chip caches. In IEEE Micro’s Top Picks. Google ScholarDigital Library
- Korn, W. and Chang, M. 2007. SPEC CPU2006 sensitivity to memory page sizes. ACM SIGARCH Comp. Archit. News 35, 1. Google ScholarDigital Library
- Martin, M. M. K., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., et al. 2005. Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset. Comp. Archit. News 33, 4, 92--99. Google ScholarDigital Library
- Muralimanohar, N., Balasubramonian, R., and Jouppi, N. 2009. CACTI 6.0: A tool to model large caches. Tech. rep. HPL-2009-85, HP Labs.Google Scholar
- Nagle, D., Uhlig, R., Stanley, T., Sechrest, S., Mudge, T., and Brown, R. 1993. Design tradeoffs for software managed TLBs. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Phansalkar, A., Joshi, A., and John, L. K., 2007. Subsetting the SPEC CPU2006 benchmark suite. ACM SIGARCH Comput. Archit. News 35, 1, 69--76. Google ScholarDigital Library
- Qui, X. and Dubois, M. 1998. Options for dynamic address translations in COMAs. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Rangan, K., Wei, G., and Brooks, D. 2009. ThreadMotion: Fine-grained power management for multi-core systems. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Romanescu, B., Lebeck, A., Sorin, D., and Bracy, A. 2010. UNified instruction/translation/data (UNITD) coherence: One protocol to rule them all. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
- Rosenblum, M., Bugnion, E., Herrod, S. A, Witchel, E., and Gupta, A. 1995. The impact of architectural trends on operating system performance. ACM SIGOPS Oper. Syst. Rev. 29, 5, 285--298. Google ScholarDigital Library
- Saulsbury, A., Dahlgren, F., and Stenström, P. 2000. Recency-based TLB preloading. In Proceedings of the International Symposium on Computer Architecture (ISCA). Google ScholarDigital Library
- Sharif, A. and Lee, H.-H. 2009. Data prefetching mechanism by exploiting global and local access patterns. J. Instruction-Level Parallel. Data Prefetch. Championship (DPC).Google Scholar
- SPEC. 2006. SPEC CPU2006 results. The Standard Performance Evaluation Corporation. http://www.spec.org/cpu2006.Google Scholar
- Srikantaiah, S. and Kandemir, M. 2010. Synergistic TLBs for high performance address translation in chip multiprocessors. In Proceedings of the Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Google ScholarDigital Library
- Srinivasan, V., Davidson, E., and Tyson, G. 2004. A prefetch taxonomy. IEEE Trans. Computer. 53, 2. Google ScholarDigital Library
- Sun. 2003. An overview of UltraSPARC III Cu. http://www.sun.com/processors/UltraSPARCIII/ USII1Cuoverview.pgf.Google Scholar
- Talluri, M. and Hill, M. 1994. Surpassing the TLB performance of superpages with less operating system support. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). Google ScholarDigital Library
- Tickoo, O., Kannan, H., Chadha, V., Illkkai, R., and Newell, D. 2007. qTLB: Looking inside the look-aside buffer. In High Performance Computing (HiPC). Lecture Notes in Computer Science, vol. 4873, Springer, 107--118. Google ScholarDigital Library
- Venkatasubramanian, G., Figueiredo, R. J., Illikkal, R., and Newell, D. 2009. TMT: A TLB tag management framework for virtualized platforms. In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing (SBACPAD). Google ScholarDigital Library
- Villavieja, C., Karakostas, V., Vilanova, L., Etsion, Y., Ramirez, A. et al. 2011. DiDi: Mitigating the performance impact of TLB shootdowns using a shared TLB directory. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Google ScholarDigital Library
- Virtutech. 2007. Simics for multicore software. http://www.virtutech.com/.Google Scholar
- Wilton, S. and Jouppi, N. 1994. An enhanced access and cycle time model for on-chip caches. West. Res. Lab. Res. Rep. 93, 5.Google Scholar
- Woo, D. H., Seong, N. H., Lewis, D. L., and Lee, H.-H. S. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA).Google Scholar
Index Terms
- TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs
Recommendations
Morrigan: A Composite Instruction TLB Prefetcher
MICRO '21: MICRO-54: 54th Annual IEEE/ACM International Symposium on MicroarchitectureThe effort to reduce address translation overheads has typically targeted data accesses since they constitute the overwhelming portion of the second-level TLB (STLB) misses in desktop and HPC applications. The address translation cost of instruction ...
Inter-core cooperative TLB for chip multiprocessors
ASPLOS '10Translation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...
Inter-core cooperative TLB for chip multiprocessors
ASPLOS XV: Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systemsTranslation Lookaside Buffers (TLBs) are commonly employed in modern processor designs and have considerable impact on overall system performance. A number of past works have studied TLB designs to lower access times and miss rates, specifically for ...
Comments