ABSTRACT
Poor placement of data blocks in memory may negatively impact application performance because of an increase in the cache conflict miss rate [18]. For dynamically allocated structures this placement is typically determined by the memory allocator. Cache index-oblivious allocators may inadvertently place blocks on a restricted fraction of the available cache indexes, artificially and needlessly increasing the conflict miss rate. While some allocators are less vulnerable to this phenomena, no general-purpose malloc allocator is index-aware and methodologically addresses this concern. We demonstrate that many existing state-of-the-art allocators are index-oblivious, admitting performance pathologies for certain block sizes. We show that a simple adjustment within the allocator to control the spacing of blocks can provide better index coverage, which in turn reduces the superfluous conflict miss rate in various applications, improving performance with no observed negative consequences. The result is an index-aware allocator. Our technique is general and can easily be applied to most memory allocators and to various processor architectures.
Furthermore, we can reduce inter-thread and inter-process conflict misses for processors where threads concurrently share the level-1 cache such as the Sun UltraSPARC-T2™ and Intel "Nehalem" by coloring the placement of blocks so that allocations for different threads and processes start on different cache indexes.
- The Solaris Schedctl Facility. US Patent #5937187.Google Scholar
- hoard.org, 2010 (accessed 2010-1-14). hoard.org.Google Scholar
- Amino-CBBS, 2010 (accessed 2010-1-5). amino-cbbs.sourceforge.net.Google Scholar
- BioPerf, 2010 (accessed 2010-5-19). http://www.bioperf.org/.Google Scholar
- dlmalloc, 2010 (accessed 2010-7-2). http://gee.cs.oswego.edu/dl/html/malloc.html.Google Scholar
- tcmalloc: version 1.6, 2010 (accessed November 9, 2010). http://code.google.com/p/google-perftools/.Google Scholar
- D. Bader, Y. Li, T. Li, and V. Sachdeva. Bioperf: a benchmark suite to evaluate high-performance computer architecture on bioinformatics applications. IEEE Workload Characterization Symposium, 0:163--173, 2005.Google ScholarCross Ref
- E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Wilson. Hoard: a scalable memory allocator for multithreaded applications. In Proc. ninth international conference on Architectural Support for Programming Languages and Operating Systems, pages 117--128, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
- J. Bonwick. The slab allocator: an object-caching kernel memory allocator. In USTC'94: Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference, pages 6--6, Berkeley, CA, USA, 1994. USENIX Association. Google ScholarDigital Library
- B. Calder, C. Krintz, S. John, and T. Austin. Cache-conscious data placement. SIGPLAN Not., 33(11):139--149, 1998. Google ScholarDigital Library
- T. M. Chilimbi, M. D. Hill, and J. R. Larus. Making pointer-based data structures cache conscious. Computer, 33(12):67--74, 2000. Google ScholarDigital Library
- Y. Chou, B. Fahs, and S. Abraham. Microarchitecture optimizations for exploiting memory-level parallelism. SIGARCH Comput. Archit. News, 32(2):76, 2004. Google ScholarDigital Library
- D. Dice. Dave Dice's blog, 2010 (accessed 2010-7-21). thttp://blogs.sun.com/dave/entry/solaris_scheduling_and_cpuids.Google Scholar
- D. Dice and A. Garthwaite. Mostly lock-free malloc. In Proc. 3rd International Symposium on Memory Management, pages 163--174, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- D. Dice, Y. Lev, V. J. Marathe, M. Moir, D. Nussbaum, and M. Oleszewski. Simplifying concurrent algorithms by exploiting hardware transactional memory. In SPAA '10: Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, pages 325--334, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- J. Evans. A scalable concurrent malloc(3) implementation for freebsd, 2006.Google Scholar
- A. González, M. Valero, N. Topham, and J. M. Parcerisa. Eliminating cache conflict misses through xor-based placement functions. In ICS '97: Proceedings of the 11th international conference on Supercomputing, pages 76--83, New York, NY, USA, 1997. ACM. Google ScholarDigital Library
- M. D. Hill and A. J. Smith. Evaluating associativity in cpu caches. IEEE Trans. Comput., 38(12):1612--1630, 1989. Google ScholarDigital Library
- M. S. Johnstone and P. R. Wilson. The memory fragmentation problem: Solved? In ISMM, pages 26--36, 1998. Google ScholarDigital Library
- N. P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. SIGARCH Comput. Archit. News, 18(3a):364--373, 1990. Google ScholarDigital Library
- R. E. Kessler and M. D. Hill. Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst., 10(4):338--359, 1992. Google ScholarDigital Library
- T. Kistler and M. Franz. Automated data-member layout of heap objects to improve memory-hierarchy performance. ACM Trans. Program. Lang. Syst., 22(3):490--505, 2000. Google ScholarDigital Library
- K. C. Knowlton. A fast storage allocator. Commun. ACM, 8(10):623--624, 1965. Google ScholarDigital Library
- C. Lattner and V. Adve. Automatic pool allocation: improving performance by controlling data structure layout in the heap. In PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 129--142, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- A. R. Lebeck and D. A. Wood. Cache profiling and the spec benchmarks: A case study. Computer, 27(10):15--26, 1994. Google ScholarDigital Library
- V. B. Lvin, G. Novark, E. D. Berger, and B. G. Zorn. Archipelago: trading address space for reliability and security. SIGOPS Oper. Syst. Rev., 42(2):115--124, 2008. Google ScholarDigital Library
- M. M. Michael. Scalable Lock-free Dynamic Memory Allocation. In Proc. ACM SIGPLAN 2004 Conference on Programming Language Design and Implementation, pages 35--46, 2004. Google ScholarDigital Library
- R. Min and Y. Hu. Improving performance of large physically indexed caches by decoupling memory addresses from cache addresses. IEEE Trans. Comput., 50(11):1191--1201, 2001. Google ScholarDigital Library
- E. Petrank and D. Rawitz. The hardness of cache conscious data placement. In POPL '02: Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 101--112, New York, NY, USA, 2002. ACM. Google ScholarDigital Library
- T. Romer, D. Lee, B. N. Bershad, and J. B. Chen. Dynamic page mapping policies for cache conflict resolution on standard hardware. In In 1st USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 255--266, 1994.Google Scholar
- D. Sanchez and K. Christos. The zcache: Decoupling ways and associativity. In MICRO 43: Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 2010. Google ScholarDigital Library
- A. Seznec. A case for two-way skewed-associative caches. In ISCA '93: Proceedings of the 20th annual international symposium on Computer architecture, pages 169--178, New York, NY, USA, 1993. ACM. Google ScholarDigital Library
- D. D. Sleator and R. E. Tarjan. Self-adjusting binary search trees. J. ACM, 32(3):652--686, 1985. Google ScholarDigital Library
- V. Čakarević, P. Radojković, J. Verdú, A. Pajuelo, F. J. Cazorla, M. Nemirovsky, and M. Valero. Characterizing the resource-sharing levels in the ultrasparc t2 processor. In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 481--492, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- Z. Wang and R. B. Lee. A novel cache architecture with enhanced performance and security. In MICRO 41: Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, pages 83--93, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
- C. B. Zilles. Benchmark health considered harmful. SIGARCH Comput. Archit. News, 29(3):4--5, 2001. Google ScholarDigital Library
- M. Zukowski, S. Héman, and P. Boncz. Architecture-conscious hashing. In DaMoN '06: Proceedings of the 2nd international workshop on Data management on new hardware, page 6, New York, NY, USA, 2006. ACM. Google ScholarDigital Library
Index Terms
- Cache index-aware memory allocation
Recommendations
Cache index-aware memory allocation
ISMM '11Poor placement of data blocks in memory may negatively impact application performance because of an increase in the cache conflict miss rate [18]. For dynamically allocated structures this placement is typically determined by the memory allocator. Cache ...
Reducing traffic generated by conflict misses in caches
CF '04: Proceedings of the 1st conference on Computing frontiersOff-chip memory accesses are a major source of power consumption in embedded processors. In order to reduce the amount of traffic between the processor and the off-chip memory as well as to hide the memory latency, nearly all embedded processors have a ...
Unison Cache: A Scalable and Effective Die-Stacked DRAM Cache
MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on MicroarchitectureRecent research advocates large die-stacked DRAM caches in many core servers to break the memory latency and bandwidth wall. To realize their full potential, die-stacked DRAM caches necessitate low lookup latencies, high hit rates and the efficient use ...
Comments