ABSTRACT
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-variable padding adjusts variable base addresses, while intra-variable padding modifies array dimension sizes. Two levels of precision are evaluated. PADLITE only uses array and column dimension sizes, relying on assumptions about common array reference patterns. PAD analyzes programs, detecting conflict misses by linearizing array references and calculating conflict distances between uniformly-generated references. The Euclidean algorithm for computing the gcd of two numbers is used to predict conflicts between different array columns for linear algebra codes. Experiments on a range of programs indicate PADLITE can eliminate conflicts for benchmarks, but PAD is more effective over a range of cache and problem sizes. Padding reduces cache miss rates by 16% on average for a 16K direct-mapped cache. Execution times are reduced by 6% on average, with some SPEC95 programs improving up to 15%.
- 1.J. Anderson, S. Amarasinghe, and M. Lam. Data and computation transformation for multiprocessors. In Proceedings of the Fifth A UM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Santa Barbara, CA, July 1995. Google ScholarDigital Library
- 2.B. Bershad, D. Lee, T. Romer, and B. Chen. Avoiding confiict misses dynamically in large direct-mapped caches. In Proceedings oat the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VI), San Jose, CA, October 1994. Google ScholarDigital Library
- 3.W. Bolosky, R. Fitzgerald, and M. Scott. Simple but effective tecluxiques for NUMA memory management. In Proceedings of the Twelfth Symposium on Operating Systems Principles, Litchfield Park, AZ, December 1989. Google ScholarDigital Library
- 4.E. Bugnion, J. Anderson, T. Mowry, M. Rosenblum, and M. Lam. Compiler-directed page coloring for multiprocessors. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VIII), Boston, MA, October 1996. Google ScholarDigital Library
- 5.D. Calahan. Performance evaluation of static and dynamic memory systems on the Cray-2. In Proceedings of the Second international Conference on Supercomputing, St. Malo, France, July 1988. Google ScholarDigital Library
- 6.M. Cierniak and W. Li. Unifying data and control transformations for distributed shared-memory machines. In Proceedings of the SIGPLAN '95 Conference or& Programming Language Design and Implementation, La Jolla, CA, June 1995. Google ScholarDigital Library
- 7.S. Coleman and K. S. MCKinley. Tile size selection using cache organization and data layout. In Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. Google ScholarDigital Library
- 8.J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In U. Basmrjee, D. Gelernter, A. Nicolau, and D. Padua, editors, Languages and Compilers .for Parallel Computing, Fourth International Workshop, Santa Clara, CA, August 1991. Springer-Verlag. Google ScholarDigital Library
- 9.D. Gannon, W. Jalby, and K. Galliveal. Strategies for cache and local memory management by global program transformation. Journal o} Parallel and Distributed Computing, 5(5):587-616, October 1988. Google ScholarDigital Library
- 10.S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 ACM International Conference on Supercomputing, Vienna, Austria, July 1997. Google ScholarDigital Library
- 11.A. Gonzalez, M. Valero, N. Topham, and J. Parcerisa. ELiminating cache conflict misses through XOR-based placement functions. In Proceedings of the 1997 ACM International Conference on Supercompuiing, Vienna, Austria, July 1997. Google ScholarDigital Library
- 12.T. Jeremiassen and S. Eggers. Reducing false sharing on shared memory multiprocessors through compile time data transformations. In Proceedings of the Fifth A CM SIGPLAN Symposium on Principles and Practice o.f Parallel Programming, Santa Barbara, CA, July 1995. Google ScholarDigital Library
- 13.N. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th International Symposium on Computer Architecture, Seattle, WA, May 1990. Google ScholarDigital Library
- 14.M. Kandemir, J. Ramanujam, and A. Choudhary. A compiler algorithm for optimizing locality in loop nests. In Proceedings of the 1997 A CM International Conference on Supercomputing, Vienna, Austria, July 1997. Google ScholarDigital Library
- 15.N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193-209, February 1997. Google ScholarDigital Library
- 16.S. McFarling. Program optimization for instruction caches. In Proceedings of the Third International Conference on Architectural Support .for Programming Languages and Operating Systems (ASPLOS-III), pages 257-270, Boston, MA, April 1989. Google ScholarDigital Library
- 17.K. S. MCKinley, S. Carr, and C.-W. Tseng. Improving data locality with loop transformations. A CM Transactions on Programming Languages and Systems, 18(4):424-453, July 1996. Google ScholarDigital Library
- 18.K. S. MCKinley and O. Temam. A quantitative analysis of loop nest locality. In Proceedings o} the Eighth international Conference on Architectural Support }or Programming Languages and Operating Systems (ASPLOS-VIII), Boston, MA, October 1996. Google ScholarDigital Library
- 19.M. O'Boyle and P. Knijnenburg. Non-singular data transformations: Definition, validity, and applications. In Proceedings of the 1997 ACM International Conference on Supercomputing, Vienna, Austria, Jtdy 1997. Google ScholarDigital Library
- 20.O. Temam, C. Fricker, and W. Jalby. Cache interference phenomena. In Proceedings of the 1994 A CM SIGMET- RICS Conference on Measurement ~ Modeling Computer Systems, Santa Clara, CA, May 1994. Google ScholarDigital Library
- 21.J. TorrelIas, M. Lam, and J. Hennessy. Shared data placement optimizations to reduce multiprocessor cache miss rates. In Proceedings of the 1990 International Conference on Parallel Processing, St. Charles, IL, August 1990.Google Scholar
- 22.R. Wilson et al. SUIF: An infrastructure for research on parallelizing and optimizing compilers. A CM SIGPLAN Notices, 29(12):31-37, December 1994. Google ScholarDigital Library
- 23.M. E. Wolf and M. Lain. A data locality optimizing algorithm. In Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, Toronto, Canada, June 1991. Google ScholarDigital Library
Index Terms
- Data transformations for eliminating conflict misses
Recommendations
Effective padding of multidimensional arrays to avoid cache conflict misses
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationCaches are used to significantly improve performance. Even with high degrees of set associativity, the number of accessed data elements mapping to the same set in a cache can easily exceed the degree of associativity. This can cause conflict misses and ...
Effective padding of multidimensional arrays to avoid cache conflict misses
PLDI '16Caches are used to significantly improve performance. Even with high degrees of set associativity, the number of accessed data elements mapping to the same set in a cache can easily exceed the degree of associativity. This can cause conflict misses and ...
Data transformations for eliminating conflict misses
Many cache misses in scientific programs are due to conflicts caused by limited set associativity. We examine two compile-time data-layout transformations for eliminating conflict misses, concentrating on misses occuring on every loop iteration. Inter-...
Comments