ABSTRACT
In this paper, we propose memory reduction as a new approach to data locality enhancement. Under this approach, we use the compiler to reduce the size of the data repeatedly referenced in a collection of nested loops. Between their reuses, the data will more likely remain in higher-speed memory devices, such as the cache. Specifically, we present an optimal algorithm to combine loop shifting, loop fusion and array contraction to reduce the temporary array storage required to execute a collection of loops. When applied to 20 benchmark programs, our technique reduces the memory requirement, counting both the data and the code, by 51% on average. The transformed programs gain a speedup of 1.40 on average, due to the reduced footprint and, consequently, the improved data locality.
- 1.R. Ahuja, T. Magnanti, and J. Orlin. Network Flows: Theory, Algorithms, and Applications. Prentice-Hall Inc., Englewood Cliffs, New Jersey, 1993.]] Google ScholarDigital Library
- 2.M. S. Bazaraa, J. J. Jarvis, and H. D. Sherali. Linear Programming and Network Flows. Wiley, NewYork, 1990.]] Google ScholarDigital Library
- 3.D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. In Proceedings of ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, pages 53-65, White Plains, New York, June 1990.]] Google ScholarDigital Library
- 4.B. Creusillet and F. Irigoin. Interprocedural array region analyses. International Journal of Parallel Programming, 24(6):513-546, December 1996.]]Google ScholarDigital Library
- 5.A. Darte. On the complixity of loop fusion. In Proceedings of International Conference on Parallel Architecture and Compilation Techniques, pages 149-157, Newport Beach, California, October 1999.]] Google ScholarDigital Library
- 6.C. Ding. Improving Effective Bandwidth Through Compiler Enhancement of Global and Dynamic Cache Reuse. PhD thesis, Department of Computer Science, Rice University, January 2000.]] Google ScholarDigital Library
- 7.J. Ferrante, V. Sarkar, and W. Thrash. On estimating and enhancing cache effectiveness. In Proceedings of the Fourth International Workshop on Languages and Compilers for Parallel Computing, August 1991. Also in Lecture Notes in Computer Science, pp. 328-341, Springer-Verlag, August 1991.]] Google ScholarDigital Library
- 8.A. Fraboulet, G. Hurard, and A. Mignotte. Loop alignment for memory accesses optimization. In Proceedings of the Twelfth International Symposium on System Synthesis, Boca Raton, Florida, November 1999.]] Google ScholarDigital Library
- 9.D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing, 5(5):587-616, October 1988.]] Google ScholarDigital Library
- 10.G. R. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In Proceedings of the Fifth Workshop on Languages and Compilers for Parallel Computing. Also in No. 757 in Lecture Notes in Computer Science, pages 281-295, Springer-Verlag, 1992.]] Google ScholarDigital Library
- 11.T. Gross and P. Steenkiste. Structured data ow analysis for arrays and its use in an optimizing compiler. Software-Practice and Experience, 20(2), February 1990.]] Google ScholarDigital Library
- 12.J. Gu, Z. Li, and G. Lee. Experience with efficient array data ow analysis for array privatization. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice ofParallel Programming, pages 157-167, Las Vegas, NV, June 1997.]] Google ScholarDigital Library
- 13.K. Kennedy and K. S. McKinley. Maximizing loop parallelism and improving data locality via loop fusion and distribution. In Springer-Verlag Lecture Notes in Computer Science, 768. Proceedings of the Sixth Workhsop on Languages and Compilers for Parallel Computing, Portland, Oregon, August, 1993.]] Google ScholarDigital Library
- 14.D. J. Kuck. The Structure of Computers and Computations, volume 1. John Wiley & Sons, 1978.]] Google ScholarDigital Library
- 15.C.-C. Lam, D. Cociorva, G. Baumgartner, and P. Sadayappan. Optimization of memory usage and communication requirements for a class of loops implementing multi-dimensional integrals. In Proceedings of the Twelfth International Workshop on Languages and Compilers for Parallel Computing, San Diego, CA, August 1999.]] Google ScholarDigital Library
- 16.C. Leiserson and J. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5-35, 1991.]]Google ScholarDigital Library
- 17.E. C. Lewis, C. Lin, and L. Snyder. The implementation and evaluation of fusion and contraction in array languages. In Proceedings of the 1998 ACM SIGPLAN Conference onProgramming Language Design and Implementation, pages 50-59, Montreal, Canada, June 1998.]] Google ScholarDigital Library
- 18.N. Manjikian and T. Abdelrahman. Fusion of loops for parallelism and locality. IEEE Transactions on Parallel and Distributed Systems, 8(2):193-209, February 1997.]] Google ScholarDigital Library
- 19.D. Maydan, S. Amarasinghe, and M. Lam. Array data- ow analysis and its use in array privatization. In Proceedings of ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 2-15, Charleston, SC, January 1993.]] Google ScholarDigital Library
- 20.A. G. Mohamed, G. C. Fox, G. von Laszewski, M. Parashar, T. Haupt, K. Mills, Y.-H. Lu, N.-T. Lin, and N.-K. Yeh. Applications benchmark set for fortran-d and high performance fortran. Technical Report CRPS-TR92260, Center for Research on Parallel Computation, Rice University, June 1992.]]Google Scholar
- 21.J. Rice and J. Jing. Problems to test parallel and vector languages. Technical Report CSD-TR-1016, Department of Computer Science, Purdue University, 1990.]]Google Scholar
- 22.V. Sarkar. Optimized unrolling of nested loops. In Proceedings of the ACM International Conference on Supercomputing, pages 153-166, Santa Fe, NM, May 2000.]] Google ScholarDigital Library
- 23.V. Sarkar and G. R. Gao. Optimization of array accesses by collective loop transformations. In Proceedings of 1991 ACM International Conference on Supercomputing, pages 194-205, Cologne, Germany, June 1991.]] Google ScholarDigital Library
- 24.A. Schrijver. Theory of Linear and Integer Programming. John Wiley & Sons, 1986.]] Google ScholarDigital Library
- 25.S. K. Singhai and K. S. McKinley. A parameterized loop fusion algorithm for improving parallelism and cache locality. The Computer Journal, 40(6), 1997.]]Google ScholarCross Ref
- 26.Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. InProceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 215-228, Atlanta, GA, May 1999.]] Google ScholarDigital Library
- 27.Y. Song, R. Xu, C. Wang, and Z. Li. Performance enhancement by memory reduction. Technical Report CSD-TR-00-016, Department of Computer Science, Purdue University, 2000. Also available at http://www.cs.purdue.edu/homes/songyh/academic.html.]]Google Scholar
- 28.M. Strout, L. Carter, J. Ferrante, and B. Simon. Schedule-independent storage mapping for loops. In Proceedings of the Eighth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 24-33, San Jose, CA, October 1998.]] Google ScholarDigital Library
- 29.M. Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Department of Computer Science, Stanford University, August 1992.]] Google ScholarDigital Library
- 30.M. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Publishing Company, 1995.]] Google ScholarDigital Library
Index Terms
- Data locality enhancement by memory reduction
Recommendations
Improving Data Locality by Array Contraction
Array contraction is a program transformation which reduces array size while preserving the correct output. In this paper, we present an aggressive array-contraction technique and study its impact on memory system performance. This technique, called ...
Loop Restructuring for Data I/O Minimization on Limited On-Chip Memory Embedded Processors
In this paper, we propose a framework for analyzing the flow of values and their reuse in loop nests to minimize data traffic under the constraints of limited on-chip memory capacity and dependences. Our analysis first undertakes fusion of possible loop ...
Improving data locality with loop transformations
In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In the this article, we present ...
Comments