Abstract
Modern computer systems have increasingly complex memory systems. Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching, and streaming behavior. Inadequate models lead to poor algorithmic choices and an incomplete understanding of algorithm behavior on real machines.A key step toward developing better models is to quantify the performance effects of features not reflected in the models. This paper explores the effect of memory system features on sorting performance. We introduce a new cache-conscious sorting algorithm, R-MERGE, which achieves better performance in practice over algorithms that are superior in the theoretical models. R-MERGE is designed to minimize memory stall cycles rather than cache misses by considering features common to many system designs.
Supplemental Material
Available for Download
- AGGARWAL, A., ALPERN, B., CHANDRA, A. K., AND SNIR, M. 1987. A model for hierarchical memory. Volume 19 (New York, 1987), pp. 305-314.]]Google Scholar
- AGGARWAL, A., CHANDRA, A., AND SNIR, M. 1987. Hierarchical memory with block transfer. Volume 28 (Los Angeles, 1987), pp. 204-216.]]Google Scholar
- AGGARWAL, A. AND VITTER, J.S. 1988. The Input/Output complexity of sorting and related problems. Commun. ACM 31, 9, 1116-1127.]] Google ScholarDigital Library
- ALPERN, B., CARTER, L., FEIG, E., AND SELKER, T. 1994. The uniform memory hierarchy model of computation. 12, 2-3, 72-109.]]Google Scholar
- ANDERSON, J. M., BERC, L. M., DEAN, J., GHEMAWAT, S., HENZINGER, M. R., LEUNG, S. A., SITES, R. L., VANDERVOORDE, M. T., WALDSPURGER, C. A., AND WEIHL, W. E. 1997. Continuous profiling: Where have all the cycles gone? In Proceedings of the 16th Symposium on Operating Systems Principles (SOSP-97), Volume 31,5 of Operating Systems Review (New York, Oct. 5-8 1997), pp. 1-14. ACM Press.]] Google ScholarDigital Library
- ARGE, L., CHASE, J., VITTER, J. S., AND WICKREMESINGHE, R. 2000. Efficient sorting using registers and caches. In WAE, Workshop on Algorithm Engineering, Lecture Notes in Computer Science (Sept. 2000). Springer. Journal version to appear in the ACM Journal of Experimental Algorithmics.]] Google ScholarDigital Library
- CALLAHAN, D., CARR, S., AND KENNEDY, K. 1990. Improving register allocation for subscripted variables. In Proceedings of the SIGPLAN '90 Conference on Programming Language Design and Implementation (White Plains, NY, June 1990).]] Google ScholarDigital Library
- EDMONDSON, J. H., RUBINFELD, P. I., BANNON, P. J., BENSCHNEIDER, B. J., BERNSTEIN, D., CASTELINO, R. W., COOPER, E. M., DEVER, D. E., DONCHIN, D. R., FISCHER, T. C., JAIN, A. K., MEHTA, S., MEYER, J. E., PRESTON, R. P., RAJAGOPALAN, V., SOMANATHAN, C., TAYLOR, S. A., AND WOLRICH, G.M. 1995. Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Technical Journal 7, 1, 119-135.]] Google ScholarDigital Library
- FRIGO, LEISERSON, PROKOP, AND RAMACHANDRAN. 1999. Cache-oblivious algorithms. In FOCS: IEEE Symposium on Foundations of Computer Science (FOCS) (1999).]] Google ScholarDigital Library
- HENNESSY, J. L. AND PATTERSON, D. A. 1995. Computer Architecture: A Quantitative Approach (second ed.). Morgan Kaufmann.]] Google ScholarDigital Library
- KNUTH, D. E. 1998. Sorting and Searching (2nd ed.), Volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, MA.]] Google ScholarDigital Library
- LADNER, R., FIX, J., AND LAMARCA, A. 1999. Cache performance analysis of traversals and random accesses. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms (1999).]] Google ScholarDigital Library
- LAM, M. S., ROTHBERG, E. E., AND WOLF, M. E. 1991. The cache performance and optimizations of blocked algorithms. In Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (April 1991).]] Google ScholarDigital Library
- LAMARCA, A. AND LADNER, R. E. 1997. The influence of caches on the performance of sorting. Volume 7 (January 1997), pp. 370-379.]]Google Scholar
- RAHMAN, N. AND RAMAN, R. 1999. Analysing cache effects in distribution sorting. In 3rd Workshop on Algorithm Engineering (July 1999).]] Google ScholarDigital Library
- RAHMAN, N. AND RAMAN, R. 2000. Adapting radix sort to the memory hierarchy. In ALENEX, Workshop on Algorithm Engineering and Experimentation (2000).]]Google Scholar
- RANADE, A., KOTHARI, S., AND UDUPA, R. 2000. Register efficient mergesorting. In International Conference on High Performance Computing (2000).]] Google ScholarDigital Library
- SANDERS, P. 1999a. Accessing multiple sequences through set associative caches. In J. WIEDERMANN, P. VAN EMDE BOAS, AND M. NIELSEN Eds., Automata, Languages and Programruing, 26th International Colloquium, ICALP'99, Prague, Czech Republic, July 11-15, 1999, Proceedings, Volume 1644 of Lecture Notes in Computer Science (1999). Springer.]] Google ScholarDigital Library
- SANDERS, P. 1999b. Fast priority queues for cached memory. In ALENEX, Workshop on Algorithm Engineering and Expermentation, Lecture Notes in Computer Science (1999).]] Google ScholarDigital Library
- SEN, S. AND CHATTERJEE, S. 2000. Towards a theory of cache-efficient algorithms. In Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms (2000).]] Google ScholarDigital Library
- SRIVASTAVA, A. AND EUSTACE, A. 1994. ATOM: A system for building customized program analysis tools. In Proceedings of the SIGPLAN '93 Conference on Programming Language Design and Implementation (June 1994), pp. 196-205.]] Google ScholarDigital Library
- VITTER, J. S. AND SHRIVER, E.A.M. 1994. Algorithms for parallel memory I: Two-level memories. 12, 2-3, 110-147.]]Google Scholar
Index Terms
- Efficient sorting using registers and caches
Recommendations
Efficient Sorting Using Registers and Caches
WAE '00: Proceedings of the 4th International Workshop on Algorithm EngineeringModern computer systems have increasingly complex memory systems. Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache ...
Reducing energy and delay using efficient victim caches
ISLPED '03: Proceedings of the 2003 international symposium on Low power electronics and designIn this paper, we investigate methods for improving the hit rates in the first level of memory hierarchy. Particularly, we propose victim cache structures to reduce the number of accesses to more power consuming structures such as level 2 caches. We ...
Comments