skip to main content
research-article

Data reorganization in memory using 3D-stacked DRAM

Published:13 June 2015Publication History
Skip Abstract Section

Abstract

In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged approach for efficient data reorganization, which combines (i) a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and (ii) a mathematical framework that is used to represent and optimize the reorganization operations.

We evaluate our proposed system through two major use cases. First, we demonstrate the reshape accelerator in performing a physical address remapping via data layout transform to utilize the internal parallelism/locality of the 3D-stacked DRAM structure more efficiently for general purpose workloads. Then, we focus on offloading and accelerating commonly used data reorganization routines selected from the Intel Math Kernel Library package. We evaluate the energy and performance benefits of our approach by comparing it against existing optimized implementations on state-of-the-art GPUs and CPUs. For the various test cases, in-memory data reorganization provides orders of magnitude performance and energy efficiency improvements via low overhead hardware.

References

  1. "CACTI 6.5, HP labs," http://www.hpl.hp.com/research/cacti/.Google ScholarGoogle Scholar
  2. "DDR3-1600 dram datasheet, MT41J256M4, Micron," http://www.micron.com/parts/dram/ddr3-sdram.Google ScholarGoogle Scholar
  3. "Intel math kernel library (MKL)," http://software.intel.com/en-us/articles/intel-mkl/.Google ScholarGoogle Scholar
  4. "McPAT 1.0, HP labs," http://www.hpl.hp.com/research/mcpat/.Google ScholarGoogle Scholar
  5. "Performance application programming interface (PAPI)," http://icl.cs.utk.edu/papi/.Google ScholarGoogle Scholar
  6. "Gromacs," http://www.gromacs.org, 2008.Google ScholarGoogle Scholar
  7. "Itrs interconnect working group, winter update," http://www.itrs.net/, Dec 2012.Google ScholarGoogle Scholar
  8. "Memory scheduling championship (MSC)," http://www.cs.utah.edu/rajeev/jwac12/, 2012.Google ScholarGoogle Scholar
  9. "High bandwidth memory (HBM) dram," JEDEC, JESD235, 2013.Google ScholarGoogle Scholar
  10. "Intel 64 and ia-32 architectures software developers," http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf, October 2014.Google ScholarGoogle Scholar
  11. B. Akin, F. Franchetti, and J. C. Hoe, "FFTS with near-optimal memory access through block data layouts," in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2014, Florence, Italy, May 4--9, 2014, 2014, pp. 3898--3902.Google ScholarGoogle Scholar
  12. B. Akin, F. Franchetti, and J. C. Hoe, "Understanding the design space of dram-optimized hardware FFT accelerators," in IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2014, Zurich, Switzerland, June 18--20, 2014, 2014, pp. 248--255.Google ScholarGoogle Scholar
  13. B. Akin, J. C. Hoe, and F. Franchetti, "Hamlet: Hardware accelerated memory layout transform within 3d-stacked DRAM," in IEEE High Performance Extreme Computing Conference, HPEC 2014, Waltham, MA, USA, September 9--11, 2014, 2014, pp. 1--6.Google ScholarGoogle Scholar
  14. B. Akin, P. A. Milder, F. Franchetti, and J. C. Hoe, "Memory bandwidth efficient two-dimensional fast fourier transform algorithm and implementation for large problem sizes," in 2012 IEEE 20th Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2012, 29 April -- 1 May 2012, Toronto, Ontario, Canada, 2012, pp. 188--191. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient virtual memory for big memory servers," in Proceedings of the 40th Annual International Symposium on Computer Architecture. ACM, 2013, pp. 237--248. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Baumgartner, A. Auer, D. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao, R. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, C. Lam, Q. Lu, M. Nooijen, R. Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov, "Synthesis of high-performance parallel programs for a class of ab initio quantum chemistry models," Proceedings of the IEEE, vol. 93, no. 2, pp. 276--292, Feb 2005.Google ScholarGoogle ScholarCross RefCross Ref
  17. C. Bienia, S. Kumar, J. P. Singh, and K. Li, "The parsec benchmark suite: Characterization and architectural implications," in Proceedings of the 17th international conference on Parallel architectures and compilation techniques. ACM, 2008, pp. 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Buluç, J. T. Fineman, M. Frigo, J. R. Gilbert, and C. E. Leiserson, "Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks," in Proceedings of the twenty-first annual symposium on Parallelism in algorithms and architectures. ACM, 2009, pp. 233--244. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Carter, W. Hsieh, L. Stoller, M. Swanson, L. Zhang, E. Brunvand, A. Davis, C.-C. Kuo, R. Kuramkote, M. Parker, L. Schaelicke, and T. Tateyama, "Impulse: building a smarter memory controller," in High-Performance Computer Architecture, 1999. Proceedings. Fifth International Symposium On, Jan 1999, pp. 70--79. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, "Usimm: the utah simulated memory module," 2012.Google ScholarGoogle Scholar
  21. S. Che, J. W. Sheaffer, and K. Skadron, "Dymaxion: Optimizing memory access patterns for heterogeneous systems," in Proc. of Intl. Conf. for High Perf. Comp., Networking, Storage and Analysis (SC), 2011, pp. 13:1--13:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Chen, S. Li, N. Muralimanohar, J.-H. Ahn, J. Brockman, and N. Jouppi, "CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory," in Design, Automation Test in Europe (DATE), 2012, pp. 33--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. O. Dickson, Y. Liu, S. V. Rylov, B. Dang, C. K. Tsang, P. S. Andry, J. F. Bulzacchelli, H. A. Ainspan, X. Gu, L. Turlapati et al., "An 8x 10-gb/s source-synchronous i/o system based on high-density silicon carrier interconnects," Solid-State Circuits, IEEE Journal of, vol. 47, no. 4, pp. 884--896, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  24. X. Dong, Y. Xie, N. Muralimanohar, and N. P. Jouppi, "Simple but effective heterogeneous main memory with on-chip memory controller support," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2010, pp. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. G. Dreslinski, D. Fick, B. Giridhar, G. Kim, S. Seo, M. Fojtik, S. Satpathy, Y. Lee, D. Kim, N. Liu, M. Wieckowski, G. Chen, D. Sylvester, D. Blaauw, and T. Mudge, "Centip3de: A many-core prototype exploring 3d integration and near-threshold computing," Commun. ACM, vol. 56, no. 11, pp. 97--104, Nov. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim, "Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules," in High Performance Computer Architecture (HPCA), 2015 IEEE 21st International Symposium on, Feb 2015, pp. 283--295.Google ScholarGoogle Scholar
  27. M. Frigo and S. G. Johnson, "The design and implementation of FFTW3," Proceedings of the IEEE, Special issue on "Program Generation, Optimization, and Platform Adaptation", vol. 93, no. 2, pp. 216--231, 2005.Google ScholarGoogle Scholar
  28. M. Gokhale, B. Holmes, and K. Iobst, "Processing in memory: the terasys massively parallel pim array," Computer, vol. 28, no. 4, pp. 23--31, Apr 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. K. Goto and R. A. v. d. Geijn, "Anatomy of high-performance matrix multiplication," ACM Trans. Math. Softw., vol. 34, no. 3, pp. 12:1--12:25, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Gou, G. Kuzmanov, and G. N. Gaydadjiev, "Sams multi-layout memory: Providing multiple views of data to boost simd performance," in Proceedings of the 24th ACM International Conference on Supercomputing, ser. ICS '10, 2010, pp. 179--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T. M. Low, L. Pileggi, J. C. Hoe, and F. Franchetti, "3d-stacked memory-side acceleration: Accelerator and system design," in In the Workshop on Near-Data Processing (WoNDP) (Held in conjunction with MICRO-47.), 2014.Google ScholarGoogle Scholar
  32. J. L. Henning, "Spec cpu2006 benchmark descriptions," ACM SIGARCH Computer Architecture News, vol. 34, no. 4, pp. 1--17, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Islam, M. Scrback, K. Kavi, M. Ignatowski, and N. Jayasena, "Improving node-level map-reduce performance using processing-in-memory technologies," in 7th Workshop on UnConventional High Performance Computing held in conjunction with the EuroPar 2014, ser. UCHPC2014, 2014.Google ScholarGoogle Scholar
  34. J. Jeddeloh and B. Keeth, "Hybrid memory cube new dram architecture increases density and performance," in VLSI Technology (VLSIT), 2012 Symposium on, June 2012, pp. 87--88.Google ScholarGoogle Scholar
  35. M. Kandemir, A. Choudhary, J. Ramanujam, and P. Banerjee, "Improving locality using loop and data transformations in an integrated framework," in Proceedings of the 31st Annual ACM/IEEE International Symposium on Microarchitecture, ser. MICRO 31, 1998, pp. 285--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas, "Flexram: Toward an advanced intelligent memory system," in Computer Design (ICCD), 2012 IEEE 30th International Conference on. IEEE, 2012, pp. 5--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco, "Gpus and the future of parallel computing," IEEE Micro, vol. 31, no. 5, pp. 7--17, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. G. Kestor, R. Gioiosa, D. Kerbyson, and A. Hoisie, "Quantifying the energy cost of data movement in scientific applications," in Workload Characterization (IISWC), 2013 IEEE International Symposium on, Sept 2013, pp. 56--65.Google ScholarGoogle Scholar
  39. D. H. Kim, K. Athikulwongse, M. Healy, M. Hossain, M. Jung, I. Khorosh, G. Kumar, Y.-J. Lee, D. Lewis, T.-W. Lin, C. Liu, S. Panth, M. Pathak, M. Ren, G. Shen, T. Song, D. H. Woo, X. Zhao, J. Kim, H. Choi, G. Loh, H.-H. Lee, and S.-K. Lim, "3d-maps: 3d massively parallel processor with stacked memory," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, Feb 2012, pp. 188--190.Google ScholarGoogle Scholar
  40. G. H. Loh, "3d-stacked memory architectures for multi-core processors," in Proc. of the 35th Annual International Symposium on Computer Architecture, (ISCA), 2008, pp. 453--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. M. Mansuri, J. E. Jaussi, J. T. Kennedy, T. Hsueh, S. Shekhar, G. Balamurugan, F. O'Mahony, C. Roberts, R. Mooney, and B. Casper, "A scalable 0.128-to-1tb/s 0.8-to-2.6 pj/b 64-lane parallel i/o in 32nm cmos," in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International. IEEE, 2013, pp. 402--403.Google ScholarGoogle Scholar
  42. G. M. Morton, A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company, 1966.Google ScholarGoogle Scholar
  43. M. Oskin, F. T. Chong, and T. Sherwood, "Active pages: A computation model for intelligent memory," in ISCA, 1998, pp. 192--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. N. Park, B. Hong, and V. Prasanna, "Tiling, block data layout, and memory hierarchy performance," IEEE Transactions on Parallel and Distributed Systems, vol. 14, no. 7, pp. 640--654, July 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick, "A case for intelligent ram," Micro, IEEE, vol. 17, no. 2, pp. 34--44, Mar 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. T. Pawlowski, "Hybrid memory cube (HMC)," in Hotchips, 2011.Google ScholarGoogle Scholar
  47. J. W. Poulton, W. J. Dally, X. Chen, J. G. Eyles, T. H. Greer, S. G. Tell, J. M. Wilson, and C. T. Gray, "A 0.54 pj/b 20 gb/s ground-referenced single-ended short-reach serial link in 28 nm cmos for advanced packaging applications," 2013.Google ScholarGoogle Scholar
  48. S. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li, "NDC: Analyzing the impact of 3D-stacked memory+logic devices on mapreduce workloads," in Proc. of IEEE Intl. Symp. on Perf. Analysis of Sys. and Soft. (ISPASS), 2014.Google ScholarGoogle Scholar
  49. M. Püschel, P. A. Milder, and J. C. Hoe, "Permuting streaming data using rams," J. ACM, vol. 56, no. 2, pp. 10:1--10:34, Apr. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo, "SPIRAL: Code generation for DSP transforms," Proc. of IEEE, special issue on "Program Generation, Optimization, and Adaptation", vol. 93, no. 2, pp. 232--275, 2005.Google ScholarGoogle Scholar
  51. L. E. Ramos, E. Gorbatov, and R. Bianchini, "Page placement in hybrid memory systems," in Proceedings of the international conference on Supercomputing. ACM, 2011, pp. 85--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. G. Ruetsch and P. Micikevicius, "Optimizing matrix transpose in CUDA," Nvidia CUDA SDK Application Note, 2009.Google ScholarGoogle Scholar
  53. V. Seshadri, Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry, "Rowclone: Fast and energy-efficient in-dram bulk data copy and initialization," in Proc. of the IEEE/ACM Intl. Symp. on Microarchitecture, ser. MICRO-46, 2013, pp. 185--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. K. Sudan, N. Chatterjee, D. Nellans, M. Awasthi, R. Balasubramonian, and A. Davis, "Micro-pages: Increasing dram efficiency with locality-aware data placement," in Proc. of Arch. Sup. for Prog. Lang. and OS, ser. ASPLOS XV, 2010, pp. 219--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. I.-J. Sung, G. Liu, and W.-M. Hwu, "Dl: A data layout transformation system for heterogeneous computing," in Innovative Parallel Computing (InPar), 2012, May 2012, pp. 1--11.Google ScholarGoogle Scholar
  56. C. Van Loan, Computational frameworks for the fast Fourier transform. SIAM, 1992. Google ScholarGoogle ScholarCross RefCross Ref
  57. C. Weis, I. Loi, L. Benini, and N. Wehn, "Exploration and optimization of 3-d integrated dram subsystems," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 32, no. 4, pp. 597--610, April 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. Lee, "An optimized 3d-stacked memory architecture by exploiting excessive, high-density tsv bandwidth," in High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on. IEEE, 2010, pp. 1--12.Google ScholarGoogle Scholar
  59. J. Xiong, J. Johnson, R. W. Johnson, and D. Padua, "SPL: A language and compiler for DSP algorithms," in Programming Languages Design and Implementation (PLDI), 2001, pp. 298--308. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski, "Top-pim: Throughput-oriented programmable processing in memory," in Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, ser. HPDC '14. New York, NY, USA: ACM, 2014, pp. 85--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Z. Zhang, Z. Zhu, and X. Zhang, "A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality," in In Proceedings of the 33rd Annual International Symposium on Microarchitecture. ACM Press, 2000, pp. 32--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. L. Zhao, R. Iyer, S. Makineni, L. Bhuyan, and D. Newell, "Hardware support for bulk data movement in server platforms," in Proc. of IEEE Intl. Conf. on Computer Design, (ICCD), Oct 2005, pp. 53--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Q. Zhu, B. Akin, H. Sumbul, F. Sadi, J. Hoe, L. Pileggi, and F. Franchetti, "A 3d-stacked logic-in-memory accelerator for application-specific data intensive computing," in 3D Systems Integration Conference (3DIC), 2013 IEEE International, Oct 2013, pp. 1--7.Google ScholarGoogle Scholar

Index Terms

  1. Data reorganization in memory using 3D-stacked DRAM

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 43, Issue 3S
        ISCA'15
        June 2015
        745 pages
        ISSN:0163-5964
        DOI:10.1145/2872887
        Issue’s Table of Contents
        • cover image ACM Conferences
          ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
          June 2015
          768 pages
          ISBN:9781450334020
          DOI:10.1145/2749469

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 June 2015

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader