skip to main content
research-article

Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

Published:01 June 2012Publication History
Skip Abstract Section

Abstract

Multicore architectures provide scalable performance with a lower hardware design effort than single core processors. Our article presents a design methodology and an embedded multicore architecture, focusing on reducing the software design complexity and boosting the performance density. First, we analyze characteristics of the Task-Level Parallelism in modern multimedia workloads. These characteristics are used to formulate requirements for the programming model. Then we translate the programming model requirements to an architecture specification, including a novel low-complexity implementation of cache coherence and a hardware synchronization unit. Our evaluation demonstrates that the novel coherence mechanism substantially simplifies hardware design, while reducing the performance by less than 18% relative to a complex snooping technique. Compared to a single processor core, the multicores have already proven to be more area- and energy-efficient. However, the multicore architectures in embedded systems still compete with highly efficient function-specific hardware accelerators. In this article we identify five architectural methods to boost performance density of multicores; microarchitectural downscaling, asymmetric multicore architectures, multithreading, generic accelerators, and conjoining. Then, we present a novel methodology to explore multicore design spaces, including the architectural methods improving the performance density. The methodology is based on a complex formula computing performances of heterogeneous multicore systems. Using this design space exploration methodology for HD and QuadHD H.264 video decoding, we estimate that the required areas of multicores in CMOS 45 nm are 2.5 mm2 and 8.6 mm2, respectively. These results suggest that heterogeneous multicores are cost-effective for embedded applications and can provide a good programmability support.

References

  1. Adve, S., Pai, V., and Ranganathan, P. 1999. Recent advances in memory consistency models for hardware shared-memory systems. Proc. IEEE Special Issue On Distributed Shared-Memory. 445--455.Google ScholarGoogle Scholar
  2. Agarwal, A. and Levy, M. 2007. The KILL rule for multicore. In Proceedings of the Design Automation Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Agarwal, A., Simoni, R., Hennessy, J., and Horowitz, M. 1988. An evaluation of directory schemes for cache coherence. In Proceedings of the 15th Annual International Symposium on Computer Architecture. 280--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Al-Kadi, G. and Terechko, A. S. 2009. A hardware task scheduler for embedded video processing. In Proceedings of the 4th International Conference on High Performance and Embedded Architectures and Compilers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Al-Kadi, G., Hoogerbrugge, J., Guntur, S., Terechko, A., and Duranton, M. 2010. Meandering based parallel 3DRS algorithm for the multicore era. In Proceedings of the IEEE International Conference on Consumer Electronics.Google ScholarGoogle Scholar
  6. Amphion. 2004. AmphionCS7050 part. http://www.design-reuse.com/news/7611/amphion-immediate-performanceh-264-avc-video-cores.html.Google ScholarGoogle Scholar
  7. Andrews, G. R. 1999. Foundations of Multithreaded, Parallel, and Distributed Programming. Addison Wesley Publishers. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Antoniu, G. and Bougé, L. 2002. Implementing multithreaded protocols for release consistency on top of the generic DSM-PM2 platform. In Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing. Lecture Notes in Computer Science, vol. 2326, Springer, 181--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Archibald, J. and Baer, J.-L. 1986. Cache coherence protocols: Evaluation using a multiprocessor simulation model. ACM Trans. Comput. Syst. 4, 4, 273--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Azevedo, A., Juurlink, B., Meenderinck, C., Terechko, A., Hoogerbrugge, J., Alvarez, M., Ramirez, A., and Valero, M. 2009. A highly scalable parallel implementation of H.264. Trans. High-Perform. Embed. Archit. Compil. 4, 2, 404--418.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., et al. 1995. An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, New York, NY, 207--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 83--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Butenhof, D. R. 1997. Programming with POSIX Threads. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. CACTI. http://www.hpl.hp.com/research/cacti/, technology model for cache structures.Google ScholarGoogle Scholar
  15. Chaudhry, S. 2008. Rock: A SPARC CMT processor. http://www.opensparc.net/pubs/preszo/08/RockHotChips.pdf.Google ScholarGoogle Scholar
  16. Christie, P., Nackaerts, A., Kumar, A., Terechko, A. S., and Doornbos, G. 2008. Rapid design flows for advanced technology pathfinding. In Proceedings of the International Electron Devices Meeting.Google ScholarGoogle Scholar
  17. Culler, D. E., Singh, J. P., and Gupta, A. 1999. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Darema, F. 2001. SPMD model: Past, present and future. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 8th European PVM/MPI Users’ Group Meeting. Lecture Notes in Computer Science, vol. 2131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. de Haan, G., Biezen, P. W. A. C., Huijgen, H., and Ojo, O. A. 1993. True-motion estimation with 3-D recursive search block matching. IEEE Trans. Circ. Syst. Video Techn. 3, 5, 368--379.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Detlefs, D., Martin, P., Moir, M., and Steele, G. L. 2001. Lock-free reference counting. In Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing. 190--199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., and Hennessy, J. 1990. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 15--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Hill, M. D. 1990. What is scalability? ACM SIGARCH Comput. Archit. News 18, 4, 18--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Hill, M. D. and Marty, M. R. 2008. Amdahl’s law in the Multicore Era”. IEEE Comput. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hoogerbrugge, J. and Augusteijn, L. 1999. Instruction scheduling for TriMedia. J. Instruct.-Level Parallel. 1.Google ScholarGoogle Scholar
  25. Hoogerbrugge, J. and Terechko, A. 2008. A multithreaded multicore system for embedded media processing. Trans. High-Perform. Embed. Archit. Compil. 4, 2.Google ScholarGoogle Scholar
  26. Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, T. R., and Shippy, D. 2007. Introduction to the Cell multiprocessor. IBM J. Resear. Devel. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Kelm, J. H., Johnson, D. R., Mahesri, A., Lumetta, S. S., Frank, M., and Patel, S. J. 2008. SChISM: Scalable cache incoherent shared memory. Tech. rep. UILU-ENG-08-2212, University of Illinois.Google ScholarGoogle Scholar
  28. Khronos. http://www.khronos.org.Google ScholarGoogle Scholar
  29. Kumar, R., Tullsen, D. M., and Jouppi, N. P. 2006. Core architecture optimization for heterogeneous chip multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compiler Techniques. 23--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kunz, R. and Horowitz, M. 2008. The case for simple, visible cache coherency. In Proceedings of the Workshop on Memory Systems Performance and Correctness Held in Conjunction with the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. 31--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Lamport, L. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. 28, 9, 690--691. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Lee, H.-H., Tyson, G., and Farrens, M. 2000. Eager writeback: A technique for improving bandwidth utilization. In Proceedings of the 33rd International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Li, E., Li, W., et al. 2008. Accelerating video-mining applications using many small, general-purpose cores. IEEE Micro 28, 5, 8--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Limberg, T., Winter, M. et al. 2009. A heterogeneous MPSoC with hardware supported dynamic task scheduling for software defined radio. In Proceedings of the Design Automation Conference.Google ScholarGoogle Scholar
  35. Lusk, E. L. 1987. Portable Programs for Parallel and Processors. Harcourt School. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mogul, J. C., Mudigonda, J., Binkert, N., Ranganathan, P., and Talwar, V. 2008. Using asymmetric single-ISA CMPs to save energy on operating systems. IEEE Micro. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mohanty, S., Prasanna, V. K., Neema, S., and Davis, J. 2002. Rapid design space exploration of heterogeneous embedded systems using symbolic search and multi-granular simulation. In Proceedings of the Joint Conference on Languages Compilers and Tools for Embedded Systems Software and Compilers for Embedded Systems. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Moudgill, M., Glossner, J., Agrawal, S., and Nacer, G. 2008. The Sandblaster 2.0 architecture and SB3500 implementation. In Proceedings of the Software Defined Radio Technical Forum.Google ScholarGoogle Scholar
  39. Munk, H., Ayguadé, E. et al. 2011. ACOTES programming model. Int. J. Paral. Program. 39, 3, 397--400.Google ScholarGoogle ScholarCross RefCross Ref
  40. Oliver, J., Rao, R., Franklin, D., Chong, F. T., and Akella, V. 2006. Synchroscalar: Evaluation of an embedded, multi-core architecture for media applications. J. Embed. Comput. 2, 2, 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Pai, V. S., Ranganathan, P., Adve, S. V., and Harton, T. 1996. An evaluation of memory consistency models for shared-memory systems with ILP processors. ACM SIGOPS Oper. Syst. Rev. 30, 5, 12--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Panis, C., Hirnschrott, U., Laure, G., Lazian, W., and Nurmi, J. 2004. DSPxPlore---Design space exploration methodology for an embedded DSP core. In Proceedings of the Symposium on Applied Computing. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. PNX8550. 2004. Philips/NXP PNX8550 (Viper2) MPSoC. http://www.nxp.com.Google ScholarGoogle Scholar
  44. Posix. 1995. The POSIX threads standard. ISO/IEC standard 9945-1:1996, also known as ANSI/IEEE POSIX 1003.1-1995.Google ScholarGoogle Scholar
  45. Sandbridge. http://www.sandbridgetech.com.Google ScholarGoogle Scholar
  46. Sarangi, S., Tiwari, A., and Torrellas, J. 2006. Phoenix: Detecting and recovering from permanent processor design bugs with programmable hardware. In Proceedings of the 39th Annual International Symposium on Microarchitecture. 9--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Sasanka, R., Li, M., Adve, S. V., Chen, Y.-K., and Debes, E. 2007. ALP: Efficient support for all levels of parallelism for complex media applications. ACM Trans. Architect. Code Optim. 4, 1, Article 3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Själander, M., Terechko, A., and Duranton, M. 2008. A look-ahead task management unit for embedded multi-core architectures. In Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools. IEEE, 149--157. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. TAD. Technology-aware design. IMEC, Leuven. http://www.imec.be/tad/.Google ScholarGoogle Scholar
  50. Terechko, A., Hoogerbrugge, J., Al-Kadi, G., Lahiri, A., Guntur, S., Duranton. M., Christie, P., Nackaerts, A., and Kumar, A. 2009. Performance density exploration of heterogeneous multicore architectures. In Proceedings of the Rapid Simulation and Performance Evaluation: Methods and Tools in Conjunction with the 4th International Conference on High-Performance and Embedded Architectures and Compilers.Google ScholarGoogle Scholar
  51. Toshiba. 2008. Toshiba Spurs Engine. http://www.toshiba.co.jp/about/press/2008_04/pr0801.htm.Google ScholarGoogle Scholar
  52. Ungerer, T., Robič, B., and Šilc, J. 2003. A survey of processors with explicit multithreading. ACM Comput. Surv. 35, 1, 29--63. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Vandeputte, F. and Eeckhout, L. 2009. Finding stress patterns in microprocessor workloads. In Proceedings of the International Conference on High Performance Embedded Architectures and Compilers. 153--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. van der Horst, R. and Hogema, J. 1993. Time-to-collision and collision avoidance systems. In Proceedings of the 6th International Cooperation on Theories and Concepts in Traffic Safety Workshop.Google ScholarGoogle Scholar
  55. van de Waerdt, J.-W. 2006. The TM3270 Media-processor. Ph.D. thesis, TU Delft, The Netherlands.Google ScholarGoogle Scholar
  56. van de Waerdt, J.-W., Vassiliadis, S., et al. 2005. The TM3270 media-processor. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. van Eijndhoven, J., Hoogerbrugge, J., Jayram M. N., Stravers, P., and Terechko, A. 2006. Cache-coherent heterogeneous multiprocessing as basis for streaming applications. In Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices, F. Toolenaar and P. van der Stok Eds., Philips Research, 61--80.Google ScholarGoogle Scholar
  58. Verma, M. and Marwedel, P. 2006. Overlay techniques for scratchpad memories in low power embedded processors. IEEE Trans. VLSI Syst. 14, 8, 802--815. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Wong, H., Bracy, A. et al. 2008. Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 52--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Woo, D. H. and Lee, H.-H. S. 2008. Extending Amdahl’s law for energy-efficient computing in the many-core era. Computer 41, 12, 24--31. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Embedded Computing Systems
              ACM Transactions on Embedded Computing Systems  Volume 11S, Issue 1
              June 2012
              283 pages
              ISSN:1539-9087
              EISSN:1558-3465
              DOI:10.1145/2180887
              Issue’s Table of Contents

              Copyright © 2012 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 June 2012
              • Accepted: 1 March 2010
              • Revised: 1 October 2009
              • Received: 1 February 2009
              Published in tecs Volume 11S, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader