research-article

Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

Authors:
Andrei Terechko

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

,
Jan Hoogerbrugge

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

,
Ghiath Alkadi

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

,
Surendra Guntur

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

,
Anirban Lahiri

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

,
Marc Duranton

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

,
Clemens Wüst

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

,
Phillip Christie

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

,
Axel Nackaerts

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

,
Aatish Kumar

NXP Semiconductors, The Netherlands

NXP Semiconductors, The Netherlands
View Profile

ACM Transactions on Embedded Computing Systems Volume 11S Issue 1Article No.: 14pp 1–32https://doi.org/10.1145/2180887.2180890

Published:01 June 2012Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Multicore architectures provide scalable performance with a lower hardware design effort than single core processors. Our article presents a design methodology and an embedded multicore architecture, focusing on reducing the software design complexity and boosting the performance density. First, we analyze characteristics of the Task-Level Parallelism in modern multimedia workloads. These characteristics are used to formulate requirements for the programming model. Then we translate the programming model requirements to an architecture specification, including a novel low-complexity implementation of cache coherence and a hardware synchronization unit. Our evaluation demonstrates that the novel coherence mechanism substantially simplifies hardware design, while reducing the performance by less than 18% relative to a complex snooping technique. Compared to a single processor core, the multicores have already proven to be more area- and energy-efficient. However, the multicore architectures in embedded systems still compete with highly efficient function-specific hardware accelerators. In this article we identify five architectural methods to boost performance density of multicores; microarchitectural downscaling, asymmetric multicore architectures, multithreading, generic accelerators, and conjoining. Then, we present a novel methodology to explore multicore design spaces, including the architectural methods improving the performance density. The methodology is based on a complex formula computing performances of heterogeneous multicore systems. Using this design space exploration methodology for HD and QuadHD H.264 video decoding, we estimate that the required areas of multicores in CMOS 45 nm are 2.5 mm² and 8.6 mm², respectively. These results suggest that heterogeneous multicores are cost-effective for embedded applications and can provide a good programmability support.

References

Adve, S., Pai, V., and Ranganathan, P. 1999. Recent advances in memory consistency models for hardware shared-memory systems. Proc. IEEE Special Issue On Distributed Shared-Memory. 445--455.Google Scholar
Agarwal, A. and Levy, M. 2007. The KILL rule for multicore. In Proceedings of the Design Automation Conference. Google ScholarDigital Library
Agarwal, A., Simoni, R., Hennessy, J., and Horowitz, M. 1988. An evaluation of directory schemes for cache coherence. In Proceedings of the 15th Annual International Symposium on Computer Architecture. 280--298. Google ScholarDigital Library
Al-Kadi, G. and Terechko, A. S. 2009. A hardware task scheduler for embedded video processing. In Proceedings of the 4th International Conference on High Performance and Embedded Architectures and Compilers. Google ScholarDigital Library
Al-Kadi, G., Hoogerbrugge, J., Guntur, S., Terechko, A., and Duranton, M. 2010. Meandering based parallel 3DRS algorithm for the multicore era. In Proceedings of the IEEE International Conference on Consumer Electronics.Google Scholar
Amphion. 2004. AmphionCS7050 part. http://www.design-reuse.com/news/7611/amphion-immediate-performanceh-264-avc-video-cores.html.Google Scholar
Andrews, G. R. 1999. Foundations of Multithreaded, Parallel, and Distributed Programming. Addison Wesley Publishers. Google ScholarDigital Library
Antoniu, G. and Bougé, L. 2002. Implementing multithreaded protocols for release consistency on top of the generic DSM-PM2 platform. In Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing. Lecture Notes in Computer Science, vol. 2326, Springer, 181--185. Google ScholarDigital Library
Archibald, J. and Baer, J.-L. 1986. Cache coherence protocols: Evaluation using a multiprocessor simulation model. ACM Trans. Comput. Syst. 4, 4, 273--298. Google ScholarDigital Library
Azevedo, A., Juurlink, B., Meenderinck, C., Terechko, A., Hoogerbrugge, J., Alvarez, M., Ramirez, A., and Valero, M. 2009. A highly scalable parallel implementation of H.264. Trans. High-Perform. Embed. Archit. Compil. 4, 2, 404--418.Google ScholarDigital Library
Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., et al. 1995. An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, New York, NY, 207--216. Google ScholarDigital Library
Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 83--94. Google ScholarDigital Library
Butenhof, D. R. 1997. Programming with POSIX Threads. Addison-Wesley. Google ScholarDigital Library
CACTI. http://www.hpl.hp.com/research/cacti/, technology model for cache structures.Google Scholar
Chaudhry, S. 2008. Rock: A SPARC CMT processor. http://www.opensparc.net/pubs/preszo/08/RockHotChips.pdf.Google Scholar
Christie, P., Nackaerts, A., Kumar, A., Terechko, A. S., and Doornbos, G. 2008. Rapid design flows for advanced technology pathfinding. In Proceedings of the International Electron Devices Meeting.Google Scholar
Culler, D. E., Singh, J. P., and Gupta, A. 1999. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. Google ScholarDigital Library
Darema, F. 2001. SPMD model: Past, present and future. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 8th European PVM/MPI Users’ Group Meeting. Lecture Notes in Computer Science, vol. 2131. Google ScholarDigital Library
de Haan, G., Biezen, P. W. A. C., Huijgen, H., and Ojo, O. A. 1993. True-motion estimation with 3-D recursive search block matching. IEEE Trans. Circ. Syst. Video Techn. 3, 5, 368--379.Google ScholarDigital Library
Detlefs, D., Martin, P., Moir, M., and Steele, G. L. 2001. Lock-free reference counting. In Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing. 190--199. Google ScholarDigital Library
Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., and Hennessy, J. 1990. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 15--26. Google ScholarDigital Library
Hill, M. D. 1990. What is scalability? ACM SIGARCH Comput. Archit. News 18, 4, 18--21. Google ScholarDigital Library
Hill, M. D. and Marty, M. R. 2008. Amdahl’s law in the Multicore Era”. IEEE Comput. Google ScholarDigital Library
Hoogerbrugge, J. and Augusteijn, L. 1999. Instruction scheduling for TriMedia. J. Instruct.-Level Parallel. 1.Google Scholar
Hoogerbrugge, J. and Terechko, A. 2008. A multithreaded multicore system for embedded media processing. Trans. High-Perform. Embed. Archit. Compil. 4, 2.Google Scholar
Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, T. R., and Shippy, D. 2007. Introduction to the Cell multiprocessor. IBM J. Resear. Devel. Google ScholarDigital Library
Kelm, J. H., Johnson, D. R., Mahesri, A., Lumetta, S. S., Frank, M., and Patel, S. J. 2008. SChISM: Scalable cache incoherent shared memory. Tech. rep. UILU-ENG-08-2212, University of Illinois.Google Scholar
Khronos. http://www.khronos.org.Google Scholar
Kumar, R., Tullsen, D. M., and Jouppi, N. P. 2006. Core architecture optimization for heterogeneous chip multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compiler Techniques. 23--32. Google ScholarDigital Library
Kunz, R. and Horowitz, M. 2008. The case for simple, visible cache coherency. In Proceedings of the Workshop on Memory Systems Performance and Correctness Held in Conjunction with the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. 31--35. Google ScholarDigital Library
Lamport, L. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. 28, 9, 690--691. Google ScholarDigital Library
Lee, H.-H., Tyson, G., and Farrens, M. 2000. Eager writeback: A technique for improving bandwidth utilization. In Proceedings of the 33rd International Symposium on Microarchitecture. Google ScholarDigital Library
Li, E., Li, W., et al. 2008. Accelerating video-mining applications using many small, general-purpose cores. IEEE Micro 28, 5, 8--21. Google ScholarDigital Library
Limberg, T., Winter, M. et al. 2009. A heterogeneous MPSoC with hardware supported dynamic task scheduling for software defined radio. In Proceedings of the Design Automation Conference.Google Scholar
Lusk, E. L. 1987. Portable Programs for Parallel and Processors. Harcourt School. Google ScholarDigital Library
Mogul, J. C., Mudigonda, J., Binkert, N., Ranganathan, P., and Talwar, V. 2008. Using asymmetric single-ISA CMPs to save energy on operating systems. IEEE Micro. Google ScholarDigital Library
Mohanty, S., Prasanna, V. K., Neema, S., and Davis, J. 2002. Rapid design space exploration of heterogeneous embedded systems using symbolic search and multi-granular simulation. In Proceedings of the Joint Conference on Languages Compilers and Tools for Embedded Systems Software and Compilers for Embedded Systems. Google ScholarDigital Library
Moudgill, M., Glossner, J., Agrawal, S., and Nacer, G. 2008. The Sandblaster 2.0 architecture and SB3500 implementation. In Proceedings of the Software Defined Radio Technical Forum.Google Scholar
Munk, H., Ayguadé, E. et al. 2011. ACOTES programming model. Int. J. Paral. Program. 39, 3, 397--400.Google ScholarCross Ref
Oliver, J., Rao, R., Franklin, D., Chong, F. T., and Akella, V. 2006. Synchroscalar: Evaluation of an embedded, multi-core architecture for media applications. J. Embed. Comput. 2, 2, 157--166. Google ScholarDigital Library
Pai, V. S., Ranganathan, P., Adve, S. V., and Harton, T. 1996. An evaluation of memory consistency models for shared-memory systems with ILP processors. ACM SIGOPS Oper. Syst. Rev. 30, 5, 12--23. Google ScholarDigital Library
Panis, C., Hirnschrott, U., Laure, G., Lazian, W., and Nurmi, J. 2004. DSPxPlore---Design space exploration methodology for an embedded DSP core. In Proceedings of the Symposium on Applied Computing. Google ScholarDigital Library
PNX8550. 2004. Philips/NXP PNX8550 (Viper2) MPSoC. http://www.nxp.com.Google Scholar
Posix. 1995. The POSIX threads standard. ISO/IEC standard 9945-1:1996, also known as ANSI/IEEE POSIX 1003.1-1995.Google Scholar
Sandbridge. http://www.sandbridgetech.com.Google Scholar
Sarangi, S., Tiwari, A., and Torrellas, J. 2006. Phoenix: Detecting and recovering from permanent processor design bugs with programmable hardware. In Proceedings of the 39th Annual International Symposium on Microarchitecture. 9--13. Google ScholarDigital Library
Sasanka, R., Li, M., Adve, S. V., Chen, Y.-K., and Debes, E. 2007. ALP: Efficient support for all levels of parallelism for complex media applications. ACM Trans. Architect. Code Optim. 4, 1, Article 3. Google ScholarDigital Library
Själander, M., Terechko, A., and Duranton, M. 2008. A look-ahead task management unit for embedded multi-core architectures. In Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools. IEEE, 149--157. Google ScholarDigital Library
TAD. Technology-aware design. IMEC, Leuven. http://www.imec.be/tad/.Google Scholar
Terechko, A., Hoogerbrugge, J., Al-Kadi, G., Lahiri, A., Guntur, S., Duranton. M., Christie, P., Nackaerts, A., and Kumar, A. 2009. Performance density exploration of heterogeneous multicore architectures. In Proceedings of the Rapid Simulation and Performance Evaluation: Methods and Tools in Conjunction with the 4th International Conference on High-Performance and Embedded Architectures and Compilers.Google Scholar
Toshiba. 2008. Toshiba Spurs Engine. http://www.toshiba.co.jp/about/press/2008_04/pr0801.htm.Google Scholar
Ungerer, T., Robič, B., and Šilc, J. 2003. A survey of processors with explicit multithreading. ACM Comput. Surv. 35, 1, 29--63. Google ScholarDigital Library
Vandeputte, F. and Eeckhout, L. 2009. Finding stress patterns in microprocessor workloads. In Proceedings of the International Conference on High Performance Embedded Architectures and Compilers. 153--167. Google ScholarDigital Library
van der Horst, R. and Hogema, J. 1993. Time-to-collision and collision avoidance systems. In Proceedings of the 6th International Cooperation on Theories and Concepts in Traffic Safety Workshop.Google Scholar
van de Waerdt, J.-W. 2006. The TM3270 Media-processor. Ph.D. thesis, TU Delft, The Netherlands.Google Scholar
van de Waerdt, J.-W., Vassiliadis, S., et al. 2005. The TM3270 media-processor. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
van Eijndhoven, J., Hoogerbrugge, J., Jayram M. N., Stravers, P., and Terechko, A. 2006. Cache-coherent heterogeneous multiprocessing as basis for streaming applications. In Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices, F. Toolenaar and P. van der Stok Eds., Philips Research, 61--80.Google Scholar
Verma, M. and Marwedel, P. 2006. Overlay techniques for scratchpad memories in low power embedded processors. IEEE Trans. VLSI Syst. 14, 8, 802--815. Google ScholarDigital Library
Wong, H., Bracy, A. et al. 2008. Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 52--61. Google ScholarDigital Library
Woo, D. H. and Lee, H.-H. S. 2008. Extending Amdahl’s law for energy-efficient computing in the many-core era. Computer 41, 12, 24--31. Google ScholarDigital Library
Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. 24--36. Google ScholarDigital Library

Index Terms

Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

Recommendations

Evaluating the Support of MTC Applications on Intel Xeon Phi Many-Core Accelerators
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster Computing

As Many-Task Computing (MTC) is becoming common-place on clusters, grids, and supercomputers, research that aims to take advantage of the new advances in hardware for MTC workloads is becoming more relevant. A good example is the design of frameworks ...
Read More
Entering the petaflop era: the architecture and performance of Roadrunner
SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing

Roadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer ...
Read More
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures
CF '12: Proceedings of the 9th conference on Computing Frontiers

With the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Embedded Computing Systems Volume 11S, Issue 1
June 2012
283 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/2180887
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 1 June 2012
- Accepted: 1 March 2010
- Revised: 1 October 2009
- Received: 1 February 2009
Published in tecs Volume 11S, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Parallelism
accelerators
design space exploration
embedded systems
heterogeneous
multimedia
multiprocessor
processor architecture
programming models
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 371
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Evaluating the Support of MTC Applications on Intel Xeon Phi Many-Core Accelerators

Entering the petaflop era: the architecture and performance of Roadrunner

The tradeoffs of fused memory hierarchies in heterogeneous computing architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

Evaluating the Support of MTC Applications on Intel Xeon Phi Many-Core Accelerators

Entering the petaflop era: the architecture and performance of Roadrunner

The tradeoffs of fused memory hierarchies in heterogeneous computing architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Journal Family

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media