Abstract
Multicore architectures provide scalable performance with a lower hardware design effort than single core processors. Our article presents a design methodology and an embedded multicore architecture, focusing on reducing the software design complexity and boosting the performance density. First, we analyze characteristics of the Task-Level Parallelism in modern multimedia workloads. These characteristics are used to formulate requirements for the programming model. Then we translate the programming model requirements to an architecture specification, including a novel low-complexity implementation of cache coherence and a hardware synchronization unit. Our evaluation demonstrates that the novel coherence mechanism substantially simplifies hardware design, while reducing the performance by less than 18% relative to a complex snooping technique. Compared to a single processor core, the multicores have already proven to be more area- and energy-efficient. However, the multicore architectures in embedded systems still compete with highly efficient function-specific hardware accelerators. In this article we identify five architectural methods to boost performance density of multicores; microarchitectural downscaling, asymmetric multicore architectures, multithreading, generic accelerators, and conjoining. Then, we present a novel methodology to explore multicore design spaces, including the architectural methods improving the performance density. The methodology is based on a complex formula computing performances of heterogeneous multicore systems. Using this design space exploration methodology for HD and QuadHD H.264 video decoding, we estimate that the required areas of multicores in CMOS 45 nm are 2.5 mm2 and 8.6 mm2, respectively. These results suggest that heterogeneous multicores are cost-effective for embedded applications and can provide a good programmability support.
- Adve, S., Pai, V., and Ranganathan, P. 1999. Recent advances in memory consistency models for hardware shared-memory systems. Proc. IEEE Special Issue On Distributed Shared-Memory. 445--455.Google Scholar
- Agarwal, A. and Levy, M. 2007. The KILL rule for multicore. In Proceedings of the Design Automation Conference. Google ScholarDigital Library
- Agarwal, A., Simoni, R., Hennessy, J., and Horowitz, M. 1988. An evaluation of directory schemes for cache coherence. In Proceedings of the 15th Annual International Symposium on Computer Architecture. 280--298. Google ScholarDigital Library
- Al-Kadi, G. and Terechko, A. S. 2009. A hardware task scheduler for embedded video processing. In Proceedings of the 4th International Conference on High Performance and Embedded Architectures and Compilers. Google ScholarDigital Library
- Al-Kadi, G., Hoogerbrugge, J., Guntur, S., Terechko, A., and Duranton, M. 2010. Meandering based parallel 3DRS algorithm for the multicore era. In Proceedings of the IEEE International Conference on Consumer Electronics.Google Scholar
- Amphion. 2004. AmphionCS7050 part. http://www.design-reuse.com/news/7611/amphion-immediate-performanceh-264-avc-video-cores.html.Google Scholar
- Andrews, G. R. 1999. Foundations of Multithreaded, Parallel, and Distributed Programming. Addison Wesley Publishers. Google ScholarDigital Library
- Antoniu, G. and Bougé, L. 2002. Implementing multithreaded protocols for release consistency on top of the generic DSM-PM2 platform. In Proceedings of the NATO Advanced Research Workshop on Advanced Environments, Tools, and Applications for Cluster Computing. Lecture Notes in Computer Science, vol. 2326, Springer, 181--185. Google ScholarDigital Library
- Archibald, J. and Baer, J.-L. 1986. Cache coherence protocols: Evaluation using a multiprocessor simulation model. ACM Trans. Comput. Syst. 4, 4, 273--298. Google ScholarDigital Library
- Azevedo, A., Juurlink, B., Meenderinck, C., Terechko, A., Hoogerbrugge, J., Alvarez, M., Ramirez, A., and Valero, M. 2009. A highly scalable parallel implementation of H.264. Trans. High-Perform. Embed. Archit. Compil. 4, 2, 404--418.Google ScholarDigital Library
- Blumofe, R. D., Joerg, C. F., Kuszmaul, B. C., et al. 1995. An efficient multithreaded runtime system. In Proceedings of the 5th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM Press, New York, NY, 207--216. Google ScholarDigital Library
- Brooks, D., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 83--94. Google ScholarDigital Library
- Butenhof, D. R. 1997. Programming with POSIX Threads. Addison-Wesley. Google ScholarDigital Library
- CACTI. http://www.hpl.hp.com/research/cacti/, technology model for cache structures.Google Scholar
- Chaudhry, S. 2008. Rock: A SPARC CMT processor. http://www.opensparc.net/pubs/preszo/08/RockHotChips.pdf.Google Scholar
- Christie, P., Nackaerts, A., Kumar, A., Terechko, A. S., and Doornbos, G. 2008. Rapid design flows for advanced technology pathfinding. In Proceedings of the International Electron Devices Meeting.Google Scholar
- Culler, D. E., Singh, J. P., and Gupta, A. 1999. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann. Google ScholarDigital Library
- Darema, F. 2001. SPMD model: Past, present and future. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 8th European PVM/MPI Users’ Group Meeting. Lecture Notes in Computer Science, vol. 2131. Google ScholarDigital Library
- de Haan, G., Biezen, P. W. A. C., Huijgen, H., and Ojo, O. A. 1993. True-motion estimation with 3-D recursive search block matching. IEEE Trans. Circ. Syst. Video Techn. 3, 5, 368--379.Google ScholarDigital Library
- Detlefs, D., Martin, P., Moir, M., and Steele, G. L. 2001. Lock-free reference counting. In Proceedings of the 20th Annual ACM Symposium on Principles of Distributed Computing. 190--199. Google ScholarDigital Library
- Gharachorloo, K., Lenoski, D., Laudon, J., Gibbons, P., Gupta, A., and Hennessy, J. 1990. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of the 17th Annual International Symposium on Computer Architecture. 15--26. Google ScholarDigital Library
- Hill, M. D. 1990. What is scalability? ACM SIGARCH Comput. Archit. News 18, 4, 18--21. Google ScholarDigital Library
- Hill, M. D. and Marty, M. R. 2008. Amdahl’s law in the Multicore Era”. IEEE Comput. Google ScholarDigital Library
- Hoogerbrugge, J. and Augusteijn, L. 1999. Instruction scheduling for TriMedia. J. Instruct.-Level Parallel. 1.Google Scholar
- Hoogerbrugge, J. and Terechko, A. 2008. A multithreaded multicore system for embedded media processing. Trans. High-Perform. Embed. Archit. Compil. 4, 2.Google Scholar
- Kahle, J. A., Day, M. N., Hofstee, H. P., Johns, C. R., Maeurer, T. R., and Shippy, D. 2007. Introduction to the Cell multiprocessor. IBM J. Resear. Devel. Google ScholarDigital Library
- Kelm, J. H., Johnson, D. R., Mahesri, A., Lumetta, S. S., Frank, M., and Patel, S. J. 2008. SChISM: Scalable cache incoherent shared memory. Tech. rep. UILU-ENG-08-2212, University of Illinois.Google Scholar
- Khronos. http://www.khronos.org.Google Scholar
- Kumar, R., Tullsen, D. M., and Jouppi, N. P. 2006. Core architecture optimization for heterogeneous chip multiprocessors. In Proceedings of the International Conference on Parallel Architectures and Compiler Techniques. 23--32. Google ScholarDigital Library
- Kunz, R. and Horowitz, M. 2008. The case for simple, visible cache coherency. In Proceedings of the Workshop on Memory Systems Performance and Correctness Held in Conjunction with the 13th International Conference on Architectural Support for Programming Languages and Operating Systems. 31--35. Google ScholarDigital Library
- Lamport, L. 1979. How to make a multiprocessor computer that correctly executes multiprocess programs. IEEE Trans. Comput. 28, 9, 690--691. Google ScholarDigital Library
- Lee, H.-H., Tyson, G., and Farrens, M. 2000. Eager writeback: A technique for improving bandwidth utilization. In Proceedings of the 33rd International Symposium on Microarchitecture. Google ScholarDigital Library
- Li, E., Li, W., et al. 2008. Accelerating video-mining applications using many small, general-purpose cores. IEEE Micro 28, 5, 8--21. Google ScholarDigital Library
- Limberg, T., Winter, M. et al. 2009. A heterogeneous MPSoC with hardware supported dynamic task scheduling for software defined radio. In Proceedings of the Design Automation Conference.Google Scholar
- Lusk, E. L. 1987. Portable Programs for Parallel and Processors. Harcourt School. Google ScholarDigital Library
- Mogul, J. C., Mudigonda, J., Binkert, N., Ranganathan, P., and Talwar, V. 2008. Using asymmetric single-ISA CMPs to save energy on operating systems. IEEE Micro. Google ScholarDigital Library
- Mohanty, S., Prasanna, V. K., Neema, S., and Davis, J. 2002. Rapid design space exploration of heterogeneous embedded systems using symbolic search and multi-granular simulation. In Proceedings of the Joint Conference on Languages Compilers and Tools for Embedded Systems Software and Compilers for Embedded Systems. Google ScholarDigital Library
- Moudgill, M., Glossner, J., Agrawal, S., and Nacer, G. 2008. The Sandblaster 2.0 architecture and SB3500 implementation. In Proceedings of the Software Defined Radio Technical Forum.Google Scholar
- Munk, H., Ayguadé, E. et al. 2011. ACOTES programming model. Int. J. Paral. Program. 39, 3, 397--400.Google ScholarCross Ref
- Oliver, J., Rao, R., Franklin, D., Chong, F. T., and Akella, V. 2006. Synchroscalar: Evaluation of an embedded, multi-core architecture for media applications. J. Embed. Comput. 2, 2, 157--166. Google ScholarDigital Library
- Pai, V. S., Ranganathan, P., Adve, S. V., and Harton, T. 1996. An evaluation of memory consistency models for shared-memory systems with ILP processors. ACM SIGOPS Oper. Syst. Rev. 30, 5, 12--23. Google ScholarDigital Library
- Panis, C., Hirnschrott, U., Laure, G., Lazian, W., and Nurmi, J. 2004. DSPxPlore---Design space exploration methodology for an embedded DSP core. In Proceedings of the Symposium on Applied Computing. Google ScholarDigital Library
- PNX8550. 2004. Philips/NXP PNX8550 (Viper2) MPSoC. http://www.nxp.com.Google Scholar
- Posix. 1995. The POSIX threads standard. ISO/IEC standard 9945-1:1996, also known as ANSI/IEEE POSIX 1003.1-1995.Google Scholar
- Sandbridge. http://www.sandbridgetech.com.Google Scholar
- Sarangi, S., Tiwari, A., and Torrellas, J. 2006. Phoenix: Detecting and recovering from permanent processor design bugs with programmable hardware. In Proceedings of the 39th Annual International Symposium on Microarchitecture. 9--13. Google ScholarDigital Library
- Sasanka, R., Li, M., Adve, S. V., Chen, Y.-K., and Debes, E. 2007. ALP: Efficient support for all levels of parallelism for complex media applications. ACM Trans. Architect. Code Optim. 4, 1, Article 3. Google ScholarDigital Library
- Själander, M., Terechko, A., and Duranton, M. 2008. A look-ahead task management unit for embedded multi-core architectures. In Proceedings of the 11th EUROMICRO Conference on Digital System Design Architectures, Methods and Tools. IEEE, 149--157. Google ScholarDigital Library
- TAD. Technology-aware design. IMEC, Leuven. http://www.imec.be/tad/.Google Scholar
- Terechko, A., Hoogerbrugge, J., Al-Kadi, G., Lahiri, A., Guntur, S., Duranton. M., Christie, P., Nackaerts, A., and Kumar, A. 2009. Performance density exploration of heterogeneous multicore architectures. In Proceedings of the Rapid Simulation and Performance Evaluation: Methods and Tools in Conjunction with the 4th International Conference on High-Performance and Embedded Architectures and Compilers.Google Scholar
- Toshiba. 2008. Toshiba Spurs Engine. http://www.toshiba.co.jp/about/press/2008_04/pr0801.htm.Google Scholar
- Ungerer, T., Robič, B., and Šilc, J. 2003. A survey of processors with explicit multithreading. ACM Comput. Surv. 35, 1, 29--63. Google ScholarDigital Library
- Vandeputte, F. and Eeckhout, L. 2009. Finding stress patterns in microprocessor workloads. In Proceedings of the International Conference on High Performance Embedded Architectures and Compilers. 153--167. Google ScholarDigital Library
- van der Horst, R. and Hogema, J. 1993. Time-to-collision and collision avoidance systems. In Proceedings of the 6th International Cooperation on Theories and Concepts in Traffic Safety Workshop.Google Scholar
- van de Waerdt, J.-W. 2006. The TM3270 Media-processor. Ph.D. thesis, TU Delft, The Netherlands.Google Scholar
- van de Waerdt, J.-W., Vassiliadis, S., et al. 2005. The TM3270 media-processor. In Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. Google ScholarDigital Library
- van Eijndhoven, J., Hoogerbrugge, J., Jayram M. N., Stravers, P., and Terechko, A. 2006. Cache-coherent heterogeneous multiprocessing as basis for streaming applications. In Dynamic and Robust Streaming in and between Connected Consumer-Electronic Devices, F. Toolenaar and P. van der Stok Eds., Philips Research, 61--80.Google Scholar
- Verma, M. and Marwedel, P. 2006. Overlay techniques for scratchpad memories in low power embedded processors. IEEE Trans. VLSI Syst. 14, 8, 802--815. Google ScholarDigital Library
- Wong, H., Bracy, A. et al. 2008. Pangaea: A tightly-coupled IA32 heterogeneous chip multiprocessor. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 52--61. Google ScholarDigital Library
- Woo, D. H. and Lee, H.-H. S. 2008. Extending Amdahl’s law for energy-efficient computing in the many-core era. Computer 41, 12, 24--31. Google ScholarDigital Library
- Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd International Symposium on Computer Architecture. 24--36. Google ScholarDigital Library
Index Terms
- Balancing Programmability and Silicon Efficiency of Heterogeneous Multicore Architectures
Recommendations
Evaluating the Support of MTC Applications on Intel Xeon Phi Many-Core Accelerators
CLUSTER '15: Proceedings of the 2015 IEEE International Conference on Cluster ComputingAs Many-Task Computing (MTC) is becoming common-place on clusters, grids, and supercomputers, research that aims to take advantage of the new advances in hardware for MTC workloads is becoming more relevant. A good example is the design of frameworks ...
Entering the petaflop era: the architecture and performance of Roadrunner
SC '08: Proceedings of the 2008 ACM/IEEE conference on SupercomputingRoadrunner is a 1.38 Pflop/s-peak (double precision) hybrid-architecture supercomputer developed by LANL and IBM. It contains 12,240 IBM PowerXCell 8i processors and 12,240 AMD Opteron cores in 3,060 compute nodes. Roadrunner is the first supercomputer ...
The tradeoffs of fused memory hierarchies in heterogeneous computing architectures
CF '12: Proceedings of the 9th conference on Computing FrontiersWith the rise of general purpose computing on graphics processing units (GPGPU), the influence from consumer markets can now be seen across the spectrum of computer architectures. In fact, many of the high-ranking Top500 HPC systems now include these ...
Comments