Skip to main content
main-content
Top

About this book

To satisfy the higher requirements of digitally converged embedded systems, this book describes heterogeneous multicore technology that uses various kinds of low-power embedded processor cores on a single chip. With this technology, heterogeneous parallelism can be implemented on an SoC, and greater flexibility and superior performance per watt can then be achieved. This book defines the heterogeneous multicore architecture and explains in detail several embedded processor cores including CPU cores and special-purpose processor cores that achieve highly arithmetic-level parallelism. The authors developed three multicore chips (called RP-1, RP-2, and RP-X) according to the defined architecture with the introduced processor cores. The chip implementations, software environments, and applications running on the chips are also explained in the book.

Provides readers an overview and practical discussion of heterogeneous multicore technologies from both a hardware and software point of view;Discusses a new, high-performance and energy efficient approach to designing SoCs for digitally converged, embedded systems;Covers hardware issues such as architecture and chip implementation, as well as software issues such as compilers, operating systems, and application programs;Describes three chips developed according to the defined heterogeneous multicore architecture, including chip implementations, software environments, and working applications.

Table of Contents

Frontmatter

Chapter 1. Background

Abstract
Since the mid-1990s, the concept of “digital convergence” has been proposed and discussed from both technological and business viewpoints [1]. In the twenty-first century, “digital convergence” has become stronger and stronger in various digital fields. It is especially notable in the recent trend in digital consumer products such as cellular phones, car information systems, and digital TVs (Fig. 1.1) [2, 3]. This trend will become more widespread in various embedded systems, and it will expand the conventional market due to the development of new functional products and also lead to the creation of new markets for goods such as robots.
Kunio Uchiyama, Fumio Arakawa, Hironori Kasahara, Tohru Nojiri, Hideyuki Noda, Yasuhiro Tawara, Akio Idehara, Kenichi Iwata, Hiroaki Shikano

Chapter 2. Heterogeneous Multicore Architecture

Abstract
In order to satisfy the high-performance and low-power requirements for advanced embedded systems with greater flexibility, it is necessary to develop parallel processing on chips by taking advantage of the advances being made in semiconductor integration. Figure 2.1 illustrates the basic architecture of our heterogeneous multicore [1, 2]. Several low-power CPU cores and special purpose processor (SPP) cores, such as a digital signal processor, a media processor, and a dynamically reconfigurable processor, are embedded on a chip. In the figure, the number of CPU cores is m. There are two types of SPP cores, SPPa and SPPb, on the chip. The values n and k represent the respective number of SPPa and SPPb cores. Each processor core includes a processing unit (PU), a local memory (LM), and a data transfer unit (DTU) as the main elements. The PU executes various kinds of operations. For example, in a CPU core, the PU includes arithmetic units, register files, a program counter, control logic, etc., and executes machine instructions. With some SPP cores like the dynamic reconfigurable processor, the PU executes a large quantity of data in parallel using its array of arithmetic units. The LM is a small-size and low-latency memory and is mainly accessed by the PU in the same core during the PU’s execution. Some cores may have caches as well as an LM or may only have caches without an LM. The LM is necessary to meet the real-time requirements of embedded systems. The access time to a cache is non-deterministic because of cache misses. On the other hand, the access to an LM is deterministic. By putting a program and data in the LM, we can accurately estimate the execution cycles of a program that has hard real-time requirements. A data transfer unit (DTU) is also embedded in the core to achieve parallel execution of internal operation in the core and data transfer operations between cores and memories. Each PU in a core processes the data on its LM or its cache, and the DTU simultaneously executes memory-to-memory data transfer between cores. The DTU is like a direct memory controller (DMAC) and executes a command that transfers data between several kinds of memories, then checks and waits for the end of the data transfer, etc. Some DTUs are capable of command chaining, where multiple commands are executed in order. The frequency and voltage controller (FVC) connected to each core controls the frequency, voltage, and power supply of each core independently and reduces the total power consumption of the chip. If the frequencies or power supplies of the core’s PU, DTU, and LM can be independently controlled, the FVC can vary their frequencies and power supplies individually. For example, the FVC can stop the frequency of the PU and run the frequencies of the DTU and LM when the core is executing only data transfers. The on-chip shared memory (CSM) is a medium-sized on-chip memory that is commonly used by cores. Each core is connected to the on-chip interconnect, which may be several types of buses or crossbar switches. The chip is also connected to the off-chip main memory, which has a large capacity but high latency.
Kunio Uchiyama, Fumio Arakawa, Hironori Kasahara, Tohru Nojiri, Hideyuki Noda, Yasuhiro Tawara, Akio Idehara, Kenichi Iwata, Hiroaki Shikano

Chapter 3. Processor Cores

Abstract
The processor cores described in this chapter are well tuned for embedded systems. They are SuperHTM RISC engine family processor cores (SH cores) as typical embedded CPU cores, flexible engine/generic ALU array (FE–GA or shortly called FE as flexible engine) as a reconfigurable processor core, MX core as a massively parallel SIMD-type processor, and video processing unit (VPU) as a video processing accelerator. We can implement heterogeneous multicore processor chips with them, and three implemented prototype chips, RP-1, RP-2, and RP-X, are introduced in the Chap. 4.
Kunio Uchiyama, Fumio Arakawa, Hironori Kasahara, Tohru Nojiri, Hideyuki Noda, Yasuhiro Tawara, Akio Idehara, Kenichi Iwata, Hiroaki Shikano

Chapter 4. Chip Implementations

Abstract
Three prototype multicore chips, RP-1, RP-2, and RP-X, were implemented with the highly efficient cores described in Chap. 3. The details of the chips are described in this chapter. The multicore architecture makes it possible to enhance the performance while maintaining the efficiency, but not to enhance the efficiency. Therefore, a multicore with inefficient cores is still inefficient, and the highly efficient cores are the key components to realize a high-performance and highly efficient SoC. However, the multicore requires different technologies from that of a single core to maximize its capabilities. The prototype chips are useful for researching and developing such technologies and have been utilized for developing and evaluating software environments, application programs, and systems (see Chaps. 5 and 6).
Kunio Uchiyama, Fumio Arakawa, Hironori Kasahara, Tohru Nojiri, Hideyuki Noda, Yasuhiro Tawara, Akio Idehara, Kenichi Iwata, Hiroaki Shikano

Chapter 5. Software Environments

Abstract
Linux is one of the operating systems capable of symmetric multiprocessing (SMP). In this subsection, we describe the work of porting SMP Linux to the RP-1, RP-2, and RP-X multicore chips which are explained in Chap. 4, and extending functions of Linux to use the new processor’s features. Each chip has the following special points:
Kunio Uchiyama, Fumio Arakawa, Hironori Kasahara, Tohru Nojiri, Hideyuki Noda, Yasuhiro Tawara, Akio Idehara, Kenichi Iwata, Hiroaki Shikano

Chapter 6. Application Programs and Systems

Abstract
This chapter describes the evaluation of a heterogeneous multicore architecture consisting of a widely used advanced audio codec (AAC) [1] audio encoder implemented on a fabricated chip. The AAC encoder is supported for audio playback by various embedded systems. The processing scheme on the heterogeneous multicore architecture with support of hierarchical memories and data transfer units was newly investigated, and the execution time and power consumption of the encoding were measured.
Kunio Uchiyama, Fumio Arakawa, Hironori Kasahara, Tohru Nojiri, Hideyuki Noda, Yasuhiro Tawara, Akio Idehara, Kenichi Iwata, Hiroaki Shikano

Backmatter

Additional information