Skip to main content

Über dieses Buch

This book constitutes the refereed proceedings of the 16th National Conference on Computer Engineering and Technology, NCCET 2012, held in Shanghai, China, in August 2012. The 27 papers presented were carefully reviewed and selected from 108 submissions. They are organized in topical sections named: microprocessor and implementation; design of integration circuit; I/O interconnect; and measurement, verification, and others.



Session 1: Microprocessor and Implementation

A Method of Balancing the Global Multi-mode Clock Network in Ultra-large Scale CPU

It is a long-time discussed problem that the balancing of global multi-mode clock tree is. And there are many potential problems caused by the unbalanced clock tree, such as timing violations, density and power comsuption. In this article, an innovative balance method is opened by adopting the redundance clock mux. The basic idea of it is to maximize the reuse of the clock tree for other modes and keep the sub-clock tree within the sub-blocks unchanged. A demo chip on 40nm process has this balance skill verified, and makes the density, leakage and power comsuption deeply decreased.
Zhuo Ma, Zhenyu Zhao, Yang Guo, Lunguo Xie, Jinshan Yu

Hardware Architecture for the Parallel Generation of Long-Period Random Numbers Using MT Method

Random numbers are extremely important to the scientific and computational applications. Mersenne Twist(MT) is one of the most widely used high-quality pseudo-random number generators(PRNG) based on binary linear recurrences. In this paper, a hardware architecture for the generation of parallel long-period random numbers using MT19937 method was proposed. Our design is implemented on a Xilinx XC6VLX240T FPGA device and is capable of producing multiple samples each period. This performance let us obtain higher throughput than the non-parallelization architecture and software. The samples generated by our design are applied to a Monte Carlo simulation for estimating the value of π, and we achieve the accuracy of 99.99%.
Shengfei Wu, Jiang Jiang, Yuzhuo Fu

MGTE: A Multi-level Hybrid Verification Platform for a 16-Core Processor

With the widely application of multi-core multi-thread processor in various computing fields, simulation and verification of processors become increasingly important. In this paper, a multi-level hybrid verification platform called MGTE is designed and developed for a 16-core processer PX-16. MGTE supports software simulating and hardware emulating in module level, sub-system level or full-chip level, which is capable of verifying the processor during all the design periods from details to the whole. Also, MGTE supports the hybrid verification of behavior models, RTL codes and net lists, which is capable of improving the simulation performance. It’s proved that MGTE can effectively ease the functional verification and preliminary performance evaluation of PX-16 processor.
Xiaobo Yan, Rangyu Deng, Caixia Sun, Qiang Dou

An Efficient Parallel SURF Algorithm for Multi-core Processor

In this paper, we propose an efficient parallel SURF algorithm for multi-core processor, which adopts data-level parallel method to implement parallel keypoints extraction and matching. The computing tasks are assigned to four DSP cores for parallel processing. The multi-core processor utilizes QLink and SDP respectively to deal with data communication and synchronization among DSP cores, which fully develops the multi-level parallelism and the strong computing power of multi-core processor. The parallel SURF algorithm is fully tested based on 5 different image samples with scale change, rotation, change in illumination, addition of noise and affine transformation The experimental results show that the parallel SURF algorithm has good adaptability for various distorted images, good image matching ability close to the sequential algorithm and the average speedup is 3.61.
Zhong Liu, Binchao Xing, Yueyue Chen

A Study of Cache Design in Stream Processor

Stream architecture is a newly developed high performance processor architecture oriented to multimedia processing. FT64 is 64-bit programmable stream processor and it aims at exploiting the parallelism and locality of the applications. In this paper, first, we inspect the memory access characteristics of FT64 with cache and without cache. Second, we propose an improved cache design method. Making use of the feature of stream data type used by FT64, the improved method avoids loading data from memory when the stream store instruction fully modifying cache block misses. The experiments show the performance has been improved by 20.7% and 25.8% when a normal cache and an improved cache are used respectively. Finally, we study on the performance influence of cache capacity and associativity. The results show that better performance can be achieved when we use a small cache and an associativity of 2 or 4.
Chiyuan Ma, Zhenyu Zhao

Design and Implementation of Dynamically Reconfigurable Token Coherence Protocol for Many-Core Processor

To efficiently maintain cache coherence in a many-core processor remains a big challenge today. Traditional protocols either offer low cache miss latency (like snoopy protocol) or not depending on bus-like interconnects (like directory protocol). Recently, Token Coherence has been proposed to capture the main characteristic of traditional protocols. However, since Token Coherence relies on broadcast-based transient request and inefficient persistent request, it is only suitable for small system. In order to make Token Coherence be scalable in many-core architectures, in this paper we introduce a dynamically reconfigurable mechanism to Token Coherence. Basing on sub-net, this mechanism can significantly reduce the average execution time and communication cost in 16-core processor. Therefore, this dynamically reconfigurable mechanism makes Token Coherence applicable in many-core architecture.
Chuan Zhou, Yuzhuo Fu, Jiang Jiang, Xing Han, Kaikai Yang

Dynamic and Online Task Scheduling Algorithm Based on Virtual Compute Group in Many-Core Architecture

Efficient task scheduling for a series of applications on Mesh based many-core processors is very challenging, especially when resource occupation and release are required in some running task phases. In this paper, we present a dynamic and online heuristic mapping for efficient task scheduling based on Virtual Computing Group (VCG), and an algorithm managing free resources based on rectangle topology is proposed as well. This method quickly finds proper rectangle resources for a task, partitions processing elements (PEs) into a Virtual Computing Group by constructing a subnet, and maps communicating subtasks on adjacent PEs according to data dependency and communication dependency. Compared with the existing algorithms, our mapping algorithm can reduce the total execution time and enhance the system throughput by 10% in simulations.
Ziyang Liu, Yuzhuo Fu, Jiang Jiang, Xing Han

ADL and High Performance Processor Design

Architecture Description Language (ADL) can model many computer related problems and is widely used in software and hardware design. When used in processor design, lots of institutes and companies use ADL as processor quick prototype design language and use it to generate processor simulator, test-benches and compiler utilities. This paper analyzes and compares three processor description languages. We also give the disadvantages of modern ADL when used in high performance processor design and give some suggestions for further ADL development.
Liu Yang, Xiaoqiang Ni, Yusong Tan, Hengzhu Liu

Session 2: Design of Integration Circuit

The Design of the ROHC Header Compression Accelerator

ROHC (Robust Header Compression) packet header compression protocol could reduce the extra overhead, which introduced by the packetizing of the Internet transport protocol, and utilize the wireless bandwidth more effectively, so it is widely used. Previous studies are mainly focused on the software implementation and optimization of key parameters. This paper introduces the ROHC header compression scheme applied in the wireless environment, and designs the framework of ROHC header compression scheme in U-mode. The header compressor of IPv4/UDP/RTP header has also been realized according to the principle of ROHC under U-mode. The modules and the implementation of the compressor are described in this paper. The performances of ROHC header compression system is analyzed through experiments. The result shows that the hardware accelerator achieves the function of ROHC packet header compression protocol correctly, and significantly reduces the overhead of packet headers to effectively improve the link utilization; at the same time has good usability and flexibility.
Mengmeng Yan, Shengbing Zhang

A Hardware Implementation of Nussinov RNA Folding Algorithm

The RNA secondary structure prediction, or RNA folding, is a compute-intensive task that is used in many bioinformatics applications. Developing the parallelism of this kind of algorithms is one of the most relevant areas in computational biology. In this paper, we propose a parallel way to implement the Nussinov algorithm on hardware. We implement our work on Xilinx FPGA, the total clock cycles to accomplish the algorithm is about half of using software in serial way, and we also partly resolve the limitation of fixed length requirement of existing hardware implementation with an efficient resource usage.
Qilong Su, Jiang Jiang, Yuzhuo Fu

A Configurable Architecture for 1-D Discrete Wavelet Transform

This work presents a novel configurable architecture for 1-dimensional discrete wavelet transform (DWT) which can be configured into different types of filters with different lengths. The architecture adopts polyphase filter structure and MAC loop based filter (MLBF) to achieve high computing performance and strong generality of the system. Loop unrolling approach is used to eliminate the data hazards caused by pipelining. The hardware usage of the configurable architecture is fixed for any kind of wavelet functions.
Qing Sun, Jiang Jiang, Yuzhuo Fu

A Comparison of Folded Architectures for the Discrete Wavelet Transform

The multi-level discrete wavelet transform (DWT) for multiresolution decomposition of a signal through the cascading of filter banks, employs a folded architecture to enhance hardware utilization. This work compares folded architectures for DWT based on three filter structures, the direct form filter, the linear systolic array, and the lifting structure. We generalize the design of these architectures in terms of DWT levels, filter taps and pipeline insertion in critical path. A figure of merit for assessing all the three architectures under different specifications is proposed. A detailed quantitative comparison among the architectures is presented with different combinations of specification. The result shows that variations in DWT levels, filter taps and pipeline insertions have different impacts on the three architectures. Overall, the folded architecture based on lifting structure gives the most desirable figure of merit and the one based on linear systolic array demonstrates the best scalability.
Jia Zhou, Jiang Jiang

A High Performance DSP System with Fault Tolerant for Space Missions

Space missions are very demanding on system reliability. As the development of space-based remote sensor technologies, space missions are increasingly high required on system performance. Conventional techniques mainly focus on the system reliability, at the expense of system performance.
In this paper, a flexible, DPS-based, high-performance system is presented. The system could dynamically adapt the system’s level of redundancy according to varying radiation levels. A compare-point and fast recovery mechanism is proposed to improve system performance. Besides, some design ideas and implementation methods also be mentioned. In this paper, the system performances are evaluated and analyzed. With running of the correlation function benchmark in this system, it is shown that the system provides high performances under the premise of certified reliability.
Kang Xia, Ao Shen, Yuzhuo Fu, Ting Liu, Jiang Jiang

The Design and Realization of Campus Information Release Platform Based on Android Framework

With the popularity of the mobile terminal, there appears a new trend to release all kinds of campus information by intelligent mobile terminals. The efficient, intelligent and popular features of Android smart phone platform will be combined with the campus information system to achieve the synchronization and convenience of all types of campus information release and to strengthen the communication between the various campuses of the same university. In this paper, we design and realize a campus information release platform based on Android framework. This campus information release platform can effectively reduce the complexity of the information release system and strengthen the real-time performance of information, which thereby promote the information construction of the campus.
Jie Wang, Xue Yu, Yu Zeng, Dongri Yang

A Word-Length Optimized Hardware Gaussian Random Number Generator Based on the Box-Muller Method

In this paper, we proposed a hardware Gaussian random number generator based on the Box-Muller method. To reduce the resource complexity, an efficient word-length optimization model is proposed to find out the optimal word-lengths for signals. Experimental results show that our word-length optimized Fixed-Point generator runs as fast as 403.7 MHz on a Xilinx Virtex-6 FPGA device and is capable of generating 2 samples every clock cycle, which is 12.6 times faster compared to its corresponding dedicated software version. It uses up 442 Slices, 1517 FFs and 1517 LUTs, which is only about 1% of the device and saves almost 85% and 71% of area in comparison to the corresponding IEEE double & single Floating-Point generators, respectively. The statistical quality of the Gaussian samples produced by our design is verified by the common empirical test: the chi-square (X 2) test.
Yuan Li, Jiang Jiang, Minxuan Zhang, Shaojun Wei

Session 3: I/O Interconnect

DAMQ Sharing Scheme for Two Physical Channels in High Performance Router

Communication in large scale interconnection networks can be made more efficient by designing faster routers, using larger buffers, larger number of ports and channels, but all of which incur significant overheads in hardware costs. In this paper we present a dual-port shared buffer scheme for router. The proposed scheme is based on a dynamically allocated multi queue and four-port Register File. Two physical channels share the same input buffer space. This can provide a larger available buffer space per channel when load is unbalanced among physical channels and virtual channels. We give the detailed organization of shared buffer and management of idle buffer. Result of simulation shows that the proposed method has similar performance using only 75% of the buffer size in traditional implementation and outperforms by 5% to 10% in throughput with the same size.
Yongqing Wang, Minxuan Zhang

Design and Implementation of Dynamic Reliable Virtual Channel for Network-on-Chip

Reliability issue such as soft error due to scaling IC technology, low voltage supply and heavy thermal effects, has caused fault tolerant design be a challenge for NoC(Network-on-Chip). The router is a core element of the NoC, and the virtual channel based on flip-flop which occupies most of the area is the most sensitive element to soft error of the router. Focus on this problem, a dynamic reliable virtual channel architecture is proposed in this paper. It can detect the utilization of the virtual channel to adjust physical configuration to support for no-protection, dual redundancy and TMR (triple modular redundancy) requirements in flexibility. Compared with typical TMR virtual channel design, the synthesis results show that our method can achieve several fault tolerant structures switch with near 3 times resource utilization in ideal case and only 13.8% extra area cost.
Peng Wu, Yuzhuo Fu, Jiang Jiang

HCCM: A Hierarchical Cross-Connected Mesh for Network on Chip

As the continuous development of semiconductor technology, more and more IP cores can be contained on the single chip. At this time the interconnected structure plays a decisive role on the area and performance of system on chip, and has a profound influence on the transmission capability of system. Based on the distributed routing lookup, we proposed a new kind of inerratic interconnection network is named HCCM (Hierarchical Cross-Connected Mesh), which is consisted of a N ×N Mesh interconnection of N 2 subnets, every subnet comprised of 2 ×2 interconnection by full connection. Meanwhile, this paper comes up with a new hierarchical routing algorithm——HXY (Hierarchical XY), the simulation results demonstrate the HCCM topology is superior to the Mesh and the Xmesh topology on the performance of system average communication delay and normalized throughput.
Liguo Zhang, Huimin Du, Jianyuan Liu

Efficient Broadcast Scheme Based on Sub-network Partition for Many-Core CMPs on Gem5 Simulator

Networks-on-chip (Noc) is proposed to achieve extensible and higher bandwidth communication in many-core CMPs. To make full use of the IC resource efficiently, sub-network partitioning oriented to Noc is proposed, which divides the whole Noc into regions to achieve the traffic isolation demand that acquired by Cache coherence protocol. We take the region segmentation for mesh-based Noc, the task mapped PEs (processing elements) aggregate into the Logic sub-network, and routing between these PEs is implemented in the according Physical sub-network, in which an efficient tree-based broadcast scheme based on multicast XY routing algorithm is carried out. The Gem5 Simulator is used to promote the research, experimental results shows our approach have a quite less average packet latency compared with multiple unicast.
Kaikai Yang, Yuzhuo Fu, Xing Han, Jiang Jiang

A Quick Method for Mapping Cores Onto 2D-Mesh Based Networks on Chip

With the development of NoC, it becomes an urgent task to efficiently map a complex application onto a specified NoC platform. In the paper, an approach which is called constraint-cluster based simulated annealing (CCSA) is proposed to tackle the mapping problem in 2D-mesh NoC in order to optimize communication energy and execution time. Different from other methods, the relationship among cores that are patitioned into several clusters is considered in our method and according to the relationship constraints are set. Experimental results show that the proposed approach gets shorter execution time with lower energy consumption compared with others algorithms. In VOPD application (4x4), the reduction of execution time is about 75.64% combing the normal simulated annealing. In greater application (8x8 vodx4) the CCSA can save 68.89% . The energy consumption is the lowest among all the compared algorithms.
Zhenlong Song, Yong Dou, Mingling Zheng, Weixia Xu

Session 4: Measurement, Verification, and Others

A Combined Hardware/Software Measurement for ARM Program Execution Time

In present there is no accurate end-to-end dynamic measurements for ARM program execution time, because the measurement results given by hardware counters in ARM microprocessors are not precise enough and the timing cost of instrument methods is difficult to be calculated. Therefore, this paper proposes a combined hardware/software measurement for ARM program execution time. It sets the precision of measurement in the system boot loader code, encapsulates the access to timers in the kernel of Linux, and then measures the execution time of the program by the timer and its corresponding interrupt during the execution of the program. Experimental results have shown that comparing with instrument methods and hardware counters, our method is an efficient way to obtain accurate and precise execution time measurements for ARM programs. Additional experiments performed by the combination of curve fitting techniques and our method have shown the method can be used to predict the execution time of program under different input data.
Liangliang Kong, Jianhui Jiang

A Low-Complexity Parallel Two-Sided Jacobi Complex SVD Algorithm and Architecture for MIMO Beamforming Systems

Singular Value Decomposition (SVD) is a very important matrix factorization technique in engineering applications. In multiple-input multiple-output (MIMO) systems, SVD is applied in transmit beamforming which provides high diversity advantages. This paper proposes a low-complexity parallel two-sided Jacobi complex SVD algorithm and architecture which are suitable for any m ×n (m ≤ 4, n ≤ 4) matrix. It performs two 2×2 complex SVD procedures in parallel, and employs master-slave CORDIC (coordinate rotation digital computer) to reduce the decomposition time. The proposed parallel algorithm for 4×4 complex SVD saves 52% decomposition time compared with the Golub-Kahan-Reinsch algorithm. Meanwhile, the Bit Error Rate (BER) performance of the proposed algorithm is almost the same with the ideal SVD.
Weihua Ding, Jiangpeng Li, Guanghui He, Jun Ma

A Thermal-Aware Task Mapping Algorithm for Coarse Grain Reconfigurable Computing System

Ever growing power density has made thermal effects one of the most crucial issues for modern VLSI designs, e.g., reports have shown that more than 50% of IC failures are related to thermal issues. However, thermal issues for Coarse Grain Reconfigurable Architectures (CGRA) have been few addressed. In this paper, a thermal-aware task mapping algorithm called Max-Min algorithm is developed for the REmus reconfigurable architecture, which uses compact thermal model based on equivalent thermal circuit to iteratively optimize the power dissipation on the modern CGRAs. Experiments based on Hotspot simulation show that the algorithm can reduce the maximum temperature by 3~9 and narrow the temperature distribution range by 7~15. Compared to previous intuitive random algorithm, the Max-Min algorithm can significantly reduce the number of optimization iterations while reserving the same result.
Shizhuo Tang, Naifeng Jing, Weiguang Sheng, Weifeng He, Zhigang Mao

DC Offset Mismatch Calibration for Time-Interleaved ADCs in High-Speed OFDM Receivers

Zero Intermediate Frequency (zero-IF) receivers with two analog-to-digital converters (ADCs) in In-Phase and Quadrature (IQ) branches are widely used in emerging multi-Gigabit wireless Orthogonal Frequency Division Multiplexing (OFDM) systems. Because ordinary ADCs could not meet the demands of sampling rate in the system, two time-interleaved analog-to-digital converters (TI-ADCs) could be an attractive alternative for sampling speed improvement in the receiver. However, the mismatches among the parallel sub-ADCs can degrade the performance significantly without calibration. Targeting the DC offset mismatch of the TI-ADCs, this paper proposes calibration algorithm based on decorrelation least-mean-squares (LMS) and recursive-least-square (RLS) utilizing the comb-type pilots in OFDM frame, which could calibrate the two TI-ADCs in (IQ) branches simultaneously. The calibration algorithm has the property of fast convergence. Simulation results show that the BER performance is improved by the proposed algorithm.
Yulong Zheng, Zhiting Yan, Jun Ma, Guanghui He

A Novel Graph Model for Loop Mapping on Coarse-Grained Reconfigurable Architectures

Coarse-Grained Reconfigurable Architectures (CGRAs) provide more opportunities for accelerating data-intensive applications, such as multi-media programs. However, the optimization of critical loops is still challenging issues, since there is lack of application mapping tool of CGRAs. To address this challenge, we first take program feature analysis on the kernel loops of applications. And then we propose a novel graph model called PIA-CDTG containing these features. We implement an efficient task mapping method with a genetic algorithm based on the graph model. Experimental results show that the mapping method with PIA-CDTG is more effective than other features-unaware methods, and make the execution attains high efficiency and availability.
Ziyu Yang, Ming Yan, Dawei Wang, Sikun Li

Memristor Working Condition Analysis Based on SPICE Model

Memristors are novel devices behaving like nonlinear resistors with memory. The concept was first proposed and described by Leon Chua in 1971. In 2008, HP lab proved its existence by announcing its first physical implementation as crossbar structures. A memristor has shown many advantages such as non-volatility and no leakage current. The logic value can be measured in terms of impedance and storing logic values without power consumption, which may cause significant effect on digital circuits. A detailed working condition of a nonlinear dopant drift model of a memristor is studied and a set of precise working condition has been found. The transition time between off and on states of a memristor is proposed as a kind of measurement of the switching behavior.
Zhuo Bi, Ying Zhang, Yunchuan Xu

On Stepsize of Fast Subspace Tracking Methods

Adjusting stepsize between convergence rate and steady state error level or stability is a problem in some subspace tracking schemes. Methods in DPM or Oja class may sometimes show sparks in their steady state error, even with a rather small stepsize. By a study on the schemes’ updating routine, it is found that the update does not happen to all of basis vectors but to a specific vector, if a proper basis is chosen to describe the estimated subspace. The vector moves only in a plane which is defined by the new input and pervious estimation. Through analyzing the vectors relationship in that plane, the movement of that vector is constricted to a reasonable range as an amendment on the algorithms to fix the sparks problem. The simulation confirms it eliminates the sparks.
Zhu Cheng, Zhan Wang, Haitao Liu, Majid Ahmadi


Weitere Informationen