Skip to main content

2015 | Buch

Computer Engineering and Technology

18th CCF Conference, NCCET 2014, Guiyang, China, July 29 – August 1, 2014, Revised Selected Papers

herausgegeben von: Weixia Xu, Liquan Xiao, Jinwen Li, Chengyi Zhang, Zhenzhen Zhu

Verlag: Springer Berlin Heidelberg

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 18th National Conference on Computer Engineering and Technology, NCCET 2014, held in Guiyang, China, during July/August 2014. The 18 papers presented were carefully reviewed and selected from 85 submissions. They are organized in topical sections on processor architecture; computer application and software optimization; technology on the horizon.

Inhaltsverzeichnis

Frontmatter

Processor Architecture

An Efficient Vector Memory Unit for SIMD DSP
Abstract
The SIMD DSP is highly efficient for embedded applications whose parallel data are aligned. However, there are many unaligned and irregular data accesses in typical embedded algorithms such as FFT, FIR. The vectorization of these kinds of algorithms will need many additional shuffle instruction operations in the SIMD architecture with alignment restriction, which greatly decreases the computation efficiency with the increasing SIMD width. This paper proposes an efficient vector memory unit (VMU) with 16 memory blocks on a 16-way SIMD DSP, M-DSP. Each memory block contains four groups of multi-bank memory structure with most-lowest-bit interleaved addressing and affords double bandwidth as needed to reduce the parallel vector access conflicts. A high-bandwidth data shuffle unit capable of dual vector accesses alignment is carried out in the vector access pipelining, which not only efficiently supports the unaligned access but also the special vector access patterns for FFT. The experimental results have shown that the VMU could afford conflict-free parallel accesses between DMA and vector Load/Stores operations with no more than 10% area overhead, and M-DSP achieves an ideal accelerate rate for FFT and FIR algorithms.
Haiyan Chen, Zhong Liu, Sheng Liu, Sheng Ma
An Analytical Model for Matrix Multiplication on Many Threaded Vector Processors
Abstract
Vector can enhance peak performance while multi-threading can improve efficiency. MTV is a new architecture that combines the two to achieve both high computing performance and high throughput. Matrix multiplication is the kernel of many scientific applications. A parallel matrix multiplication algorithm is presented and an analytical performance model is built. Based on the model, the performance of MTV was evaluated and critical configurations are given to guide the design of MTV processors..
Yongwen Wang, Jun Gao, Bingcai Sui, Chengyi Zhang, Weixia Xu
HMCPA: Heuristic Method Utilizing Critical Path Analysis for Design Space Exploration of Superscalar Microprocessors
Abstract
Microprocessor design space exploration at-tempts to determine the optimal parameter conguration to satisfy target requirements within limited time. Current mainstream superscalar microprocessors typically use out-of-order execution and fully utilize instruction level parallelism. However, the increasing complexity of superscalar microprocessor design leads to ever big design space, which poses a challenge to the determination of the optimal design point. To address this problem, this paper proposes a heuristic method utilizing critical path analysis (HMCPA) to perform design space exploration of superscalar microprocessors. Profiling a program running on a simulator enables the program dependence graph to be built by using the detailed information generated during the simulation. The critical path of the dependence graph can then be obtained and further analyzed to determine the performance bottleneck under current design conguration. Based on the information of the performance bottleneck, design space exploration can fnally be conducted efficiently. Experimental results show that compared with the traversal and simulated annealing methods, HMCPA can effectively reduce the number of design points that need to be explored, as well as determine the optimal conguration quickly.
Fangyan Qin, Lei Wang, Yu Deng, Yongwen Wang, Tianlei Zhao
Low Latency Multicasting Scheme for Bufferless Hybrid NoC-Bus 3D On-Chip Networks
Abstract
In this paper, we proposed a novel multicast routing algorithm for the 3D Bufferless Hybrid Interconnection Network to enhance the overall system performance. The proposed algorithm makes use of a single-hop and broadcast (bus-based) interlayer communication of the 3D NoC-Bus mesh architecture. Compared to the DRM_noPR multicast routing algorithm, our simulations with six different synthetic workloads reveal that our architecture using the proposed multicast routing algorithm acquires high system performance.
Chaoyun Yao, Chaochao Feng, Mingxuan Zhang, Shaojun Wei
A Highly-Efficient Crossbar Allocator Architecture for High-Radix Switch
Abstract
The present contribution explores the allocator design for high-radix switches and implements a highly-efficient allocator PWF(Parallel WaveFront) for achieving high throughput. Based on wavefront allocator, PWF allocator realizes fast allocation within one cycle to avoid timing loop, and it proposes parallelized matching strategy on cyclical priority to supply allocation fairness as well as utilizing greedy policy to reach the maximal match number. Implemented under 32nm CMOS technology, the evaluation results of PWF hardware cost show that the area and power consumption compared to wavefront allocator are slightly increased by 32.8% and 36.8%, and the critical path delay under 8x8x8 switch is less than 0.5ns which satisfies the requirement of GHz-level frequency design for high-radix switch. By further estimating the allocation efficiency, PWF reduces the request schedule time by 61.2% and 65.7%, and increases the immediate request schedule number averagely by 38.9% and 46.7% in comparison with wavefront allocator. Then, the efficiency improvement is also revealed by the distinctly decreased average schedule time and average response time compared with wavefront and DRRM allocators, yielding apparent advantages on improving allocation performance and providing good allocation fairness.
Mingche Lai, Lei Gao

Application Specific Processors

FPGA Implementation of FastICA Algorithm for On-line EEG Signal Separation
Abstract
Fast independent component analysis (FastICA) is an efficient and popular algorithm to solve blind source separation (BSS) problems. FastICA is widely used to identify artifact and interference from their mixtures such as electroencephalogram (EEG), magnetoencephalography (MEG), and electrocar-diogram (ECG). In this paper, we propose a Scalable Macro-pipelined FastICA Architecture (SMFA) which aims to exploit architectural scalability and temporal parallelism. The SMFA has strong data processing ability for on-line EEG signals and is capable of coping with different types of input data. The FastICA algorithm based on the proposed SMFA is implemented on a field programmable gate array (FPGA). It’s a key module of an ongoing project which aims to evaluate human’s fatigue degree on-line from EEG. Experimental results demonstrate the effectiveness of the presented FastICA architecture as expected.
Dongsheng Zhao, Jiang Jiang, Chang Wang, Baoliang Lu, Yongxin Zhu
FPGA Based Low-Latency Market Data Feed Handler
Abstract
Financial market data refers to price and trading data transmitted between financial exchange instruments and traders. Delivery of financial market feeds requires massive data processing with ultra-low latency. FAST protocol is a financial technology standard for compressing data stream during network transmission. This paper presents the design and implementation of a hardware accelerator for financial market data in FAST protocol. We propose a parallel data decoding architecture for field analysis process, which is the key feature in our design. The decoder of this work is able to parse and filter FAST format messages, and with an additional parallel structure compared with typical handlers, achieving a 40% speedup on decoding time compared to previous attempts. The filter function is reconfigurable for various user preferences and further protocol updates. Test under massive source data indicated an average latency of 1.6μs per message.
Liyuan Zhou, Jiang Jiang, Ruochen Liao, Tianyi Yang, Chang Wang
FPGA-Based Real-Time Emulation for High-Speed Double-Precision Small Time-Step Electromagnetic Transient System
Abstract
Real-time emulation of electromagnetic transient (EMT) system involves huge computation task and requires low latency. In this paper, a Field Programmable Gate Array (FPGA)-based EMT system is proposed for small time-step emulation. This system takes advantage of fully pipelined computing structure and can support the simulation of a large-scale power grid with various types of components. Moreover, by employing our proposed node injected current accumulation (NICA) structure, the current vector updating problem in large-scale grid that involves large computation is solved, and the small time-step latency which normally takes about 10μs is brought down to only 2μs. Finally, a specified power grid system is emulated on FPGA that can support up to 74 three-phase buses with only 2μs latency for each time-step. The emulation result is also compared to other similar designs and shows the superiority of our system.
Jianxing Li, Chunhui Ding, Guanghui He, Qing Mu
Nodal-Analysis-Based FPGA Implementation for Real-Time Electromagnetic Transient Emulation System
Abstract
Electromagnetic transient (EMT) simulation of power systems is widely applied in the planning, design, and operation of modern grid. However, large-scale real-time EMT simulation requires significant computational power and is difficult to achieve small simulation timestep. A field-programmable gate array (FPGA)-based configurable EMT emulation system is proposed in this paper. A parallel nodal algorithm with pipelined double-precision floating-point calculation is designed to achieve high accuracy and small timestep. In addition, a novel nodal equation calculation (NEC) structure is designed to save area and latency. Moreover, the NEC module is reused to merge the nodal current vectors, which further improves the emulation scale of the system. The proposed real-time EMT emulation system on FPGA achieves a timestep of 2us and can emulate a configurable power network up to 74 nodes.
Xiaozhang Gong, Tianyi Yang, Xing Zhang, Guanghui He
Low Complexity Algorithm and VLSI Design of Joint Demosaicing and Denoising for Digital Still Camera
Abstract
In this paper, we propose a low complexity algorithm to jointly demosaic and denoise Bayer format image, which combines the Hamilton and Adams (HA) method for interpolation and Epsilon filter for noise removal. Instead of using a 5x5 filtering window, one 7x1 horizontal Epsilon filter and one 1x3 vertical Epsilon filter are adopted in our method, which reduces hardware cost significantly while keeps high performance. Simulation results show that our proposed algorithm improves the mean PSNR performance of image by 1 dB compared to the algorithms treating these two processes independently. Furthermore, only 4 line buffers are consumed, and simple logic operators including adders and shifters are used for computation. For real-time implementation, a 5 stage pipelined VLSI architecture with 24 kb SRAM for line buffer is presented. The prototype of the joint processor is verified with Xilinx FPGA device and consumes about 36.6K gates for computational logic after synthesis with TSMC 90nm technology. The joint processor achieves a throughput of 6 Gbps at 250 MHz.
Liang Hong, Wei Jin, Guanghui He, Weifeng He, Zhigang Mao

Computer Application and Software Optimization

A Slide-Window-Based Hardware XML Parsing Accelerator
Abstract
Nowadays, XML is playing an extremely important role in various fields such as web services and database systems. However, the task of XML parsing is generally known as bottleneck in related applications since it takes a general processor dozens of cycles to process every single character of XML file. As a result, software XML parsing is of poor performance and hardware accelerator is an appropriate alternative to perform efficient XML parsing. Until now, some hardware XML parsers with good performance have come to the world. In order to further improve XML parsing performance, we propose a slide-window-based XML parsing accelerator (SWXPA) which introduces data-level parallelism and implement our design on a Xilinx Virtex-6 board at an average throughput of 0.33 cycle per byte (CPB) and 3.0 Gbps.
Linan Huang, Jiang Jiang, Chang Wang, Yanghan Wang, Yan Pei
Design of Fully Pipelined Dual-Mode Double Precision Reduction Circuit on FPGAs
Abstract
This paper proposes a fully pipelined dual-mode double precision floating-point reduction circuit on the field programming gate arrays (FPGAs), which is capable of supporting one double-precision operation and two parallel single-precision operations. Through the combination of tree-traversal structure and striding mode structure, the reduction circuit can handle multiple data sets with arbitrary combination of different lengths without stall and buffer requirements, and generate in-order results. Experimental results show that the proposed reduction circuit can support the dual-mode double precision floating-point reduction at the cost of only 7% increment in the absolute latency for the double precision vector with the same length, compared with the previous single-mode double precision reduction circuits.
Song Guo, Yong Dou, Yuanwu Lei
Design of Cloud Server Based on Godson Processors
Abstract
Compared with the existing cloud computing systems based on high-performance processors and traditional Ethernet network, a 32 Godson processors cloud computing system based on HyperTransport switch (HT switch) is presented in this paper, which uses HT switch as its interconnection fabric. HT switch makes it possible to build a high performance-cost ratio, and high performance-watt ratio cloud server based on Godson processors, to better meet the requirements of cloud computing workloads. As a key interconnection fabric used to construct the cloud server, the HT switch architecture is discussed in details. To evaluate the performance of HT switch-based multiprocessor systems, a prototyping system followed by results of performance testing is implemented.
Chaoqun Sha, Gongbo Li, Chenming Zheng, Yanping Gao, Xiaojun Yang, Chungjin Hu
A New Storage System for Exabytes Storage
Abstract
A novel storage system - NebulaStorage is proposed to solve the storage challenge wall for high performance computer. NebulaStorage designs a new storage architecture in which computing and storage subsystems are loosely coupled. It uses embedded thin storage nodes to build massive storage, and uses software-based erasure code to replace the popular RAID disk array. Simulation tests show that its scalability and fault tolerance ability are better than mainstream luster parallel file system.
Haitao Chen, Jinwen Li, Wei Zhang
A High-Accuracy Clock Synchronization Method in Distributed Real-Time System
Abstract
Clock synchronization is mostly needed in the distributed real-time system. Currently, the most popular Network Time Protocol (NTP) algorithm cannot meet the needs well due to its low accuracy (about 10 milliseconds) and high cost. Thus, an improved high-accuracy clock synchronization method is proposed in this paper to overcome errors and offsets. With this method, the error of clocks among computer nodes in distributed real-time system can be less than 2 milliseconds and high availability can be achieved. The method has been applied in national key engineering project.
Hongliang Li, Xuan Feng, Song Shi, Fang Zheng, Xianghui Xie
Soft-Input Soft-Output Parallel Stack Algorithm for MIMO Detection
Abstract
This paper presents a reduced-complexity soft-input soft-output parallel stack algorithm (SISO-PSA) for multiple-input multiple-output (MIMO) wireless communication systems employing Turbo iterative processing at the receiver. The proposed algorithm incorporates hybrid enumeration and a modified tree pruning criterion to support soft-inputs, which results in significant computational complexity saving. Moreover, a leaf enumeration scheme is proposed to reduce the number of expanded leaf nodes. In addition, the parallelism at algorithm level provides high throughput while reduces area compared to hardware level parallelism, which is very suitable for VLSI implementation. The simulation results show that the proposed algorithm can achieve better performance than SISO K-Best algorithm (K=50) and SISO-FSD with 60% memory saving and significantly reduced computational complexity in terms of the number of visited nodes in a 4(4 64QAM MIMO system.
Fan Luo, Zhiting Yan, Guanghui He, Jun Ma, Zhigang Mao

Technology on the Horizon

Current Reduction Phenomenon in Graphene-Based Device
Abstract
A current reduction phenomenon was observed in back gate graphene-based field effect transistor. The drain current ID became smaller in next measurement even though the sweep range of the back gate bias VBG increased. We consider the reason for this phenomenon is that the contaminations produced during the device fabrication inevitably may serve as trap centers at the electrode-graphene interface, which would weaken the extent of p-type doping by trapping electrons when VBG is positive.
Honghui Sun, Liang Fang
Dynamic Mapping Optimization for LSQ Soft Error Rate Reduction under 3D Integration Technology
Abstract
With the progress of integrated circuit technology, the soft error problem is getting worse, which has become a challenge that researchers have to face. 3D integration technology can stack several circuit layers in a vertical direction, and 3D chips have an effect of shielding, which is capable of reducing the soft error rate of the inner circuit. In this paper, we propose a dynamic mapping optimization method to reduce the soft error rate of LSQ based on the observation of the characteristics of the LSQ access behavior using 3D integration technology. The experimental result shows that, the proposed method can significantly reduce the soft error rate by 86.6% and 85.7%, on average, for the load queue and store queue respectively.
Chao Song, Min-xuan Zhang
Backmatter
Metadaten
Titel
Computer Engineering and Technology
herausgegeben von
Weixia Xu
Liquan Xiao
Jinwen Li
Chengyi Zhang
Zhenzhen Zhu
Copyright-Jahr
2015
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-662-45815-0
Print ISBN
978-3-662-45814-3
DOI
https://doi.org/10.1007/978-3-662-45815-0

Neuer Inhalt