Skip to main content

2016 | Buch

Computer Engineering and Technology

20th CCF Conference, NCCET 2016, Xi'an, China, August 10-12, 2016, Revised Selected Papers

herausgegeben von: Weixia Xu, Liquan Xiao, Jinwen Li, Chengyi Zhang, Zhenzhen Zhu

Verlag: Springer Singapore

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 20th CCF Conference on Computer Engineering and Technology, NCCET 2016, held in Xi'an, China, in August 2016. The 21 full papers presented were carefully reviewed and selected from 120 submissions. They are organized in topical sections on processor architecture; application specific processors; computer application and software optimization; technology on the horizon.

Inhaltsverzeichnis

Frontmatter

Processor Architecture

Frontmatter
Single/Double Precision Floating-Point Division and Square Root Unit Based on SRT-8 Algorithm
Abstract
To meet the precision requirement of different applications and reduce latency of operation for low precision, a unified structure for IEEE-754 double-precision/SIMD single-precision floating-point division and square root operation based on SRT-8 algorithm was introduced. Special instructions were designed and independent mantissa computing unit and normalization unit are implemented. Moreover, parallel adders and QDS structure was adopted to hide the latency of look-up table, generating fast addend was used to decrease critical path, and “On-the-fly” conversion was employed for saving area-cost. Experimental results show that our proposed design can achieve low latency and low hardware overhead.
Yuanxi Peng, Tingting He, Yuanwu Lei, Baozhou Zhu
Language-Extension-Based Vectorizing Compiling Scheme on SDR-DSP
Abstract
In this paper we propose a Language-Extension-based Vectorizing Compiling Scheme (LEVCS) for a newly developed DSP. The DSP is mainly designed for Software-Defined Radio (SDR) and is called SDR-DSP. The SDR-DSP architecture mixes the styles of VLIW (Very Long Instruction Word) and SIMD (Single Instruction Multiple Data). To explore the potential of SDR-DSP and achieve high performance, vectorization is one of the must equipped critical methods. Because auto-vectorization techniques cannot satisfy the requirements of the typical application, LEVCS is used to direct the vectorization. The C-extending programming language used in LEVCS is called SDR-DSP-C. LEVCS uses flexible data reorganization to make vectorization on SDR-DSP more efficient. We use LEVCS to vectorize five benchmark kernels: Fast Fourier Transform (FFT), Finite Impulse Responsefilter (FIR) and Infinite Impulse Response filter (IIR), Dot product implementation (Dotprod), Sum of vectors (vecsum). Experiment results show that LEVCS is functional correct and can achieve 2.883–8.074 speedups comparing to TI-DSPs.
Xiaoqiang Ni, Liu Yang, Chiyuan Ma
A Methodology for Performance Verification of Microprocessors
Abstract
The tested performance of a microprocessor chip is more important than the predicted performance of it’s model. However, performance deviations are often introduced during the design stages. In order to identify and fix the performance defects, a hierarchical performance verification methodology is proposed. Parameter sensitive performance models and coverage driven stimulus are built at the unit-level. Implementation oriented performance calibration and RTL simulation based benchmarks are made at the core-level. Prototyping and counter-based performance analysis systems are built in the system level. An example is given to demonstrate the application and effectiveness of the proposed methodology.
Yongwen Wang, Libo Huang, Zhong Zheng
A Novel L1 Cache Based on Volatile STT-RAM
Abstract
Spin-transfer torque random access memory (STT-RAM) is one of the most promising substitutes for universal main memory and cache due to its excellent scalability, high density and low leakage power. Nevertheless, the current non-volatile STT-RAM cache architecture also has some drawbacks, such as long write latency and high write energy, which limit the application of STT-RAM in the top level cache design. To solve these problems, we relax the retention time of STT-RAM to explore its different write performance, and propose a novel STT-RAM L1 cache architecture implemented with volatile STT-RAM as well as its related refresh scheme. The performance of proposed design is the same as SRAM L1 cache while its overall power consumption is only 63.8% of the latter one.
Zhang Hongguang, Zhang Minxuan
A New DVFS Algorithm Design for Multi-core Processor Chip
Abstract
With the development of the CMOS process, beyond 3 billion of transistors are integrated on chip. But the increasing power density becomes a serious problem making the performance improvement slow down. Therefore, how to optimize the power consumption of multi-core processor is a crisis in processor design. This paper proposes a dual-threshold adaptive DVFS algorithm to dynamically control the processor voltage and frequency. Comparing with traditional single-threshold algorithm, experimental results show that dual-threshold adaptive DVFS can save more power with no obviously performance reduction. The performance of most benchmarks is beyond 90% of the original performance, while the power optimization can be up to 35%.
Chengyi Zhang, Jiming Wang, Minxuan Zhang, Xiangdi Wu

Application Specific Processors

Frontmatter
A Novel Low-Power and High-PSNR Architecture Based on ARC for DCT/IDCT
Abstract
Discrete cosine transform (DCT) and its inverse (IDCT) play a key role in image and video systems. In this paper, we propose an efficient DCT/IDCT architecture based on adaptive recoding coordinate rotation digital computer (ARC), which has been validated on an FPGA platform. Compared to the state-of-the-art DCT, the proposed architecture dissipates 8.2% less power and improves PSNR by 3.21 dB while maintaining nearly the same area and speed. The proposed architecture uses 37.6% less hardware resources, saves 31.6% in power dissipation, provides a 2.15 times speed-up and improves PSNR slightly when compared with the newest DCT/IDCT architecture.
Yiliu Feng, Jianfeng Zhang, Hengzhu Liu
Microsecond-Level Temperature Variation of Logic Circuits and Influences of Infrared Cameras’ Parameters on Hardware Trojans Detection
Abstract
Currently, hardware Trojans have posed a serious threat to the integrated circuit security. A novel approach using chips’ infrared radiation to detect Trojans on a second scale was proposed in 2014. However, the temperature differences can be distinguishable on a microsecond scale between the normal areas in normal chips and the corresponding infected areas in chips with Trojans. As a result, the second-level detection can influence the detection accuracy reversely because of the temperature balance. On the other hand, infrared cameras’ ability to detect Trojans is determined by three parameters. They are the noise equivalent temperature difference (NETD), the pixel size and the frame frequency. It will be of benefit to Trojans detection using infrared cameras, if we determine the influences of the three parameters on detection. In this paper, we utilize finite element analysis to simulate the microsecond-level temperature variations of a fixed pixel-size silicon substrate while logic circuits on this size silicon substrate vary and operate under different challenges. Then, we find that the distinguishable time between different cases is on a microsecond scale according to a normal NETD. Based on our simulation results, an increasing step size (ISS) approach is proposed to capture dies’ microsecond-level infrared maps accurately using the low frame frequency infrared cameras. Finally, we analyze the temperature variations while a fixed logic circuit under a fixed challenge is operating on the different pixel-size silicon substrates. Based on the results, we get the link between NETD and the pixel size on the Trojan detection.
Yongkang Tang, Jianye Wang, Shaoqing Li, Jihua Chen, Binbin Yang
BFDir: A Space-Efficient Coherence Directory Based on Bloom Filter
Abstract
Directory-based coherence is widely used in modern CMP systems. As the number of cores increases, it is increasingly deemed as the only candidate for on-chip cache coherence maintaining. However, limitations of traditional coherence directory pose serious challenges to deal with the ever-increasing size of the system. The hardware overhead and redundant message broadcasting problems dramatically degrade the scalability and performance of the system. In this paper, a space-efficient coherence directory BFDir is proposed. The directory dramatically reduces the directory size as the share list is shortened by Bloom filter. Also, it does not incur message broadcasting as that in limited directories. The evaluation results show, for 32-core CMP systems, compared to full-map directory, 59% overhead of share list can be avoided at the expense of 2.77% performance loss on average; compared to 16-bit coarse directory, 22% overhead of share list can be avoided at the expense of 0.16% average performance loss on average; compared to 8-bit coarse directory, 48% invalid messages are saved and the performance is improved by 2.31%.
Jicheng Chen, Yaqian Zhao, Hongzhi Shi, Yihan Li
FPGA-Based High Throughput TDMP LDPC Decoder
Abstract
In this paper, a high-throughput decoder architecture for quasi-cyclic low density parity check (QC-LDPC) codes is presented. Using the Normalized Min-Sum algorithm and the turbo-decoding message-passing algorithm, the proposed design expanded degree of parallelism to improve the throughput at a cost of hardware resource usage. Based on the proposed architecture, we implemented a (8176, 7154) Euclidian geometry-based QC-LDPC code decoder on a Xilinx Kintex7 (XC7K325T-2) board. The FPGA implementation results show that the decoder can achieve a total decoding throughput of 1.6 Gbps at the clock frequency of 105Mth at 10 iterations.
Ruochen Liao, Yuzhuo Fu, Ting Liu
A Dynamic Multi-precision Fixed-Point Data Quantization Strategy for Convolutional Neural Network
Abstract
In recent years, deep learning represented by Convolutional Neural Network (CNN) has been one of the hottest topics of research. CNN inference process based models have been widely used in more and more computer vision applications. The execution speed of inference process is critical for applications, and the hardware acceleration method is mostly considered. To relieve the memory pressure, data quantization strategies are often used in hardware implementation. In this paper, a dynamic multi-precision fixed-point data quantization strategy for CNN has been proposed and used to quantify the floating-point data in trained CNN inference process. Results shows that our quantization strategy for LeNet model can reduce the accuracy loss from 22.2% to 5.9% at most, compared with previous static quantization strategy, when 8/4-bit quantization is used. When 16-bit quantization is used, only 0.03% accuracy loss is introduced by our quantization strategy with half memory footprint and bandwidth requirement comparing with 32-bit floating-point implementation.
Lei Shan, Minxuan Zhang, Lin Deng, Guohui Gong

Computer Application and Software Optimization

Frontmatter
Optimization of Two Bottleneck Programs in SAR System on GPGPU
Abstract
The Synthetic Aperture Radar (SAR) system is a kind of modern high-resolution microwave imaging radar used in all-weather and all day long to provide remote sensing means and generate high resolution images of the land under illumination of radar beam. Unlike optical sensors, SAR algorithm needs a post-processing process on the data acquired to form the final image. In this article, we use the General Purpose Graphic Processing Units (GPGPU) to accelerate two of SAR algorithms, PGA (Phase Gradient Autofocus) and PDE (Partial Differential Equations), which are two computational intensive algorithms in the post-processing process for the system. Our work shows that the GPU architecture has different acceleration effects on the two algorithms. PGA can achieve an acceleration of 21.7% and PDE can get a speed up of 2.58\(\times \) on GPGPU. We analyse the reasons for the results and conclude that GPU is a promising platform to accelerate the SAR system.
Yang Zhang, Zuocheng Xing, Cang Liu, Chuan Tang, Lirui Chen, Qinglin Wang
A Channel-Level RAID5 Schema Based Physical Address in SSD
Abstract
Flash-based solid state disks (SSDs) have been widely used for its high performance, low power, and concurrency features. With the increase in storage capacity, the reliability problem of SSD is becoming increasingly serious. In this paper, we implemented a technique based SSDs by constructing RAID-5 to enhance the reliability of SSD while maintaining its performance. First, we construct RAID-5 stripe based on SSD physical address which means no mapping tables to store stripe information. Second, our schema constructs dynamic stripe with log-structure to solve the inherent small write problem associated with conventional RAID-5. Third, since the correlation between data stripe, we realize garbage collection based on stripe group. Finally, we conduct extensive simulations using real-world traces and synthetic benchmarks in the SSDsim [1]. The experimental results show that we consume less than 7% of the performance and 6% of the storage consumption of SSD to achieve inner-channel RAID-5 to improve the reliability of SSD.
Ya Feng, Yuxuan Xing, Nong Xiao, Fang Liu
Unification Protection Design for a Certain Type of Vehicle-Borne Server
Abstract
The vehicle-borne environment has great influence to vehicle-borne server when it is in normal operation. In this paper, the vehicle-borne server is taken many ways to keep work normally when it is in the vehicle-borne environment such as anti-vibration design, thermal design, electromagnetic compatibility design, noise reduction design and so on. In each of the design, the methods and main structure diagram are given in this paper. A type of vehicle-borne server products is developed successfully according to this unification protection design methods. The vehicle-borne server is taken many environmental tests to make sure that it can work responsibly.
Chengkuan Sun, Jianping Cai, Wanli Sha, Shangyong Liang
Monaural Speech Separation on Many Integrated Core Architecture
Abstract
Monaural speech separation is a challenging problem in practical audio analysis applications. Non-negative matrix factorization (NMF) is one of the most effective methods to solve this problem because it can learn meaningful features from a speech dataset in a supervised manner. Recently, a semi-supervised method, i.e., transductive NMF (TNMF), has shown great power to separate speeches from different individuals by incorporating both training and testing data in learning the dictionary. However, both NMF-based and TNMF-based monaural speech separation approaches have high computational complexity, and prohibit them from real-time processing. In this paper, we implement TNMF-based monaural speech separation on many integrated core (MIC) architecture to meet the requirement of real-time speech separation. This approach conducts parallelism based on the OpenMP technology, and performs the computing intensitive matrix manipulations on a MIC coprocessor. The experimental results confirm the efficiency of our implementation of monaural speech separation on MIC architecture.
Wang He, Xu Weixia, Guan Naiyang, Yang Canqun
An AWGR-Based High Performance Optical Interconnect Architecture for Exascale Systems
Abstract
The next milestone objective of HPC is exascale computing, which includes millions of nodes in the system. One of the key critical barrier toward realizing exascale computing is the fundamental challenge of communication networks. We propose a high performance optical interconnect architecture based on Arrayed waveguide grating router (AWGR) with WDM wavelength routing, the inherent parallelism in AWGRs and multi-hop switching provide high scalability of the network. Theoretical analysis and simulation show its better performance compared with fat-tree architecture.
Shi Xu, Lei Zhang, Zhiling Li
Accelerating Nyström Kernel Independent Component Analysis with Many Integrated Core Architecture
Abstract
Kernel independent component analysis (KICA) penalizes the correlations among components in a reproducing kernel Hilbert space (RKHS) and performs well in many practical tasks such as speech separation due to its robustness on varying source distributions. Recently, Nyström-KICA (NKICA) incorporates a low-rank approximation and low-complexity sampling method to reduce the computational complexity of KICA. In this paper, we show that the computational complexity of NKICA can be further decreased by implementing the algorithm on the many integrated core (MIC) architecture to meet the requirement of large data processing. Particularly, we parallelize the critical segments with the OpenMP technology and perform the intensive matrix manipulations on a MIC coprocessor. This MIC-based approach has been evaluated on both simulated dataset and the TIMIT dataset. The experimental results confirm the efficiency of our implementation of NKICA on the MIC architecture, and show that it achieves a consistent speedup rate of around 10 on average, and of 12.3 at best, comparing with that performed on single CPU.
Lei Shan, He Wang, Weixia Xu, Canqun Yang, Minxuan Zhang

Technology on the Horizon

Frontmatter
A High-Radix Switch Architecture Based on Silicon Photonic and 3D Integration
Abstract
The design of high-radix switch chips is becoming a challenging research field in EHPC (Exascale High-Performance Computing). Recent development of silicon photonic and 3D integration technologies has inspired new methods of designing high-radix switch chips. In this paper, we propose a high-radix switch architecture called Grahpein, which improves the radix and bandwidth while lowering switch chips power consumption by 3D integration and silicon photonic technology. The simulation result also shows that the average latencies under both random and hotspot patterns are less than 10 cycles, and the throughput under random pattern is more than 95%. Compared to hi-rise architecture, the proposed architecture ensures the packets from different source ports receive fairer service, thereby yielding more concentrated latency distribution. In addition, the power consumption of the Graphein chip is about 19.2 W, which totally satisfies the power constraint on a high-radix switch chip.
Jian Jie, Xiao Liquan, Lai Mingche, Xu Shi
A Radiation Hardening Algorithm on 2nd Order CDR
Abstract
A radiation hardening algorithm named as state-conservation on 2nd order clock and data recovery (CDR) system is presented in this paper. This proposed algorithm is used to resist the single event transient (SET) of CDR tracking loop. A MATLAB model is established to fast evaluate the sensitive position of the system. A circuit model of 5 Gbps half rate CDR together with the hardening algorithm is set up to verify the effect of the proposed algorithm in Cadence design environment. The simulation result shows that SET does not lead to any error data and no loop delay is added. Compared to the RHBD standard-cell technique, the hardening algorithm saves area about 15.3% and reduces power consumption about 47.8%.
Hu Chunmei, Chen Shuming, Liu Yao, Chen Jianjun, Xu Jingyan
Sub-threshold Performance Driven Choice in Tunneling CNFETs
Abstract
The working mechanism of Tunneling Carbon Nanotube Feild Effect transistors (TCNFETs) has been analyzed firstly by defining the sub-threshold plunging voltage and subdividing sub-threshold region into Band-To-Band Tunneling (BTBT) burst region, BTBT sharp region and BTBT smooth region. And then, the effects of device parameters, such as source/drain doping level, oxide thickness, working voltage, on the transfer characteristics are studied with an eye kept on the effect of BTBT burst region. As a conclusion, a reference device parameters choice flow and corresponding criterion are brought out. Research results show that: (1) BTBT burst region make a non-ignorable contribution to the sub-threshold slope. (2) Proper device parameters would contribute to ultra-low sub-threshold slope. (3) BTB tunneling at channel-drain interface would have a negative effect on device performance, which is even could not be suppressed for TCNFETs with small enough energy gap.
Hailiang Zhou, Xiantuo Tang, Minxuan Zhang, Yue Hao
A Novel Separated Pre-discharging Sense Amplifier for STT-MRAM
Abstract
This paper presents a novel sense amplifier for Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM), named Separated Pre-discharging Sense Amplifier (SPDSA). By inverting the pre-charging path of Separated Pre-charging Sense Amplifier (SPCSA) to a pre-discharging path, a couple of inverters that used to transfer the voltage can be eliminated, and thus the area overhead of SPCSA is reduced. We develop a compact magnetoresistance model for MTJ to perform hybrid CMOS/magnetic HSPICE simulations. Based on 45 nm CMOS technology, simulation results exhibit that compared with SPCSA, SPDSA can reduce the power consumption by 35.6% and improve the read reliability by 29%.
Huan Li, Zhenyu Zhao, Quan Deng, Peng Li, Haoyue Tang, Lianhua Qu
Dynamic Response Characteristics of the PCB Under Thermo-Acoustic Load
Abstract
The temperature of thermal buckling of the PCB is very low. The dynamic response characteristics of the PCB are very different from pre-buckling and post-buckling. The snap-through motion between multiple post buckled equilibrium positions introduces high level of alternating stress which reduces the fatigue life of the structures. The vibration equation for the PCB under thermo-acoustic load is derived in this paper. Thermal post buckling equilibrium Path is solved using the finite element method. The affection of the thermo-acoustic load on the dynamic response is analyzed with the study of the difference of dynamic response characteristics of the PCB from pre-buckling and post-buckling. The conclusions provide a reference for the calculation of stochastic dynamics with the consideration of the thermal buckling and the prediction of the PCB fatigue life under thermo-acoustic load. Furthermore, it lays a foundation for the structural optimization, which aims to increase the fatigue life of the PCB.
Cunxian Cao, Jiangfeng Huang, Daoqing Qu, Miao Zhang
Backmatter
Metadaten
Titel
Computer Engineering and Technology
herausgegeben von
Weixia Xu
Liquan Xiao
Jinwen Li
Chengyi Zhang
Zhenzhen Zhu
Copyright-Jahr
2016
Verlag
Springer Singapore
Electronic ISBN
978-981-10-3159-5
Print ISBN
978-981-10-3158-8
DOI
https://doi.org/10.1007/978-981-10-3159-5

Neuer Inhalt