Skip to main content

Über dieses Buch

This book contains extended and revised versions of the best papers presented at the 20th IFIP WG 10.5/IEEE International Conference on Very Large Scale Integration, VLSI-SoC 2012, held in Santa Cruz, CA, USA, in October 2012. The 12 papers included in the book were carefully reviewed and selected from the 33 full papers presented at the conference. The papers cover a wide range of topics in VLSI technology and advanced research. They address the current trend toward increasing chip integration and technology process advancements bringing about stimulating new challenges both at the physical and system-design levels, as well as in the test of these systems.



FPGA-Based High-Speed Authenticated Encryption System

The Advanced Encryption Standard (AES) running in the Galois/Counter Mode of Operation represents a de facto standard in the field of hardware-accelerated, block-cipher-based high-speed authenticated encryption (AE) systems. We propose hardware architectures supporting the Ethernet standard IEEE 802.3ba utilizing different cryptographic primitives suitable for AE applications. Our main design goal was to achieve high throughput on FPGA platforms. Compared to previous works aiming at data rates beyond 100 Gbit/s, our design makes use of an alternative block cipher and an alternative mode of operation, namely Serpent and the offset codebook mode of operation, respectively. Using four cipher cores for the encryption part of the AE architecture, we achieve a throughput of 141 Gbit/s on an Altera Stratix IV FPGA. The design requires 39 kALMs and runs at a maximum clock frequency of 275 MHz. This represents, to the best of our knowledge, the fastest full implementation of an AE scheme on FPGAs to date. In order to make the design applicable in a real-world environment, we developed a custom-designed printed circuit board for the Stratix IV FPGA, suitable to process data with up to 100 Gbit/s.

Michael Muehlberghuber, Christoph Keller, Frank K. Gürkaynak, Norbert Felber

A Smart Memory Accelerated Computed Tomography Parallel Backprojection

As nanoscale lithography challenges mandate greater pattern regularity and commonality for logic and memory circuits, new opportunities are created to affordably synthesize more powerful smart memory blocks for specific applications. Leveraging the ability to embed logic inside the memory block boundary, we demonstrate the synthesis of smart memory architectures that exploits the inherent memory address patterns of the backprojection algorithm to enable efficient parallel image reconstruction at minimum hardware overhead. An end-to-end design framework in sub-20nm CMOS technologies was constructed for the physical synthesis of smart memories and evaluation of the huge design space. Our experimental results show that customizing memory for the computerized tomography (CT) parallel backprojection can achieve more than 30% area and power savings while offering significant performance improvements with marginal sacrifice of image accuracy.

Qiuling Zhu, Larry Pileggi, Franz Franchettis

Trinocular Stereo Vision Using a Multi Level Hierarchical Classification Structure

A real-time trinocular stereo vision processor is proposed which combines a window matching architecture with a classification architecture. A pair wise segmented window matching for both the center-right and center-left image pairs as their scaled down image pairs is performed. The resulting cost functions are combined which results into nine different cost curves. A multi level hierarchical classifier is used to select the most promising disparity value. The classifier makes use of features provided by the calculated cost curves and the pixels’ spatial neighborhood information. Evaluation and classifier training has been performed using an indoor dataset. The system is prototyped on an FPGA board equipped with three CMOS cameras. Special care has been taken to reduce the latency and the memory footprint.

Andy Motten, Luc Claesen, Yun Pan

Spatially-Varying Image Warping: Evaluations and VLSI Implementations

Spatially-varying, non-linear image warping has gained growing interest due to the appearance of image domain warping applications such as aspect ratio retargeting or stereo remapping/stereo-to-multiview conversion. In contrast to the more common global image warping, e.g., zoom or rotation, the image transformation is now a spatially-varying mapping that, in principle, enables arbitrary image transformations. A practical constraint is that transformed pixels keep their relative ordering, i.e., there are no fold-overs. In this work, we analyze and compare spatially-varying image warping techniques in terms of quality and computational performance. In particular, aliasing artifacts, interpolation quality (sharpness), number of arithmetical operations, and memory bandwidth requirements are considered. Further, we provide an architecture based on Gaussian filtering and an architecture with bicubic interpolation and compare corresponding VLSI implementations.

Pierre Greisen, Michael Schaffner, Danny Luu, Val Mikos, Simon Heinzle, Frank K. Gürkaynak, Aljoscha Smolic

An Ultra-Low-Power Application-Specific Processor with Sub-VT Memories for Compressed Sensing

Compressed sensing (CS) is a universal low-complexity data compression technique for signals that have a sparse representation in some domain. While CS data compression can be done both in the analog- and digital domain, digital implementations are often used on low-power sensor nodes, where an ultra-low-power (ULP) processor carries out the algorithm on Nyquist-rate sampled data. In such systems an energy-efficient implementation of the CS compression kernel is a vital ingredient to maximize battery lifetime. In this paper, we propose an application-specific instruction-set processor (ASIP) processor that has been optimized for CS data compression and for operation in the subthreshold (sub-V


) regime. The design is equipped with specific sub-V


capable standard-cell based memories, to enable low-voltage operation with low leakage. Our results show that the proposed ASIP accomplishes 62× speed-up and 11.6× power savings with respect to a straightforward CS implementation running on the baseline low-power processor without instruction set extensions.

Jeremy Constantin, Ahmed Dogan, Oskar Andersson, Pascal Meinerzhagen, Joachim Rodrigues, David Atienza, Andreas Burg

Configurable Low-Latency Interconnect for Multi-core Clusters

Shared L1 memories are of interest for tightly-coupled processor clusters in programmable accelerators as they provide a convenient shared memory abstraction while avoiding cache coherence overheads. The performance of a shared-L1 memory critically depends on the architecture of the low-latency interconnect between processors and memory banks, which needs to provide ultra-fast access to the largest possible L1 working set. The advent of 3D technology provides new opportunities to improve the interconnect delay and the form factor. In this chapter we propose a network architecture, 3D-LIN, based on 3D integration technology. The network can be configured based on user specifications and technology constraints to provide fast access to L1 memories on multiple stacked dies. The extracted results from the physical synthesis of 3D-LIN permit to explore trade-offs between memory size and network latency from a planar design to multiple memory layers stacked on top of logic, evaluating the improvement in both form factor and latency.

Giulia Beanato, Igor Loi, Giovanni De Micheli, Yusuf Leblebici, Luca Benini

A Hexagonal Processor and Interconnect Topology for Many-Core Architecture with Dense On-Chip Networks

Network-on-Chips (NoCs) are used to connect large numbers of processors in many-core processor architecture because they perform better than less scalable methods such as global shared buses. Among all NoC design parameters, NoC topologies define how nodes are placed and connected and greatly affect the performance, energy efficiency, and circuit area of many-core processor arrays. Due to its simplicity and the fact that processor tiles are traditionally square or rectangular, 2D mesh is mostly used for existing on-chip networks. However, efficiently mapping applications can be a challenge for cases that require communication between processors that are not adjacent on the 2D mesh. Motivated by the fact that applications often have largely localized communication patterns, we have proposed an 8-neighbor mesh topology and a 6-neighbor topology with hexagonal-shaped processor tiles, both of which increase local connectivity while keep much of the simplicity of a mesh-based topology. We have physically designed a 16-bit DSP processor and the corresponding processor arrays which utilize all three topologies. A 1080p H.264/AVC residual video encoder and a 54 Mbps 802.11a/11g OFDM wireless LAN baseband receiver are mapped onto all topologies. The 6-neighbor hexagonal grid topology incurs a 2.9% area increase per tile compared to the 4-neighbor 2D mesh, but its much more effective inter-processor interconnect yields an average total application area reduction of 21%, an average power reduction of 17%, and a total application inter-processor communication distance reduction of 19%.

Zhibin Xiao, Bevan Baas

Fault-Tolerant Techniques to Manage Yield and Power Constraints in Network-on-Chip Interconnections

The use of fault-tolerant mechanism is essential to ensure the correct functionality of integrated circuits after manufacturing due to the massive number of faults that may occur during the process. In this work, we propose a set of fault-tolerant techniques to cope with faulty wires in Network-on-Chip (NoC). The most appropriate technique is chosen by taking into account the number of faulty wires and their location in the NoC. The goal is to combine different techniques to reduce overheads in area, delay and power. The use of testing and diagnosis can minimize costs associated with embedded fault-tolerant mechanisms once the architecture adapts itself to work in different faulty scenarios. The proposed fault-tolerant strategy uses a lightweight adaptive routing combined with data splitting, which is able to send the data in one clock cycle. The power penalty has a low correlation with the number of faulty interconnections. Results for MPEG4 and VOPD applications running on the NoC with different faulty case-study scenarios show that the proposed techniques can tolerate many faulty interconnections with a low area, performance and power overheads.

Anelise Kologeski, Caroline Concatto, Fernanda Lima Kastensmidt, Luigi Carro

On the Automatic Generation of Software-Based Self-Test Programs for Functional Test and Diagnosis of VLIW Processors

Software-Based Self-Test (SBST) approaches have shown to be an effective solution to detect permanent faults, both at the end of the production process, and during the operational phase. However, when Very Long Instruction Word (VLIW) processors are addressed these techniques require some optimization steps in order to properly exploit the parallelism intrinsic in these architectures. In this chapter we present a new method that, starting from previously known algorithms, automatically generates an effective test program able to still reach high fault coverage on the VLIW processor under test, while minimizing the test duration and the test code size. Moreover, using this method, a set of small SBST programs can be generated aimed at the diagnosis of the VLIW processor. Experimental results gathered on a case study show the effectiveness of the proposed approach.

Davide Sabena, Luca Sterpone, Matteo Sonza Reorda

SEU-Aware Low-Power Memories Using a Multiple Supply Voltage Array Architecture

Electric devices should be resilient because reliability issues are increasingly problematic as technology scales down and the supply voltage is lowered. Specifically, the Soft-Error Rate (SER) increases due to the reduced feature size and the reduced charge. This paper describes an adaptive method to lower memory power using a dual



in a column-based



memory with Built-In Current Sensors (BICS). Using our method, we reduce the memory power by about 40% and increase the error immunity of the memory without the significant power overhead as in previous methods.

Seokjoong Kim, Matthew R. Guthaus

CMOS Implementation of Threshold Gates with Hysteresis

NULL Convention Logic (NCL) is one of the mainstream asynchronous logic design paradigms. NCL circuits use threshold gates with hysteresis. In this chapter, the transistor-level CMOS design of NCL gates is investigated, and various gate styles are introduced and compared to each other. In addition, a novel approach to design static NCL gates is introduced. The new approach is based on integrating each pair of pull-up and pull-down transistor networks into one composite transistor network. The new static gates are then compared to the original ones in terms of delay, area, and energy consumption. It will be shown that the new gate style is significantly faster with negligible area and energy overhead.

Farhad A. Parsan, Scott C. Smith

Simulation and Experimental Characterization of a Unified Memory Device with Two Floating-Gates

The operation of a novel unified memory device using two floating-gates is described through experimental characterization of a fabricated proof-of-concept device and confirmed through simulation. The dynamic, nonvolatile, and concurrent modes of the device are described in detail. Simulations show that the device compares favorably to conventional memory devices. Applications enabled by this unified memory device are discussed, highlighting the dramatic impact this device could have on next generation memory architectures.

Neil Di Spigna, Daniel Schinke, Srikant Jayanti, Veena Misra, Paul Franzon


Weitere Informationen

Premium Partner

Neuer Inhalt

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.



Product Lifecycle Management im Konzernumfeld – Herausforderungen, Lösungsansätze und Handlungsempfehlungen

Für produzierende Unternehmen hat sich Product Lifecycle Management in den letzten Jahrzehnten in wachsendem Maße zu einem strategisch wichtigen Ansatz entwickelt. Forciert durch steigende Effektivitäts- und Effizienzanforderungen stellen viele Unternehmen ihre Product Lifecycle Management-Prozesse und -Informationssysteme auf den Prüfstand. Der vorliegende Beitrag beschreibt entlang eines etablierten Analyseframeworks Herausforderungen und Lösungsansätze im Product Lifecycle Management im Konzernumfeld.
Jetzt gratis downloaden!