research-article

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

Authors:
Song Han

Stanford University & DeePhi Tech, Stanford, USA

Stanford University & DeePhi Tech, Stanford, USA
View Profile

,
Junlong Kang

DeePhi, Beijing, China

DeePhi, Beijing, China
View Profile

,
Huizi Mao

Stanford University & DeePhi Tech, Stanford, China

Stanford University & DeePhi Tech, Stanford, China
View Profile

,
Yiming Hu

Tsinghua University & DeePhi Tech, Beijing, China

Tsinghua University & DeePhi Tech, Beijing, China
View Profile

,
Xin Li

DeePhi, Beijing, China

DeePhi, Beijing, China
View Profile

,
Yubin Li

DeePhi, Beijing, China

DeePhi, Beijing, China
View Profile

,
Dongliang Xie

DeePhi, Beijing, China

DeePhi, Beijing, China
View Profile

,
Hong Luo

DeePhi, Beijing, China

DeePhi, Beijing, China
View Profile

,
Song Yao

DeePhi, Beijing, China

DeePhi, Beijing, China
View Profile

,
Yu Wang

Tsinghua University & DeePhi Tech, Beijing, China

Tsinghua University & DeePhi Tech, Beijing, China
View Profile

,
Huazhong Yang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
William (Bill) J. Dally

Stanford University & NVIDIA, Stanford, USA

Stanford University & NVIDIA, Stanford, USA
View Profile

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFebruary 2017Pages 75–84https://doi.org/10.1145/3020078.3021745

Published:22 February 2017Publication History

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pages 75–84

ABSTRACT

Long Short-Term Memory (LSTM) is widely used in speech recognition. In order to achieve higher prediction accuracy, machine learning scientists have built increasingly larger models. Such large model is both computation intensive and memory intensive. Deploying such bulky model results in high power consumption and leads to a high total cost of ownership (TCO) of a data center. To speedup the prediction and make it energy efficient, we first propose a load-balance-aware pruning method that can compress the LSTM model size by 20x (10x from pruning and 2x from quantization) with negligible loss of the prediction accuracy. The pruned model is friendly for parallel processing. Next, we propose a scheduler that encodes and partitions the compressed model to multiple PEs for parallelism and schedule the complicated LSTM data flow. Finally, we design the hardware architecture, named Efficient Speech Recognition Engine (ESE) that works directly on the sparse LSTM model.

Implemented on Xilinx KU060 FPGA running at 200MHz, ESE has a performance of 282 GOPS working directly on the sparse LSTM network, corresponding to 2.52 TOPS on the dense one, and processes a full LSTM for speech recognition with a power dissipation of 41 Watts. Evaluated on the LSTM for speech recognition benchmark, ESE is 43x and 3x faster than Core i7 5930k CPU and Pascal Titan X GPU implementations. It achieves 40x and 11.5x higher energy efficiency compared with the CPU and GPU respectively.

References

A. X. M. Chang, B. Martini, and E. Culurciello. Recurrent neural networks hardware implementation on FPGA. CoRR, abs/1511.05552, 2015.Google Scholar
T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In ASPLOS, 2014. Google ScholarDigital Library
Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. Dadiannao: A machine-learning supercomputer. In MICRO, December 2014. Google ScholarDigital Library
R. Dorrance, F. Ren, et al. A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs. In FPGA, 2014. Google ScholarDigital Library
Z. Du, R. Fasthuber, T. Chen, P. Ienne, L. Li, T. Luo, X. Feng, Y. Chen, and O. Temam. Shidiannao: shifting vision processing closer to the sensor. In ISCA, pages 92--104. ACM, 2015. Google ScholarDigital Library
D. A. et al. Deep speech 2: End-to-end speech recognition in english and mandarin. arXiv, preprint arXiv:1512.02595, 2015.Google Scholar
J. Fowers, K. Ovtcharov, K. Strauss, et al. A high memory bandwidth fpga accelerator for sparse matrixvector multiplication. In FCCM, 2014.Google Scholar
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett. Darpa timit acoustic-phonetic continous speech corpus cd-rom. nist speech disc 1--1.1. NASA STI/Recon technical report n, 93, 1993.Google ScholarCross Ref
K. Guo, L. Sui, et al. Angel-eye: A complete design flow for mapping cnn onto customized hardware. In ISVLSI, 2016. Google ScholarCross Ref
S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. Eie: efficient inference engine on compressed deep neural network. arXiv preprint arXiv:1602.01528, 2016.Google Scholar
S. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. ICLR, 2016.Google Scholar
S. Han, J. Pool, J. Tran, and W. J. Dally. Learning both weights and connections for efficient neural networks. In Proceedings of Advances in Neural Information Processing Systems, 2015.Google Scholar
A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and A. Ng. Deep speech: Scaling up end-to-end speech recognition. arXiv, preprint arXiv:1412.5567, 2014.Google Scholar
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 1997. Google ScholarDigital Library
M. Lee, K. Hwang, J. Park, S. Choi, S. Shin, and W. Sung. Fpga-based low-power speech recognition with recurrent neural networks. arXiv preprint arXiv:1610.00552, 2016.Google Scholar
E. Nurvitadhi, J. Sim, D. Sheffield, A. Mishra, S. Krishnan, and D. Marr. Accelerating recurrent neural networks in analytics servers: Comparison of fpga, cpu, gpu, and asic. In Field Programmable Logic and Applications (FPL), 2016 26th International Conference on, pages 1--4. EPFL, 2016.Google ScholarCross Ref
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al. The Kaldi speech recognition toolkit. In IEEE 2011 workshop on automatic speech recognition and understanding, 2011.Google Scholar
J. Qiu, J. Wang, et al. Going deeper with embedded FPGA platform for convolutional neural network. In FPGA, 2016. Google ScholarDigital Library
H. Sak et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In INTERSPEECH, pages 338--342, 2014.Google Scholar
L. D. Xuedong Huang. An Overview of Modern Speech Recognition, pages 339--366. Chapman & Hall/CRC, January 2010.Google Scholar
L. Zhuo and V. K. Prasanna. Sparse matrix-vector multiplication on fpgas. In FPGA, 2005. Google ScholarDigital Library

Index Terms

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Hardware
  1. Integrated circuits
    1. Reconfigurable logic and FPGAs
      1. Hardware accelerators

Recommendations

A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks
Special Issue on Frontiers of Hardware and Algorithms for On-chip Learning, Special Issue on Silicon Photonics and Regular Papers

FPGA-based hardware accelerators for convolutional neural networks (CNNs) have received attention due to their higher energy efficiency than GPUs. However, it is challenging for FPGA-based solutions to achieve a higher throughput than GPU counterparts. ...
Read More
A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only)
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

FPGA-based hardware accelerator for convolutional neural networks (CNNs) has obtained great attentions due to its higher energy efficiency than GPUs. However, it has been a challenge for FPGA-based solutions to achieve a higher throughput than GPU ...
Read More
FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization
FPGA '22: Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

With the trend to deploy Deep Neural Network (DNN) inference models on edge devices with limited resources, quantization techniques have been widely used to reduce on-chip storage and improve computation throughput. However, existing DNN quantization ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays
February 2017
312 pages
ISBN:9781450343541
DOI:10.1145/3020078
General Chair:
Jonathan Greene
Microsemi, USA
,
Program Chair:
Jason H. Anderson
University of Toronto, Canada
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 February 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FPGA
deep learning
hardware acceleration
model compression
software-hardware co-design
speech recognition
Qualifiers
- research-article
Conference

Acceptance Rates
FPGA '17 Paper Acceptance Rate25of101submissions,25%Overall Acceptance Rate125of627submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 419
  Total Citations
  View Citations
- 4,294
  Total Downloads
- Downloads (Last 12 months)382
- Downloads (Last 6 weeks)47
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks

A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only)

FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA

FPGA '17: Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

A GPU-Outperforming FPGA Accelerator Architecture for Binary Convolutional Neural Networks

A 7.663-TOPS 8.2-W Energy-efficient FPGA Accelerator for Binary Convolutional Neural Networks (Abstract Only)

FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media