ABSTRACT
This work exploits the tolerance of Deep Neural Networks (DNNs) to reduced precision numerical representations and specifically, their recently demonstrated ability to tolerate representations of different precision per layer while maintaining accuracy. This flexibility enables improvements over conventional DNN implementations that use a single, uniform representation. This work proposes Proteus, which reduces the data traffic and storage footprint needed by DNNs, resulting in reduced energy and improved area efficiency for DNN implementations. Proteus uses a different representation per layer for both the data (neurons) and the weights (synapses) processed by DNNs. Proteus is a layered extension over existing DNN implementations that converts between the numerical representation used by the DNN execution engines and the shorter, layer-specific fixed-point representation used when reading and writing data values to memory be it on-chip buffers or off-chip memory. Proteus uses a novel memory layout for DNN data, enabling a simple, low-cost and low-energy conversion unit.
We evaluate Proteus as an extension to a state-of-the-art accelerator [7] which uses a uniform 16-bit fixed-point representation. On five popular DNNs Proteus reduces data traffic among layers by 43% on average while maintaining accuracy within 1% even when compared to a single precision floating-point implementation. As a result, Proteus improves energy by 15% with no performance loss. Proteus also reduces the data footprint by at least 38% and hence the amount of on-chip buffering needed resulting in an implementation that requires 20% less area overall. This area savings can be used to improve cost by building smaller chips, to process larger DNNs for the same on-chip area, or to incorporate an additional three execution engines increasing peak performance bandwidth by 18%.
- AMD. AMD GRAPHICS CORES NEXT (GCN). Whitepaper. "https://www.amd.com/Documents/GCN_Architecture_whitepaper.pdf", 2012.Google Scholar
- S. Anwar, K. Hwang, and W. Sung. Fixed point optimization of deep convolutional neural networks for object recognition. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1131--1135, Apr. 2015.Google ScholarCross Ref
- K. Asanovic and N. Morgan. Using simulations of reduced precision arithmetic to design a neuro-microprocessor. Journal of VLSI Signal Processing, pages 33--44, 1993.Google ScholarCross Ref
- I. Buck. NVIDIA's Next-Gen Pascal GPU Architecture to Provide 10X Speedup for Deep Learning Apps. "http://blogs.nvidia.com/blog/2015/03/17/pascal/", 2015.Google Scholar
- M. Burrows and D. Wheeler. A Block-sorting Lossless Data Compression Algorithm. Number no. 124. Digital, Systems Research Center, 1994.Google Scholar
- T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems, 2014. Google ScholarDigital Library
- Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam. DaDianNao: A machine-learning supercomputer. In Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pages 609--622, Dec 2014. Google ScholarDigital Library
- D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. Mitosis detection in breast cancer histology images with deep neural networks. In MICCAI, 2013.Google ScholarCross Ref
- M. Courbariaux, Y. Bengio, and J. David. Low precision arithmetic for deep learning. CoRR, abs/1412.7024, 2014.Google Scholar
- G. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Transactions on Audio, Speech, and Language Processing (receiving 2013 IEEE SPS Best Paper Award), 20(1):30--42, January 2012. Google ScholarDigital Library
- Z. Deng, C. Xu, Q. Cai, and P. Faraboschi. Reduced-precision memory value approximation for deep learning. 2015.Google Scholar
- H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger. Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 365--376, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524, 2013. Google ScholarDigital Library
- S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan. Deep learning with limited numerical precision. CoRR, abs/1502.02551, 2015.Google Scholar
- S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: efficient inference engine on compressed deep neural network. CoRR, abs/1602.01528, 2016.Google Scholar
- S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.Google Scholar
- J. Holt and T. Baker. Back propagation simulations using limited precision calculations. In Neural Networks, 1991., IJCNN-91-Seattle International Joint Conference on, volume ii, pages 121--126 vol.2, Jul 1991.Google ScholarCross Ref
- J. L. Holt and J.-N. Hwang. Finite precision error analysis of neural network hardware implementations. IEEE Trans. on Computers, 42:281--290, 1993. Google ScholarDigital Library
- Y. Jia. Caffe model zoo. https://github.com/BVLC/caffe/wiki/Model-Zoo, 2015.Google Scholar
- Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.Google Scholar
- P. Judd, J. Albericio, T. Hetherington, T. Aamodt, N. Enright Jerger, R. Urtasun, and A. Moshovos. Reduced-Precision Strategies for Bounded Memory in Deep Neural Nets. arXiv:1511.05236 {cs}, Nov. 2015. arXiv: 1511.05236.Google Scholar
- J. Kim, K. Hwang, and W. Sung. X1000 real-time phoneme recognition VLSI using feed-forward deep neural networks. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7510--7514, May 2014.Google ScholarCross Ref
- A. Krizhevsky. cuda-convnet: High-performance C++/CUDA implementation of convolutional neural networks. https://code.google.com/p/cuda-convnet/.Google Scholar
- A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097--1105. Curran Associates, Inc., 2012.Google ScholarDigital Library
- D. Larkin and A. Kinane. Towards hardware acceleration of neuroevolution for multimedia processing applications on mobile devices.Google Scholar
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278--2324, Nov 1998.Google ScholarCross Ref
- M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.Google Scholar
- N. Muralimanohar and R. Balasubramonian. CACTI 6.0: A tool to understand large caches.Google Scholar
- G. Pekhimenko, V. Seshadri, O. Mutlu, P. B. Gibbons, M. A. Kozuch, and T. C. Mowry. Base-delta-immediate compression: Practical data compression for on-chip caches. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, PACT '12, pages 377--388, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- M. Poremba, S. Mittal, D. Li, J. Vetter, and Y. Xie. DESTINY: A tool for modeling emerging 3D NVM and eDRAM caches. In Design, Automation Test in Europe Conference Exhibition (DATE), 2015, pages 1543--1546, March 2015. Google ScholarDigital Library
- R. Presley and R. Haggard. A fixed point implementation of the backpropagation learning algorithm. In Southeastcon '94. Creative Technology Transfer - A Global Affair., Proceedings of the 1994 IEEE, pages 136--138, Apr 1994.Google ScholarCross Ref
- P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A cycle accurate memory system simulator. IEEE Comput. Archit. Lett., 10(1):16--19, Jan. 2011. Google ScholarDigital Library
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. arXiv:1409.0575 {cs}, Sept. 2014. arXiv: 1409.0575.Google Scholar
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. Google ScholarDigital Library
- F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu. 1-bit stochastic gradient descent and application to data-parallel distributed training of speech dnns. In Interspeech 2014, September 2014.Google Scholar
- A. Strey and N. Avellana. A new concept for parallel neurocomputer architectures, 1996.Google Scholar
- Synopsys. Design Compiler. http://www.synopsys.com/Tools/Implementation/RTLSynthesis/DesignCompiler/Pages/default.aspx.Google Scholar
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.Google Scholar
- Y. Xie and M. A. Jabri. Training algorithms for limited precision feedforward neural networks. Technical report, 1991.Google Scholar
Recommendations
Proteus: a flexible and fast software supported hardware logging approach for NVM
MICRO-50 '17: Proceedings of the 50th Annual IEEE/ACM International Symposium on MicroarchitectureEmerging non-volatile memory (NVM) technologies, such as phase-change memory, spin-transfer torque magnetic memory, memristor, and 3D Xpoint, are encouraging the development of new architectures that support the challenges of persistent programming. An ...
Proteus: A Runtime Reconfigurable Distributed Shared Memory System
HPDC '99: Proceedings of the 8th IEEE International Symposium on High Performance Distributed ComputingThis paper describes a Distributed Shared Memory (DSM) system called Proteus, which aims to support runtime node reconfiguration. To make the system execute efficiently after node reconfiguration and reduce the overhead of reconfiguration, Proteus ...
A Novel Memory Block Management Scheme for PCM Using WOM-Code
HPCC-CSS-ICESS '15: Proceedings of the 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conf on Embedded Software and SystemsPhase Change Memory (PCM) is a promising DRAM replacement in embedded systems due to its attractive characteristics including low static power consumption and high density. However, long write latency is one of the major drawbacks in current PCM ...
Comments