Top

Knowledge and Information Systems

Published in:

19-07-2020 | Regular paper

Fast LSTM by dynamic decomposition on cloud and distributed systems

Authors: Yang You, Yuxiong He, Samyam Rajbhandari, Wenhan Wang, Cho-Jui Hsieh, Kurt Keutzer, James Demmel

Published in: Knowledge and Information Systems | Issue 11/2020

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Long short-term memory (LSTM) is a powerful deep learning technique that has been widely used in many real-world data-mining applications such as language modeling and machine translation. In this paper, we aim to minimize the latency of LSTM inference on cloud systems without losing accuracy. If an LSTM model does not fit in cache, the latency due to data movement will likely be greater than that due to computation. In this case, we reduce model parameters. If, as in most applications we consider, the LSTM models are able to fit the cache of cloud server processors, we focus on reducing the number of floating point operations, which has a corresponding linear impact on the latency of the inference calculation. Thus, in our system, we dynamically reduce model parameters or flops depending on which most impacts latency. Our inference system is based on singular value decomposition and canonical polyadic decomposition. Our system is accurate and low latency. We evaluate our system based on models from a series of real-world applications like language modeling, computer vision, question answering, and sentiment analysis. Users of our system can use either pre-trained models or start from scratch. Our system achieves 15\(\times \) average speedup for six real-world applications without losing accuracy in inference. We also design and implement a distributed optimization system with dynamic decomposition, which can significantly reduce the energy cost and accelerate the training process.

previous article An efficient projection-based method for high utility itemset mining using a novel pruning approach on the utility matrix

next article Integrating researchers’ scientific production information through Ogmios

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

https://d1.awsstatic.com/HPC2019/Amazon-HyperionTechSpotlight-190329.FINAL-FINAL.pdf.

https://techcrunch.com/2019/05/07/googles-newest-cloud-tpu-pods-feature-over-1000-tpus.

https://www.top500.org/site/50701.

https://thenextweb.com/artificial-intelligence/2019/07/23/openai-microsoft-azure-ai/.

Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI), vol 16. pp 265–283

Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In: International conference on machine learning, pp 173–182

Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166CrossRef

Benkrid K, Belkacemi S (2002) Design and implementation of a 2d convolution core for video applications on fpgas. In: Third international workshop on digital and computational video, 2002. DCV 2002. proceedings. IEEE, pp 85–92

Bottou L (2012) Stochastic gradient descent tricks. In: Montavon G, Orr G, Müller K-R (eds) Neural networks: tricks of the trade. Springer, New York, pp 421–436CrossRef

Cardells-Tormo F, Molinet P-L (2005) Area-efficient 2-d shift-variant convolvers for fpga-based digital image processing. In: IEEE workshop on signal processing systems design and implementation. IEEE, pp 209–213

Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2013) One billion word benchmark for measuring progress in statistical language modeling. arXiv:1312.3005

Cloutier J, Cosatto E, Pigeon S, Boyer FR, Simard PY (1996) Vip: An fpga-based processor for image processing and neural networks. In: Proceedings of fifth international conference on microelectronics for neural networks. IEEE, pp 330–336

Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, et al (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1223–1231

10.

Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391CrossRef

11.

Demmel JW (1997) Applied numerical linear algebra, vol 56. Siam, USACrossRef

12.

Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255

13.

Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems, pp 1269–1277

14.

Eldan R, Shamir O (2016) The power of depth for feedforward neural networks. In: Conference on learning theory, pp 907–940

15.

Gironés RG, Palero RC, Boluda JC, Cortés AS (2005) Fpga implementation of a pipelined on-line backpropagation. J VLSI Signal Process Syst Signal Image Video Technol 40(2):189–213CrossRef

16.

Grave E, Joulin A, Cissé M, Grangier D, Jégou H (2016) Efficient softmax approximation for gpus. arXiv:1609.04309

17.

Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ (2016) Eie: efficient inference engine on compressed deep neural network. In: 2016 ACM/IEEE 43rd annual international symposium on, computer architecture (ISCA). IEEE, pp 243–254

18.

Han S, Mao H, Dally WJ (2015) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149

19.

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

20.

He Z, Gao S, Xiao L, Liu D, He H, Barber D (2017) Wider and deeper, cheaper and faster: tensorized lstms for sequence learning. In: Advances in neural information processing systems, pp 1–11

21.

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef

22.

Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K Learning to reason: end-to-end module networks for visual question answering

23.

Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4418–4427

24.

Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, Lee H, Ngiam J, Le QV, Wu Y, et al (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Advances in neural information processing systems, pp 103–112

25.

Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, et al (2017) In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th annual international symposium on computer architecture, pp 1–12

26.

Jozefowicz R, Vinyals O, Schuster M, Shazeer N, Wu Y (2016) Exploring the limits of language modeling. arXiv:1602.02410

27.

Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2016) On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836

28.

Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv:1412.6980

29.

Kuchaiev O, Ginsburg B (2017) Factorization tricks for lstm networks. arXiv:1703.10722

30.

LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef

31.

Lei T, Zhang Y (2017) Training rnns as fast as cnns. arXiv:1709.02755

32.

Lin Y, Han S, Mao H, Wang Y, Dally WJ (2017) Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887

33.

Luong M, Brevdo E, Zhao R (2017) Neural machine translation (seq2seq) tutorial. https://github.com/tensorflow/nmt

34.

Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA association for computational linguistics, pp 142–150

35.

Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2):313–330

36.

Narang S, Undersander E, Diamos G (2017) Block-sparse recurrent neural networks. arXiv:1711.02782

37.

Nichols KR, Moussa MA, Areibi SM (2002) Feasibility of floating-point arithmetic in fpga based artificial neural networks. In: In CAINE. Citeseer

38.

Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318

39.

Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch

40.

Sainath TN, Kingsbury B, Sindhwani V, Arisoy E, Ramabhadran B Low-rank matrix factorization for deep neural network training with high-dimensional output targets

41.

Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the international speech communication association

42.

Savich AW, Moussa M, Areibi S (2007) The impact of arithmetic representation on implementing mlp-bp on fpgas: A study. IEEE Trans Neural Netw 18(1):240–252CrossRef

43.

Schwartz MO (2020) Groundwater contamination associated with a potential nuclear waste repository at yucca mountain. USA Bull Eng Geol Environ 79(2):1125–1136CrossRef

44.

Shawahna A, Sait SM, El-Maleh A (2018) Fpga-based accelerators of deep learning networks for learning and classification: A review. IEEE Access 7:7823–7859CrossRef

45.

Shazeer N, Cheng Y, Parmar N, Tran D, Vaswani A, Koanantakool P, Hawkins P, Lee H, Hong M, Young C, et al (2018) Mesh-tensorflow: Deep learning for supercomputers. In: Advances in neural information processing systems, pp 10414–10423

46.

Shim K, Lee M, Choi I, Boo Y, Sung W (2017) Svd-softmax: Fast softmax approximation on large vocabulary neural networks. In: Advances in neural information processing systems, pp 5469–5479

47.

Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

48.

Sivakumar SC, Robertson W, Phillips WJ (1999) Online stabilization of block-diagonal recurrent neural networks. IEEE Trans Neural Netw 10(1):167–175CrossRef

49.

Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. The Intern J High Perform Comput Appl 19(1):49–66CrossRef

50.

Tjandra A, Sakti S, Nakamura S (2017) Compressing recurrent neural network with tensor train. In: 2017 international joint conference on, Neural networks (IJCNN). IEEE, pp 4451–4458

51.

Tjandra A, Sakti S, Nakamura S (2018) Tensor decomposition for compressing recurrent neural network. In: 2018 international joint conference on neural networks (IJCNN), IEEE, pp 1–8

52.

Tucker LR (1966) Some mathematical notes on three-mode factor analysis. Psychometrika 31(3):279–311MathSciNetCrossRef

53.

Wen W, He Y, Rajbhandari S, Wang W, Liu F, Hu B, Chen Y, Li H (2017) Learning intrinsic sparse structures within long short-term memory. arXiv:1709.05027

54.

Wolf DF, Romero RA, Marques E (2001) Using embedded processors in hardware models of artificial neural networks. In: Simpósio Brasileiro de Automação Inteligente, vol 9. Brasil

55.

Wu C-Y, Ahmed A, Beutel A, Smola AJ, Jing H (2017) Recurrent recommender networks. In: Proceedings of the tenth ACM international conference on web search and data mining. ACM, pp 495–503

56.

Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144

57.

Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Yu D, Zweig G (2016) Achieving human parity in conversational speech recognition. arXiv:1610.05256

58.

Xu C, Yao J, Lin Z, Ou W, Cao Y, Wang Z, Zha H (2018) Alternating multi-bit quantization for recurrent neural networks. arXiv:1802.00150

59.

You Y, Buluç A, Demmel J (2017) Scaling deep learning on gpu and knights landing clusters. In: Proceedings of the international conference for high performance computing, networking, storage and analysis. ACM, p 9

60.

You Y, Li J, Reddi S, Hseu J, Kumar S, Bhojanapalli S, Song X, Demmel J, Hsieh C-J (2019) Large batch optimization for deep learning: training bert in 76 minutes. arXiv:1904.00962

61.

You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th international conference on parallel processing. ACM, p 1

62.

Zhai S, Chang K-h, Zhang R, Zhang ZM (2016) Deepintent: Learning attentions for online advertising with recurrent neural networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1295–1304

63.

Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput-aided Des Integr Circuits and Syst 38(11):2072–2085CrossRef

64.

Zhang H, Xia M, Hu G (2007) A multiwindow partial buffering scheme for fpga-based 2-d convolvers. IEEE Trans Circuits Syst II: Express Br 54(2):200–204CrossRef

65.

Zhu M, Rhu M, Clemons J, Keckler SW, Xie Y (2016) Training long short-term memory with sparsified stochastic gradient descent

Title: Fast LSTM by dynamic decomposition on cloud and distributed systems
Authors: Yang You
Yuxiong He
Samyam Rajbhandari
Wenhan Wang
Cho-Jui Hsieh
Kurt Keutzer
James Demmel
Publication date: 19-07-2020
Publisher: Springer London
Published in: Knowledge and Information Systems / Issue 11/2020
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI: https://doi.org/10.1007/s10115-020-01487-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Springer Professional "Technik"

Other articles of this Issue 11/2020

Service cost-based resource optimization and load balancing for edge and cloud environment

Discovering dependencies with reliable mutual information

An experimental study of graph-based semi-supervised classification with additional node information

A joint optimization framework for better community detection based on link prediction in social networks

SNOWL model: social networks unification-based semantic data integration

Multi-objective Bonobo Optimizer (MOBO): an intelligent heuristic for multi-criteria optimization

Premium Partner