Skip to main content
Top
Published in: Knowledge and Information Systems 11/2020

19-07-2020 | Regular paper

Fast LSTM by dynamic decomposition on cloud and distributed systems

Authors: Yang You, Yuxiong He, Samyam Rajbhandari, Wenhan Wang, Cho-Jui Hsieh, Kurt Keutzer, James Demmel

Published in: Knowledge and Information Systems | Issue 11/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Long short-term memory (LSTM) is a powerful deep learning technique that has been widely used in many real-world data-mining applications such as language modeling and machine translation. In this paper, we aim to minimize the latency of LSTM inference on cloud systems without losing accuracy. If an LSTM model does not fit in cache, the latency due to data movement will likely be greater than that due to computation. In this case, we reduce model parameters. If, as in most applications we consider, the LSTM models are able to fit the cache of cloud server processors, we focus on reducing the number of floating point operations, which has a corresponding linear impact on the latency of the inference calculation. Thus, in our system, we dynamically reduce model parameters or flops depending on which most impacts latency. Our inference system is based on singular value decomposition and canonical polyadic decomposition. Our system is accurate and low latency. We evaluate our system based on models from a series of real-world applications like language modeling, computer vision, question answering, and sentiment analysis. Users of our system can use either pre-trained models or start from scratch. Our system achieves 15\(\times \) average speedup for six real-world applications without losing accuracy in inference. We also design and implement a distributed optimization system with dynamic decomposition, which can significantly reduce the energy cost and accelerate the training process.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI), vol 16. pp 265–283 Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) Tensorflow: a system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation (OSDI), vol 16. pp 265–283
2.
go back to reference Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In: International conference on machine learning, pp 173–182 Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G, et al (2016) Deep speech 2: end-to-end speech recognition in english and mandarin. In: International conference on machine learning, pp 173–182
3.
go back to reference Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166CrossRef Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5(2):157–166CrossRef
4.
go back to reference Benkrid K, Belkacemi S (2002) Design and implementation of a 2d convolution core for video applications on fpgas. In: Third international workshop on digital and computational video, 2002. DCV 2002. proceedings. IEEE, pp 85–92 Benkrid K, Belkacemi S (2002) Design and implementation of a 2d convolution core for video applications on fpgas. In: Third international workshop on digital and computational video, 2002. DCV 2002. proceedings. IEEE, pp 85–92
5.
go back to reference Bottou L (2012) Stochastic gradient descent tricks. In: Montavon G, Orr G, Müller K-R (eds) Neural networks: tricks of the trade. Springer, New York, pp 421–436CrossRef Bottou L (2012) Stochastic gradient descent tricks. In: Montavon G, Orr G, Müller K-R (eds) Neural networks: tricks of the trade. Springer, New York, pp 421–436CrossRef
6.
go back to reference Cardells-Tormo F, Molinet P-L (2005) Area-efficient 2-d shift-variant convolvers for fpga-based digital image processing. In: IEEE workshop on signal processing systems design and implementation. IEEE, pp 209–213 Cardells-Tormo F, Molinet P-L (2005) Area-efficient 2-d shift-variant convolvers for fpga-based digital image processing. In: IEEE workshop on signal processing systems design and implementation. IEEE, pp 209–213
7.
go back to reference Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2013) One billion word benchmark for measuring progress in statistical language modeling. arXiv:1312.3005 Chelba C, Mikolov T, Schuster M, Ge Q, Brants T, Koehn P, Robinson T (2013) One billion word benchmark for measuring progress in statistical language modeling. arXiv:​1312.​3005
8.
go back to reference Cloutier J, Cosatto E, Pigeon S, Boyer FR, Simard PY (1996) Vip: An fpga-based processor for image processing and neural networks. In: Proceedings of fifth international conference on microelectronics for neural networks. IEEE, pp 330–336 Cloutier J, Cosatto E, Pigeon S, Boyer FR, Simard PY (1996) Vip: An fpga-based processor for image processing and neural networks. In: Proceedings of fifth international conference on microelectronics for neural networks. IEEE, pp 330–336
9.
go back to reference Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, et al (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1223–1231 Dean J, Corrado G, Monga R, Chen K, Devin M, Mao M, Ranzato M, Senior A, Tucker P, Yang K, et al (2012) Large scale distributed deep networks. In: Advances in neural information processing systems, pp 1223–1231
10.
go back to reference Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391CrossRef Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391CrossRef
11.
12.
go back to reference Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255 Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255
13.
go back to reference Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems, pp 1269–1277 Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R (2014) Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems, pp 1269–1277
14.
go back to reference Eldan R, Shamir O (2016) The power of depth for feedforward neural networks. In: Conference on learning theory, pp 907–940 Eldan R, Shamir O (2016) The power of depth for feedforward neural networks. In: Conference on learning theory, pp 907–940
15.
go back to reference Gironés RG, Palero RC, Boluda JC, Cortés AS (2005) Fpga implementation of a pipelined on-line backpropagation. J VLSI Signal Process Syst Signal Image Video Technol 40(2):189–213CrossRef Gironés RG, Palero RC, Boluda JC, Cortés AS (2005) Fpga implementation of a pipelined on-line backpropagation. J VLSI Signal Process Syst Signal Image Video Technol 40(2):189–213CrossRef
17.
go back to reference Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ (2016) Eie: efficient inference engine on compressed deep neural network. In: 2016 ACM/IEEE 43rd annual international symposium on, computer architecture (ISCA). IEEE, pp 243–254 Han S, Liu X, Mao H, Pu J, Pedram A, Horowitz MA, Dally WJ (2016) Eie: efficient inference engine on compressed deep neural network. In: 2016 ACM/IEEE 43rd annual international symposium on, computer architecture (ISCA). IEEE, pp 243–254
18.
go back to reference Han S, Mao H, Dally WJ (2015) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149 Han S, Mao H, Dally WJ (2015) Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:​1510.​00149
19.
go back to reference He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
20.
go back to reference He Z, Gao S, Xiao L, Liu D, He H, Barber D (2017) Wider and deeper, cheaper and faster: tensorized lstms for sequence learning. In: Advances in neural information processing systems, pp 1–11 He Z, Gao S, Xiao L, Liu D, He H, Barber D (2017) Wider and deeper, cheaper and faster: tensorized lstms for sequence learning. In: Advances in neural information processing systems, pp 1–11
21.
go back to reference Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRef
22.
go back to reference Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K Learning to reason: end-to-end module networks for visual question answering Hu R, Andreas J, Rohrbach M, Darrell T, Saenko K Learning to reason: end-to-end module networks for visual question answering
23.
go back to reference Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4418–4427 Hu R, Rohrbach M, Andreas J, Darrell T, Saenko K (2017) Modeling relationships in referential expressions with compositional modular networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 4418–4427
24.
go back to reference Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, Lee H, Ngiam J, Le QV, Wu Y, et al (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Advances in neural information processing systems, pp 103–112 Huang Y, Cheng Y, Bapna A, Firat O, Chen D, Chen M, Lee H, Ngiam J, Le QV, Wu Y, et al (2019) Gpipe: Efficient training of giant neural networks using pipeline parallelism. In: Advances in neural information processing systems, pp 103–112
25.
go back to reference Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, et al (2017) In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th annual international symposium on computer architecture, pp 1–12 Jouppi NP, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A, et al (2017) In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th annual international symposium on computer architecture, pp 1–12
27.
go back to reference Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2016) On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:1609.04836 Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP (2016) On large-batch training for deep learning: Generalization gap and sharp minima. arXiv:​1609.​04836
30.
go back to reference LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324CrossRef
32.
go back to reference Lin Y, Han S, Mao H, Wang Y, Dally WJ (2017) Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:1712.01887 Lin Y, Han S, Mao H, Wang Y, Dally WJ (2017) Deep gradient compression: Reducing the communication bandwidth for distributed training. arXiv:​1712.​01887
34.
go back to reference Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA association for computational linguistics, pp 142–150 Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, Portland, Oregon, USA association for computational linguistics, pp 142–150
35.
go back to reference Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2):313–330 Marcus MP, Marcinkiewicz MA, Santorini B (1993) Building a large annotated corpus of english: The penn treebank. Computational linguistics 19(2):313–330
37.
go back to reference Nichols KR, Moussa MA, Areibi SM (2002) Feasibility of floating-point arithmetic in fpga based artificial neural networks. In: In CAINE. Citeseer Nichols KR, Moussa MA, Areibi SM (2002) Feasibility of floating-point arithmetic in fpga based artificial neural networks. In: In CAINE. Citeseer
38.
go back to reference Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318 Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: International conference on machine learning, pp 1310–1318
39.
go back to reference Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch Paszke A, Gross S, Chintala S, Chanan G, Yang E, DeVito Z, Lin Z, Desmaison A, Antiga L, Lerer A (2017) Automatic differentiation in pytorch
40.
go back to reference Sainath TN, Kingsbury B, Sindhwani V, Arisoy E, Ramabhadran B Low-rank matrix factorization for deep neural network training with high-dimensional output targets Sainath TN, Kingsbury B, Sindhwani V, Arisoy E, Ramabhadran B Low-rank matrix factorization for deep neural network training with high-dimensional output targets
41.
go back to reference Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the international speech communication association Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In: Fifteenth annual conference of the international speech communication association
42.
go back to reference Savich AW, Moussa M, Areibi S (2007) The impact of arithmetic representation on implementing mlp-bp on fpgas: A study. IEEE Trans Neural Netw 18(1):240–252CrossRef Savich AW, Moussa M, Areibi S (2007) The impact of arithmetic representation on implementing mlp-bp on fpgas: A study. IEEE Trans Neural Netw 18(1):240–252CrossRef
43.
go back to reference Schwartz MO (2020) Groundwater contamination associated with a potential nuclear waste repository at yucca mountain. USA Bull Eng Geol Environ 79(2):1125–1136CrossRef Schwartz MO (2020) Groundwater contamination associated with a potential nuclear waste repository at yucca mountain. USA Bull Eng Geol Environ 79(2):1125–1136CrossRef
44.
go back to reference Shawahna A, Sait SM, El-Maleh A (2018) Fpga-based accelerators of deep learning networks for learning and classification: A review. IEEE Access 7:7823–7859CrossRef Shawahna A, Sait SM, El-Maleh A (2018) Fpga-based accelerators of deep learning networks for learning and classification: A review. IEEE Access 7:7823–7859CrossRef
45.
go back to reference Shazeer N, Cheng Y, Parmar N, Tran D, Vaswani A, Koanantakool P, Hawkins P, Lee H, Hong M, Young C, et al (2018) Mesh-tensorflow: Deep learning for supercomputers. In: Advances in neural information processing systems, pp 10414–10423 Shazeer N, Cheng Y, Parmar N, Tran D, Vaswani A, Koanantakool P, Hawkins P, Lee H, Hong M, Young C, et al (2018) Mesh-tensorflow: Deep learning for supercomputers. In: Advances in neural information processing systems, pp 10414–10423
46.
go back to reference Shim K, Lee M, Choi I, Boo Y, Sung W (2017) Svd-softmax: Fast softmax approximation on large vocabulary neural networks. In: Advances in neural information processing systems, pp 5469–5479 Shim K, Lee M, Choi I, Boo Y, Sung W (2017) Svd-softmax: Fast softmax approximation on large vocabulary neural networks. In: Advances in neural information processing systems, pp 5469–5479
48.
go back to reference Sivakumar SC, Robertson W, Phillips WJ (1999) Online stabilization of block-diagonal recurrent neural networks. IEEE Trans Neural Netw 10(1):167–175CrossRef Sivakumar SC, Robertson W, Phillips WJ (1999) Online stabilization of block-diagonal recurrent neural networks. IEEE Trans Neural Netw 10(1):167–175CrossRef
49.
go back to reference Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. The Intern J High Perform Comput Appl 19(1):49–66CrossRef Thakur R, Rabenseifner R, Gropp W (2005) Optimization of collective communication operations in mpich. The Intern J High Perform Comput Appl 19(1):49–66CrossRef
50.
go back to reference Tjandra A, Sakti S, Nakamura S (2017) Compressing recurrent neural network with tensor train. In: 2017 international joint conference on, Neural networks (IJCNN). IEEE, pp 4451–4458 Tjandra A, Sakti S, Nakamura S (2017) Compressing recurrent neural network with tensor train. In: 2017 international joint conference on, Neural networks (IJCNN). IEEE, pp 4451–4458
51.
go back to reference Tjandra A, Sakti S, Nakamura S (2018) Tensor decomposition for compressing recurrent neural network. In: 2018 international joint conference on neural networks (IJCNN), IEEE, pp 1–8 Tjandra A, Sakti S, Nakamura S (2018) Tensor decomposition for compressing recurrent neural network. In: 2018 international joint conference on neural networks (IJCNN), IEEE, pp 1–8
53.
go back to reference Wen W, He Y, Rajbhandari S, Wang W, Liu F, Hu B, Chen Y, Li H (2017) Learning intrinsic sparse structures within long short-term memory. arXiv:1709.05027 Wen W, He Y, Rajbhandari S, Wang W, Liu F, Hu B, Chen Y, Li H (2017) Learning intrinsic sparse structures within long short-term memory. arXiv:​1709.​05027
54.
go back to reference Wolf DF, Romero RA, Marques E (2001) Using embedded processors in hardware models of artificial neural networks. In: Simpósio Brasileiro de Automação Inteligente, vol 9. Brasil Wolf DF, Romero RA, Marques E (2001) Using embedded processors in hardware models of artificial neural networks. In: Simpósio Brasileiro de Automação Inteligente, vol 9. Brasil
55.
go back to reference Wu C-Y, Ahmed A, Beutel A, Smola AJ, Jing H (2017) Recurrent recommender networks. In: Proceedings of the tenth ACM international conference on web search and data mining. ACM, pp 495–503 Wu C-Y, Ahmed A, Beutel A, Smola AJ, Jing H (2017) Recurrent recommender networks. In: Proceedings of the tenth ACM international conference on web search and data mining. ACM, pp 495–503
56.
go back to reference Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144 Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, et al (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:​1609.​08144
57.
go back to reference Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Yu D, Zweig G (2016) Achieving human parity in conversational speech recognition. arXiv:1610.05256 Xiong W, Droppo J, Huang X, Seide F, Seltzer M, Stolcke A, Yu D, Zweig G (2016) Achieving human parity in conversational speech recognition. arXiv:​1610.​05256
58.
go back to reference Xu C, Yao J, Lin Z, Ou W, Cao Y, Wang Z, Zha H (2018) Alternating multi-bit quantization for recurrent neural networks. arXiv:1802.00150 Xu C, Yao J, Lin Z, Ou W, Cao Y, Wang Z, Zha H (2018) Alternating multi-bit quantization for recurrent neural networks. arXiv:​1802.​00150
59.
go back to reference You Y, Buluç A, Demmel J (2017) Scaling deep learning on gpu and knights landing clusters. In: Proceedings of the international conference for high performance computing, networking, storage and analysis. ACM, p 9 You Y, Buluç A, Demmel J (2017) Scaling deep learning on gpu and knights landing clusters. In: Proceedings of the international conference for high performance computing, networking, storage and analysis. ACM, p 9
60.
go back to reference You Y, Li J, Reddi S, Hseu J, Kumar S, Bhojanapalli S, Song X, Demmel J, Hsieh C-J (2019) Large batch optimization for deep learning: training bert in 76 minutes. arXiv:1904.00962 You Y, Li J, Reddi S, Hseu J, Kumar S, Bhojanapalli S, Song X, Demmel J, Hsieh C-J (2019) Large batch optimization for deep learning: training bert in 76 minutes. arXiv:​1904.​00962
61.
go back to reference You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th international conference on parallel processing. ACM, p 1 You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2018) Imagenet training in minutes. In: Proceedings of the 47th international conference on parallel processing. ACM, p 1
62.
go back to reference Zhai S, Chang K-h, Zhang R, Zhang ZM (2016) Deepintent: Learning attentions for online advertising with recurrent neural networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1295–1304 Zhai S, Chang K-h, Zhang R, Zhang ZM (2016) Deepintent: Learning attentions for online advertising with recurrent neural networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 1295–1304
63.
go back to reference Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput-aided Des Integr Circuits and Syst 38(11):2072–2085CrossRef Zhang C, Sun G, Fang Z, Zhou P, Pan P, Cong J (2018) Caffeine: Toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Trans Comput-aided Des Integr Circuits and Syst 38(11):2072–2085CrossRef
64.
go back to reference Zhang H, Xia M, Hu G (2007) A multiwindow partial buffering scheme for fpga-based 2-d convolvers. IEEE Trans Circuits Syst II: Express Br 54(2):200–204CrossRef Zhang H, Xia M, Hu G (2007) A multiwindow partial buffering scheme for fpga-based 2-d convolvers. IEEE Trans Circuits Syst II: Express Br 54(2):200–204CrossRef
65.
go back to reference Zhu M, Rhu M, Clemons J, Keckler SW, Xie Y (2016) Training long short-term memory with sparsified stochastic gradient descent Zhu M, Rhu M, Clemons J, Keckler SW, Xie Y (2016) Training long short-term memory with sparsified stochastic gradient descent
Metadata
Title
Fast LSTM by dynamic decomposition on cloud and distributed systems
Authors
Yang You
Yuxiong He
Samyam Rajbhandari
Wenhan Wang
Cho-Jui Hsieh
Kurt Keutzer
James Demmel
Publication date
19-07-2020
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 11/2020
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-020-01487-8

Other articles of this Issue 11/2020

Knowledge and Information Systems 11/2020 Go to the issue

Premium Partner