Skip to main content
Log in

A comparative performance analysis of different activation functions in LSTM networks for classification

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

In recurrent neural networks such as the long short-term memory (LSTM), the sigmoid and hyperbolic tangent functions are commonly used as activation functions in the network units. Other activation functions developed for the neural networks are not thoroughly analyzed in LSTMs. While many researchers have adopted LSTM networks for classification tasks, no comprehensive study is available on the choice of activation functions for the gates in these networks. In this paper, we compare 23 different kinds of activation functions in a basic LSTM network with a single hidden layer. Performance of different activation functions and different number of LSTM blocks in the hidden layer are analyzed for classification of records in the IMDB, Movie Review, and MNIST data sets. The quantitative results on all data sets demonstrate that the least average error is achieved with the Elliott activation function and its modifications. Specifically, this family of functions exhibits better results than the sigmoid activation function which is popular in LSTM networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz.

  2. http://ai.stanford.edu/amaas/data/sentiment/.

  3. http://yann.lecun.com/exdb/mnist/.

References

  1. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. doi:10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  2. Graves A (2012) Supervised sequence labelling with recurrent neural networks. Springer, Berlin

    Book  MATH  Google Scholar 

  3. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw 18(5):602–610

    Article  Google Scholar 

  4. Liwicki M, Graves A, Bunke H, Schmidhuber J (2007) A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks. In: Proceedings of the 9th international conference on document analysis and recognition, ICDAR 2007

  5. Graves A, Liwicki M, Fernández S et al (2009) A novel connectionist system for unconstrained handwriting recognition. IEEE Trans Pattern Anal Mach Intell 31:855–868. doi:10.1109/TPAMI.2008.137

    Article  Google Scholar 

  6. Graves A, Fernández S, Schmidhuber J (2005) Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition. In: Duch W, Kacprzyk J, Oja E, Zadrożny S (eds) Artificial neural networks: formal models and their applications—ICANN 2005. Springer, Berlin, pp 799–804

    Google Scholar 

  7. Otte S, Krechel D, Liwicki M, Dengel A (2012) Local feature based online mode detection with recurrent neural networks. In: 2012 international conference on frontiers in handwriting recognition (ICFHR). pp 533–537

  8. Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. arXiv:1303.5778 [cs]

  9. Thang Luong IS (2014) Addressing the rare word problem in neural machine translation. doi:10.3115/v1/P15-1002

  10. Wöllmer M, Metallinou A, Eyben F, et al (2010) Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling. In: Proceedings of interspeech, Makuhari. pp 2362–2365

  11. Sak H, Senior A, Beaufays F (2014) Long short-term memory recurrent neural network architectures for large scale acoustic modeling. In Proceedings of the annual conference of international speech communication association (INTERSPEECH). pp 338–342

  12. Fan Y, Qian Y, Xie F, Soong FK (2014) TTS synthesis with bidirectional LSTM based recurrent neural networks. In: Proceedings interspeech. pp 1964–1968

  13. Zaremba W, Sutskever I, Vinyals O (2014) Recurrent neural network regularization. arXiv:1409.2329 [cs]

  14. Sønderby SK, Winther O (2014) Protein secondary structure prediction with long short term memory networks. arXiv:1412.7828 [cs, q-bio]

  15. Marchi E, Ferroni G, Eyben F, et al (2014) Multi-resolution linear prediction based features for audio onset detection with bidirectional LSTM neural networks. In: 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP). pp 2164–2168

  16. Donahue J, Hendricks LA, Guadarrama S, et al (2014) Long-term recurrent convolutional networks for visual recognition and description. arXiv:1411.4389 [cs]

  17. Wollmer M, Blaschke C, Schindl T et al (2011) Online driver distraction detection using long short-term memory. IEEE Trans Intell Transp Syst 12:574–582. doi:10.1109/TITS.2011.2119483

    Article  Google Scholar 

  18. da Gomes GSS, Ludermir TB (2013) Optimization of the weights and asymmetric activation function family of neural network for time series forecasting. Exp Syst Appl 40:6438–6446. doi:10.1016/j.eswa.2013.05.053

    Article  Google Scholar 

  19. Duch W, Jankowski N (1999) Survey of neural transfer functions. Neural Comput Surv 2:163–213

    Google Scholar 

  20. Singh Sodhi S, Chandra P (2003) A class +1 sigmoidal activation functions for FFANNs. J Econ Dyna Control 28:183–187

    Article  MathSciNet  MATH  Google Scholar 

  21. da Gomes GSS, Ludermir TB, Lima LMMR (2010) Comparison of new activation functions in neural network for forecasting financial time series. Neural Comput Appl 20:417–439. doi:10.1007/s00521-010-0407-3

    Article  Google Scholar 

  22. Gomes GS d S, Ludermir TB (2008) Complementary log-log and probit: activation functions implemented in artificial neural networks. In: Eighth international conference on hybrid intelligent systems, 2008. HIS’08. pp 939–942

  23. Michal Rosen-Zvi MB (1998) Learnability of periodic activation functions: general results. Phys Rev E 58:3606–3609. doi:10.1103/PhysRevE.58.3606

    Article  Google Scholar 

  24. Leung H, Haykin S (1993) Rational function neural network. Neural Comput 5:928–938. doi:10.1162/neco.1993.5.6.928

    Article  Google Scholar 

  25. Ma L, Khorasani K (2005) Constructive feedforward neural networks using Hermite polynomial activation functions. IEEE Trans Neural Netw 16:821–833. doi:10.1109/TNN.2005.851786

    Article  Google Scholar 

  26. Hornik K (1993) Some new results on neural network approximation. Neural Netw 6:1069–1072. doi:10.1016/S0893-6080(09)80018-X

    Article  Google Scholar 

  27. Hornik K (1991) Approximation capabilities of multilayer feedforward networks. Neural Netw 4:251–257. doi:10.1016/0893-6080(91)90009-T

    Article  Google Scholar 

  28. Hartman E, Keeler JD, Kowalski JM (1990) Layered neural networks with Gaussian hidden units as universal approximations. Neural Comput 2:210–215. doi:10.1162/neco.1990.2.2.210

    Article  Google Scholar 

  29. Skoundrianos EN, Tzafestas SG (2004) Modelling and FDI of dynamic discrete time systems using a MLP with a new sigmoidal activation function. J Intell Robot Syst 41:19–36. doi:10.1023/B:JINT.0000049175.78893.2f

    Article  Google Scholar 

  30. Pao Y-H (1989) Adaptive pattern recognition and neural networks. Addison-Wesley Longman Publishing Co. Inc, Boston

    MATH  Google Scholar 

  31. Carroll SM, Dickinson BW (1989) Construction of neural nets using the radon transform. In: International joint conference on neural networks, 1989, vol 1. IJCNN. pp 607–611

  32. Cybenko G (1989) Approximation by superpositions of a sigmoidal function. Math Control Signal Syst 2:303–314. doi:10.1007/BF02551274

    Article  MathSciNet  MATH  Google Scholar 

  33. Chandra P, Singh Y (2004) Feedforward sigmoidal networks—equicontinuity and fault-tolerance properties. IEEE Trans Neural Netw 15:1350–1366. doi:10.1109/TNN.2004.831198

    Article  Google Scholar 

  34. Williams RJ, Zipser D (1995) Gradient-based learning algorithms for recurrent networks and their computational complexity. In: Chauvin Y, Rumelhart DE (eds) Back-propagation: theory, architectures and applications. L. Erlbaum Associates Inc., Hillsdale, pp 433–486

    Google Scholar 

  35. Zeiler MD (2012) ADADELTA: an adaptive learning rate method. arXiv:1212.5701 [cs]

  36. Singh Sodhi S, Chandra P (2014) Bi-modal derivative activation function for sigmoidal feedforward networks. Neurocomputing 143:182–196. doi:10.1016/j.neucom.2014.06.007

    Article  Google Scholar 

  37. Yuan M, Hu H, Jiang Y, Hang S (2013) A new camera calibration based on neural network with tunable activation function in intelligent space. In: 2013 6th international symposium on computational intelligence and design (ISCID). pp 371–374

  38. Chandra P, Sodhi SS (2014) A skewed derivative activation function for SFFANNs. In: Recent advances and innovations in engineering (ICRAIE). IEEE, pp 1–6

  39. Elliott DL (1993) A better activation function for artificial neural networks. Technical Report ISR TR 93–8, University of Maryland

  40. Hara K, Nakayamma K (1994) Comparison of activation functions in multilayer neural network for pattern classification. In: 1994 IEEE international conference on neural networks, 1994. IEEE world congress on computational intelligence, vol 5. pp 2997–3002

  41. Burhani H, Feng W, Hu G (2015) Denoising autoencoder in neural networks with modified Elliott activation function and sparsity-favoring cost function. In: 2015 3rd international conference on applied computing and information technology/2nd international conference on computational science and intelligence (ACIT-CSI). pp 343–348

  42. Chandra P, Singh Y (2004) A case for the self-adaptation of activation functions in FFANNs. Neurocomputing 56:447–454. doi:10.1016/j.neucom.2003.08.005

    Article  Google Scholar 

  43. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML 2010). pp 807–814

  44. Tieleman T, Hinton G (2012) Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA Neural Netw Mach Learn 4:26–31

    Google Scholar 

  45. Duchi J, Hazan E, Singer Y (2011) Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. Journal of Machine Learning Research 12:2121–2159

    MathSciNet  MATH  Google Scholar 

  46. Lipton ZC, Berkowitz J, Elkan C (2015) A critical review of recurrent neural networks for sequence learning. arXiv:1506.00019 [cs]

  47. Ruder S (2016) An overview of gradient descent optimization algorithms. arXiv:1609.04747 [cs]

  48. Hinton GE, Srivastava N, Krizhevsky A, et al (2012) Improving neural networks by preventing co-adaptation of feature detectors, vol abs/1207.0580. arXiv preprint arXiv:1207.0580. The Computing Research Repository (CoRR)

  49. Pang B, Lee L (2005) Seeing stars: exploiting class relationships for sentiment categorization with respect to rating scales. In: Proceedings of the 43rd annual meeting on association for computational linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA. pp 115–124

  50. Maas AL, Daly RE, Pham PT, et al (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, Stroudsburg, PA, USA. pp 142–150

  51. Lenc L, Hercig T (2016) Neural networks for sentiment analysis in Czech. In: ITAT 2016 proceedings, CEUR Workshop Proceedings, vol 1649. pp 48–55

  52. Dai AM, Le QV (2015) Semi-supervised sequence learning. In: Cortes C, Lawrence ND, Lee DD et al (eds) Advances in neural information processing systems, vol 28. Curran Associates Inc, Red Hook, pp 3079–3087

    Google Scholar 

  53. Arjovsky M, Shah A, Bengio Y (2016) Unitary evolution recurrent neural networks. In: Proceedings of the 33rd international conference on machine learning. pp 1120–1128

  54. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 5:157–166. doi:10.1109/72.279181

    Article  Google Scholar 

  55. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proceedings of the 30th international conference on machine learning. pp 1310–1318

  56. Gers FA, Schmidhuber J (2000) Recurrent nets that time and count. In: Proceedings of the IEEE-INNS-ENNS international joint conference on neural networks. IJCNN 2000. Neural computing: new challenges and perspectives for the new millennium, vol 3. pp 189–194

  57. Schuster M, Paliwal KK (1997) Bidirectional recurrent neural networks. IEEE Trans Signal Process 45:2673–2681. doi:10.1109/78.650093

    Article  Google Scholar 

  58. Cho K, van Merrienboer B, Bahdanau D, Bengio Y (2014) On the properties of neural machine translation: encoder–decoder approaches. arXiv:1409.1259 [cs, stat]

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoda Mashayekhi.

Ethics declarations

Conflict of interest

We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.

Appendix

Appendix

See Tables 9, 10 and 11.

Table 9 Average test error values per each activation function for the Movie Review data set
Table 10 Average test error values per each activation function for the IMDB data set
Table 11 Average error values per each activation function for the MNIST data set

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Farzad, A., Mashayekhi, H. & Hassanpour, H. A comparative performance analysis of different activation functions in LSTM networks for classification. Neural Comput & Applic 31, 2507–2521 (2019). https://doi.org/10.1007/s00521-017-3210-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-017-3210-6

Keywords

Navigation