Skip to main content
Log in

On the role of multimodal learning in the recognition of sign language

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Sign Language Recognition (SLR) has become one of the most important research areas in the field of human computer interaction. SLR systems are meant to automatically translate sign language into text or speech, in order to reduce the communicational gap between deaf and hearing people. The aim of this paper is to exploit multimodal learning techniques for an accurate SLR, making use of data provided by Kinect and Leap Motion. In this regard, single-modality approaches as well as different multimodal methods, mainly based on convolutional neural networks, are proposed. Our main contribution is a novel multimodal end-to-end neural network that explicitly models private feature representations that are specific to each modality and shared feature representations that are similar between modalities. By imposing such regularization in the learning process, the underlying idea is to increase the discriminative ability of the learned features and, hence, improve the generalization capability of the model. Experimental results demonstrate that multimodal learning yields an overall improvement in the sign recognition performance. In particular, the novel neural network architecture outperforms the current state-of-the-art methods for the SLR task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Adithya V, Vinod PR, Gopalakrishnan U (2013) Artificial neural network based method for indian sign language recognition. In: 2013 IEEE conference on information communication technologies (ICT), pp 1080–1085. https://doi.org/10.1109/CICT.2013.6558259

  2. Bastien F, Lamblin P, Pascanu R, Bergstra J, Goodfellow IJ, Bergeron A, Bouchard N, Bengio Y (2012) Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop

  3. Bousmalis K, Trigeorgis G, Silberman N, Krishnan D, Erhan D (2016) Domain separation networks. In: lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, pp 343–351

  4. Cooper H, Bowden R (2007) Large lexicon detection of sign language. Springer, Berlin, pp 88–97

    Google Scholar 

  5. den Bergh MV, Gool LV (2011) Combining rgb and tof cameras for real-time 3d hand gesture interaction. In: 2011 IEEE workshop on applications of computer vision (WACV), pp 66–72

  6. Dominio F, Donadeo M, Zanuttigh P (2014) Combining multiple depth-based descriptors for hand gesture recognition. Pattern Recogn Lett 50:101–111

    Article  Google Scholar 

  7. Ferreira PM, Cardoso JS, Rebelo A (2017) Multimodal learning for sign language recognition. In: Iberian conference on pattern recognition and image analysis, pp 313–321. Springer

  8. Geng Y, Zhang G, Li W, Gu Y, Liang RZ, Liang G, Wang J, Wu Y, Patil N, Wang JY (2017) A novel image tag completion method based on convolutional neural transformation. In: Lintas A, Rovetta S, Verschure PF, Villa AE (eds) Artificial neural networks and machine learning – ICANN 2017. Springer International Publishing, Cham, pp 539–546

  9. Hamid ATZ, Wirza RR, Iqbal SM, Suhaiza SP (2014) Skin segmentation using yuv and rgb color spaces. J Inf Process Syst 10(2):283

    Article  Google Scholar 

  10. Huang C, Loy CC, Tang X (2016) Local similarity-aware deep feature embedding. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in neural information processing systems 29, pp 1262–1270

  11. Kurakin A, Zhang Z, Liu Z (2012) A real time system for dynamic hand gesture recognition with a depth sensor. In: 2012 Proceedings of the 20th European signal processing conference (EUSIPCO), pp 1975–1979

  12. Lenz I, Lee H, Saxena A (2015) Deep learning for detecting robotic grasps. Int J Robot Res 34(4-5):705–724. https://doi.org/10.1177/0278364914549607

    Article  Google Scholar 

  13. Liang R, Liang G, Li W, Li Q, Wang JJ (2016) Learning convolutional neural network to maximize pos@top performance measure. arXiv:1609.08417

  14. Marin G, Dominio F, Zanuttigh P (2014) Hand gesture recognition with leap motion and kinect devices. In: 2014 IEEE International conference on image processing (ICIP), pp 1565–1569

  15. Marin G, Dominio F, Zanuttigh P (2016) Hand gesture recognition with jointly calibrated leap motion and depth sensor. Multimedia Tools and Applications 75 (22):14,991–15,015. https://doi.org/10.1007/s11042-015-2451-6

    Article  Google Scholar 

  16. Ngiam J, Khosla A, Kim M, Nam J, Lee H, Ng AY (2011) Multimodal deep learning. In: International conference on machine learning (ICML), vol 6

  17. Potter LE, Araullo J, Carter L (2013) The leap motion controller: a view on sign language. In: Proceedings of the 25th Australian computer-human interaction conference: augmentation, application, innovation, collaboration, OzCHI ’13. ACM, New York, pp 175–178

  18. Ramachandram D, Taylor GW (2017) Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Proc Mag 34(6):96–108. https://doi.org/10.1109/MSP.2017.2738401

    Article  Google Scholar 

  19. Schroff F, Kalenichenko D, Philbin J (2015) Facenet: a unified embedding for face recognition and clustering. In: The IEEE conference on computer vision and pattern recognition (CVPR)

  20. Sohn K, Shang W, Lee H (2014) Improved multimodal deep learning with variation of information. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems 27, pp 2141–2149. Curran Associates, Inc. http://papers.nips.cc/paper/5279-improved-multimodal-deep-learning-with-variation-of-information.pdf

  21. Srinivas S, Sarvadevabhatla RK, Mopuri KR, Prabhu N, Kruthiventi S, Radhakrishnan VB (2016) A taxonomy of deep convolutional neural nets for computer vision. Frontiers in Robotics and AI 2(36):1–13

    Google Scholar 

  22. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15:1929–1958. http://jmlr.org/papers/v15/srivastava14a.html

    MathSciNet  MATH  Google Scholar 

  23. Su F, Wang J (2018) Domain transfer convolutional attribute embedding. arXiv:1803.09733

  24. Wang A, Cai J, Lu J, Cham TJ (2015) Mmss: Multi-modal sharable and specific feature learning for rgb-d object recognition. In: 2015 IEEE International conference on computer vision (ICCV), pp 1125–1133

  25. Wang A, Lu J, Cai J, Cham TJ, Wang G (2015) Large-margin multi-modal deep learning for rgb-d object recognition. IEEE Trans Multimedia 17(11):1887–1898. https://doi.org/10.1109/TMM.2015.2476655

    Article  Google Scholar 

  26. Wang J, Shi L, Wang H, Meng J, Wang JJ, Sun Q, Gu Y (2016) Optimizing top precision performance measure of content-based image retrieval by learning similarity function. arXiv:1604.06620

  27. Wang JJY, Wang Y, Zhao S, Gao X (2015) Maximum mutual information regularized classification. Eng Appl Artif Intell 37:1–8. https://doi.org/10.1016/j.engappai.2014.08.009. http://www.sciencedirect.com/science/article/pii/S0952197614002085

    Article  Google Scholar 

  28. Wu Z, Jiang YG, Wang J, Pu J, Xue X (2014) Exploring inter-feature and inter-class relationships with deep neural networks for video classification. In: Proceedings of the 22Nd ACM International conference on multimedia, MM ’14. ACM, New York, pp 167–176. https://doi.org/10.1145/2647868.2654931. http://doi.acm.org/10.1145/2647868.2654931

  29. Yang H (2015) Sign language recognition with the kinect sensor based on conditional random fields. Sensors 15(1):135–147. https://doi.org/10.3390/s150100135

    Article  Google Scholar 

  30. Zhang G, Liang G, Li W, Fang J, Wang J, Geng Y, Wang JY (2017) Learning convolutional ranking-score function by query preference regularization. In: Yin H, Gao Y, Chen S, Wen Y, Cai G, Gu T, Du J, Tallón-Ballesteros AJ, Zhang M (eds) Intelligent data engineering and automated learning – IDEAL 2017. Springer International Publishing, Cham, pp 1–8

  31. Zhang S, Wang H, Huang W (2017) Two-stage plant species recognition by local mean clustering and weighted sparse representation classification. Clust Comput 20(2):1517–1525. https://doi.org/10.1007/s10586-017-0859-7

    Article  Google Scholar 

Download references

Acknowledgements

This work was funded by the Protect “NanoSTIMA: Macro-to-Nano Human Sensing: Towards Integrated Multimodal Health Monitoring and Analytics/NORTE010145-FEDER000016 ” financed by the North Portugal Regional Operational Programme (NORTE 2020), under PORTUGAL 2020 Partnership Agreement, and through the European Regional Development FUND (ERDF), and also by Fundação para a Ciência e a Tecnologia (FCT) within PhD and BPD grants with numbers SFRH/BD/102177/2014 and SFRH/BPD/101439/2014.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pedro M. Ferreira.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ferreira, P.M., Cardoso, J.S. & Rebelo, A. On the role of multimodal learning in the recognition of sign language. Multimed Tools Appl 78, 10035–10056 (2019). https://doi.org/10.1007/s11042-018-6565-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6565-5

Keywords

Navigation