Skip to main content
Top
Published in:
Cover of the book

2017 | OriginalPaper | Chapter

Low-Resource Speech Recognition and Keyword-Spotting

Authors : Mark J. F. Gales, Kate M. Knill, Anton Ragni

Published in: Speech and Computer

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This paper will describe some of the developments in speech recognition and keyword-spotting during the lifetime of the project. Two technical areas will be briefly discussed with a focus on techniques developed at Cambridge University: the application of deep learning for low-resource speech recognition; and efficient approaches for keyword spotting. Finally a brief analysis of the Babel speech language characteristics and language performance will be presented.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
For a complete movie of the activation functions for stimulated training see: http://​mi.​eng.​cam.​ac.​uk/​~mjfg/​bneStimu.​avi.
 
2
All markers such as accents are stripped from the grapheme to yield the root grapheme. Thus Latin scripts have 26 graphemes. These accuracies include silence at the beginning and end of sentences, and between all words.
 
Literature
2.
go back to reference Beyerlein, P.: Discriminative model combination. In: Proceedings of ASRU (1997) Beyerlein, P.: Discriminative model combination. In: Proceedings of ASRU (1997)
3.
go back to reference Chen, X., Ragni, A., Liu, X., Vasilakes, J., Knill, K.M., Gales, M.J.: Recurrent neural network language models for keyword search. In: ICASSP (2017) Chen, X., Ragni, A., Liu, X., Vasilakes, J., Knill, K.M., Gales, M.J.: Recurrent neural network language models for keyword search. In: ICASSP (2017)
4.
go back to reference Cui, J., Kingsbury, B., Ramabhadran, B., Sethy, A., Audhkhasi, K., Cui, X., Kislal, E., Mangu, L., Nussbaum-Thom, M., Picheny, M., et al.: Multilingual representations for low resource speech recognition and keyword search. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 259–266. IEEE (2015) Cui, J., Kingsbury, B., Ramabhadran, B., Sethy, A., Audhkhasi, K., Cui, X., Kislal, E., Mangu, L., Nussbaum-Thom, M., Picheny, M., et al.: Multilingual representations for low resource speech recognition and keyword search. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 259–266. IEEE (2015)
5.
go back to reference Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modeling. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(9), 1469–1477 (2015)CrossRef Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modeling. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(9), 1469–1477 (2015)CrossRef
6.
go back to reference Evermann, G., Woodland, P.C.: Large vocabulary decoding and confidence estimation using word posterior probabilities. In: Proceedings of ICASSP (2000) Evermann, G., Woodland, P.C.: Large vocabulary decoding and confidence estimation using word posterior probabilities. In: Proceedings of ICASSP (2000)
7.
go back to reference Evermann, G., Woodland, P.: Posterior probability decoding, confidence estimation and system combination. In: Proceedings of Speech Transcription Workshop, vol. 27, p. 78, Baltimore (2000) Evermann, G., Woodland, P.: Posterior probability decoding, confidence estimation and system combination. In: Proceedings of Speech Transcription Workshop, vol. 27, p. 78, Baltimore (2000)
8.
go back to reference Fiscus, J.G.: A post-processing system to yield reduced word error rates: recogniser output voting error reduction (ROVER). In: Proceedings of ASRU (1997) Fiscus, J.G.: A post-processing system to yield reduced word error rates: recogniser output voting error reduction (ROVER). In: Proceedings of ASRU (1997)
9.
go back to reference Fiscus, J.G., et al.: Results of the 2006 spoken term detection evaluation. In: Proceedings of ACM SIGIR Workshop on Searching Spontaneous Conversational Speech (2007) Fiscus, J.G., et al.: Results of the 2006 spoken term detection evaluation. In: Proceedings of ACM SIGIR Workshop on Searching Spontaneous Conversational Speech (2007)
10.
go back to reference Gales, M.J., Knill, K.M., Ragni, A.: Unicode-based graphemic systems for limited resource languages. In: ICASSP, pp. 5186–5190. IEEE (2015) Gales, M.J., Knill, K.M., Ragni, A.: Unicode-based graphemic systems for limited resource languages. In: ICASSP, pp. 5186–5190. IEEE (2015)
11.
go back to reference Gales, M.J., Knill, K.M., Ragni, A., Rath, S.P.: Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In: SLTU. pp. 16–23 (2014) Gales, M.J., Knill, K.M., Ragni, A., Rath, S.P.: Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In: SLTU. pp. 16–23 (2014)
12.
go back to reference Grezl, F., Karafiat, M., Janda, M.: Study of probabilistic and bottle-neck features in multilingual environment. In: Proceedings of ASRU (2011) Grezl, F., Karafiat, M., Janda, M.: Study of probabilistic and bottle-neck features in multilingual environment. In: Proceedings of ASRU (2011)
14.
go back to reference Hartmann, W., Ng, T., Hsiao, R., Tsakalidis, S., Schwartz, R.: Two-stage data augmentation for low-resourced speech recognition. In: Interspeech 2016, pp. 2378–2382 (2016) Hartmann, W., Ng, T., Hsiao, R., Tsakalidis, S., Schwartz, R.: Two-stage data augmentation for low-resourced speech recognition. In: Interspeech 2016, pp. 2378–2382 (2016)
15.
go back to reference Hermansky, H., Ellis, D., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of ICASSP (2000) Hermansky, H., Ellis, D., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of ICASSP (2000)
16.
go back to reference Hinton, G., Deng, L., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef Hinton, G., Deng, L., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef
17.
go back to reference Kanthak, S., Ney, H.: Context-dependent acoustic modelling using graphemes for large-vocabulary speech recognition. In: Proceedings of ICASSP (2002) Kanthak, S., Ney, H.: Context-dependent acoustic modelling using graphemes for large-vocabulary speech recognition. In: Proceedings of ICASSP (2002)
18.
go back to reference Killer, M., Stüker, S., Schultz, T.: Grapheme based speech recognition. In: Proceedings of EUROSPEECH (2003) Killer, M., Stüker, S., Schultz, T.: Grapheme based speech recognition. In: Proceedings of EUROSPEECH (2003)
19.
go back to reference Mamou, J., Cui, J., Cui, X., Gales, M.J., Kingsbury, B., Knill, K., Mangu, L., Nolden, D., Picheny, M., Ramabhadran, B., et al.: System combination and score normalization for spoken term detection. In: ICASSP, pp. 8272–8276. IEEE (2013) Mamou, J., Cui, J., Cui, X., Gales, M.J., Kingsbury, B., Knill, K., Mangu, L., Nolden, D., Picheny, M., Ramabhadran, B., et al.: System combination and score normalization for spoken term detection. In: ICASSP, pp. 8272–8276. IEEE (2013)
20.
go back to reference Mendels, G., Cooper, E., Soto, V., Hirschberg, J., Gales, M.J., Knill, K.M., Ragni, A., Wang, H.: Improving speech recognition and keyword search for low resource languages using web data. In: INTERSPEECH, pp. 829–833 (2015) Mendels, G., Cooper, E., Soto, V., Hirschberg, J., Gales, M.J., Knill, K.M., Ragni, A., Wang, H.: Improving speech recognition and keyword search for low resource languages using web data. In: INTERSPEECH, pp. 829–833 (2015)
21.
go back to reference Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, p. 3 (2010) Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, p. 3 (2010)
22.
go back to reference Miller, D.R.H., Kleber, M., et al.: Rapid and accurate spoken term detection. In: Proceedings of Interspeech (2007) Miller, D.R.H., Kleber, M., et al.: Rapid and accurate spoken term detection. In: Proceedings of Interspeech (2007)
23.
go back to reference Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.F.: Data augmentation for low resource languages. In: Proceedings of InterSpeech (2014) Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.F.: Data augmentation for low resource languages. In: Proceedings of InterSpeech (2014)
24.
go back to reference Ragni, A., Dakin, E., Chen, X., Gales, M.J., Knill, K.M.: Multi-language neural network language models. Interspeech 8, 3042–3046 (2016)CrossRef Ragni, A., Dakin, E., Chen, X., Gales, M.J., Knill, K.M.: Multi-language neural network language models. Interspeech 8, 3042–3046 (2016)CrossRef
25.
go back to reference Ragni, A., Wu, C., Gales, M.J., Vasilakes, J., Knill, K.M.: Stimulated training for automatic speech recognition and keyword search in limited resource conditions. In: ICASSP (2017) Ragni, A., Wu, C., Gales, M.J., Vasilakes, J., Knill, K.M.: Stimulated training for automatic speech recognition and keyword search in limited resource conditions. In: ICASSP (2017)
26.
go back to reference Rath, S.P., Knill, K.M., Ragni, A., Gales, M.J.: Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages. In: INTERSPEECH, pp. 835–839 (2014) Rath, S.P., Knill, K.M., Ragni, A., Gales, M.J.: Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages. In: INTERSPEECH, pp. 835–839 (2014)
27.
go back to reference Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and gmm-hmm system combination techniques. In: ICASSP, pp. 6744–6748. IEEE (2013) Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and gmm-hmm system combination techniques. In: ICASSP, pp. 6744–6748. IEEE (2013)
28.
go back to reference Szoke, I., Burget, L., Cernocky, J., Fapso, M.: Sub-word modeling of out of vocabulary words in spoken term detection. In: Proceedings of SLT (2008) Szoke, I., Burget, L., Cernocky, J., Fapso, M.: Sub-word modeling of out of vocabulary words in spoken term detection. In: Proceedings of SLT (2008)
29.
go back to reference Tan, S., Sim, K.C., Gales, M.: Improving the interpretability of deep neural networks with stimulated learning. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 617–623. IEEE (2015) Tan, S., Sim, K.C., Gales, M.: Improving the interpretability of deep neural networks with stimulated learning. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 617–623. IEEE (2015)
30.
go back to reference Vergyri, D., Shafran, I., et al.: The SRI/OGI 2006 spoken term detection system. In: Proceedings of Interspeech (2007) Vergyri, D., Shafran, I., et al.: The SRI/OGI 2006 spoken term detection system. In: Proceedings of Interspeech (2007)
31.
go back to reference Wang, H., Ragni, A., Gales, M.J., Knill, K.M., Woodland, P.C., Zhang, C.: Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages. In: INTERSPEECH, pp. 3660–3664 (2015) Wang, H., Ragni, A., Gales, M.J., Knill, K.M., Woodland, P.C., Zhang, C.: Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages. In: INTERSPEECH, pp. 3660–3664 (2015)
32.
go back to reference Wu, C., Karanasou, P., Gales, M.J., Sim, K.C.: Stimulated deep neural network for speech recognition. In: Proceedings of Interspeech, pp. 400–404 (2016) Wu, C., Karanasou, P., Gales, M.J., Sim, K.C.: Stimulated deep neural network for speech recognition. In: Proceedings of Interspeech, pp. 400–404 (2016)
33.
go back to reference Yang, J., Zhang, C., Ragni, A., Gales, M.J., Woodland, P.C.: System combination with log-linear models. In: ICASSP, pp. 5675–5679. IEEE (2016) Yang, J., Zhang, C., Ragni, A., Gales, M.J., Woodland, P.C.: System combination with log-linear models. In: ICASSP, pp. 5675–5679. IEEE (2016)
34.
go back to reference Zhang, L., Karakos, D., Hartmann, W., Hsiao, R., Schwartz, R., Tsakalidis, S.: Enhancing low resource keyword spotting with automatically retrieved web documents. In: Interspeech (2015) Zhang, L., Karakos, D., Hartmann, W., Hsiao, R., Schwartz, R., Tsakalidis, S.: Enhancing low resource keyword spotting with automatically retrieved web documents. In: Interspeech (2015)
Metadata
Title
Low-Resource Speech Recognition and Keyword-Spotting
Authors
Mark J. F. Gales
Kate M. Knill
Anton Ragni
Copyright Year
2017
DOI
https://doi.org/10.1007/978-3-319-66429-3_1

Premium Partner