Skip to main content
Erschienen in:
Buchtitelbild

2017 | OriginalPaper | Buchkapitel

Low-Resource Speech Recognition and Keyword-Spotting

verfasst von : Mark J. F. Gales, Kate M. Knill, Anton Ragni

Erschienen in: Speech and Computer

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This paper will describe some of the developments in speech recognition and keyword-spotting during the lifetime of the project. Two technical areas will be briefly discussed with a focus on techniques developed at Cambridge University: the application of deep learning for low-resource speech recognition; and efficient approaches for keyword spotting. Finally a brief analysis of the Babel speech language characteristics and language performance will be presented.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
For a complete movie of the activation functions for stimulated training see: http://​mi.​eng.​cam.​ac.​uk/​~mjfg/​bneStimu.​avi.
 
2
All markers such as accents are stripped from the grapheme to yield the root grapheme. Thus Latin scripts have 26 graphemes. These accuracies include silence at the beginning and end of sentences, and between all words.
 
Literatur
2.
Zurück zum Zitat Beyerlein, P.: Discriminative model combination. In: Proceedings of ASRU (1997) Beyerlein, P.: Discriminative model combination. In: Proceedings of ASRU (1997)
3.
Zurück zum Zitat Chen, X., Ragni, A., Liu, X., Vasilakes, J., Knill, K.M., Gales, M.J.: Recurrent neural network language models for keyword search. In: ICASSP (2017) Chen, X., Ragni, A., Liu, X., Vasilakes, J., Knill, K.M., Gales, M.J.: Recurrent neural network language models for keyword search. In: ICASSP (2017)
4.
Zurück zum Zitat Cui, J., Kingsbury, B., Ramabhadran, B., Sethy, A., Audhkhasi, K., Cui, X., Kislal, E., Mangu, L., Nussbaum-Thom, M., Picheny, M., et al.: Multilingual representations for low resource speech recognition and keyword search. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 259–266. IEEE (2015) Cui, J., Kingsbury, B., Ramabhadran, B., Sethy, A., Audhkhasi, K., Cui, X., Kislal, E., Mangu, L., Nussbaum-Thom, M., Picheny, M., et al.: Multilingual representations for low resource speech recognition and keyword search. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 259–266. IEEE (2015)
5.
Zurück zum Zitat Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modeling. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(9), 1469–1477 (2015)CrossRef Cui, X., Goel, V., Kingsbury, B.: Data augmentation for deep neural network acoustic modeling. IEEE/ACM Trans. Audio Speech Lang. Process. (TASLP) 23(9), 1469–1477 (2015)CrossRef
6.
Zurück zum Zitat Evermann, G., Woodland, P.C.: Large vocabulary decoding and confidence estimation using word posterior probabilities. In: Proceedings of ICASSP (2000) Evermann, G., Woodland, P.C.: Large vocabulary decoding and confidence estimation using word posterior probabilities. In: Proceedings of ICASSP (2000)
7.
Zurück zum Zitat Evermann, G., Woodland, P.: Posterior probability decoding, confidence estimation and system combination. In: Proceedings of Speech Transcription Workshop, vol. 27, p. 78, Baltimore (2000) Evermann, G., Woodland, P.: Posterior probability decoding, confidence estimation and system combination. In: Proceedings of Speech Transcription Workshop, vol. 27, p. 78, Baltimore (2000)
8.
Zurück zum Zitat Fiscus, J.G.: A post-processing system to yield reduced word error rates: recogniser output voting error reduction (ROVER). In: Proceedings of ASRU (1997) Fiscus, J.G.: A post-processing system to yield reduced word error rates: recogniser output voting error reduction (ROVER). In: Proceedings of ASRU (1997)
9.
Zurück zum Zitat Fiscus, J.G., et al.: Results of the 2006 spoken term detection evaluation. In: Proceedings of ACM SIGIR Workshop on Searching Spontaneous Conversational Speech (2007) Fiscus, J.G., et al.: Results of the 2006 spoken term detection evaluation. In: Proceedings of ACM SIGIR Workshop on Searching Spontaneous Conversational Speech (2007)
10.
Zurück zum Zitat Gales, M.J., Knill, K.M., Ragni, A.: Unicode-based graphemic systems for limited resource languages. In: ICASSP, pp. 5186–5190. IEEE (2015) Gales, M.J., Knill, K.M., Ragni, A.: Unicode-based graphemic systems for limited resource languages. In: ICASSP, pp. 5186–5190. IEEE (2015)
11.
Zurück zum Zitat Gales, M.J., Knill, K.M., Ragni, A., Rath, S.P.: Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In: SLTU. pp. 16–23 (2014) Gales, M.J., Knill, K.M., Ragni, A., Rath, S.P.: Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In: SLTU. pp. 16–23 (2014)
12.
Zurück zum Zitat Grezl, F., Karafiat, M., Janda, M.: Study of probabilistic and bottle-neck features in multilingual environment. In: Proceedings of ASRU (2011) Grezl, F., Karafiat, M., Janda, M.: Study of probabilistic and bottle-neck features in multilingual environment. In: Proceedings of ASRU (2011)
14.
Zurück zum Zitat Hartmann, W., Ng, T., Hsiao, R., Tsakalidis, S., Schwartz, R.: Two-stage data augmentation for low-resourced speech recognition. In: Interspeech 2016, pp. 2378–2382 (2016) Hartmann, W., Ng, T., Hsiao, R., Tsakalidis, S., Schwartz, R.: Two-stage data augmentation for low-resourced speech recognition. In: Interspeech 2016, pp. 2378–2382 (2016)
15.
Zurück zum Zitat Hermansky, H., Ellis, D., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of ICASSP (2000) Hermansky, H., Ellis, D., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of ICASSP (2000)
16.
Zurück zum Zitat Hinton, G., Deng, L., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef Hinton, G., Deng, L., et al.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRef
17.
Zurück zum Zitat Kanthak, S., Ney, H.: Context-dependent acoustic modelling using graphemes for large-vocabulary speech recognition. In: Proceedings of ICASSP (2002) Kanthak, S., Ney, H.: Context-dependent acoustic modelling using graphemes for large-vocabulary speech recognition. In: Proceedings of ICASSP (2002)
18.
Zurück zum Zitat Killer, M., Stüker, S., Schultz, T.: Grapheme based speech recognition. In: Proceedings of EUROSPEECH (2003) Killer, M., Stüker, S., Schultz, T.: Grapheme based speech recognition. In: Proceedings of EUROSPEECH (2003)
19.
Zurück zum Zitat Mamou, J., Cui, J., Cui, X., Gales, M.J., Kingsbury, B., Knill, K., Mangu, L., Nolden, D., Picheny, M., Ramabhadran, B., et al.: System combination and score normalization for spoken term detection. In: ICASSP, pp. 8272–8276. IEEE (2013) Mamou, J., Cui, J., Cui, X., Gales, M.J., Kingsbury, B., Knill, K., Mangu, L., Nolden, D., Picheny, M., Ramabhadran, B., et al.: System combination and score normalization for spoken term detection. In: ICASSP, pp. 8272–8276. IEEE (2013)
20.
Zurück zum Zitat Mendels, G., Cooper, E., Soto, V., Hirschberg, J., Gales, M.J., Knill, K.M., Ragni, A., Wang, H.: Improving speech recognition and keyword search for low resource languages using web data. In: INTERSPEECH, pp. 829–833 (2015) Mendels, G., Cooper, E., Soto, V., Hirschberg, J., Gales, M.J., Knill, K.M., Ragni, A., Wang, H.: Improving speech recognition and keyword search for low resource languages using web data. In: INTERSPEECH, pp. 829–833 (2015)
21.
Zurück zum Zitat Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, p. 3 (2010) Mikolov, T., Karafiát, M., Burget, L., Cernockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Interspeech, vol. 2, p. 3 (2010)
22.
Zurück zum Zitat Miller, D.R.H., Kleber, M., et al.: Rapid and accurate spoken term detection. In: Proceedings of Interspeech (2007) Miller, D.R.H., Kleber, M., et al.: Rapid and accurate spoken term detection. In: Proceedings of Interspeech (2007)
23.
Zurück zum Zitat Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.F.: Data augmentation for low resource languages. In: Proceedings of InterSpeech (2014) Ragni, A., Knill, K.M., Rath, S.P., Gales, M.J.F.: Data augmentation for low resource languages. In: Proceedings of InterSpeech (2014)
24.
Zurück zum Zitat Ragni, A., Dakin, E., Chen, X., Gales, M.J., Knill, K.M.: Multi-language neural network language models. Interspeech 8, 3042–3046 (2016)CrossRef Ragni, A., Dakin, E., Chen, X., Gales, M.J., Knill, K.M.: Multi-language neural network language models. Interspeech 8, 3042–3046 (2016)CrossRef
25.
Zurück zum Zitat Ragni, A., Wu, C., Gales, M.J., Vasilakes, J., Knill, K.M.: Stimulated training for automatic speech recognition and keyword search in limited resource conditions. In: ICASSP (2017) Ragni, A., Wu, C., Gales, M.J., Vasilakes, J., Knill, K.M.: Stimulated training for automatic speech recognition and keyword search in limited resource conditions. In: ICASSP (2017)
26.
Zurück zum Zitat Rath, S.P., Knill, K.M., Ragni, A., Gales, M.J.: Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages. In: INTERSPEECH, pp. 835–839 (2014) Rath, S.P., Knill, K.M., Ragni, A., Gales, M.J.: Combining tandem and hybrid systems for improved speech recognition and keyword spotting on low resource languages. In: INTERSPEECH, pp. 835–839 (2014)
27.
Zurück zum Zitat Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and gmm-hmm system combination techniques. In: ICASSP, pp. 6744–6748. IEEE (2013) Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and gmm-hmm system combination techniques. In: ICASSP, pp. 6744–6748. IEEE (2013)
28.
Zurück zum Zitat Szoke, I., Burget, L., Cernocky, J., Fapso, M.: Sub-word modeling of out of vocabulary words in spoken term detection. In: Proceedings of SLT (2008) Szoke, I., Burget, L., Cernocky, J., Fapso, M.: Sub-word modeling of out of vocabulary words in spoken term detection. In: Proceedings of SLT (2008)
29.
Zurück zum Zitat Tan, S., Sim, K.C., Gales, M.: Improving the interpretability of deep neural networks with stimulated learning. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 617–623. IEEE (2015) Tan, S., Sim, K.C., Gales, M.: Improving the interpretability of deep neural networks with stimulated learning. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 617–623. IEEE (2015)
30.
Zurück zum Zitat Vergyri, D., Shafran, I., et al.: The SRI/OGI 2006 spoken term detection system. In: Proceedings of Interspeech (2007) Vergyri, D., Shafran, I., et al.: The SRI/OGI 2006 spoken term detection system. In: Proceedings of Interspeech (2007)
31.
Zurück zum Zitat Wang, H., Ragni, A., Gales, M.J., Knill, K.M., Woodland, P.C., Zhang, C.: Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages. In: INTERSPEECH, pp. 3660–3664 (2015) Wang, H., Ragni, A., Gales, M.J., Knill, K.M., Woodland, P.C., Zhang, C.: Joint decoding of tandem and hybrid systems for improved keyword spotting on low resource languages. In: INTERSPEECH, pp. 3660–3664 (2015)
32.
Zurück zum Zitat Wu, C., Karanasou, P., Gales, M.J., Sim, K.C.: Stimulated deep neural network for speech recognition. In: Proceedings of Interspeech, pp. 400–404 (2016) Wu, C., Karanasou, P., Gales, M.J., Sim, K.C.: Stimulated deep neural network for speech recognition. In: Proceedings of Interspeech, pp. 400–404 (2016)
33.
Zurück zum Zitat Yang, J., Zhang, C., Ragni, A., Gales, M.J., Woodland, P.C.: System combination with log-linear models. In: ICASSP, pp. 5675–5679. IEEE (2016) Yang, J., Zhang, C., Ragni, A., Gales, M.J., Woodland, P.C.: System combination with log-linear models. In: ICASSP, pp. 5675–5679. IEEE (2016)
34.
Zurück zum Zitat Zhang, L., Karakos, D., Hartmann, W., Hsiao, R., Schwartz, R., Tsakalidis, S.: Enhancing low resource keyword spotting with automatically retrieved web documents. In: Interspeech (2015) Zhang, L., Karakos, D., Hartmann, W., Hsiao, R., Schwartz, R., Tsakalidis, S.: Enhancing low resource keyword spotting with automatically retrieved web documents. In: Interspeech (2015)
Metadaten
Titel
Low-Resource Speech Recognition and Keyword-Spotting
verfasst von
Mark J. F. Gales
Kate M. Knill
Anton Ragni
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-66429-3_1