Skip to main content

2015 | OriginalPaper | Buchkapitel

9. Feature Representation Learning in Deep Neural Networks

verfasst von : Dong Yu, Li Deng

Erschienen in: Automatic Speech Recognition

Verlag: Springer London

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this chapter, we show that deep neural networks jointly learn the feature representation and the classifier. Through many layers of nonlinear processing, DNNs transform the raw input feature to a more invariant and discriminative representation that can be better classified by the log-linear model. In addition, DNNs learn a hierarchy of features. The lower-level features typically catch local patterns. These patterns are very sensitive to changes in the raw feature. The higher-level features, however, are built upon the low-level features and are more abstract and invariant to the variations in the raw feature. We demonstrate that the learned high-level features are robust to speaker and environment variations.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Good raw features still help though since the existing DNN learning algorithms may generate an underperformed system even if a linear transformation such as discrete cosine transformation (DCT) is applied to the log filter-bank features.
 
2
This behavior can be alleviated by adding small random noises to each training sample dynamically during the training time.
 
3
Huang et al. [8] also tried some of the variations such as the number of vowels per second and the speaking rate normalized by the average duration of different phonemes. It was reported that no matter which definition is used the WER pattern is very similar.
 
Literatur
1.
Zurück zum Zitat Andreou, A., Kamm, T., Cohen, J.: Experiments in vocal tract normalization. In: Proceedings of the CAIP Workshop: Frontiers in Speech Recognition II (1994) Andreou, A., Kamm, T., Cohen, J.: Experiments in vocal tract normalization. In: Proceedings of the CAIP Workshop: Frontiers in Speech Recognition II (1994)
2.
Zurück zum Zitat Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech Signal Process. 28(4), 357–366 (1980)CrossRef Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech Signal Process. 28(4), 357–366 (1980)CrossRef
3.
Zurück zum Zitat Flego, F., Gales, M.J.: Discriminative adaptive training with VTS and JUD. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 170–175 (2009) Flego, F., Gales, M.J.: Discriminative adaptive training with VTS and JUD. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 170–175 (2009)
4.
Zurück zum Zitat Gales, M.J., Woodland, P.: Mean and variance adaptation within the mllr framework. Comput. Speech Lang. 10(4), 249–264 (1996)CrossRef Gales, M.J., Woodland, P.: Mean and variance adaptation within the mllr framework. Comput. Speech Lang. 10(4), 249–264 (1996)CrossRef
5.
Zurück zum Zitat Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRef Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRef
6.
Zurück zum Zitat Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random fields for phone classification. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1117–1120 (2005) Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random fields for phone classification. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1117–1120 (2005)
7.
Zurück zum Zitat Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012) Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:​1207.​0580 (2012)
8.
Zurück zum Zitat Huang, Y., Yu, D., Liu, C., Gong, Y.: A comparative analytic study on the gaussian mixture and context dependent deep neural network hidden markov models. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014) Huang, Y., Yu, D., Liu, C., Gong, Y.: A comparative analytic study on the gaussian mixture and context dependent deep neural network hidden markov models. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
10.
Zurück zum Zitat Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2146–2153 (2009) Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2146–2153 (2009)
11.
Zurück zum Zitat Kalinli, O., Seltzer, M.L., Droppo, J., Acero, A.: Noise adaptive training for robust automatic speech recognition. IEEE Trans. Audio, Speech Lang. Process. 18(8), 1889–1901 (2010)CrossRef Kalinli, O., Seltzer, M.L., Droppo, J., Acero, A.: Noise adaptive training for robust automatic speech recognition. IEEE Trans. Audio, Speech Lang. Process. 18(8), 1889–1901 (2010)CrossRef
12.
Zurück zum Zitat Kim, D.Y., Kwan Un, C., Kim, N.S.: Speech recognition in noisy environments using first-order vector Taylor series. Speech Commun. 24(1), 39–49 (1998)CrossRef Kim, D.Y., Kwan Un, C., Kim, N.S.: Speech recognition in noisy environments using first-order vector Taylor series. Speech Commun. 24(1), 39–49 (1998)CrossRef
13.
Zurück zum Zitat Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)CrossRef Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)CrossRef
14.
Zurück zum Zitat Li, J., Deng, L., Yu, D., Gong, Y., Acero, A.: High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 65–70 (2007) Li, J., Deng, L., Yu, D., Gong, Y., Acero, A.: High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 65–70 (2007)
15.
Zurück zum Zitat Li, J., Deng, L., Yu, D., Gong, Y., Acero, A.: HMM adaptation using a phase-sensitive acoustic distortion model for environment-robust speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4069–4072 (2008) Li, J., Deng, L., Yu, D., Gong, Y., Acero, A.: HMM adaptation using a phase-sensitive acoustic distortion model for environment-robust speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4069–4072 (2008)
16.
Zurück zum Zitat Li, J., Yu, D., Huang, J.T., Gong, Y.: Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM. In: Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 131–136 (2012) Li, J., Yu, D., Huang, J.T., Gong, Y.: Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM. In: Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 131–136 (2012)
17.
Zurück zum Zitat Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer vision, vol. 2, IEEE, pp. 1150–1157. (1999) Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer vision, vol. 2, IEEE, pp. 1150–1157. (1999)
18.
Zurück zum Zitat Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276 (2012) Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276 (2012)
19.
Zurück zum Zitat Moreno, P.J., Raj, B., Stern, R.M.: A vector Taylor series approach for environment-independent speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 733–736 (1996) Moreno, P.J., Raj, B., Stern, R.M.: A vector Taylor series approach for environment-independent speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 733–736 (1996)
20.
Zurück zum Zitat Parihar, N., Picone, J.: Aurora working group: DSR front end LVCSR evaluation AU/384/02. Institute for Signal and Information Process, Mississippi State University, Technical Report (2002) Parihar, N., Picone, J.: Aurora working group: DSR front end LVCSR evaluation AU/384/02. Institute for Signal and Information Process, Mississippi State University, Technical Report (2002)
21.
Zurück zum Zitat Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4057–4060 (2008) Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4057–4060 (2008)
22.
Zurück zum Zitat Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G.: fMPE: Discriminatively trained features for speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 961–964 (2005) Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G.: fMPE: Discriminatively trained features for speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 961–964 (2005)
23.
Zurück zum Zitat Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–105 (2002) Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–105 (2002)
24.
Zurück zum Zitat Ragni, A., Gales, M.: Derivative kernels for noise robust ASR. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 119–124 (2011) Ragni, A., Gales, M.: Derivative kernels for noise robust ASR. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 119–124 (2011)
25.
Zurück zum Zitat Ratnaparkhi, A.: A simple introduction to maximum entropy models for natural language processing. IRCS Technical Reports Series, p. 81 (1997) Ratnaparkhi, A.: A simple introduction to maximum entropy models for natural language processing. IRCS Technical Reports Series, p. 81 (1997)
26.
Zurück zum Zitat Sainath, T.N., Kingsbury, B., Mohamed, A.r., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2013) Sainath, T.N., Kingsbury, B., Mohamed, A.r., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2013)
27.
Zurück zum Zitat Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 24–29 (2011) Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 24–29 (2011)
28.
Zurück zum Zitat Seltzer, M., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013) Seltzer, M., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
29.
Zurück zum Zitat Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:​1312.​6199 (2013)
30.
Zurück zum Zitat Wang, Y., Gales, M.J.: Speaker and noise factorization for robust speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20(7), 2149–2158 (2012)CrossRef Wang, Y., Gales, M.J.: Speaker and noise factorization for robust speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20(7), 2149–2158 (2012)CrossRef
31.
Zurück zum Zitat Yu, D., Deng, L., Acero, A.: Hidden conditional random field with distribution constraints for phone classification. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 676–679 (2009) Yu, D., Deng, L., Acero, A.: Hidden conditional random field with distribution constraints for phone classification. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 676–679 (2009)
32.
Zurück zum Zitat Yu, D., Ju, Y.C., Wang, Y.Y., Zweig, G., Acero, A.: Automated directory assistance system-from theory to practice. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2709–2712 (2007) Yu, D., Ju, Y.C., Wang, Y.Y., Zweig, G., Acero, A.: Automated directory assistance system-from theory to practice. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2709–2712 (2007)
33.
Zurück zum Zitat Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks—studies on speech recognition tasks. In: Proceedings of the ICLR (2013) Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks—studies on speech recognition tasks. In: Proceedings of the ICLR (2013)
34.
Zurück zum Zitat Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901 (2013) Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:​1311.​2901 (2013)
Metadaten
Titel
Feature Representation Learning in Deep Neural Networks
verfasst von
Dong Yu
Li Deng
Copyright-Jahr
2015
Verlag
Springer London
DOI
https://doi.org/10.1007/978-1-4471-5779-3_9

Neuer Inhalt