Skip to main content
Erschienen in: International Journal of Speech Technology 3/2018

07.12.2017

Improvement of phone recognition accuracy using speech mode classification

verfasst von: Kumud Tripathi, K. Sreenivasa Rao

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this work, we have developed a speech mode classification model for improving the performance of phone recognition system (PRS). In this paper, we have explored vocal tract system, excitation source and prosodic features for development of speech mode classification (SMC) model. These features are extracted from voiced regions of a speech signal. In this study, conversation, extempore, and read speech are considered as three different modes of speech. The vocal tract component of speech is extracted using Mel-frequency cepstral coefficients (MFCCs). The excitation source features are captured through Mel power differences of spectrum in sub-bands (MPDSS) and residual Mel-frequency cepstral coefficients (RMFCCs) of the speech signal. The prosody information is extracted from pitch and intensity. Speech mode classification models are developed using above described features independently, and in fusion. The experiments carried out on Bengali speech corpus to analyze the accuracy of the speech mode classification model using the artificial neural network (ANN), naive Bayes, support vector machines (SVMs) and k-nearest neighbor (KNN). We proposed four classification models which are combined using maximum voting approach for optimal performance. From the results, it is observed that speech mode classification model developed using the fusion of vocal tract system, excitation source and prosodic features of speech, yields the best performance of 98%. Finally, the proposed speech mode classifier is integrated to the PRS, and the accuracy of phone recognition system is observed to be improved by 11.08%.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ananthapadmanabha, T., & Yegnanarayana, B. (1979). Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Transactions on Acoustics Speech and Signal Processing, 27(4), 309–319.CrossRef Ananthapadmanabha, T., & Yegnanarayana, B. (1979). Epoch extraction from linear prediction residual for identification of closed glottis interval. IEEE Transactions on Acoustics Speech and Signal Processing, 27(4), 309–319.CrossRef
Zurück zum Zitat Bachu, R., Kopparthi, S., Adapa, B., & Barkana, B. (2008). Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In Proceedings of Zone Conference Proceedings on American Society for Engineering Education (pp. 1–7). Bachu, R., Kopparthi, S., Adapa, B., & Barkana, B. (2008). Separation of voiced and unvoiced using zero crossing rate and energy of the speech signal. In Proceedings of Zone Conference Proceedings on American Society for Engineering Education (pp. 1–7).
Zurück zum Zitat Bhattacharjee, U., & Sarmah, K. (2013). Language identification system using mfcc and prosodic features. In Proceedings of International Conference on Intelligent Systems and Signal Processing (pp. 194–197). IEEE Bhattacharjee, U., & Sarmah, K. (2013). Language identification system using mfcc and prosodic features. In Proceedings of International Conference on Intelligent Systems and Signal Processing (pp. 194–197). IEEE
Zurück zum Zitat Cheriyadat, A., Bruce, L. M., & Mathur, A. (2003). Decision level fusion with best-bases for hyperspectral classification. In Proceedings of IEEE Workshop on Advances in Techniques for Analysis of Remotely Sensed Data (pp. 399–406). IEEE. Cheriyadat, A., Bruce, L. M., & Mathur, A. (2003). Decision level fusion with best-bases for hyperspectral classification. In Proceedings of IEEE Workshop on Advances in Techniques for Analysis of Remotely Sensed Data (pp. 399–406). IEEE.
Zurück zum Zitat Hossan, M. A., Memon, S., & Gregory, M. A. (2010). A novel approach for mfcc feature extraction. In Proceedings of 4th International Conference on Signal Processing and Communication Systems (pp. 1–5). IEEE. Hossan, M. A., Memon, S., & Gregory, M. A. (2010). A novel approach for mfcc feature extraction. In Proceedings of 4th International Conference on Signal Processing and Communication Systems (pp. 1–5). IEEE.
Zurück zum Zitat Islam, M. (2014). Feature and score fusion based multiple classifier selection for iris recognition. Computational Intelligence and Neuroscience, 2014, 10.CrossRef Islam, M. (2014). Feature and score fusion based multiple classifier selection for iris recognition. Computational Intelligence and Neuroscience, 2014, 10.CrossRef
Zurück zum Zitat Jalil, M., Butt, F. A., & Malik, A. (2013). Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals. In Proceedings of International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (pp. 208–212). IEEE. Jalil, M., Butt, F. A., & Malik, A. (2013). Short-time energy, magnitude, zero crossing rate and autocorrelation measurement for discriminating voiced and unvoiced segments of speech signals. In Proceedings of International Conference on Technological Advances in Electrical, Electronics and Computer Engineering (pp. 208–212). IEEE.
Zurück zum Zitat Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech using source, system, and prosodic features. International Journal of Speech Technology, 15(2), 265–289.CrossRef Koolagudi, S. G., & Rao, K. S. (2012). Emotion recognition from speech using source, system, and prosodic features. International Journal of Speech Technology, 15(2), 265–289.CrossRef
Zurück zum Zitat Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A review of classification techniques. Informatica, 31, 249–268. Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A review of classification techniques. Informatica, 31, 249–268.
Zurück zum Zitat Kumar, S. S., Rao, K. S., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages: Bengali and odia. In International Conference on Oriental COCOSDA held Jointly with Conference on Asian Spoken Language Research and Evaluation (pp. 1–5). IEEE. Kumar, S. S., Rao, K. S., & Pati, D. (2013). Phonetic and prosodically rich transcribed speech corpus in indian languages: Bengali and odia. In International Conference on Oriental COCOSDA held Jointly with Conference on Asian Spoken Language Research and Evaluation (pp. 1–5). IEEE.
Zurück zum Zitat Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.CrossRef Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.CrossRef
Zurück zum Zitat Mangai, U. G., Samanta, S., Das, S., & Chowdhury, P. R. (2010). A survey of decision fusion and feature fusion strategies for pattern classification. IETE Technical Review, 27(4), 293–307.CrossRef Mangai, U. G., Samanta, S., Das, S., & Chowdhury, P. R. (2010). A survey of decision fusion and feature fusion strategies for pattern classification. IETE Technical Review, 27(4), 293–307.CrossRef
Zurück zum Zitat Manjunath, K., & Rao, K. S. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an indian language: Bengali. In Proceedings of Twentieth National Conference on Communications (pp. 1–6). IEEE. Manjunath, K., & Rao, K. S. (2014). Automatic phonetic transcription for read, extempore and conversation speech for an indian language: Bengali. In Proceedings of Twentieth National Conference on Communications (pp. 1–6). IEEE.
Zurück zum Zitat Manjunath, K., & Rao, K. S. (2015). Source and system features for phone recognition. International Journal of Speech Technology, 18(2), 257–270.CrossRef Manjunath, K., & Rao, K. S. (2015). Source and system features for phone recognition. International Journal of Speech Technology, 18(2), 257–270.CrossRef
Zurück zum Zitat Manjunath, K. E., & Rao, K. S. (2016). Articulatory and excitation source features for speech recognition in read, extempore and conversation modes. International Journal of Speech Technology, 19(1), 121–134.CrossRef Manjunath, K. E., & Rao, K. S. (2016). Articulatory and excitation source features for speech recognition in read, extempore and conversation modes. International Journal of Speech Technology, 19(1), 121–134.CrossRef
Zurück zum Zitat Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio Speech, and Language Processing, 16(8), 1602–1613.CrossRef Murty, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio Speech, and Language Processing, 16(8), 1602–1613.CrossRef
Zurück zum Zitat Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., et al. (2011). The subspace gaussian mixture modela structured model for speech recognition. Computer Speech & Language, 25(2), 404–439.CrossRef Povey, D., Burget, L., Agarwal, M., Akyazi, P., Kai, F., Ghoshal, A., et al. (2011). The subspace gaussian mixture modela structured model for speech recognition. Computer Speech & Language, 25(2), 404–439.CrossRef
Zurück zum Zitat Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P. et al. (2011). The kaldi speech recognition toolkit. In Proceedings of IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., & Schwarz, P. et al. (2011). The kaldi speech recognition toolkit. In Proceedings of IEEE 2011 workshop on automatic speech recognition and understanding, no. EPFL-CONF-192584. IEEE Signal Processing Society.
Zurück zum Zitat Rabiner, L. (1977). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics Speech and Signal Processing, 25(1), 24–33.CrossRef Rabiner, L. (1977). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics Speech and Signal Processing, 25(1), 24–33.CrossRef
Zurück zum Zitat Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.CrossRef Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2), 257–286.CrossRef
Zurück zum Zitat Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16(2), 143–160.CrossRef Rao, K. S., Koolagudi, S. G., & Vempada, R. R. (2013). Emotion recognition from speech using global and local prosodic features. International Journal of Speech Technology, 16(2), 143–160.CrossRef
Zurück zum Zitat Rao, K. S., & Manjunath, K. (2017). Speech recognition using articulatory and excitation source features. New York: Springer.CrossRef Rao, K. S., & Manjunath, K. (2017). Speech recognition using articulatory and excitation source features. New York: Springer.CrossRef
Zurück zum Zitat Reddy, M. K., & Rao, K. S. (2017). Robust pitch extraction method for the hmm-based speech synthesis system. IEEE Signal Processing Letters, 24(8), 1133–1137.CrossRef Reddy, M. K., & Rao, K. S. (2017). Robust pitch extraction method for the hmm-based speech synthesis system. IEEE Signal Processing Letters, 24(8), 1133–1137.CrossRef
Metadaten
Titel
Improvement of phone recognition accuracy using speech mode classification
verfasst von
Kumud Tripathi
K. Sreenivasa Rao
Publikationsdatum
07.12.2017
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 3/2018
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-017-9483-4

Weitere Artikel der Ausgabe 3/2018

International Journal of Speech Technology 3/2018 Zur Ausgabe

Neuer Inhalt