Skip to main content
Erschienen in: International Journal of Speech Technology 4/2017

19.09.2017

Pitch segmentation of speech signals based on short-time energy waveform

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In general, speech is constituted of quasi-repetitive patterns called pitches representing the speech fundamental period and tonal information of the voice. Extraction of pitch information that is crucial for many speech processing techniques, usually faces a noise problem and interference caused by high-order harmonic components. This paper introduces a novel, noise-robust method for determining speech fundamental frequency and pitch segmentation, based on a short-time energy waveform (SEW), defined as a moving average squared signal. When applying a moving average filter with a window size closed to the fundamental period, nearly repetitive patterns, with fewer ripples, synchronizing with actual pitches can clearly be observed in the SEW. The DC component in the SEW is removed using morphological top-hat and bottom-hat transforms. The fundamental frequency is determined as the frequency corresponding to the largest peak of the power spectrum of the DC-removed SEW. Finally, a time-domain window search is then performed to locate local extrema associated with pitches. Compared to traditional pitch detection techniques, the proposed technique yields pitch segmentation results with a higher rate of accuracy and greater noise robustness.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Bereksi-Reguig, F., & Taouli, S. A. (2013). ECG signal denoising by morphological top-hat transform. Global Journal of Computer Science and Technology, 13(5). Bereksi-Reguig, F., & Taouli, S. A. (2013). ECG signal denoising by morphological top-hat transform. Global Journal of Computer Science and Technology, 13(5).
Zurück zum Zitat Antonios (2012). An improved time domain pitch detection algorithm for pathological voice. American Journal of Applied Sciences, 9(1), 93–102.CrossRef Antonios (2012). An improved time domain pitch detection algorithm for pathological voice. American Journal of Applied Sciences, 9(1), 93–102.CrossRef
Zurück zum Zitat Chamnongthai, K., Pichitwong, W., & Ayudhya,N. P. (2005). Final consonant segmentation for Thai syllable by using vowel characteristics and wavelet packet transform. ECTI-CIT Transactions on Communications and Information Technology, 1(1), 50–62. Chamnongthai, K., Pichitwong, W., & Ayudhya,N. P. (2005). Final consonant segmentation for Thai syllable by using vowel characteristics and wavelet packet transform. ECTI-CIT Transactions on Communications and Information Technology, 1(1), 50–62.
Zurück zum Zitat de Cheveigneb, A., & Kawahara, H. (2002). Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111, 1917–1930.CrossRef de Cheveigneb, A., & Kawahara, H. (2002). Yin, a fundamental frequency estimator for speech and music. The Journal of the Acoustical Society of America, 111, 1917–1930.CrossRef
Zurück zum Zitat Eddins, D. A., Anand, S., Camacho, A., & Shrivastav, R. (2016). Modeling of breathy voice quality using pitch-strength estimates. Journal of Voice, 30(6), 43–52.CrossRef Eddins, D. A., Anand, S., Camacho, A., & Shrivastav, R. (2016). Modeling of breathy voice quality using pitch-strength estimates. Journal of Voice, 30(6), 43–52.CrossRef
Zurück zum Zitat Gerhard, D. (2002). Pitch extraction and fundamental frequency: History and current techniques. Technical Report TR-CS. Gerhard, D. (2002). Pitch extraction and fundamental frequency: History and current techniques. Technical Report TR-CS.
Zurück zum Zitat Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automatic speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2494–2498). Ghahremani, P., BabaAli, B., Povey, D., Riedhammer, K., Trmal, J., & Khudanpur, S. (2014). A pitch extraction algorithm tuned for automatic speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 2494–2498).
Zurück zum Zitat Huang, Q., Wang, D., & Lu, Y. (2009) Single channel speech enhancement based on prominent pitch estimation. In IET international communication conference on wireless mobile and computing (CCWMC) (pp. 205–208). Huang, Q., Wang, D., & Lu, Y. (2009) Single channel speech enhancement based on prominent pitch estimation. In IET international communication conference on wireless mobile and computing (CCWMC) (pp. 205–208).
Zurück zum Zitat Hui, L., Dai, B.-Q., & Wei, L. (2006). A pitch detection algorithm based on AMDF and ACF. In IEEE international conference on acoustics speech and signal processing proceedings (Vol. 1). Hui, L., Dai, B.-Q., & Wei, L. (2006). A pitch detection algorithm based on AMDF and ACF. In IEEE international conference on acoustics speech and signal processing proceedings (Vol. 1).
Zurück zum Zitat Hunt, M., & Lefebvre, C. (1987). Speech recognition using an auditory model with pitch-synchronous analysis. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 12, pp. 813–816). Hunt, M., & Lefebvre, C. (1987). Speech recognition using an auditory model with pitch-synchronous analysis. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 12, pp. 813–816).
Zurück zum Zitat Hyun, K. H., Kim, E. H., & Kwak, Y. K. (2005). Improvement of emotion recognition by Bayesian classifier using non-zero-pitch concept. In IEEE international workshop on robot and human interactive communication (pp. 312–316). Hyun, K. H., Kim, E. H., & Kwak, Y. K. (2005). Improvement of emotion recognition by Bayesian classifier using non-zero-pitch concept. In IEEE international workshop on robot and human interactive communication (pp. 312–316).
Zurück zum Zitat Jdira, M. B., Jemâa, I., & Ouni, K. (2014). Speaker recognition system based on pitch estimation. In International conference on electrical sciences and technologies (CISTEM) (pp. 1–5). Jdira, M. B., Jemâa, I., & Ouni, K. (2014). Speaker recognition system based on pitch estimation. In International conference on electrical sciences and technologies (CISTEM) (pp. 1–5).
Zurück zum Zitat Kammoun, M., & Ellouze, N. (2006) Pitch and energy contribution in emotion and speaking styles recognition enhancement. In IMACS multiconference on computational engineering in systems applications (Vol. 1, pp. 97–100). Kammoun, M., & Ellouze, N. (2006) Pitch and energy contribution in emotion and speaking styles recognition enhancement. In IMACS multiconference on computational engineering in systems applications (Vol. 1, pp. 97–100).
Zurück zum Zitat Khulage, A. A. (2012). Extraction of pitch, duration and formant frequencies for emotion recognition system. In Communication and computing (ARTCom2012) (pp. 7–9). Khulage, A. A. (2012). Extraction of pitch, duration and formant frequencies for emotion recognition system. In Communication and computing (ARTCom2012) (pp. 7–9).
Zurück zum Zitat Kim, S., Eriksson, T., Kang, H.-G., & Youn, D. H. (2004). A pitch synchronous feature extraction method for speaker recognition. In IEEE international conference on acoustics, speech, and signal processing, 2004. Proceedings (ICASSP’04) (Vol. 1, p. I-405-8). Kim, S., Eriksson, T., Kang, H.-G., & Youn, D. H. (2004). A pitch synchronous feature extraction method for speaker recognition. In IEEE international conference on acoustics, speech, and signal processing, 2004. Proceedings (ICASSP’04) (Vol. 1, p. I-405-8).
Zurück zum Zitat Krishnakumar, S., Kumar, K. R. P., & Balakrishnan, N. (2003). Pitch maxima for robust speaker recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 2, p. II-201-4). Krishnakumar, S., Kumar, K. R. P., & Balakrishnan, N. (2003). Pitch maxima for robust speaker recognition. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 2, p. II-201-4).
Zurück zum Zitat Li, D., Yang, Y., & Huang, T. (2009). Pitch envelope based frame level score reweighed algorithm for emotion robust speaker recognition. In 2009 3rd international conference on affective computing and intelligent interaction and workshops (pp. 1–4). Li, D., Yang, Y., & Huang, T. (2009). Pitch envelope based frame level score reweighed algorithm for emotion robust speaker recognition. In 2009 3rd international conference on affective computing and intelligent interaction and workshops (pp. 1–4).
Zurück zum Zitat McLaughlin, S., Leith, D., & Mann, I. (2002). Using Gaussian processes to synthesize voiced speech with natural pitch variations. In International conference on digital signal processing. McLaughlin, S., Leith, D., & Mann, I. (2002). Using Gaussian processes to synthesize voiced speech with natural pitch variations. In International conference on digital signal processing.
Zurück zum Zitat Muhammad, G. (2010). Noise-robust pitch detection using auto-correlation function with enhancements. Journal of King Saud University Computer and Information Sciences, 22, 13–28.CrossRef Muhammad, G. (2010). Noise-robust pitch detection using auto-correlation function with enhancements. Journal of King Saud University Computer and Information Sciences, 22, 13–28.CrossRef
Zurück zum Zitat Perez-Pueyo, R., Soneira, M. J., & Ruiz-Moreno, S. (2010). Morphology-based automated baseline removal for Raman spectra of artistic pigments. Applied Spectroscopy, 64(6), 595–600.CrossRef Perez-Pueyo, R., Soneira, M. J., & Ruiz-Moreno, S. (2010). Morphology-based automated baseline removal for Raman spectra of artistic pigments. Applied Spectroscopy, 64(6), 595–600.CrossRef
Zurück zum Zitat Qiang, H., & Youwei, Z. (1998). On prefiltering and endpoint detection of speech signal. In International conference on signal processing proceedings (Vol. 1, pp. 749–752). Qiang, H., & Youwei, Z. (1998). On prefiltering and endpoint detection of speech signal. In International conference on signal processing proceedings (Vol. 1, pp. 749–752).
Zurück zum Zitat Rabiner, L. (1977). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech and Signal Processing, 25(1), 24–33.CrossRef Rabiner, L. (1977). On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech and Signal Processing, 25(1), 24–33.CrossRef
Zurück zum Zitat Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54(2), 297–315.CrossRef Rabiner, L. R., & Sambur, M. R. (1975). An algorithm for determining the endpoints of isolated utterances. Bell System Technical Journal, 54(2), 297–315.CrossRef
Zurück zum Zitat Ramalho, M. A., & Mammone, R. J. (1993). New speech enhancement techniques using the pitch mode modulation model. In Proceedings of the 36th midwest symposium on circuits and systems (Vol. 2, pp. 1531–1534). Ramalho, M. A., & Mammone, R. J. (1993). New speech enhancement techniques using the pitch mode modulation model. In Proceedings of the 36th midwest symposium on circuits and systems (Vol. 2, pp. 1531–1534).
Zurück zum Zitat Ru-Wei, L., Long-Tao, C., & Yang, L. (2013). Pitch detection method for noisy speech signals based on wavelet transform and autocorrelation function. In Ninth international conference on intelligent information hiding and multimedia signal processing (pp. 153–156). Ru-Wei, L., Long-Tao, C., & Yang, L. (2013). Pitch detection method for noisy speech signals based on wavelet transform and autocorrelation function. In Ninth international conference on intelligent information hiding and multimedia signal processing (pp. 153–156).
Zurück zum Zitat Shimamura, T. (2010). An efficient pitch estimation method using windowless and normalized autocorrelation functions in noisy environments. ResearchGate, 6(3), 197–204. Shimamura, T. (2010). An efficient pitch estimation method using windowless and normalized autocorrelation functions in noisy environments. ResearchGate, 6(3), 197–204.
Zurück zum Zitat Shimamura, T., & Kobayashi, H. (2001). Weighted autocorrelation for pitch extraction of noisy speech. IEEE Transactions Speech and Audio Processing, 9(7), 727–730.CrossRef Shimamura, T., & Kobayashi, H. (2001). Weighted autocorrelation for pitch extraction of noisy speech. IEEE Transactions Speech and Audio Processing, 9(7), 727–730.CrossRef
Zurück zum Zitat Stephenson, T. A., Escofet, J., Magimai-Doss, M., & Bourlard, H. (2002). Dynamic Bayesian network based speech recognition with pitch and energy as auxiliary variables. In Proceedings 12th IEEE workshop on neural networks for signal processing (pp. 637–646). Stephenson, T. A., Escofet, J., Magimai-Doss, M., & Bourlard, H. (2002). Dynamic Bayesian network based speech recognition with pitch and energy as auxiliary variables. In Proceedings 12th IEEE workshop on neural networks for signal processing (pp. 637–646).
Zurück zum Zitat Sun, Y., Chan, K. L., & Krishnan, S. M. (2002). ECG signal conditioning by morphological filtering. Computers in Biology and Medicine, 32(6), 465–479.CrossRef Sun, Y., Chan, K. L., & Krishnan, S. M. (2002). ECG signal conditioning by morphological filtering. Computers in Biology and Medicine, 32(6), 465–479.CrossRef
Zurück zum Zitat Swee, T. T., Salleh, S. H. S., & Jamaludin, M. R. (2010). Speech pitch detection using short-time energy. In International conference on computer and communication engineering (ICCCE) (pp. 1–6). Swee, T. T., Salleh, S. H. S., & Jamaludin, M. R. (2010). Speech pitch detection using short-time energy. In International conference on computer and communication engineering (ICCCE) (pp. 1–6).
Zurück zum Zitat Tabrikian, J., Dubnov, S., & Dickalov, Y. (2002). Speech enhancement by harmonic modeling via map pitch tracking. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 1, pp. I-549–I-552). Tabrikian, J., Dubnov, S., & Dickalov, Y. (2002). Speech enhancement by harmonic modeling via map pitch tracking. In IEEE international conference on acoustics, speech, and signal processing (ICASSP) (Vol. 1, pp. I-549–I-552).
Zurück zum Zitat Wang, Y. B., Li, S. W., & s Lee, L. (2006). An experimental analysis on integrating multi-stream spectro-temporal, cepstral and pitch information for mandarin speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2006–2014.CrossRef Wang, Y. B., Li, S. W., & s Lee, L. (2006). An experimental analysis on integrating multi-stream spectro-temporal, cepstral and pitch information for mandarin speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 21(10), 2006–2014.CrossRef
Zurück zum Zitat Xu, X., Zhang, T. Q, Shi, S., & Zhang, Y. (2014). An improved pitch detection of speech combined with speech enhancement. In 7th international congress on image and signal processing (CISP) (pp. 778–782). Xu, X., Zhang, T. Q, Shi, S., & Zhang, Y. (2014). An improved pitch detection of speech combined with speech enhancement. In 7th international congress on image and signal processing (CISP) (pp. 778–782).
Zurück zum Zitat Zhu, J., Sun, S., Liu, X., & Lei, B. (2009). Pitch in speaker recognition. In Ninth international conference on hybrid intelligent systems (Vol. 1, pp. 33–36). Zhu, J., Sun, S., Liu, X., & Lei, B. (2009). Pitch in speaker recognition. In Ninth international conference on hybrid intelligent systems (Vol. 1, pp. 33–36).
Zurück zum Zitat Zilca, R. D., Kingsbury, B., Navratil, J., & Ramaswamy, G. N. (2006). Pseudo pitch synchronous analysis of speech with applications to speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 467–478.CrossRef Zilca, R. D., Kingsbury, B., Navratil, J., & Ramaswamy, G. N. (2006). Pseudo pitch synchronous analysis of speech with applications to speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing, 14(2), 467–478.CrossRef
Metadaten
Titel
Pitch segmentation of speech signals based on short-time energy waveform
Publikationsdatum
19.09.2017
Erschienen in
International Journal of Speech Technology / Ausgabe 4/2017
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-017-9459-4

Weitere Artikel der Ausgabe 4/2017

International Journal of Speech Technology 4/2017 Zur Ausgabe

Neuer Inhalt