nach oben

International Journal of Speech Technology

Erschienen in:

12.12.2018

Enhancement of esophageal speech obtained by a voice conversion technique using time dilated Fourier cepstra

verfasst von: Imen Ben Othmane, Joseph Di Martino, Kaïs Ouni

Erschienen in: International Journal of Speech Technology | Ausgabe 1/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

This paper presents a novel speaking-aid system for enhancing esophageal speech (ES). The method adopted in this paper aims to improve the quality of esophageal speech using a combination of a voice conversion technique and a time dilation algorithm. In the proposed system, a Deep Neural Network (DNN) is used as a nonlinear mapping function for vocal tract vector transformation. Then the converted frames are used to determine realistic excitation and phase vectors from the target training space using a frame selection algorithm. Next, in order to preserve speaker identity of the esophageal speakers, we use the source vocal tract features and propose to apply on them a time dilation algorithm to reduce the unpleasant esophageal noises. Finally the converted speech is reconstructed using the dilated source vocal tract frames and the predicted excitation and phase. DNN and Gaussian mixture model (GMM) based voice conversion systems have been evaluated using objective and subjective measures. Such an experimental study has been realized also in order to evaluate the changes in speech quality and intelligibility of the transformed signals. Experimental results demonstrate that the proposed methods provide considerable improvement in intelligibility and naturalness of the converted esophageal speech.

Vorheriger Artikel Hidden-Markov-model based statistical parametric speech synthesis for Marathi with optimal number of hidden states

Nächster Artikel A comparative study of deep neural network based Punjabi-ASR system

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Abe, M., et al. (1990). Voice conversion through vector quantization. Journal of the Acoustical Society of Japan (E), 11(2), 71–76.CrossRef

Arya, S. (1996). Nearest neighbor searching and applications. Univ. of Maryland at College Park, MD.

Barney, H. L., Haworth, F. E., & Dunn, H. K. (1959). An experimental transistorized artificial larynx. Bell Labs Technical Journal, 38(6), 1337–1356.CrossRef

Beauregard, G. T., Zhu, X., & Wyse, L. (2005) An efficient algorithm for real-time spectrogram inversion. In Proceedings of the 8th international conference on digital audio effects.

Boll, S. F. (1979). Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2), 113120.CrossRef

Chenausky, K., & MacAuslan, J. (2000). Utilization of microprocessors in voice quality improvement: The electrolarynx. Current Opinion in Otolaryngology & Head and Neck Surgery, 8(3), 138–142.CrossRef

Childers, D. G., Skinner, D. P., & Kemerait, R. C. (1977). The cepstrum: A guide to processing. Proceedings of the IEEE, 65(10), 1428–1443.CrossRef

Cole, D., et al. (1997). Application of noise reduction techniques for alaryngeal speech enhancement. In TENCON’97. IEEE region 10 annual conference. Speech and image technologies for computing and telecommunications., Proceedings of IEEE (Vol. 2). IEEE.

Del Pozo, A., & Young, S. (2006). Continuous tracheoesophageal speech repair. In Signal processing conference, 2006 14th European. IEEE.

Del Pozo, A., & Young, S. (2008). Repairing tracheoesophageal speech duration. In Proc Speech Prosody.

Desai, S., et al. (2009). Voice conversion using artificial neural networks. In Acoustics, speech and signal processing, 2009. ICASSP 2009. IEEE international conference on. IEEE.

Desai, S., et al. (2010). Spectral mapping using artificial neural networks for voice conversion. IEEE Transactions on Audio, Speech, and Language Processing, 18(5), 954–964.CrossRef

Deza, M. M., & Deza, E. (2009). Encyclopedia of distances. In Encyclopedia of distances (pp. 1–583). Springer, Berlin.

Doi, H. (2010). Esophageal speech enhancement based on statistical voice conversion with Gaussian mixture models. IEICE Transaction on Information and Systems, 93(9), 2472–2482.CrossRef

Doi, H., Nakamura, K., Toda, T., Saruwatari, H., & Shikano, K. (May 2011). An evaluation of alaryngeal speech enhancement methods based on voice conversion techniques. In Proc. ICASSP (pp. 5136–5139).

Doi, H., et al. (2010). Statistical approach to enhancing esophageal speech based on Gaussian mixture models. In Acoustics speech and signal processing (ICASSP), 2010 IEEE international conference on. IEEE.

Filter, M. D., & Hyman, M. (1975). Relationship of acoustic parameters and perceptual ratings of esophageal speech. Perceptual and Motor Skills, 40(1), 63–68.CrossRef

García, B., Vicente, J., & Aramendi, E. (2002). Time-spectral technique for esophageal speech regeneration. In 11th EUSIPCO (European Signal Processing Conference). IEEE, Toulouse, France.

García, B., et al. (2005). Esophageal voices: Glottal flow restoration. In IEEE international conference on acoustics, speech, and signal processing, 2005. Proceedings (ICASSP’05) (Vol. 4). IEEE.

Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time Fourier transform. IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(2), 236–243.CrossRef

Hisada, A., & Sawada, H. (2002). Real-time clarification of esophageal speech using a comb filter. In International conference on disability, virtual reality and associated technologies (pp. 39–46).

Hoops, H. R., & Noll, J. D. (1969). Relationship of selected acoustic variables to judgments of esophageal speech. Journal of Communication Disorders, 2(1), 1–13.CrossRef

http://www.cancerresearchuk.org/about-cancer.

Ben Othmane, I., Di Martino, J., & Ouni, K. (2017). Enhancement of esophageal speech using voice conversion techniques. In International conference on natural language, signal and speech processing-ICNLSSP.

Ben Othmane, I., Di Martino, J., & Ouni, K. (2018). Improving the computational performance of standard GMM-based voice conversion systems used in real-time applications. In ICECOCS18—1st international conference on electronics, control, optimization and computer science, Dec 2018, Kenitra, Morocco. IEEE.

Ben Othmane, I., Di Martino, J., & Ouni, K. (2018). Enhancement of esophageal speech using statistical and neuromimetic voice conversion techniques. Journal of International Science and General Applications, 1(1), 10.

Kain, A., & Macon, M. W. (1998). Spectral voice conversion for text-to-speech synthesis. In Acoustics, speech and signal processing, 1998. Proceedings of the 1998 IEEE international conference on (Vol. 1). IEEE.

Kanungo, T. (2002). An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7), 881–892.CrossRef

Kawahara, H., et al. (1999). Fixed point analysis of frequency to instantaneous frequency mapping for accurate estimation of F0 and periodicity. In Sixth european conference on speech communication and technology.

Ling-HuiChen, Z.-H., & YanSong, L.-R. (2013). Joint spectral distribution modeling using restricted boltzmann machines for voice conversion.

Liu, H., Zhao, Q., Wan, M. X., & Wang, S. P. (2006). Enhancement of electrolarynx speech based on auditory masking. IEEE Transactions on Biomedical Engineering, 53(5), 865874.

Mantilla-Caeiros, A., Nakano-Miyatake, M., & Perez-Meana, H. (2010). A pattern recognition based esophageal speech enhancement system. Journal of Applied Research and Technology, 8(1), 56–70.

Matsui, K., et al. (2002). Enhancement of esophageal speech using formant synthesis. Acoustical Science and Technology, 23(2), 69–76.CrossRef

Matui, K., Hara, N., Kobayashi, N., & Hirose, H. (May, 1999). Enhancement of esophageal speech using formant synthesis. In Proc. ICASSP (pp. 1831–1834), Phoenix, Arizona.

Mouchtaris, A., Van der Spiegel, J., & Mueller, P. (2006). Nonparallel training for voice conversion based on a parameter adaptation approach. IEEE Transactions on Audio, Speech, and Language Processing, 14(3), 952–963.CrossRef

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10).

Nakamura, K., Toda, T., Saruwatari, H., & Shikano, K. (2012). Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech. SPECOM, 54(1), 134146.

Nakashika, T., Takiguchi, T., & Ariki, Y. (2014). High-order sequence modeling using speaker-dependent recurrent temporal restricted Boltzmann machines for voice conversion. In Fifteenth annual conference of the international speech communication association.

Nakashika, T., et al. (2013). Voice conversion in high-order eigen space using deep belief nets. In Interspeech.

Nankaku, Y., et al. (2007). Spectral conversion based on statistical models including time-sequence matching.

Narendranath, M. (1995). Transformation of formants for voice conversion using artificial neural networks. Speech Communication, 16(2), 207–216.CrossRef

Park, S. H. (2011). Simple linear regression. In International Encyclopedia of Statistical Science (pp. 1327–1328). Springer, Berlin.

Qi, Y., Weinberg, B., & Bi, N. (1995). Enhancement of female esophageal and tracheoesophageal speech. The Journal of the Acoustical Society of America, 98(5), 2461–2465.CrossRef

Robbins, J., et al. (1984). A comparative acoustic study of normal, esophageal, and tracheoesophageal speech production. The Journal of Speech and Hearing Disorders, 49(2), 202–210.CrossRef

Robbins, J., et al. (1984). Selected acoustic features of tracheoesophageal, esophageal, and laryngeal speech. Archives of Otolaryngology, 110(10), 670–672.CrossRef

Sharifzadeh, H. R., McLoughlin, I. V., & Ahmadi, F. (2010). Reconstruction of normal sounding speech for laryngectomy patients through a modified CELP codec. IEEE Transactions on Biomedical Engineering, 57(10), 2448–2458.CrossRef

Shipp, T. (1967). Frequency, duration, and perceptual measures in relation to judgments of alaryngeal speech acceptability. Journal of Speech, Language, and Hearing Research, 10, 417–427.CrossRef

Snidecor, J. C., & Curry, E. T. (1959). XLIV temporal and pitch aspects of superior esophageal speech. Annals of Otology, Rhinology & Laryngology, 68(3), 623–636.CrossRef

Srivastava, N. (2013). Improving neural networks with dropout. University of Toronto 182.

Srivastava, N. (2014). Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.MathSciNetMATH

Stylianou, O. C., & Moulines, E. (1998). Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing, 6(2), 131–142.CrossRef

Toda, T., Black, A. W., & Tokuda, K. (2007). Voice conversion based on maximum likelihood estimation of spectral parameter trajectory. IEEE Transactions on Audio, Speech, and Language, 15(8), 2222–2235.CrossRef

Tüurkmen, H. I., & Karsligil, M. E. (2008). Reconstruction of dysphonic speech by melp. In Iberoamerican congress on pattern recognition. Springer, Berlin.

Werghi, A., Di Martino, J., & Jebara, S. B. (2010). On the use of an iterative estimation of continuous probabilistic transforms for voice conversion. In I/V Communications and mobile network (ISVC), 2010 5th international symposium on. IEEE.

Wu, Z., Chng, E. S., & Li, H. (2013). Conditional restricted boltzmann machine for voice conversion. In Signal and information processing (ChinaSIP), 2013 IEEE China summit & international conference on. IEEE.

Zhang, M., et al. (2008). Text-independent voice conversion based on state mapped codebook. In Acoustics, speech and signal processing, 2008. ICASSP 2008. IEEE international conference on. IEEE.

Zhu, X., Beauregard, G. T., & Wyse, L. L. (2007). Real-time signal estimation from modified short-time Fourier transform magnitude spectra. IEEE Transactions on Audio, Speech, and Language Processing, 15(5), 1645–1653.CrossRef

Titel: Enhancement of esophageal speech obtained by a voice conversion technique using time dilated Fourier cepstra
verfasst von: Imen Ben Othmane
Joseph Di Martino
Kaïs Ouni
Publikationsdatum: 12.12.2018
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 1/2019
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-018-09579-1

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Jonas Klose/© Pine Valley Capital GmbH, Carina Kießling von der Strategieberatung Roland Berger/© Monika Walther Fotografie | ATZ, Beijing Auto Show 2024: Deutsche Hersteller wollen angreifen./© EKH-Pictures / Generated with AI / Stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2019

Multilingual query-by-example spoken term detection in Indian languages

Segment-level probabilistic sequence kernel and segment-level pyramid match kernel based extreme learning machine for classification of varying length patterns of speech

Continuous Tamil Speech Recognition technique under non stationary noisy environments

Noise reduction in speech signals using adaptive independent component analysis (ICA) for hands free communication devices

A statistical framework for EEG channel selection and seizure prediction on mobile

Evaluating noise suppression methods for recovering the Lombard speech from vocal output in an external noise field

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.