Skip to main content

2017 | OriginalPaper | Buchkapitel

Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion

verfasst von : Nirmesh J. Shah, Hemant A. Patil

Erschienen in: Pattern Recognition and Machine Intelligence

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Voice Conversion (VC) is a technique that convert the perceived speaker identity from a source speaker to a target speaker. Given a source and target speakers’ parallel training speech database in the text-dependent VC, first task is to align source and target speakers’ spectral features at frame-level before learning the mapping function. The accuracy of alignment will affect the learning of mapping function and hence, the voice quality of converted voice in VC. The impact of alignment is not much explored in the VC literature. Most of the alignment techniques try to align the acoustical features (namely, spectral features, such as Mel Cepstral Coefficients (MCC)). However, spectral features represents both speaker as well as speech-specific information. In this paper, we have done analysis on the use of different speaker-independent features (namely, unsupervised posterior features, such as, Gaussian Mixture Model (GMM)-based and Maximum A Posteriori (MAP) adapted from Universal Background Model (UBM), i.e., GMM-UBM-based posterior features) for the alignment task. In addition, we propose to use different metrics, such as, symmetric Kullback-Leibler (KL) and cosine distances instead of Euclidean distance for the alignment. Our analysis-based on % Phone Accuracy (PA) is correlating with subjective scores of the developed VC systems with 0.98 Pearson correlation coefficient.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Aradilla, G., Bourlard, H., Magimai-Doss, M.: Posterior features applied to speech recognition tasks with user-defined vocabulary. In: Proceeding ICASSP, Taipei, pp. 3809–3812 (2009) Aradilla, G., Bourlard, H., Magimai-Doss, M.: Posterior features applied to speech recognition tasks with user-defined vocabulary. In: Proceeding ICASSP, Taipei, pp. 3809–3812 (2009)
2.
Zurück zum Zitat Aradilla, G., Vepa, J., Bourlard, H.: Using posterior-based features in template matching for speech recognition. In: INTERSPEECH, Pittsburgh, pp. 1–5 (2006) Aradilla, G., Vepa, J., Bourlard, H.: Using posterior-based features in template matching for speech recognition. In: INTERSPEECH, Pittsburgh, pp. 1–5 (2006)
3.
Zurück zum Zitat Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRef Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRef
4.
Zurück zum Zitat Erro, D., Moreno, A., Bonafonte, A.: INCA algorithm for training voice conversion systems from nonparallel corpora. IEEE Trans. Audio Speech Lang. Process. 18(5), 944–953 (2010)CrossRef Erro, D., Moreno, A., Bonafonte, A.: INCA algorithm for training voice conversion systems from nonparallel corpora. IEEE Trans. Audio Speech Lang. Process. 18(5), 944–953 (2010)CrossRef
5.
Zurück zum Zitat Fukada, T., Tokuda, K., Kobayashi, T., Imai, S.: An adaptive algorithm for mel-cepstral analysis of speech. In: ICASSP, San Francisco, California, USA, pp. 137–140 (1992) Fukada, T., Tokuda, K., Kobayashi, T., Imai, S.: An adaptive algorithm for mel-cepstral analysis of speech. In: ICASSP, San Francisco, California, USA, pp. 137–140 (1992)
6.
Zurück zum Zitat Helander, E., Schwarz, J., Nurminen, J., Silen, H., Gabbouj, M.: On the impact of alignment on voice conversion performance. In: INTERSPEECH, Brisbane, Australia, pp. 1–5 (2008) Helander, E., Schwarz, J., Nurminen, J., Silen, H., Gabbouj, M.: On the impact of alignment on voice conversion performance. In: INTERSPEECH, Brisbane, Australia, pp. 1–5 (2008)
7.
Zurück zum Zitat Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: Proceeding ICASSP, Seattle, WA, pp. 285–288 (1998) Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: Proceeding ICASSP, Seattle, WA, pp. 285–288 (1998)
9.
Zurück zum Zitat Madhavi, M.C., Patil, H.A.: Modification in sequential dynamic time warping for fast computation of query-by-example spoken term detection task. In: SPCOM, Bangalore, India, pp. 1–5 (2016) Madhavi, M.C., Patil, H.A.: Modification in sequential dynamic time warping for fast computation of query-by-example spoken term detection task. In: SPCOM, Bangalore, India, pp. 1–5 (2016)
10.
Zurück zum Zitat Patil, H.A., Patel, T., Talesara, S., Shah, N., Sailor, H., Vachhani, B., Akhani, J., Kanakiya, B., Gaur, Y., Prajapati, V.: Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati. In: Oriental COCOSDA, New Delhi, India, pp. 1–7 (2013) Patil, H.A., Patel, T., Talesara, S., Shah, N., Sailor, H., Vachhani, B., Akhani, J., Kanakiya, B., Gaur, Y., Prajapati, V.: Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati. In: Oriental COCOSDA, New Delhi, India, pp. 1–7 (2013)
11.
Zurück zum Zitat Rajpal, A., Shah, N.J., Zaki, M., Patil, H.A.: Quality assessment of voice converted speech using articulatory features. In: Proceeding ICASSP, New Orleans, pp. 5515–5519 (2017) Rajpal, A., Shah, N.J., Zaki, M., Patil, H.A.: Quality assessment of voice converted speech using articulatory features. In: Proceeding ICASSP, New Orleans, pp. 5515–5519 (2017)
12.
Zurück zum Zitat Rao, S.V., Shah, N.J., Patil, H.A.: Novel pre-processing using outlier removal in voice conversion. In: 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, pp. 147–152 (2016) Rao, S.V., Shah, N.J., Patil, H.A.: Novel pre-processing using outlier removal in voice conversion. In: 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, pp. 147–152 (2016)
13.
Zurück zum Zitat Reddy, P.R., Rout, K., Murty, K.S.R.: Query word retrieval from continuous speech using GMM posteriorgrams. In: SPCOM, Banglore, India, pp. 1–6 (2014) Reddy, P.R., Rout, K., Murty, K.S.R.: Query word retrieval from continuous speech using GMM posteriorgrams. In: SPCOM, Banglore, India, pp. 1–6 (2014)
14.
Zurück zum Zitat Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)CrossRef Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)CrossRef
15.
Zurück zum Zitat Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)CrossRefMATH Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)CrossRefMATH
16.
Zurück zum Zitat Shah, N.J., Vachhani, B.B., Sailor, H.B., Patil, H.A.: Effectiveness of PLP-based phonetic segmentation for speech synthesis. In: Proceeding ICASSP, Florence, Italy, pp. 270–274 (2014) Shah, N.J., Vachhani, B.B., Sailor, H.B., Patil, H.A.: Effectiveness of PLP-based phonetic segmentation for speech synthesis. In: Proceeding ICASSP, Florence, Italy, pp. 270–274 (2014)
17.
Zurück zum Zitat Shah, N.J., Patil, H.A.: Novel amplitude scaling method for bilinear frequency warping based voice conversion. In: Proceeding ICASSP, New Orleans, USA, pp. 5520–5524 (2017) Shah, N.J., Patil, H.A.: Novel amplitude scaling method for bilinear frequency warping based voice conversion. In: Proceeding ICASSP, New Orleans, USA, pp. 5520–5524 (2017)
18.
Zurück zum Zitat Sündermann, D., Bonafonte, A., Ney, H., Höge, H.: A first step towards text-independent voice conversion. In: International Conference on Spoken Language Processing (ICSLP), South Korea, pp. 1–4 (2004) Sündermann, D., Bonafonte, A., Ney, H., Höge, H.: A first step towards text-independent voice conversion. In: International Conference on Spoken Language Processing (ICSLP), South Korea, pp. 1–4 (2004)
19.
Zurück zum Zitat Talesara, S., Patil, H.A., Patel, T., Sailor, H., Shah, N.: A novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language. In: Proceeding IALP, Urumqi, China, pp. 139–142 (2013) Talesara, S., Patil, H.A., Patel, T., Sailor, H., Shah, N.: A novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language. In: Proceeding IALP, Urumqi, China, pp. 139–142 (2013)
20.
Zurück zum Zitat Zaki, M., Shah, N.J., Patil, H.A.: Effectiveness of multiscale fractal dimension-based phonetic segmentation in speech synthesis for low resource language. In: Proceeding IALP, Kuching, Borneo, Malaysia, pp. 103–106 (2014) Zaki, M., Shah, N.J., Patil, H.A.: Effectiveness of multiscale fractal dimension-based phonetic segmentation in speech synthesis for low resource language. In: Proceeding IALP, Kuching, Borneo, Malaysia, pp. 103–106 (2014)
21.
Zurück zum Zitat Zhang, Y., Glass, J.R.: Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In: IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), Merano, Italy, pp. 398–403 (2009) Zhang, Y., Glass, J.R.: Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In: IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), Merano, Italy, pp. 398–403 (2009)
Metadaten
Titel
Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion
verfasst von
Nirmesh J. Shah
Hemant A. Patil
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-69900-4_38