nach oben

Erschienen in:

2017 | OriginalPaper | Buchkapitel

Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion

verfasst von : Nirmesh J. Shah, Hemant A. Patil

Erschienen in: Pattern Recognition and Machine Intelligence

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Voice Conversion (VC) is a technique that convert the perceived speaker identity from a source speaker to a target speaker. Given a source and target speakers’ parallel training speech database in the text-dependent VC, first task is to align source and target speakers’ spectral features at frame-level before learning the mapping function. The accuracy of alignment will affect the learning of mapping function and hence, the voice quality of converted voice in VC. The impact of alignment is not much explored in the VC literature. Most of the alignment techniques try to align the acoustical features (namely, spectral features, such as Mel Cepstral Coefficients (MCC)). However, spectral features represents both speaker as well as speech-specific information. In this paper, we have done analysis on the use of different speaker-independent features (namely, unsupervised posterior features, such as, Gaussian Mixture Model (GMM)-based and Maximum A Posteriori (MAP) adapted from Universal Background Model (UBM), i.e., GMM-UBM-based posterior features) for the alignment task. In addition, we propose to use different metrics, such as, symmetric Kullback-Leibler (KL) and cosine distances instead of Euclidean distance for the alignment. Our analysis-based on % Phone Accuracy (PA) is correlating with subjective scores of the developed VC systems with 0.98 Pearson correlation coefficient.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Music-Induced Emotion Classification from the Prefrontal Hemodynamics

Nächstes Kapitel Effectiveness of Mel Scale-Based ESA-IFCC Features for Classification of Natural vs. Spoofed Speech

Aradilla, G., Bourlard, H., Magimai-Doss, M.: Posterior features applied to speech recognition tasks with user-defined vocabulary. In: Proceeding ICASSP, Taipei, pp. 3809–3812 (2009)

Aradilla, G., Vepa, J., Bourlard, H.: Using posterior-based features in template matching for speech recognition. In: INTERSPEECH, Pittsburgh, pp. 1–5 (2006)

Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)CrossRef

Erro, D., Moreno, A., Bonafonte, A.: INCA algorithm for training voice conversion systems from nonparallel corpora. IEEE Trans. Audio Speech Lang. Process. 18(5), 944–953 (2010)CrossRef

Fukada, T., Tokuda, K., Kobayashi, T., Imai, S.: An adaptive algorithm for mel-cepstral analysis of speech. In: ICASSP, San Francisco, California, USA, pp. 137–140 (1992)

Helander, E., Schwarz, J., Nurminen, J., Silen, H., Gabbouj, M.: On the impact of alignment on voice conversion performance. In: INTERSPEECH, Brisbane, Australia, pp. 1–5 (2008)

Kain, A., Macon, M.W.: Spectral voice conversion for text-to-speech synthesis. In: Proceeding ICASSP, Seattle, WA, pp. 285–288 (1998)

Kulis, B., et al.: Metric learning: a survey. Found. Trends® Mach. Learn. 5(4), 287–364 (2013)CrossRefMATHMathSciNet

Madhavi, M.C., Patil, H.A.: Modification in sequential dynamic time warping for fast computation of query-by-example spoken term detection task. In: SPCOM, Bangalore, India, pp. 1–5 (2016)

10.

Patil, H.A., Patel, T., Talesara, S., Shah, N., Sailor, H., Vachhani, B., Akhani, J., Kanakiya, B., Gaur, Y., Prajapati, V.: Algorithms for speech segmentation at syllable-level for text-to-speech synthesis system in Gujarati. In: Oriental COCOSDA, New Delhi, India, pp. 1–7 (2013)

11.

Rajpal, A., Shah, N.J., Zaki, M., Patil, H.A.: Quality assessment of voice converted speech using articulatory features. In: Proceeding ICASSP, New Orleans, pp. 5515–5519 (2017)

12.

Rao, S.V., Shah, N.J., Patil, H.A.: Novel pre-processing using outlier removal in voice conversion. In: 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, pp. 147–152 (2016)

13.

Reddy, P.R., Rout, K., Murty, K.S.R.: Query word retrieval from continuous speech using GMM posteriorgrams. In: SPCOM, Banglore, India, pp. 1–6 (2014)

14.

Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)CrossRef

15.

Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoust. Speech Signal Process. 26(1), 43–49 (1978)CrossRefMATH

16.

Shah, N.J., Vachhani, B.B., Sailor, H.B., Patil, H.A.: Effectiveness of PLP-based phonetic segmentation for speech synthesis. In: Proceeding ICASSP, Florence, Italy, pp. 270–274 (2014)

17.

Shah, N.J., Patil, H.A.: Novel amplitude scaling method for bilinear frequency warping based voice conversion. In: Proceeding ICASSP, New Orleans, USA, pp. 5520–5524 (2017)

18.

Sündermann, D., Bonafonte, A., Ney, H., Höge, H.: A first step towards text-independent voice conversion. In: International Conference on Spoken Language Processing (ICSLP), South Korea, pp. 1–4 (2004)

19.

Talesara, S., Patil, H.A., Patel, T., Sailor, H., Shah, N.: A novel Gaussian filter-based automatic labeling of speech data for TTS system in Gujarati language. In: Proceeding IALP, Urumqi, China, pp. 139–142 (2013)

20.

Zaki, M., Shah, N.J., Patil, H.A.: Effectiveness of multiscale fractal dimension-based phonetic segmentation in speech synthesis for low resource language. In: Proceeding IALP, Kuching, Borneo, Malaysia, pp. 103–106 (2014)

21.

Zhang, Y., Glass, J.R.: Unsupervised spoken keyword spotting via segmental DTW on Gaussian posteriorgrams. In: IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), Merano, Italy, pp. 398–403 (2009)

Titel: Analysis of Features and Metrics for Alignment in Text-Dependent Voice Conversion
verfasst von: Nirmesh J. Shah
Hemant A. Patil
Verlag: Springer International Publishing
Buch: Pattern Recognition and Machine Intelligence
Print ISBN: 978-3-319-69899-1

Electronic ISBN: 978-3-319-69900-4

Copyright-Jahr: 2017
DOI: https://doi.org/10.1007/978-3-319-69900-4_38

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"