nach oben

International Journal of Speech Technology

Erschienen in:

26.09.2018

A method to compensate the influence of speech codec in speaker recognition

verfasst von: José R. Calvo de Lara, Flavio J. Reyes Diaz, Gabriel Hernández Sierra, Orlando Jimenez Alcazar

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

The recognition of a person by his voice or “speaker recognition”, is a biometric specialty increasingly used in electronic commerce and electronic banking transactions and forensic investigations, among others. Speaker recognition is supported by the discriminative information contained in the speech of a person and its main challenge is the variability that exists between different speech samples of the same person, used for training and evaluation, or “session variability”. When a speech communication is transmitted over the internet, for example, the coding–decoding process “codec” of the speech causes loss of such information and affects the effectiveness of the speaker recognition. Some methods have been proposed to mitigate this effect. This work makes a study of the degree of affectation of this information for some commonly used codec types and proposes our own solution, to compensate the session variability provoked by the codec. The influence of some types of codec in the quality of the sample was evaluated first with a set of synthesized speech samples. Later, experiments were carried out with speech samples of international competitions, retransmitted over two different codecs, and the effect on the speaker recognition effectiveness was checked. Finally, the variability compensation was applied, with an improvement of the recognition effectiveness, measured by the equal error rate, of 20.8% for the g.722 codec and 27.8% for the gsm 6.20 codec.

Vorheriger Artikel Tamil and English speech database for heartbeat estimation

Nächster Artikel Automatic note transcription system for Hindustani classical music

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Speaker Recognition Evaluation of National Institute of Standardization, USA. https://www.nist.gov/itl/iad/mig/speaker-recognition.

Asterisk is the world’s most popular open source communications project that lets you create telephony apps for IP PBXs, VoIP Gateways and Conference Servers. Available in: https://www.asterisk.org/.

Cepstral coefficients in linear or Mel scale, standardized with respect to their mean and variance, plus energy and its derivatives, usually obtained every 20 msec. of speech, with a dimension F that can vary from 39 to 60, depending on the application.

VoIP: Coded voice to be transmitted over Internet protocols.

As a convention to identify each codec, we will use in this work: “codec name (bit rate)”.

The threshold was set by knowing a priori the target and impostor labels of the development samples. With the evaluation scores, is possible to determine the probability of acceptance and false rejection of the targets, as well as the probability of rejection and false acceptance of the impostors, establishing the score of the EER point in the DET curve, where the probabilities of false acceptance and false rejection are equated, as a threshold to accept or reject the result of the comparison.

2008 NIST Speaker Recognition Evaluation Plan, April 3, 2008.

Private Branch Exchange: shares one to several telephone lines with a group of users.

“There is something there, in the air, that changes the meaning of things. That gentle wind flies, touches your face, as you count the leaves of the trees. The water runs looking for the fields. When I open the doors of my house, I think: this country, one more morning. At my age my strength begins to run out, I am hardly young anymore, and the death of my wife in the war weighs me down. When the body reaches that hour, the science of doctors can not stop the passage of time. As a child, back in my land, I used to spend my days rummaging from one place to another. Little by little, the cars of the city were calling my attention; My mother said to be careful, but I thought I was very old, so I had no interest or time for my own sign. But I’m still, it’s true; How many good things I found among your people. If I count the beloved summers then there are not seven, nor nine, nor twenty. It must be that I am a child again in this sad body.”

MOS, Mean opinion score numerical indication about the perceptual quality of the voice after it has been processed (encoded, compressed, encrypted, etc.) and transmitted over the telephone channel. It is a survey conducted on a population of samples in which users are asked to rate the quality of the voice perceived with values from 1 (worst case) to 5 (best case). The grades are averaged to obtain the MOS. MOS scale is: (1) Impossible to communicate. (2) Very poor quality, almost impossible to communicate. (3) Poor quality, unclear and irritating, but still functional. (4) Failure to communicate can be perceived, but it is still possible to clearly hear the speaker. (5) Perfect conversation like in a face-to-face conversation or at a radio reception.

Medium bit rate codec, commonly used in VoIP communications.

Low bit rate codec, commonly used in mobile telephony.

Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer handbook of speech processing. Berlin: Springer.CrossRef

Calvo, J. R. (2015). (In Spanish) Métodos de transmisión de voz sobre internet: VoIP. El reconocimiento del locutor en Internet. Technical Report RT078, Blue Serie, CENATAV.

Campbell, W., Sturim, D., & Reynolds, D. (2006). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.CrossRef

Cui, X., Goel, V., & Kingsbury, B. (2015). Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(9),1469–1477.CrossRef

Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011).). Front-end factor analysis for speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 19(4), 788–798.CrossRef

Dunn, R. B., et al. (2001). Speaker recognition from coded speech in matched and mismatched conditions. In IEEE Odyssey’01 The Speaker and Language Recognition Workshop Proceedings, pp 72–83.

Fernández, L., Wagner, M., & Möller, S. (2012). Analysis of automatic speaker verification performance over different narrowband and wideband telephone channels. In SST’12 Australasian Conference Proceedings, pp. 157–160.

Fernández, L., Wagner, M., & Möller, S. (2014a). Advantages of wideband over narrowband channels for speaker verification employing MFCCs and LFCCs. In ISCA Interspeech Conference Proceedings, pp 1115–1118.

Fernández, L., Wagner, M., & Möller, S. (2014b). Spectral sub-band analysis of speaker verification employing narrowband and wideband speech. IEEE Odyssey’14 The Speaker and Language Recognition Workshop Proceedings, pp 81–87.

Hatch, A. O., Kajarekar, S. S., & Stolcke, A. (2006). Within-class covariance normalization for svm-based speaker recognition. ISCA ICSLP’06 Conference Proceedings, pp. 1471–1474.

Hernández, G., Calvo, J. R., Bonastre, J., & Bousquet, P. M. (2014). Session compensation using binary speech representation for speaker recognition. Pattern Recognition Letters, 49, 17–23.CrossRef

International Telecommunication Union (2004). ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications. https://www.itu.int/rec/T-REC-P.563.

International Telecommunication Union (1996). Recommendation Series, I. T. U. T. P.800: “Methods for subjective determination of transmission quality”. https://www.itu.int/rec/T-REC-P.800.

Jain, A., Flynn, P., & Ross, A. (2007). Handbook of biometrics. Berlin: Springer.

Janicki, A. (2010). SVM-based speaker verification for codec and un-coded speech. EUSIPCO’10 Conference Proceedings, pp 26–30.

Janicki, A., & Staroszczyk, T. (2011). Speaker recognition from coded speech using SVM. TSD’11 Conference Proceedings, LNAI 6836, pp. 291–298.

Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Speaker and session variability in gmm-based speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1448–1460.CrossRef

Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. ESCA Eurospeech’97 Conference Proceedings, pp 1895–1898.

McLaren, M., et al. (2013). Improving robustness to compressed speech in speaker recognition. In Proceedings of interspeech, pp. 3698–3701, 2013.

National Institute of Standardization (2008). The 2008 NIST speaker recognition evaluation results. https://www.nist.gov/itl/iad/mig/2008-nist-speaker-recognition-evaluation-results.

Ortega, J., Gonzalez, J., & Marrero, V. (2000). AHUMADA: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31, 255–264.CrossRef

Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10,(1–3), 19–41.CrossRef

Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.CrossRef

Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In IEEE Proceedings of International Conference on technologies for homeland security (HST), pp. 447–452.

Silovsky, J., et al. (2011). Assessment of speaker recognition on lossy codecs used for transmission of speech. In ELMAR’11 Symposium Proceedings, pp. 205–208.

Solomonoff, A., Campbell, W. M., & Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. In IEEE ICASSP’05 Conference Proceedings, pp 629–632.

Yessad, D., & Amrouche, A. (2014). Robust regression fusion of GMM-UBM and GMM-SVM normalized scores using G729 bit-stream for speaker recognition over IP. Springer International Journal of Speech Technologies, 17, 43–51.CrossRef

Titel: A method to compensate the influence of speech codec in speaker recognition
verfasst von: José R. Calvo de Lara
Flavio J. Reyes Diaz
Gabriel Hernández Sierra
Orlando Jimenez Alcazar
Publikationsdatum: 26.09.2018
Verlag: Springer US
Erschienen in: International Journal of Speech Technology / Ausgabe 4/2018
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI: https://doi.org/10.1007/s10772-018-9547-0

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Nachhaltigkeitsaward Key Visual/© Cometis AG/Global ESG Monitor | Daniel Rupp | Generiert mit KI, Search Icon, Banner Hanser, Beijing Auto Show 2024: Deutsche Hersteller wollen angreifen./© EKH-Pictures / Generated with AI / Stock.adobe.com, Buchstaben, die aus einem Megaphon kommen/© MicroStockHub/Getty Images/iStock, Digitale Lieferkette/© zapp2photo / stock.adobe.com, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, Zukunftswerkstatt Sales Excellence 2024/© AndreyPopov / Getty Images / iStock, 2023_Antrieb/© supervisuell, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2018

Arabic discourse analysis based on acoustic, prosodic and phonetic modeling: elocution evaluation, speech classification and pathological speech correction

Tamil and English speech database for heartbeat estimation

Improvement in monaural speech separation using sparse non-negative tucker decomposition

Comparative performance evaluation of MMSE-based speech enhancement techniques through simulation and real-time implementation

Deep and shallow features fusion based on deep convolutional neural network for speech emotion recognition

Low bit-rate speech coding based on multicomponent AFM signal model

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.