Skip to main content
Erschienen in: International Journal of Speech Technology 4/2018

26.09.2018

A method to compensate the influence of speech codec in speaker recognition

verfasst von: José R. Calvo de Lara, Flavio J. Reyes Diaz, Gabriel Hernández Sierra, Orlando Jimenez Alcazar

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The recognition of a person by his voice or “speaker recognition”, is a biometric specialty increasingly used in electronic commerce and electronic banking transactions and forensic investigations, among others. Speaker recognition is supported by the discriminative information contained in the speech of a person and its main challenge is the variability that exists between different speech samples of the same person, used for training and evaluation, or “session variability”. When a speech communication is transmitted over the internet, for example, the coding–decoding process “codec” of the speech causes loss of such information and affects the effectiveness of the speaker recognition. Some methods have been proposed to mitigate this effect. This work makes a study of the degree of affectation of this information for some commonly used codec types and proposes our own solution, to compensate the session variability provoked by the codec. The influence of some types of codec in the quality of the sample was evaluated first with a set of synthesized speech samples. Later, experiments were carried out with speech samples of international competitions, retransmitted over two different codecs, and the effect on the speaker recognition effectiveness was checked. Finally, the variability compensation was applied, with an improvement of the recognition effectiveness, measured by the equal error rate, of 20.8% for the g.722 codec and 27.8% for the gsm 6.20 codec.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Speaker Recognition Evaluation of National Institute of Standardization, USA. https://​www.​nist.​gov/​itl/​iad/​mig/​speaker-recognition.
 
2
Asterisk is the world’s most popular open source communications project that lets you create telephony apps for IP PBXs, VoIP Gateways and Conference Servers. Available in: https://​www.​asterisk.​org/​.
 
3
Cepstral coefficients in linear or Mel scale, standardized with respect to their mean and variance, plus energy and its derivatives, usually obtained every 20 msec. of speech, with a dimension F that can vary from 39 to 60, depending on the application.
 
4
VoIP: Coded voice to be transmitted over Internet protocols.
 
5
As a convention to identify each codec, we will use in this work: “codec name (bit rate)”.
 
6
The threshold was set by knowing a priori the target and impostor labels of the development samples. With the evaluation scores, is possible to determine the probability of acceptance and false rejection of the targets, as well as the probability of rejection and false acceptance of the impostors, establishing the score of the EER point in the DET curve, where the probabilities of false acceptance and false rejection are equated, as a threshold to accept or reject the result of the comparison.
 
7
2008 NIST Speaker Recognition Evaluation Plan, April 3, 2008.
 
8
Private Branch Exchange: shares one to several telephone lines with a group of users.
 
9
“There is something there, in the air, that changes the meaning of things. That gentle wind flies, touches your face, as you count the leaves of the trees. The water runs looking for the fields. When I open the doors of my house, I think: this country, one more morning. At my age my strength begins to run out, I am hardly young anymore, and the death of my wife in the war weighs me down. When the body reaches that hour, the science of doctors can not stop the passage of time. As a child, back in my land, I used to spend my days rummaging from one place to another. Little by little, the cars of the city were calling my attention; My mother said to be careful, but I thought I was very old, so I had no interest or time for my own sign. But I’m still, it’s true; How many good things I found among your people. If I count the beloved summers then there are not seven, nor nine, nor twenty. It must be that I am a child again in this sad body.”
 
10
MOS, Mean opinion score numerical indication about the perceptual quality of the voice after it has been processed (encoded, compressed, encrypted, etc.) and transmitted over the telephone channel. It is a survey conducted on a population of samples in which users are asked to rate the quality of the voice perceived with values from 1 (worst case) to 5 (best case). The grades are averaged to obtain the MOS. MOS scale is: (1) Impossible to communicate. (2) Very poor quality, almost impossible to communicate. (3) Poor quality, unclear and irritating, but still functional. (4) Failure to communicate can be perceived, but it is still possible to clearly hear the speaker. (5) Perfect conversation like in a face-to-face conversation or at a radio reception.
 
11
Medium bit rate codec, commonly used in VoIP communications.
 
12
Low bit rate codec, commonly used in mobile telephony.
 
Literatur
Zurück zum Zitat Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer handbook of speech processing. Berlin: Springer.CrossRef Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer handbook of speech processing. Berlin: Springer.CrossRef
Zurück zum Zitat Calvo, J. R. (2015). (In Spanish) Métodos de transmisión de voz sobre internet: VoIP. El reconocimiento del locutor en Internet. Technical Report RT078, Blue Serie, CENATAV. Calvo, J. R. (2015). (In Spanish) Métodos de transmisión de voz sobre internet: VoIP. El reconocimiento del locutor en Internet. Technical Report RT078, Blue Serie, CENATAV.
Zurück zum Zitat Campbell, W., Sturim, D., & Reynolds, D. (2006). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.CrossRef Campbell, W., Sturim, D., & Reynolds, D. (2006). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.CrossRef
Zurück zum Zitat Cui, X., Goel, V., & Kingsbury, B. (2015). Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(9),1469–1477.CrossRef Cui, X., Goel, V., & Kingsbury, B. (2015). Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(9),1469–1477.CrossRef
Zurück zum Zitat Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011).). Front-end factor analysis for speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 19(4), 788–798.CrossRef Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011).). Front-end factor analysis for speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 19(4), 788–798.CrossRef
Zurück zum Zitat Dunn, R. B., et al. (2001). Speaker recognition from coded speech in matched and mismatched conditions. In IEEE Odyssey’01 The Speaker and Language Recognition Workshop Proceedings, pp 72–83. Dunn, R. B., et al. (2001). Speaker recognition from coded speech in matched and mismatched conditions. In IEEE Odyssey’01 The Speaker and Language Recognition Workshop Proceedings, pp 72–83.
Zurück zum Zitat Fernández, L., Wagner, M., & Möller, S. (2012). Analysis of automatic speaker verification performance over different narrowband and wideband telephone channels. In SST’12 Australasian Conference Proceedings, pp. 157–160. Fernández, L., Wagner, M., & Möller, S. (2012). Analysis of automatic speaker verification performance over different narrowband and wideband telephone channels. In SST’12 Australasian Conference Proceedings, pp. 157–160.
Zurück zum Zitat Fernández, L., Wagner, M., & Möller, S. (2014a). Advantages of wideband over narrowband channels for speaker verification employing MFCCs and LFCCs. In ISCA Interspeech Conference Proceedings, pp 1115–1118. Fernández, L., Wagner, M., & Möller, S. (2014a). Advantages of wideband over narrowband channels for speaker verification employing MFCCs and LFCCs. In ISCA Interspeech Conference Proceedings, pp 1115–1118.
Zurück zum Zitat Fernández, L., Wagner, M., & Möller, S. (2014b). Spectral sub-band analysis of speaker verification employing narrowband and wideband speech. IEEE Odyssey’14 The Speaker and Language Recognition Workshop Proceedings, pp 81–87. Fernández, L., Wagner, M., & Möller, S. (2014b). Spectral sub-band analysis of speaker verification employing narrowband and wideband speech. IEEE Odyssey’14 The Speaker and Language Recognition Workshop Proceedings, pp 81–87.
Zurück zum Zitat Hatch, A. O., Kajarekar, S. S., & Stolcke, A. (2006). Within-class covariance normalization for svm-based speaker recognition. ISCA ICSLP’06 Conference Proceedings, pp. 1471–1474. Hatch, A. O., Kajarekar, S. S., & Stolcke, A. (2006). Within-class covariance normalization for svm-based speaker recognition. ISCA ICSLP’06 Conference Proceedings, pp. 1471–1474.
Zurück zum Zitat Hernández, G., Calvo, J. R., Bonastre, J., & Bousquet, P. M. (2014). Session compensation using binary speech representation for speaker recognition. Pattern Recognition Letters, 49, 17–23.CrossRef Hernández, G., Calvo, J. R., Bonastre, J., & Bousquet, P. M. (2014). Session compensation using binary speech representation for speaker recognition. Pattern Recognition Letters, 49, 17–23.CrossRef
Zurück zum Zitat Jain, A., Flynn, P., & Ross, A. (2007). Handbook of biometrics. Berlin: Springer. Jain, A., Flynn, P., & Ross, A. (2007). Handbook of biometrics. Berlin: Springer.
Zurück zum Zitat Janicki, A. (2010). SVM-based speaker verification for codec and un-coded speech. EUSIPCO’10 Conference Proceedings, pp 26–30. Janicki, A. (2010). SVM-based speaker verification for codec and un-coded speech. EUSIPCO’10 Conference Proceedings, pp 26–30.
Zurück zum Zitat Janicki, A., & Staroszczyk, T. (2011). Speaker recognition from coded speech using SVM. TSD’11 Conference Proceedings, LNAI 6836, pp. 291–298. Janicki, A., & Staroszczyk, T. (2011). Speaker recognition from coded speech using SVM. TSD’11 Conference Proceedings, LNAI 6836, pp. 291–298.
Zurück zum Zitat Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Speaker and session variability in gmm-based speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1448–1460.CrossRef Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Speaker and session variability in gmm-based speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1448–1460.CrossRef
Zurück zum Zitat Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. ESCA Eurospeech’97 Conference Proceedings, pp 1895–1898. Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. ESCA Eurospeech’97 Conference Proceedings, pp 1895–1898.
Zurück zum Zitat McLaren, M., et al. (2013). Improving robustness to compressed speech in speaker recognition. In Proceedings of interspeech, pp. 3698–3701, 2013. McLaren, M., et al. (2013). Improving robustness to compressed speech in speaker recognition. In Proceedings of interspeech, pp. 3698–3701, 2013.
Zurück zum Zitat Ortega, J., Gonzalez, J., & Marrero, V. (2000). AHUMADA: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31, 255–264.CrossRef Ortega, J., Gonzalez, J., & Marrero, V. (2000). AHUMADA: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31, 255–264.CrossRef
Zurück zum Zitat Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10,(1–3), 19–41.CrossRef Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10,(1–3), 19–41.CrossRef
Zurück zum Zitat Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.CrossRef Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.CrossRef
Zurück zum Zitat Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In IEEE Proceedings of International Conference on technologies for homeland security (HST), pp. 447–452. Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In IEEE Proceedings of International Conference on technologies for homeland security (HST), pp. 447–452.
Zurück zum Zitat Silovsky, J., et al. (2011). Assessment of speaker recognition on lossy codecs used for transmission of speech. In ELMAR’11 Symposium Proceedings, pp. 205–208. Silovsky, J., et al. (2011). Assessment of speaker recognition on lossy codecs used for transmission of speech. In ELMAR’11 Symposium Proceedings, pp. 205–208.
Zurück zum Zitat Solomonoff, A., Campbell, W. M., & Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. In IEEE ICASSP’05 Conference Proceedings, pp 629–632. Solomonoff, A., Campbell, W. M., & Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. In IEEE ICASSP’05 Conference Proceedings, pp 629–632.
Zurück zum Zitat Yessad, D., & Amrouche, A. (2014). Robust regression fusion of GMM-UBM and GMM-SVM normalized scores using G729 bit-stream for speaker recognition over IP. Springer International Journal of Speech Technologies, 17, 43–51.CrossRef Yessad, D., & Amrouche, A. (2014). Robust regression fusion of GMM-UBM and GMM-SVM normalized scores using G729 bit-stream for speaker recognition over IP. Springer International Journal of Speech Technologies, 17, 43–51.CrossRef
Metadaten
Titel
A method to compensate the influence of speech codec in speaker recognition
verfasst von
José R. Calvo de Lara
Flavio J. Reyes Diaz
Gabriel Hernández Sierra
Orlando Jimenez Alcazar
Publikationsdatum
26.09.2018
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 4/2018
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-018-9547-0

Weitere Artikel der Ausgabe 4/2018

International Journal of Speech Technology 4/2018 Zur Ausgabe

Neuer Inhalt