Skip to main content
Erschienen in: International Journal of Speech Technology 3/2012

01.09.2012

Speaker recognition using pyramid match kernel based support vector machines

verfasst von: A. D. Dileep, C. Chandra Sekhar

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2012

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Gaussian mixture model (GMM) based approaches have been commonly used for speaker recognition tasks. Methods for estimation of parameters of GMMs include the expectation-maximization method which is a non-discriminative learning based method. Discriminative classifier based approaches to speaker recognition include support vector machine (SVM) based classifiers using dynamic kernels such as generalized linear discriminant sequence kernel, probabilistic sequence kernel, GMM supervector kernel, GMM-UBM mean interval kernel (GUMI) and intermediate matching kernel. Recently, the pyramid match kernel (PMK) using grids in the feature space as histogram bins and vocabulary-guided PMK (VGPMK) using clusters in the feature space as histogram bins have been proposed for recognition of objects in an image represented as a set of local feature vectors. In PMK, a set of feature vectors is mapped onto a multi-resolution histogram pyramid. The kernel is computed between a pair of examples by comparing the pyramids using a weighted histogram intersection function at each level of pyramid. We propose to use the PMK-based SVM classifier for speaker identification and verification from the speech signal of an utterance represented as a set of local feature vectors. The main issue in building the PMK-based SVM classifier is construction of a pyramid of histograms. We first propose to form hard clusters, using k-means clustering method, with increasing number of clusters at different levels of pyramid to design the codebook-based PMK (CBPMK). Then we propose the GMM-based PMK (GMMPMK) that uses soft clustering. We compare the performance of the GMM-based approaches, and the PMK and other dynamic kernel SVM-based approaches to speaker identification and verification. The 2002 and 2003 NIST speaker recognition corpora are used in evaluation of different approaches to speaker identification and verification. Results of our studies show that the dynamic kernel SVM-based approaches give a significantly better performance than the state-of-the-art GMM-based approaches. For speaker recognition task, the GMMPMK-based SVM gives a performance that is better than that of SVMs using many other dynamic kernels and comparable to that of SVMs using state-of-the-art dynamic kernel, GUMI kernel. The storage requirements of the GMMPMK-based SVMs are less than that of SVMs using any other dynamic kernel.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Auckenthaler, R., Parris, E. S., & Carey, M. J. (1999). Improving a GMM speaker verification system by phonetic weighting. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 1999), Phoenix, Arizona, USA, March 1999 (Vol. 1, pp. 313–316). Auckenthaler, R., Parris, E. S., & Carey, M. J. (1999). Improving a GMM speaker verification system by phonetic weighting. In Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP 1999), Phoenix, Arizona, USA, March 1999 (Vol. 1, pp. 313–316).
Zurück zum Zitat Boughorbel, S., Tarel, J. -P., & Fleuret, F. (2004). Non-Mercer kernels for SVM object recognition. In Proceedings of British machine vision conference (BMVC 2004) (pp. 137–146). Boughorbel, S., Tarel, J. -P., & Fleuret, F. (2004). Non-Mercer kernels for SVM object recognition. In Proceedings of British machine vision conference (BMVC 2004) (pp. 137–146).
Zurück zum Zitat Boughorbel, S., Tarel, J. P., & Boujemaa, N. (2005). The intermediate matching kernel for image local features. In Proceedings of the international joint conference on neural networks, Montreal, Canada, July 2005 (pp. 889–894). Boughorbel, S., Tarel, J. P., & Boujemaa, N. (2005). The intermediate matching kernel for image local features. In Proceedings of the international joint conference on neural networks, Montreal, Canada, July 2005 (pp. 889–894).
Zurück zum Zitat Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. CrossRef Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167. CrossRef
Zurück zum Zitat Campbell, W., Assaleh, K., & Broun, C. (2002). Speaker recognition with polynomial classifiers. IEEE Transactions on Speech and Audio Processing, 10(4), 205–212. CrossRef Campbell, W., Assaleh, K., & Broun, C. (2002). Speaker recognition with polynomial classifiers. IEEE Transactions on Speech and Audio Processing, 10(4), 205–212. CrossRef
Zurück zum Zitat Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., & Torres-Carrasquillo, P. A. (2006a). Support vector machines for speaker and language recognition. Computer Speech & Language, 20(2–3), 210–229. CrossRef Campbell, W. M., Campbell, J. P., Reynolds, D. A., Singer, E., & Torres-Carrasquillo, P. A. (2006a). Support vector machines for speaker and language recognition. Computer Speech & Language, 20(2–3), 210–229. CrossRef
Zurück zum Zitat Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006b). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311. CrossRef Campbell, W. M., Sturim, D. E., & Reynolds, D. A. (2006b). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311. CrossRef
Zurück zum Zitat Dileep, A. D., & Sekhar, C. C. (2011). Speaker recognition using intermediate matching kernel based support vector machines. In A. Neustein & H. Patil (Eds.), Speaker forensics: new developments in voice technology to combat and detect threats to homeland security. Berlin: Springer. Dileep, A. D., & Sekhar, C. C. (2011). Speaker recognition using intermediate matching kernel based support vector machines. In A. Neustein & H. Patil (Eds.), Speaker forensics: new developments in voice technology to combat and detect threats to homeland security. Berlin: Springer.
Zurück zum Zitat Grauman, K. L. (2006). Matching sets of features for efficient retrieval and recognition. PhD Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 2006. Grauman, K. L. (2006). Matching sets of features for efficient retrieval and recognition. PhD Thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, September 2006.
Zurück zum Zitat Grauman, K., & Darrell, T. (2007). The pyramid match kernel: efficient learning with sets of features. Journal of Machine Learning Research, 8, 725–760. MATH Grauman, K., & Darrell, T. (2007). The pyramid match kernel: efficient learning with sets of features. Journal of Machine Learning Research, 8, 725–760. MATH
Zurück zum Zitat Kailath, T. (1967). The divergence and Bhattacharyya distance measures in signal selection. IEEE Transactions on Communication Technology, 15(1), 52–60. CrossRef Kailath, T. (1967). The divergence and Bhattacharyya distance measures in signal selection. IEEE Transactions on Communication Technology, 15(1), 52–60. CrossRef
Zurück zum Zitat Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52, 12–40. CrossRef Kinnunen, T., & Li, H. (2010). An overview of text-independent speaker recognition: from features to supervectors. Speech Communication, 52, 12–40. CrossRef
Zurück zum Zitat Lee, K.-A., You, C. H., Li, H., & Kinnunen, T. (2007). A GMM-based probabilistic sequence kernel for speaker verification. In Proceedings of INTERSPEECH, Antwerp, Belgium, August 2007 (pp. 294–297). Lee, K.-A., You, C. H., Li, H., & Kinnunen, T. (2007). A GMM-based probabilistic sequence kernel for speaker verification. In Proceedings of INTERSPEECH, Antwerp, Belgium, August 2007 (pp. 294–297).
Zurück zum Zitat Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. In Proceedings of EUROSPEECH (pp. 1895–1898). Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. In Proceedings of EUROSPEECH (pp. 1895–1898).
Zurück zum Zitat Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine, 17(8), 857–872. CrossRef Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: comparison of seven methods. Statistics in Medicine, 17(8), 857–872. CrossRef
Zurück zum Zitat Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17, 91–108. CrossRef Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17, 91–108. CrossRef
Zurück zum Zitat Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41. CrossRef Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10, 19–41. CrossRef
Zurück zum Zitat Sha, F., & Saul, L. (2006). Large margin Gaussian mixture modeling for phonetic classification and recognition. In Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP 2006), Toulouse, France, May 2006 (pp. 265–268). Sha, F., & Saul, L. (2006). Large margin Gaussian mixture modeling for phonetic classification and recognition. In Proceedings of IEEE international conference on acoustics, speech and signal processing (ICASSP 2006), Toulouse, France, May 2006 (pp. 265–268).
Zurück zum Zitat Swain, M. J., & Ballard, D. H. (1991). Color indexing. International Journal of Computer Vision, 7, 11–32. CrossRef Swain, M. J., & Ballard, D. H. (1991). Color indexing. International Journal of Computer Vision, 7, 11–32. CrossRef
Zurück zum Zitat Wallraven, C., Caputo, B., & Graf, A. (2003). Recognition with local features: the kernel recipe. In Proceedings of the ninth IEEE international conference on computer vision (ICCV 2003) (pp. 257–264). CrossRef Wallraven, C., Caputo, B., & Graf, A. (2003). Recognition with local features: the kernel recipe. In Proceedings of the ninth IEEE international conference on computer vision (ICCV 2003) (pp. 257–264). CrossRef
Zurück zum Zitat Wan, V., & Renals, S. (2002). Evaluation of kernel methods for speaker verification and identification. In Proceedings of IEEE international conference on acoustics, speech and signal processing, Orlando, Florida, US, May 2002 (pp. 669–672). Wan, V., & Renals, S. (2002). Evaluation of kernel methods for speaker verification and identification. In Proceedings of IEEE international conference on acoustics, speech and signal processing, Orlando, Florida, US, May 2002 (pp. 669–672).
Zurück zum Zitat You, C. H., Lee, K. A., & Li, H. (2009). An SVM kernel with GMM-supervector based on the Bhattacharyya distance for speaker recognition. IEEE Signal Processing Letters, 16(1), 49–52. CrossRef You, C. H., Lee, K. A., & Li, H. (2009). An SVM kernel with GMM-supervector based on the Bhattacharyya distance for speaker recognition. IEEE Signal Processing Letters, 16(1), 49–52. CrossRef
Metadaten
Titel
Speaker recognition using pyramid match kernel based support vector machines
verfasst von
A. D. Dileep
C. Chandra Sekhar
Publikationsdatum
01.09.2012
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 3/2012
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-012-9154-4

Weitere Artikel der Ausgabe 3/2012

International Journal of Speech Technology 3/2012 Zur Ausgabe

Neuer Inhalt