Skip to main content
Erschienen in: International Journal of Speech Technology 3/2018

22.11.2017

Sparse coding of i-vector/JFA latent vector over ensemble dictionaries for language identification systems

verfasst von: Om Prakash Singh, Rohit Sinha

Erschienen in: International Journal of Speech Technology | Ausgabe 3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Computing sparse representation (SR) over an exemplar dictionary is time consuming and computationally expensive for large dictionary size. This also requires huge memory requirement for saving the dictionary. In order to reduce the latency and to achieve some diversity, ensemble of exemplar dictionary based language identification (LID) system is explored. The full diversity can be obtained if each of the exemplar dictionary contains only one feature vector from each of the language class. To achieve full diversity, a large number of multiple dictionaries are required; thus needs to compute SR for a particular test utterance as many times. The other solution to reduce the latency is to use a learned dictionary. The dictionary may contain unequal number of dictionary atoms and it is not guaranteed that each language class information is present. It totally depends upon the number of data and its variations. Motivated by this, language specific dictionary is learned, and then concatenated to form a single learned dictionary. Furthermore, to overcome the problem of ensemble exemplar dictionary based LID system, we investigated the ensemble of learned-exemplar dictionary based LID system. The proposed approach achieves the same diversity and latency as that of ensemble exemplar dictionary with reduced number of learned dictionaries. The proposed techniques are applied on two spoken utterance representations: the i-vector and the JFA latent vector. The experiments are performed on 2007 NIST LRE, 2009 NIST LRE and AP17-OLR datasets in closed set condition.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aharon, M., Elad, M., & Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11), 4311–4322.CrossRefMATH Aharon, M., Elad, M., & Bruckstein, A. (2006). K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11), 4311–4322.CrossRefMATH
Zurück zum Zitat Ambikairajah, E., Li, H., Wang, L., Yin, B., & Sethu, V. (2011). Language identification: A tutorial. IEEE Circuits and Systems Magazine, 11(2), 82–108.CrossRef Ambikairajah, E., Li, H., Wang, L., Yin, B., & Sethu, V. (2011). Language identification: A tutorial. IEEE Circuits and Systems Magazine, 11(2), 82–108.CrossRef
Zurück zum Zitat Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.CrossRef Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366.CrossRef
Zurück zum Zitat Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In: Proceedings of Interspeech (pp. 857–860). Dehak, N., Torres-Carrasquillo, P. A., Reynolds, D., & Dehak, R. (2011). Language recognition via i-vectors and dimensionality reduction. In: Proceedings of Interspeech (pp. 857–860).
Zurück zum Zitat Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language, 19(4), 788–798.CrossRef Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech and Language, 19(4), 788–798.CrossRef
Zurück zum Zitat Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39, 1–38.MathSciNetMATH Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39, 1–38.MathSciNetMATH
Zurück zum Zitat Haris B.C., & Sinha, R. (2012) Speaker verification using sparse representation over ksvd learned dictionary. In: Proceedings of National Conference on Communications (NCC) (pp. 1–5). Haris B.C., & Sinha, R. (2012) Speaker verification using sparse representation over ksvd learned dictionary. In: Proceedings of National Conference on Communications (NCC) (pp. 1–5).
Zurück zum Zitat Haris, B. C., & Sinha, R. (2015). Robust speaker verification with joint sparse coding over learned dictionaries. IEEE Transactions on Information Forensics and Security, 10(10), 2143–2157.CrossRef Haris, B. C., & Sinha, R. (2015). Robust speaker verification with joint sparse coding over learned dictionaries. IEEE Transactions on Information Forensics and Security, 10(10), 2143–2157.CrossRef
Zurück zum Zitat Hatch, A.O., Kajarekar, S., & Stolcke, A. (2006). Within-class covariance normalization for svm-based speaker recognition. In: Proceedings of the ICSLP (pp. 1471–1474). Hatch, A.O., Kajarekar, S., & Stolcke, A. (2006). Within-class covariance normalization for svm-based speaker recognition. In: Proceedings of the ICSLP (pp. 1471–1474).
Zurück zum Zitat Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.CrossRef Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.CrossRef
Zurück zum Zitat Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine, 20(8), 832–844.CrossRef Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine, 20(8), 832–844.CrossRef
Zurück zum Zitat Jiang, B., Song, Y., Guo, W., & Dai, L.R. (2012). Exemplar-based sparse representation for language recognition on i-vectors. In: Proceedings of ISCA Interspeech. Jiang, B., Song, Y., Guo, W., & Dai, L.R. (2012). Exemplar-based sparse representation for language recognition on i-vectors. In: Proceedings of ISCA Interspeech.
Zurück zum Zitat Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In: Odyssey, p. 14. Kenny, P. (2010). Bayesian speaker verification with heavy-tailed priors. In: Odyssey, p. 14.
Zurück zum Zitat Kenny, P., Boulianne, G., & Dumouchel, P. (2005). Eigenvoice modeling with sparse training data. IEEE Transactions on Speech and Audio Processing, 13(3), 345–354.CrossRef Kenny, P., Boulianne, G., & Dumouchel, P. (2005). Eigenvoice modeling with sparse training data. IEEE Transactions on Speech and Audio Processing, 13(3), 345–354.CrossRef
Zurück zum Zitat Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio Speech and Language Processing, 15(4), 1435–1447.CrossRef Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Joint factor analysis versus eigenchannels in speaker recognition. IEEE Transactions on Audio Speech and Language Processing, 15(4), 1435–1447.CrossRef
Zurück zum Zitat Kua, J., Ambikairajah, E., Epps, J., & Togneri, R. (2011). Speaker verification using sparse representation classification. In: Proceedings of IEEE ICASSP (pp. 4548–4551). Kua, J., Ambikairajah, E., Epps, J., & Togneri, R. (2011). Speaker verification using sparse representation classification. In: Proceedings of IEEE ICASSP (pp. 4548–4551).
Zurück zum Zitat Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(Jan), 19–60.MathSciNetMATH Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(Jan), 19–60.MathSciNetMATH
Zurück zum Zitat Martinez, D., Plchot, O., Burget, L., Glembek, O., & Matejka, P. (2011). Language recognition in i-vectors space. In: Proceedings of Interspeech (pp. 861–864). Martinez, D., Plchot, O., Burget, L., Glembek, O., & Matejka, P. (2011). Language recognition in i-vectors space. In: Proceedings of Interspeech (pp. 861–864).
Zurück zum Zitat Naseem, I., Togneri, R., & Bennamoun, M. (2010). Sparse representation for speaker identification. In: Proceedings of the 20th International Conference on Pattern Recognition (ICPR) (pp. 4460–4463). Naseem, I., Togneri, R., & Bennamoun, M. (2010). Sparse representation for speaker identification. In: Proceedings of the 20th International Conference on Pattern Recognition (ICPR) (pp. 4460–4463).
Zurück zum Zitat Ng, A.Y. (2004). Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 78. ACM. Ng, A.Y. (2004). Feature selection, l 1 vs. l 2 regularization, and rotational invariance. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 78. ACM.
Zurück zum Zitat Pati, Y.C., Rezaiifar, R., & Krishnaprasad, P.S. (1993). Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Proceedings of 27th Asilomar Conference Signals, Systems and Computers, Pacific Grove, CA (vol. 1, pp. 40–44). https://doi.org/10.1109/ACSSC.1993.342465. Pati, Y.C., Rezaiifar, R., & Krishnaprasad, P.S. (1993). Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition. In: Proceedings of 27th Asilomar Conference Signals, Systems and Computers, Pacific Grove, CA (vol. 1, pp. 40–44). https://​doi.​org/​10.​1109/​ACSSC.​1993.​342465.
Zurück zum Zitat Prince, S.J., & Elder, J.H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In: IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007 (pp. 1–8). IEEE. Prince, S.J., & Elder, J.H. (2007). Probabilistic linear discriminant analysis for inferences about identity. In: IEEE 11th International Conference on Computer Vision, 2007. ICCV 2007 (pp. 1–8). IEEE.
Zurück zum Zitat Roach, P., Arnfield, S., Barry, W., Baltova, J., Boldea, M., Fourcin, A., Gonet, W., Gubrynowicz, R., Hallum, E., Lamel, L., et al. (1996). Babel: An eastern european multi-language database. In: Proceedings of Fourth International Conference On Spoken Language, 1996. ICSLP 96 (vol. 3, pp. 1892–1893). IEEE. Roach, P., Arnfield, S., Barry, W., Baltova, J., Boldea, M., Fourcin, A., Gonet, W., Gubrynowicz, R., Hallum, E., Lamel, L., et al. (1996). Babel: An eastern european multi-language database. In: Proceedings of Fourth International Conference On Spoken Language, 1996. ICSLP 96 (vol. 3, pp. 1892–1893). IEEE.
Zurück zum Zitat Singh, O.P., Haris B.C., & Sinha, R. (2013). Language identification using sparse representation: A comparison between gmm supervector and i-vector based approaches. In: Proceedings of Annual IEEE India Conference (INDICON) (pp. 1–4). Singh, O.P., Haris B.C., & Sinha, R. (2013). Language identification using sparse representation: A comparison between gmm supervector and i-vector based approaches. In: Proceedings of Annual IEEE India Conference (INDICON) (pp. 1–4).
Zurück zum Zitat Singh, O.P., & Sinha, R. (2017). Sparse representation classification based language recognition using elastic net. In: 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN). Singh, O.P., & Sinha, R. (2017). Sparse representation classification based language recognition using elastic net. In: 2017 4th International Conference on Signal Processing and Integrated Networks (SPIN).
Zurück zum Zitat Tang, Z., Wang, D., Chen, Y., Li, L., & Abel, A. (2017). Phonetic temporal neural model for language identification. arXiv preprint arXiv:1705.03151. Tang, Z., Wang, D., Chen, Y., Li, L., & Abel, A. (2017). Phonetic temporal neural model for language identification. arXiv preprint arXiv:​1705.​03151.
Zurück zum Zitat Torres-Carrasquillo, P.A., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., & Deller, Jr., J.R. (2002). Approaches to language identification using gaussian mixture models and shifted delta cepstal features. In: Proceedings of ICSLP. Torres-Carrasquillo, P.A., Singer, E., Kohler, M.A., Greene, R.J., Reynolds, D.A., & Deller, Jr., J.R. (2002). Approaches to language identification using gaussian mixture models and shifted delta cepstal features. In: Proceedings of ICSLP.
Zurück zum Zitat Vapnik, V. (2013). The nature of statistical learning theory. Berlin: Springer.MATH Vapnik, V. (2013). The nature of statistical learning theory. Berlin: Springer.MATH
Zurück zum Zitat Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25(1), 133–147.CrossRef Viikki, O., & Laurila, K. (1998). Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication, 25(1), 133–147.CrossRef
Zurück zum Zitat Wright, J., Yang, A., Ganesh, A., Sastry, S., & Ma, Y. (2009). Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 210–227.CrossRef Wright, J., Yang, A., Ganesh, A., Sastry, S., & Ma, Y. (2009). Robust face recognition via sparse representation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(2), 210–227.CrossRef
Zurück zum Zitat Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.MathSciNetCrossRefMATH Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.MathSciNetCrossRefMATH
Metadaten
Titel
Sparse coding of i-vector/JFA latent vector over ensemble dictionaries for language identification systems
verfasst von
Om Prakash Singh
Rohit Sinha
Publikationsdatum
22.11.2017
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 3/2018
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-017-9476-3

Weitere Artikel der Ausgabe 3/2018

International Journal of Speech Technology 3/2018 Zur Ausgabe

Neuer Inhalt