Skip to main content
Erschienen in: International Journal of Speech Technology 4/2013

01.12.2013

A unified framework for domain independent online speaker indexing in eigen-voice space using an index tree of reference models

verfasst von: M. H. Moattar, M. M. Homayounpour

Erschienen in: International Journal of Speech Technology | Ausgabe 4/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Speaker indexing referred in literature as speaker diarization is an important task in audio indexing and retrieval. Speaker indexing includes two important and usually separate stages, namely speaker segmentation and speaker clustering. Speaker indexing can be divided into online and offline categories. This paper mainly focuses on domain independent online speaker indexing. For this purpose, the proposed framework should be parameter free and no application specific parameters such as utterance duration or threshold settings are required. To reduce dependency on parameters, the traditional speaker segmentation is reformed to a voting based homogeneous speech segmentation, in which several approaches are applied in parallel to decide on the existence of a change point. In online indexing, data insufficiency is encountered at each time slice. In the proposed framework, a set of reference speaker models are used as side information to facilitate online tracking. To improve the indexing accuracy, adaptation approaches in eigen-voice decomposition space are proposed in this paper. To enhance the tracking performance from the computational cost point of view, an index structure of the reference models is proposed to speed up the search in the model space. The proposed framework is evaluated on the 2002 Rich Transcription Broadcast News and Conversational Telephone Speech Corpus (in Garofolo, NIST Rich Transcription, 2002) as well as a synthetic dataset. The indexing error of the proposed framework on telephone conversations, broadcast news and synthetic dataset are 7.51 %, 6.36 % and 9.34 %, respectively. Also, using the index tree structure approach, the tracking run time of the proposed framework is improved by 32 %.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ajmera, J., McCowan, I., & Bourlard, H. (2004). Robust speaker change detection. IEEE Signal Processing Letters, 11(8), 649–651. CrossRef Ajmera, J., McCowan, I., & Bourlard, H. (2004). Robust speaker change detection. IEEE Signal Processing Letters, 11(8), 649–651. CrossRef
Zurück zum Zitat Anguera, X., & Hernando, J. (2004). XBIC: nueva medida para segmentacion de locutor hacia el indexado automatico de la senal de voz. In III jornadas en tecnologia del habla, Valencia, Spain. Anguera, X., & Hernando, J. (2004). XBIC: nueva medida para segmentacion de locutor hacia el indexado automatico de la senal de voz. In III jornadas en tecnologia del habla, Valencia, Spain.
Zurück zum Zitat Anguera, X., Wooters, C., & Hernando, J. (2006). Frame purification for cluster comparison in speaker diarization. In Second international workshop on multimodal user authentication. Anguera, X., Wooters, C., & Hernando, J. (2006). Frame purification for cluster comparison in speaker diarization. In Second international workshop on multimodal user authentication.
Zurück zum Zitat Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. In 15th conf. uncertainty artif. intell., Stockholm, Sweden (pp. 21–30). Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes. In 15th conf. uncertainty artif. intell., Stockholm, Sweden (pp. 21–30).
Zurück zum Zitat Barras, C., Zhu, X., Meignier, S., & Gauvain, J. L. (2006). Multistage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1505–1512. CrossRef Barras, C., Zhu, X., Meignier, S., & Gauvain, J. L. (2006). Multistage speaker diarization of broadcast news. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1505–1512. CrossRef
Zurück zum Zitat Berrani, S., Amsaleg, L., & Gros, P. (2003). Robust content-based image searches for copyright protection. In ACM workshop on multimedia databases, New Orleans, USA (pp. 70–77). Berrani, S., Amsaleg, L., & Gros, P. (2003). Robust content-based image searches for copyright protection. In ACM workshop on multimedia databases, New Orleans, USA (pp. 70–77).
Zurück zum Zitat Bijankhan, M. (2002). Great farsdat database (Technical report). Iran Research center on Intelligent Signal Processing. Bijankhan, M. (2002). Great farsdat database (Technical report). Iran Research center on Intelligent Signal Processing.
Zurück zum Zitat Bimbot, F., Magrin-Chagnolleau, I., & Mathan, L. (1995). Second order statistical measures for text-independent speaker identification. Speech Communication, 17(1–2), 177–192. CrossRef Bimbot, F., Magrin-Chagnolleau, I., & Mathan, L. (1995). Second order statistical measures for text-independent speaker identification. Speech Communication, 17(1–2), 177–192. CrossRef
Zurück zum Zitat Boehm, C., & Pernkopf, F. (2009). Effective metric-based speaker segmentation in the frequency domain. In ICASSP (pp. 4081–4084). Boehm, C., & Pernkopf, F. (2009). Effective metric-based speaker segmentation in the frequency domain. In ICASSP (pp. 4081–4084).
Zurück zum Zitat Chen, S. S., & Gopalakrishnan, P. S. (1998). Clustering via the Bayesian information criterion with applications in speech recognition. In Proc. of ICASSP, USA (Vol. 2, pp. 645–648). Chen, S. S., & Gopalakrishnan, P. S. (1998). Clustering via the Bayesian information criterion with applications in speech recognition. In Proc. of ICASSP, USA (Vol. 2, pp. 645–648).
Zurück zum Zitat Chen, K., et al. (2000). Fast speaker adaptation using eigenspace-based maximum likelihood linear regression. In Interspeech (pp. 742–745). Chen, K., et al. (2000). Fast speaker adaptation using eigenspace-based maximum likelihood linear regression. In Interspeech (pp. 742–745).
Zurück zum Zitat Chu, S. M., Tang, H., & Huang, T. S. (2009). Fishervoice and semi-supervised speaker clustering. In ICASSP (pp. 4089–4092). Chu, S. M., Tang, H., & Huang, T. S. (2009). Fishervoice and semi-supervised speaker clustering. In ICASSP (pp. 4089–4092).
Zurück zum Zitat Davy, M., Doncarli, C., & Tourneret, J. (2000). Supervised classification using MCMC methods. In Proc. ICASSP (pp. 33–36). Davy, M., Doncarli, C., & Tourneret, J. (2000). Supervised classification using MCMC methods. In Proc. ICASSP (pp. 33–36).
Zurück zum Zitat Delacourt, P., & Wellekens, C. J. (2000). DISTBIC: a speaker based segmentation for audio indexing. Speech Communication, 32(1–2), 111–127. CrossRef Delacourt, P., & Wellekens, C. J. (2000). DISTBIC: a speaker based segmentation for audio indexing. Speech Communication, 32(1–2), 111–127. CrossRef
Zurück zum Zitat Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B. Methodological, 39(1), 1C38. MathSciNet Dempster, A., Laird, N., & Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B. Methodological, 39(1), 1C38. MathSciNet
Zurück zum Zitat Desobry, F., & Davy, M. (2003). Support vector-based online detection of abrupt changes. In ICASSP (Vol. 5, pp. 872–875). Desobry, F., & Davy, M. (2003). Support vector-based online detection of abrupt changes. In ICASSP (Vol. 5, pp. 872–875).
Zurück zum Zitat Evans, N. W. D., Fredouille, C., & Bonastre, J. F. (2009). Speaker diarization using unsupervised discriminant analysis of inter-channel delay features. In ICASSP (pp. 4061–4064). Evans, N. W. D., Fredouille, C., & Bonastre, J. F. (2009). Speaker diarization using unsupervised discriminant analysis of inter-channel delay features. In ICASSP (pp. 4061–4064).
Zurück zum Zitat Fernandez, D., Otero, P. L., & Mateo, C. G. (2009). An adaptive threshold computation for unsupervised speaker segmentation. In Proc. of interspeech, Brighton, UK (pp. 843–849). Fernandez, D., Otero, P. L., & Mateo, C. G. (2009). An adaptive threshold computation for unsupervised speaker segmentation. In Proc. of interspeech, Brighton, UK (pp. 843–849).
Zurück zum Zitat Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., & Dahlgren, N. L. (1993). In The DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM. Linguistic data consortium. Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., & Dahlgren, N. L. (1993). In The DARPA TIMIT acoustic-phonetic continuous speech corpus CDROM. Linguistic data consortium.
Zurück zum Zitat Garofolo, J., et al. (2002). In NIST rich transcription 2002 evaluation: a preview. LREC. Garofolo, J., et al. (2002). In NIST rich transcription 2002 evaluation: a preview. LREC.
Zurück zum Zitat Gauvain, J. L., Lamel, L., & Adda, G. (1998). Partitioning and transcription of broadcast news data. In Proc. of interspeech, Sydney, Australia (Vol. 4, pp. 1335–1338). Gauvain, J. L., Lamel, L., & Adda, G. (1998). Partitioning and transcription of broadcast news data. In Proc. of interspeech, Sydney, Australia (Vol. 4, pp. 1335–1338).
Zurück zum Zitat Han, K. J., & Narayanan, S. (2007). A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system. In Proc. of interspeech, Antwerp, Belgium. Han, K. J., & Narayanan, S. (2007). A robust stopping criterion for agglomerative hierarchical clustering in a speaker diarization system. In Proc. of interspeech, Antwerp, Belgium.
Zurück zum Zitat Han, K. J., & Narayanan, S. S. (2008). Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling. In Interspeech (pp. 20–23). Han, K. J., & Narayanan, S. S. (2008). Agglomerative hierarchical speaker clustering using incremental Gaussian mixture cluster modeling. In Interspeech (pp. 20–23).
Zurück zum Zitat Huang, C. H., Chien, J. T., & Wang, H. M. (2004). A new eigenvoice approach to speaker adaptation. In International symposium on Chinese spoken language processing (ISCSLP), Hong Kong. Huang, C. H., Chien, J. T., & Wang, H. M. (2004). A new eigenvoice approach to speaker adaptation. In International symposium on Chinese spoken language processing (ISCSLP), Hong Kong.
Zurück zum Zitat Hung, J., Wang, H., & Lee, L. (2000). Automatic metric based speech segmentation for broadcast news via principal component analysis. In Proc. of interspeech, Beijing, China. Hung, J., Wang, H., & Lee, L. (2000). Automatic metric based speech segmentation for broadcast news via principal component analysis. In Proc. of interspeech, Beijing, China.
Zurück zum Zitat Iso, K. (2010). Speaker clustering using vector quantization and spectral clustering. In ICASSP (pp. 4986–4989). Iso, K. (2010). Speaker clustering using vector quantization and spectral clustering. In ICASSP (pp. 4986–4989).
Zurück zum Zitat Izmirli, O. (2000). Using a spectral flatness based feature for audio segmentation and retrieval (Abstract). In Proc. of the international symposium on music information retrieval (ISMIR2000), Plymouth, Massachusetts, USA. Izmirli, O. (2000). Using a spectral flatness based feature for audio segmentation and retrieval (Abstract). In Proc. of the international symposium on music information retrieval (ISMIR2000), Plymouth, Massachusetts, USA.
Zurück zum Zitat Jolliffe, I. T. (1986). Principal component analysis. Berlin: Springer. CrossRef Jolliffe, I. T. (1986). Principal component analysis. Berlin: Springer. CrossRef
Zurück zum Zitat Kemp, T., Schmidt, M., Westphal, M., & Waibel, A. (2000). Strategies for automatic segmentation of audio data. In Proc. of ICASSP, Istanbul, Turkey (Vol. 3, pp. 1423–1426). Kemp, T., Schmidt, M., Westphal, M., & Waibel, A. (2000). Strategies for automatic segmentation of audio data. In Proc. of ICASSP, Istanbul, Turkey (Vol. 3, pp. 1423–1426).
Zurück zum Zitat Kim, H., Elter, D., & Sikora, T. (2005). Hybrid speaker-based segmentation system using model-level clustering. In Proc. of ICASSP, Philadelphia, USA (Vol. I, pp. 745–748). Kim, H., Elter, D., & Sikora, T. (2005). Hybrid speaker-based segmentation system using model-level clustering. In Proc. of ICASSP, Philadelphia, USA (Vol. I, pp. 745–748).
Zurück zum Zitat Koshinaka, T., Nagatomo, K., & Shinoda, K. (2009). Online speaker clustering using incremental learning of an ergodic hidden Markov model. In ICASSP (pp. 4093–4096). Koshinaka, T., Nagatomo, K., & Shinoda, K. (2009). Online speaker clustering using incremental learning of an ergodic hidden Markov model. In ICASSP (pp. 4093–4096).
Zurück zum Zitat Kotti, M., Moschou, V., & Kotropoulos, C. (2008). Speaker segmentation and clustering. Signal Processing, 88(5), 1091–1124. CrossRefMATH Kotti, M., Moschou, V., & Kotropoulos, C. (2008). Speaker segmentation and clustering. Signal Processing, 88(5), 1091–1124. CrossRefMATH
Zurück zum Zitat Kuhn, R., Junqua, J. C., Nguyen, P., & Niedzielski, N. (2000). Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing, 8(4), 695–707. CrossRef Kuhn, R., Junqua, J. C., Nguyen, P., & Niedzielski, N. (2000). Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing, 8(4), 695–707. CrossRef
Zurück zum Zitat Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. MathSciNetCrossRefMATH Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86. MathSciNetCrossRefMATH
Zurück zum Zitat Kwok, J. T., Mak, B., & Ho, S. (2004). Eigenvoice speaker adaptation via composite kernel PCA. In NIPS 16, Cambridge: MIT Press. Kwok, J. T., Mak, B., & Ho, S. (2004). Eigenvoice speaker adaptation via composite kernel PCA. In NIPS 16, Cambridge: MIT Press.
Zurück zum Zitat Kwon, S., & Narayanan, S. (2004a). Unsupervised speaker indexing using generic models. IEEE Transactions on Speech and Audio Processing, 13, 1004–1013. CrossRef Kwon, S., & Narayanan, S. (2004a). Unsupervised speaker indexing using generic models. IEEE Transactions on Speech and Audio Processing, 13, 1004–1013. CrossRef
Zurück zum Zitat Kwon, S., & Narayanan, S. (2004b). Speaker model quantization for unsupervised speaker indexing. In Interspeech (pp. 1517–1520). Kwon, S., & Narayanan, S. (2004b). Speaker model quantization for unsupervised speaker indexing. In Interspeech (pp. 1517–1520).
Zurück zum Zitat Lopez, J. F., & Ellis, D. P. W. (2000). Using acoustic condition clustering to improve acoustic change detection on broadcast news. In Proc. of interspeech, Beijing, China. Lopez, J. F., & Ellis, D. P. W. (2000). Using acoustic condition clustering to improve acoustic change detection on broadcast news. In Proc. of interspeech, Beijing, China.
Zurück zum Zitat Lu, L., & Zhang, H. (2002). Speaker change detection and tracking in real-time news broadcast analysis. In Proc. of the ACM multimedia, France (pp. 602–610). Lu, L., & Zhang, H. (2002). Speaker change detection and tracking in real-time news broadcast analysis. In Proc. of the ACM multimedia, France (pp. 602–610).
Zurück zum Zitat Lu, L., & Zhang, H. (2005). Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimedia Systems, 10(4), 332–343. CrossRef Lu, L., & Zhang, H. (2005). Unsupervised speaker segmentation and tracking in real-time audio content analysis. Multimedia Systems, 10(4), 332–343. CrossRef
Zurück zum Zitat Mami, Y., & Charlet, D. (2002). Speaker identification by location in an optimal space of anchor models. In Proc. ICSLP, Denver, Colorado, USA (pp. 1333–1336). Mami, Y., & Charlet, D. (2002). Speaker identification by location in an optimal space of anchor models. In Proc. ICSLP, Denver, Colorado, USA (pp. 1333–1336).
Zurück zum Zitat Markov, K., & Nakamura, S. (2007). Never-ending learning with dynamic hidden Markov network. In Proc. of interspeech. Markov, K., & Nakamura, S. (2007). Never-ending learning with dynamic hidden Markov network. In Proc. of interspeech.
Zurück zum Zitat Markov, K., & Nakamura, S. (2008). Improved novelty detection for online GMM based speaker diarization. In Interspeech, Brisbane, Australia (pp. 363–366). Markov, K., & Nakamura, S. (2008). Improved novelty detection for online GMM based speaker diarization. In Interspeech, Brisbane, Australia (pp. 363–366).
Zurück zum Zitat Moattar, M. H., & Homayounpour, M. M. (2009). A simple but efficient real-time voice activity detection algorithm. In 17th European signal processing conference (Eusipco) (pp. 2549–2553). Moattar, M. H., & Homayounpour, M. M. (2009). A simple but efficient real-time voice activity detection algorithm. In 17th European signal processing conference (Eusipco) (pp. 2549–2553).
Zurück zum Zitat Moh, Y., Nguyen, P., & Junqua, J. C. (2003). Toward domain independent clustering. In Proc. of ICASSP (Vol. II, pp. 85–88). Moh, Y., Nguyen, P., & Junqua, J. C. (2003). Toward domain independent clustering. In Proc. of ICASSP (Vol. II, pp. 85–88).
Zurück zum Zitat Muthusamy, Y. K., et al. (1992). The OGI multi-language telephone speech corpus. In Interspeech (pp. 895–898). Muthusamy, Y. K., et al. (1992). The OGI multi-language telephone speech corpus. In Interspeech (pp. 895–898).
Zurück zum Zitat Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models (pp. 355–368). Cambridge: MIT Press. CrossRef Neal, R. M., & Hinton, G. E. (1998). A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models (pp. 355–368). Cambridge: MIT Press. CrossRef
Zurück zum Zitat Nguyen, T. H., Cheng, E. S., & Li, H. (2008). T-test distance and clustering criterion for speaker diarization. In Interspeech (pp. 36–39). Nguyen, T. H., Cheng, E. S., & Li, H. (2008). T-test distance and clustering criterion for speaker diarization. In Interspeech (pp. 36–39).
Zurück zum Zitat Nguyen, T. H., Li, H., & Cheng, E. S. (2009). Cluster criterion functions in spectral subspace and their application in speaker clustering. In ICASSP (pp. 4085–4088). Nguyen, T. H., Li, H., & Cheng, E. S. (2009). Cluster criterion functions in spectral subspace and their application in speaker clustering. In ICASSP (pp. 4085–4088).
Zurück zum Zitat Ning, H., Liu, M., Tang, H., & Huang, T. (2006). A spectral clustering approach to speaker diarization. In Interspeech (pp. 2178–2181). Ning, H., Liu, M., Tang, H., & Huang, T. (2006). A spectral clustering approach to speaker diarization. In Interspeech (pp. 2178–2181).
Zurück zum Zitat Nishida, M., & Kawahara, T. (2003). Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion. In ICASSP (Vol. 1, pp. 172–175). Nishida, M., & Kawahara, T. (2003). Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion. In ICASSP (Vol. 1, pp. 172–175).
Zurück zum Zitat Omar, M., Chaudhari, U., & Ramaswamy, G. (2005). Blind change detection for audio segmentation. In ICASSP. Omar, M., Chaudhari, U., & Ramaswamy, G. (2005). Blind change detection for audio segmentation. In ICASSP.
Zurück zum Zitat Otero, P. L., Fernandez, L. D., & Mateo, C. G. (2010). Novel strategies for reducing the false alarm rate in a speaker segmentation system. In Proc. of ICASSP (pp. 4970–4973). Otero, P. L., Fernandez, L. D., & Mateo, C. G. (2010). Novel strategies for reducing the false alarm rate in a speaker segmentation system. In Proc. of ICASSP (pp. 4970–4973).
Zurück zum Zitat Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice-Hall. Rabiner, L. R., & Juang, B. H. (1993). Fundamentals of speech recognition. Englewood Cliffs: Prentice-Hall.
Zurück zum Zitat Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17(1–2), 91–108. CrossRef Reynolds, D. A. (1995). Speaker identification and verification using Gaussian mixture speaker models. Speech Communication, 17(1–2), 91–108. CrossRef
Zurück zum Zitat Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41. CrossRef Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41. CrossRef
Zurück zum Zitat Rodriguez, L. J., Penagarikano, M., & Bordel, G. (2007). A simple but effective approach to speaker tracking in broadcast news. In IbPRIA, part II (pp. 48–55). Rodriguez, L. J., Penagarikano, M., & Bordel, G. (2007). A simple but effective approach to speaker tracking in broadcast news. In IbPRIA, part II (pp. 48–55).
Zurück zum Zitat Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In DARPA speech recognition workshop, Chantilly (pp. 97–99). Siegler, M. A., Jain, U., Raj, B., & Stern, R. M. (1997). Automatic segmentation, classification and clustering of broadcast news audio. In DARPA speech recognition workshop, Chantilly (pp. 97–99).
Zurück zum Zitat Sivakumaran, P., Fortuna, J., & Ariyaeeinia, A. (2001). On the use of the Bayesian information criterion in multiple speaker detection. In Eurospeech, Scandinavia. Sivakumaran, P., Fortuna, J., & Ariyaeeinia, A. (2001). On the use of the Bayesian information criterion in multiple speaker detection. In Eurospeech, Scandinavia.
Zurück zum Zitat Sun, H., et al. (2010). Speaker diarization system for RT-07 and RT-09 meeting room audio. In ICASSP (pp. 4982–4985). Sun, H., et al. (2010). Speaker diarization system for RT-07 and RT-09 meeting room audio. In ICASSP (pp. 4982–4985).
Zurück zum Zitat Tang, H., Chu, S. M., & Huang, T. S. (2009). Generative model-based speaker clustering via mixture of von Mises-Fisher distributions. In ICASSP (pp. 4101–4104). Tang, H., Chu, S. M., & Huang, T. S. (2009). Generative model-based speaker clustering via mixture of von Mises-Fisher distributions. In ICASSP (pp. 4101–4104).
Zurück zum Zitat Tranter, S. E., Yu, K., Evermann, G., & Woodland, P. C. (2004). Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech. In Proc. of ICASSP, Montreal, Canada (pp. 433–477). Tranter, S. E., Yu, K., Evermann, G., & Woodland, P. C. (2004). Generating and evaluating segmentations for automatic speech recognition of conversational telephone speech. In Proc. of ICASSP, Montreal, Canada (pp. 433–477).
Zurück zum Zitat Tritschler, A., & Gopinath, R. (1999). Improved speaker segmentation and segment clustering using the Bayesian information criterion. In EuroSpeech (pp. 679–682). Tritschler, A., & Gopinath, R. (1999). Improved speaker segmentation and segment clustering using the Bayesian information criterion. In EuroSpeech (pp. 679–682).
Zurück zum Zitat Tsai, W. H., Cheng, S. S., & Wang, H. M. (2007). Automatic speaker clustering using a voice characteristic reference space and maximum purity estimation. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1461–1474. CrossRef Tsai, W. H., Cheng, S. S., & Wang, H. M. (2007). Automatic speaker clustering using a voice characteristic reference space and maximum purity estimation. IEEE Transactions on Audio, Speech, and Language Processing, 15(4), 1461–1474. CrossRef
Zurück zum Zitat Valente, F., & Wellekens, C. (2004). Variational Bayesian speaker clustering. In Speaker odyssey, Toledo, Spain. Valente, F., & Wellekens, C. (2004). Variational Bayesian speaker clustering. In Speaker odyssey, Toledo, Spain.
Zurück zum Zitat Valente, F., & Wellekens, C. (2005). Variational Bayesian adaptation for speaker clustering. In Proc. of ICASSP, Lisbon, Portugal. Valente, F., & Wellekens, C. (2005). Variational Bayesian adaptation for speaker clustering. In Proc. of ICASSP, Lisbon, Portugal.
Zurück zum Zitat Valente, F., Motlicek, P., & Vijayasenan, D. (2010). Variational Bayesian speaker diarization of meeting recordings. In ICASSP (pp. 4954–4957). Valente, F., Motlicek, P., & Vijayasenan, D. (2010). Variational Bayesian speaker diarization of meeting recordings. In ICASSP (pp. 4954–4957).
Zurück zum Zitat Wang, D., Lu, L., & Zhang, H. J. (2003). Speech segmentation without speech recognition. In Proc. of ICASSP, Hong Kong (Vol. 1, pp. 468–471). Wang, D., Lu, L., & Zhang, H. J. (2003). Speech segmentation without speech recognition. In Proc. of ICASSP, Hong Kong (Vol. 1, pp. 468–471).
Zurück zum Zitat Wang, W., Lv, P., Zhao, Q., & Yan, Y. (2007). A decision-tree-based online speaker clustering. In Lecture notes in computer science (Vol. 4477, pp. 555–562). Berlin: Springer. Wang, W., Lv, P., Zhao, Q., & Yan, Y. (2007). A decision-tree-based online speaker clustering. In Lecture notes in computer science (Vol. 4477, pp. 555–562). Berlin: Springer.
Zurück zum Zitat Wu, J., & Chang, E. (2001). Cohorts based custom models for rapid speaker and dialect adaptation. In Proc. eurospeech (pp. 1261–1264). Wu, J., & Chang, E. (2001). Cohorts based custom models for rapid speaker and dialect adaptation. In Proc. eurospeech (pp. 1261–1264).
Zurück zum Zitat Zamalloa, M., et al. (2010). Low latency online speaker tracking on the AMI corpus of meeting conversations. In ICASSP (pp. 4962–4965). Zamalloa, M., et al. (2010). Low latency online speaker tracking on the AMI corpus of meeting conversations. In ICASSP (pp. 4962–4965).
Zurück zum Zitat Zdansky, J. (2006). BINSEG: an efficient speaker-based segmentation technique. In Interspeech, Pennsylvania (pp. 2186–2189). Zdansky, J. (2006). BINSEG: an efficient speaker-based segmentation technique. In Interspeech, Pennsylvania (pp. 2186–2189).
Zurück zum Zitat Zezula, P., Amato, G., Dohnal, V., & Batko, M. (2006). Similarity search: the metric space approach. In Advances in database systems (Vol. 32). ISBN 0-387-29146-6 Zezula, P., Amato, G., Dohnal, V., & Batko, M. (2006). Similarity search: the metric space approach. In Advances in database systems (Vol. 32). ISBN 0-387-29146-6
Zurück zum Zitat Zhou, B., & Hansen, J. (2002). Improved structural maximum likelihood eigenspace mapping for rapid speaker adaptation. In Interspeech, Denver, Colorado (pp. 554–564). Zhou, B., & Hansen, J. (2002). Improved structural maximum likelihood eigenspace mapping for rapid speaker adaptation. In Interspeech, Denver, Colorado (pp. 554–564).
Zurück zum Zitat Zhou, B., & Hansen, J. H. L. (2005). Efficient audio stream segmentation via the combined T2 statistic and the Bayesian information criterion. IEEE Transactions on Speech and Audio Processing, 13(4), 467–474. CrossRef Zhou, B., & Hansen, J. H. L. (2005). Efficient audio stream segmentation via the combined T2 statistic and the Bayesian information criterion. IEEE Transactions on Speech and Audio Processing, 13(4), 467–474. CrossRef
Metadaten
Titel
A unified framework for domain independent online speaker indexing in eigen-voice space using an index tree of reference models
verfasst von
M. H. Moattar
M. M. Homayounpour
Publikationsdatum
01.12.2013
Verlag
Springer US
Erschienen in
International Journal of Speech Technology / Ausgabe 4/2013
Print ISSN: 1381-2416
Elektronische ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-013-9190-8

Weitere Artikel der Ausgabe 4/2013

International Journal of Speech Technology 4/2013 Zur Ausgabe

Neuer Inhalt