Skip to main content

2015 | OriginalPaper | Buchkapitel

8. Speaker Diarization: An Emerging Research

verfasst von : Trung Hieu Nguyen, Eng Siong Chng, Haizhou Li

Erschienen in: Speech and Audio Processing for Coding, Enhancement and Recognition

Verlag: Springer New York

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Speaker diarization is the task of determining “Who spoke when?”, where the objective is to annotate a continuous audio recording with appropriate speaker labels corresponding to the time regions where they spoke. The labels are not necessarily the actual speaker identities, i.e. speaker identification, as long as the same labels are assigned to the regions uttered by the same speakers. These regions may overlap as multiple speakers could talk simultaneously. Speaker diarization is thus essentially the combination of two different processes: segmentation, in which the speaker turns are detected, and unsupervised clustering, in which segments of the same speakers are grouped. The clustering process is considered as unsupervised problem since there is no prior information about the number of speakers, their identities or acoustic conditions (Meignier et al., Comput Speech Lang 20(2–3):303–330, 2006; Zhou and Hansen, IEEE Trans Speech Audio Process 13(4):467–474, 2005). This chapter presents the fundamentals of speaker diarization and the most significant works over the recent years on this topic.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
5.
Zurück zum Zitat A.G. Adam, S.S. Kajarekar, H. Hermansky, A new speaker change detection method for two-speaker segmentation, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 4 (2002), pp. 3908–3911 A.G. Adam, S.S. Kajarekar, H. Hermansky, A new speaker change detection method for two-speaker segmentation, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 4 (2002), pp. 3908–3911
6.
Zurück zum Zitat A.G. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S.S. Kajarekar, N. Morgan, S. Sivadas, Qualcomm-ICSI-OGI features for ASR, in Interspeech (2002) A.G. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S.S. Kajarekar, N. Morgan, S. Sivadas, Qualcomm-ICSI-OGI features for ASR, in Interspeech (2002)
7.
Zurück zum Zitat J. Ajmera, H. Bourlard, I. Lapidot, I. McCowan, Unknown-multiple speaker clustering using HMM, in Interspeech (2002) J. Ajmera, H. Bourlard, I. Lapidot, I. McCowan, Unknown-multiple speaker clustering using HMM, in Interspeech (2002)
8.
Zurück zum Zitat J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)CrossRef J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)CrossRef
9.
Zurück zum Zitat J. Ajmera, C. Wooters, A robust speaker clustering algorithm, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, 2003 (ASRU’03) (2003), pp. 411–416 J. Ajmera, C. Wooters, A robust speaker clustering algorithm, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, 2003 (ASRU’03) (2003), pp. 411–416
10.
Zurück zum Zitat J. Allen, How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994)CrossRef J. Allen, How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994)CrossRef
12.
Zurück zum Zitat X. Anguera, M. Aguilo, C. Wooters, C. Nadeu, J. Hernando, Hybrid speech/non-speech detector applied to speaker diarization of meetings, in IEEE Odyssey 2006: The Speaker and Language Recognition Workshop (2006), pp. 1–6 X. Anguera, M. Aguilo, C. Wooters, C. Nadeu, J. Hernando, Hybrid speech/non-speech detector applied to speaker diarization of meetings, in IEEE Odyssey 2006: The Speaker and Language Recognition Workshop (2006), pp. 1–6
13.
Zurück zum Zitat X. Anguera, J. Hernando, Evolutive speaker segmentation using a repository system, in Proceedings of International Conference on Speech and Language Processing, Jeju Island, 2004 X. Anguera, J. Hernando, Evolutive speaker segmentation using a repository system, in Proceedings of International Conference on Speech and Language Processing, Jeju Island, 2004
14.
Zurück zum Zitat X. Anguera, J. Hernando, Xbic: real-time cross probabilities measure for speaker segmentation. University of California Berkeley, ICSIBerkeley Technical Report (2005) X. Anguera, J. Hernando, Xbic: real-time cross probabilities measure for speaker segmentation. University of California Berkeley, ICSIBerkeley Technical Report (2005)
15.
Zurück zum Zitat X. Anguera, C. Wooters, J. Hernando, Automatic cluster complexity and quantity selection: towards robust speaker diarization, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 248–256 X. Anguera, C. Wooters, J. Hernando, Automatic cluster complexity and quantity selection: towards robust speaker diarization, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 248–256
16.
Zurück zum Zitat X. Anguera, C. Wooters, J. Pardo, Robust speaker diarization for meetings: ICSI RT06s evaluation system, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006) X. Anguera, C. Wooters, J. Pardo, Robust speaker diarization for meetings: ICSI RT06s evaluation system, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)
17.
Zurück zum Zitat X. Anguera, C. Wooters, J. Pardo, J. Hernando, Automatic weighting for the combination of TDOA and acoustic features in speaker diarization for meetings, in Proceedings of ICASSP (2007), pp. 241–244 X. Anguera, C. Wooters, J. Pardo, J. Hernando, Automatic weighting for the combination of TDOA and acoustic features in speaker diarization for meetings, in Proceedings of ICASSP (2007), pp. 241–244
18.
Zurück zum Zitat X. Anguera, C. Wooters, B. Peskin, M. Aguiló, Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 402–414 X. Anguera, C. Wooters, B. Peskin, M. Aguiló, Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 402–414
19.
Zurück zum Zitat C. Barras, X. Zhu, S. Meignier, J.L. Gauvain, Improving speaker diarization, in RT-04F Workshop (2004) C. Barras, X. Zhu, S. Meignier, J.L. Gauvain, Improving speaker diarization, in RT-04F Workshop (2004)
20.
Zurück zum Zitat M. Ben, M. Betser, F. Bimbot, G. Gravier, Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004) M. Ben, M. Betser, F. Bimbot, G. Gravier, Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)
21.
Zurück zum Zitat F. Bimbot, L. Mathan, Text-free speaker recognition using an arithmetic-harmonic sphericity measure, in Third European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1993) F. Bimbot, L. Mathan, Text-free speaker recognition using an arithmetic-harmonic sphericity measure, in Third European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1993)
22.
Zurück zum Zitat J.F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, C. Wellekens, A speaker tracking system based on speaker turn detection for NIST evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 2 (2000), pp. 1177–1180 J.F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, C. Wellekens, A speaker tracking system based on speaker turn detection for NIST evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 2 (2000), pp. 1177–1180
23.
Zurück zum Zitat S. Bozonnet, N. Evans, C. Fredouille, The lia-eurecom RT’09 speaker diarization system: enhancements in speaker modelling and cluster purification, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4958–4961. doi:10.1109/ICASSP.2010.5495088 S. Bozonnet, N. Evans, C. Fredouille, The lia-eurecom RT’09 speaker diarization system: enhancements in speaker modelling and cluster purification, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4958–4961. doi:10.​1109/​ICASSP.​2010.​5495088
24.
Zurück zum Zitat J. Campbell et al., Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)CrossRef J. Campbell et al., Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)CrossRef
26.
Zurück zum Zitat G.C. Carter, A.H. Nuttall, P.G. Cable, The smoothed coherence transform. Proc. IEEE 61(10), 1497–1498 (1973)CrossRef G.C. Carter, A.H. Nuttall, P.G. Cable, The smoothed coherence transform. Proc. IEEE 61(10), 1497–1498 (1973)CrossRef
27.
Zurück zum Zitat S. Cassidy, The Macquarie speaker diarization system for RT04s, in NIST 2004 Spring Rich Transcription Evaluation Workshop, Montreal, 2004 S. Cassidy, The Macquarie speaker diarization system for RT04s, in NIST 2004 Spring Rich Transcription Evaluation Workshop, Montreal, 2004
28.
Zurück zum Zitat M. Cettolo, M. Vescovi, Efficient audio segmentation algorithms based on the BIC, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 6 (2003) M. Cettolo, M. Vescovi, Efficient audio segmentation algorithms based on the BIC, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 6 (2003)
29.
Zurück zum Zitat S. Chen, P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the Bayesian information criterion, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998), pp. 127–132 S. Chen, P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the Bayesian information criterion, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998), pp. 127–132
30.
Zurück zum Zitat T. Cover, J. Thomas, Elements of Information Theory (Wiley-Interscience, London, 2006)MATH T. Cover, J. Thomas, Elements of Information Theory (Wiley-Interscience, London, 2006)MATH
31.
Zurück zum Zitat S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980) [see also IEEE Transactions on Signal Processing] S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980) [see also IEEE Transactions on Signal Processing]
33.
Zurück zum Zitat P. Delacourt, D. Kryze, C. Wellekens, Detection of speaker changes in an audio document, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999) P. Delacourt, D. Kryze, C. Wellekens, Detection of speaker changes in an audio document, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)
34.
Zurück zum Zitat P. Delacourt, C. Wellekens, DISTBIC: a speaker-based segmentation for audio data indexing. Speech Commun. 32(1–2), 111–126 (2000)CrossRef P. Delacourt, C. Wellekens, DISTBIC: a speaker-based segmentation for audio data indexing. Speech Commun. 32(1–2), 111–126 (2000)CrossRef
35.
Zurück zum Zitat A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)MathSciNetMATH A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)MathSciNetMATH
36.
Zurück zum Zitat R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (Wiley, London, 2012) R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (Wiley, London, 2012)
38.
Zurück zum Zitat E. El-Khoury, C. Senac, R. Andre-Obrecht, Speaker diarization: towards a more robust and portable system, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4 (2007), pp. 489–492. doi:10.1109/ICASSP.2007.366956 E. El-Khoury, C. Senac, R. Andre-Obrecht, Speaker diarization: towards a more robust and portable system, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4 (2007), pp. 489–492. doi:10.​1109/​ICASSP.​2007.​366956
39.
Zurück zum Zitat D.P. Ellis, J.C. Liu, Speaker turn segmentation based on between-channel differences, in NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, 2004, pp. 112–117 D.P. Ellis, J.C. Liu, Speaker turn segmentation based on between-channel differences, in NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, 2004, pp. 112–117
41.
Zurück zum Zitat J.G. Fiscus, J. Ajot, J.S. Garofolo, The rich transcription 2007 meeting recognition evaluation, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 373–389 J.G. Fiscus, J. Ajot, J.S. Garofolo, The rich transcription 2007 meeting recognition evaluation, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 373–389
42.
Zurück zum Zitat J.G. Fiscus, J. Ajot, M. Michel, J.S. Garofolo, The Rich Transcription 2006 Spring Meeting Recognition Evaluation (Springer, Berlin, 2006) J.G. Fiscus, J. Ajot, M. Michel, J.S. Garofolo, The Rich Transcription 2006 Spring Meeting Recognition Evaluation (Springer, Berlin, 2006)
43.
Zurück zum Zitat J.G. Fiscus, N. Radde, J.S. Garofolo, A. Le, J. Ajot, C. Laprun, The rich transcription 2005 spring meeting recognition evaluation, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 369–389 J.G. Fiscus, N. Radde, J.S. Garofolo, A. Le, J. Ajot, C. Laprun, The rich transcription 2005 spring meeting recognition evaluation, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 369–389
44.
Zurück zum Zitat E. Fox, E. Sudderth, M. Jordan, A. Willsky, An HDP-HMM for systems with state persistence, in Proceedings of the 25th International Conference on Machine Learning (ACM, New York, 2008), pp. 312–319 E. Fox, E. Sudderth, M. Jordan, A. Willsky, An HDP-HMM for systems with state persistence, in Proceedings of the 25th International Conference on Machine Learning (ACM, New York, 2008), pp. 312–319
45.
Zurück zum Zitat E.B. Fox, E.B. Sudderth, M.I. Jordan, A.S. Willsky, A sticky HDP-HMM with application to speaker diarization. Ann. Appl. Stat. 5(2A), 1020–1056 (2011)MathSciNetCrossRefMATH E.B. Fox, E.B. Sudderth, M.I. Jordan, A.S. Willsky, A sticky HDP-HMM with application to speaker diarization. Ann. Appl. Stat. 5(2A), 1020–1056 (2011)MathSciNetCrossRefMATH
46.
Zurück zum Zitat A. Friedland, B. Vinyals, C. Huang, D. Muller, Fusing short term and long term features for improved speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4077–4080. doi:10.1109/ICASSP.2009.4960524 A. Friedland, B. Vinyals, C. Huang, D. Muller, Fusing short term and long term features for improved speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4077–4080. doi:10.​1109/​ICASSP.​2009.​4960524
47.
Zurück zum Zitat G. Friedland, A. Janin, D. Imseng, X. Anguera Miro, L. Gottlieb, M. Huijbregts, M. Knox, O. Vinyals, The ICSI RT-09 speaker diarization system. IEEE Trans. Audio Speech Lang. Process. 20(2), 371–381 (2012). doi:10.1109/TASL.2011.2158419 CrossRef G. Friedland, A. Janin, D. Imseng, X. Anguera Miro, L. Gottlieb, M. Huijbregts, M. Knox, O. Vinyals, The ICSI RT-09 speaker diarization system. IEEE Trans. Audio Speech Lang. Process. 20(2), 371–381 (2012). doi:10.​1109/​TASL.​2011.​2158419 CrossRef
49.
Zurück zum Zitat R. Gangadharaiah, B. Narayanaswamy, N. Balakrishnan, A novel method for two-speaker segmentation, in Interspeech (2004) R. Gangadharaiah, B. Narayanaswamy, N. Balakrishnan, A novel method for two-speaker segmentation, in Interspeech (2004)
50.
Zurück zum Zitat J. Gauvain, C. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)CrossRef J. Gauvain, C. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)CrossRef
51.
Zurück zum Zitat J.L. Gauvain, L. Lamel, G. Adda, Partitioning and transcription of broadcast news data, in ICSLP, vol. 98 (1998), pp. 1335–1338 J.L. Gauvain, L. Lamel, G. Adda, Partitioning and transcription of broadcast news data, in ICSLP, vol. 98 (1998), pp. 1335–1338
52.
Zurück zum Zitat J.T. Geiger, F. Wallhoff, G. Rigoll, GMM-UBM based open-set online speaker diarization, in Interspeech (2010), pp. 2330–2333 J.T. Geiger, F. Wallhoff, G. Rigoll, GMM-UBM based open-set online speaker diarization, in Interspeech (2010), pp. 2330–2333
53.
Zurück zum Zitat H. Gish, M.H. Siu, R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, in International Conference on Acoustics, Speech, and Signal Processing, 1991 (ICASSP-91) (1991), pp. 873–876 H. Gish, M.H. Siu, R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, in International Conference on Acoustics, Speech, and Signal Processing, 1991 (ICASSP-91) (1991), pp. 873–876
54.
Zurück zum Zitat T. Hain, S. Johnson, A. Tuerk, P. Woodland, S. Young, Segment generation and clustering in the HTK broadcast news transcription system, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 1998 (1998) T. Hain, S. Johnson, A. Tuerk, P. Woodland, S. Young, Segment generation and clustering in the HTK broadcast news transcription system, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 1998 (1998)
55.
Zurück zum Zitat J. Hansen, B. Zhou, M. Akbacak, R. Sarikaya, B. Pellom, Audio stream phrase recognition for a national gallery of the spoken word:“ One Small Step”, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000) J. Hansen, B. Zhou, M. Akbacak, R. Sarikaya, B. Pellom, Audio stream phrase recognition for a national gallery of the spoken word:“ One Small Step”, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)
56.
Zurück zum Zitat H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRef H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRef
57.
Zurück zum Zitat H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992 (ICASSP-92), vol. 1 (1992), pp. 121–124 H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992 (ICASSP-92), vol. 1 (1992), pp. 121–124
58.
Zurück zum Zitat M. Huijbregts, R. Ordelman, F. de Jong, Annotation of heterogeneous multimedia content using automatic speech recognition. Lecture Notes in Computer Science Semantic Multimedia, vol. 4816, (Springer Berlin Heldeberg 2007), pp. 78–90 M. Huijbregts, R. Ordelman, F. de Jong, Annotation of heterogeneous multimedia content using automatic speech recognition. Lecture Notes in Computer Science Semantic Multimedia, vol. 4816, (Springer Berlin Heldeberg 2007), pp. 78–90
59.
Zurück zum Zitat D. Imseng, G. Friedland, An adaptive initialization method for speaker diarization based on prosodic features, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4946–4949 D. Imseng, G. Friedland, An adaptive initialization method for speaker diarization based on prosodic features, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4946–4949
60.
Zurück zum Zitat D. Istrate, C. Fredouille, S. Meignier, L. Besacier, J.F. Bonastre, NIST RT’05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 428–439 D. Istrate, C. Fredouille, S. Meignier, L. Besacier, J.F. Bonastre, NIST RT’05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 428–439
61.
Zurück zum Zitat H. Jin, F. Kubala, R. Schwartz, Automatic speaker clustering, in Proceedings of the DARPA Speech Recognition Workshop (1997), pp. 108–111 H. Jin, F. Kubala, R. Schwartz, Automatic speaker clustering, in Proceedings of the DARPA Speech Recognition Workshop (1997), pp. 108–111
62.
Zurück zum Zitat Q. Jin, T. Schultz, Speaker segmentation and clustering in meetings, in Interspeech, vol. 4 (2004), pp. 597–600 Q. Jin, T. Schultz, Speaker segmentation and clustering in meetings, in Interspeech, vol. 4 (2004), pp. 597–600
63.
Zurück zum Zitat S. Johnson, Who spoke when?-automatic segmentation and clustering for determining speaker turns, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999) S. Johnson, Who spoke when?-automatic segmentation and clustering for determining speaker turns, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)
64.
Zurück zum Zitat S.E. Johnson, J. Woodland, Speaker clustering using direct maximisation of the MLLR-adapted likelihood, in Proceedings of ICSLP 98 (1998), pp. 1775–1779 S.E. Johnson, J. Woodland, Speaker clustering using direct maximisation of the MLLR-adapted likelihood, in Proceedings of ICSLP 98 (1998), pp. 1775–1779
65.
Zurück zum Zitat T. Kemp, M. Schmidt, M. Westphal, A. Waibel, Strategies for automatic segmentation of audio data, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 3 (2000), pp. 1423–1426 T. Kemp, M. Schmidt, M. Westphal, A. Waibel, Strategies for automatic segmentation of audio data, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 3 (2000), pp. 1423–1426
67.
Zurück zum Zitat H. Kim, D. Ertelt, T. Sikora, Hybrid speaker-based segmentation system using model-level clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (2005), pp. 745–748 H. Kim, D. Ertelt, T. Sikora, Hybrid speaker-based segmentation system using model-level clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (2005), pp. 745–748
68.
Zurück zum Zitat B.E. Kingsbury, N. Morgan, S. Greenberg, Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)CrossRef B.E. Kingsbury, N. Morgan, S. Greenberg, Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)CrossRef
69.
Zurück zum Zitat C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)CrossRef C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)CrossRef
70.
Zurück zum Zitat T. Koshinaka, K. Nagatomo, K. Shinoda, Online speaker clustering using incremental learning of an ergodic hidden Markov model, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4093–4096. doi:10.1109/ICASSP.2009.4960528 T. Koshinaka, K. Nagatomo, K. Shinoda, Online speaker clustering using incremental learning of an ergodic hidden Markov model, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4093–4096. doi:10.​1109/​ICASSP.​2009.​4960528
71.
Zurück zum Zitat R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)CrossRef R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)CrossRef
72.
Zurück zum Zitat I. Lapidot, SOM as likelihood estimator for speaker clustering, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003) I. Lapidot, SOM as likelihood estimator for speaker clustering, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)
73.
Zurück zum Zitat K. Laskowski, C. Fugen, T. Schultz, Simultaneous multispeaker segmentation for automatic meeting recognition, in Proceedings of EUSIPCO, Poznan, 2007, pp. 1294–1298 K. Laskowski, C. Fugen, T. Schultz, Simultaneous multispeaker segmentation for automatic meeting recognition, in Proceedings of EUSIPCO, Poznan, 2007, pp. 1294–1298
74.
Zurück zum Zitat K. Laskowski, Q. Jin, T. Schultz, Crosscorrelation-based multispeaker speech activity detection, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004) K. Laskowski, Q. Jin, T. Schultz, Crosscorrelation-based multispeaker speech activity detection, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)
75.
Zurück zum Zitat K. Laskowski, G. Karlsruhe, T. Schultz, A geometric interpretation of non-target-normalized maximum cross-channel correlation for vocal activity detection in meetings, in Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 89–92. Association for Computational Linguistics (2007) K. Laskowski, G. Karlsruhe, T. Schultz, A geometric interpretation of non-target-normalized maximum cross-channel correlation for vocal activity detection in meetings, in Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 89–92. Association for Computational Linguistics (2007)
76.
Zurück zum Zitat K. Laskowski, T. Schultz, Unsupervised learning of overlapped speech model parameters for multichannel speech activity detection in meetings, in Proceedings of ICASSP (2006), pp. 993–996 K. Laskowski, T. Schultz, Unsupervised learning of overlapped speech model parameters for multichannel speech activity detection in meetings, in Proceedings of ICASSP (2006), pp. 993–996
77.
Zurück zum Zitat V.B. Le, O. Mella, D. Fohr, et al., Speaker diarization using normalized cross likelihood ratio, in Interspeech, vol. 7 (2007), pp. 1869–1872 V.B. Le, O. Mella, D. Fohr, et al., Speaker diarization using normalized cross likelihood ratio, in Interspeech, vol. 7 (2007), pp. 1869–1872
78.
Zurück zum Zitat D.A. van Leeuwen, The TNO speaker diarization system for NIST RT05s meeting data, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 440–449 D.A. van Leeuwen, The TNO speaker diarization system for NIST RT05s meeting data, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 440–449
79.
Zurück zum Zitat D.A. van Leeuwen, M. Konečný, Progress in the AMIDA speaker diarization system for meeting data, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 475–483 D.A. van Leeuwen, M. Konečný, Progress in the AMIDA speaker diarization system for meeting data, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 475–483
80.
Zurück zum Zitat D. Lilt, F. Kubala, Online speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), pp. 333–336 D. Lilt, F. Kubala, Online speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), pp. 333–336
81.
Zurück zum Zitat D. Liu, F. Kubala, Fast speaker change detection for broadcast news transcription and indexing, in Sixth European Conference on Speech Communication and Technology (1999) D. Liu, F. Kubala, Fast speaker change detection for broadcast news transcription and indexing, in Sixth European Conference on Speech Communication and Technology (1999)
82.
Zurück zum Zitat J. López, D. Ellis, Using acoustic condition clustering to improve acoustic change detection on broadcast news, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000) J. López, D. Ellis, Using acoustic condition clustering to improve acoustic change detection on broadcast news, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)
83.
Zurück zum Zitat L. Lu, H. Zhang, Real-time unsupervised speaker change detection, in International Conference on Pattern Recognition, vol. 16 (2002), pp. 358–361 L. Lu, H. Zhang, Real-time unsupervised speaker change detection, in International Conference on Pattern Recognition, vol. 16 (2002), pp. 358–361
84.
Zurück zum Zitat J. Luque, C. Segura, J. Hernando, Clustering initialization based on spatial information for speaker diarization of meetings, in Interspeech (2008), pp. 383–386 J. Luque, C. Segura, J. Hernando, Clustering initialization based on spatial information for speaker diarization of meetings, in Interspeech (2008), pp. 383–386
85.
Zurück zum Zitat J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)CrossRef J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)CrossRef
86.
Zurück zum Zitat A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, J. Fortuna, Unsupervised speaker change detection using probabilistic pattern matching. IEEE Signal Process. Lett. 13(8), 509–512 (2006)CrossRef A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, J. Fortuna, Unsupervised speaker change detection using probabilistic pattern matching. IEEE Signal Process. Lett. 13(8), 509–512 (2006)CrossRef
87.
Zurück zum Zitat K. Markov, S. Nakamura, Never-ending learning system for on-line speaker diarization, in IEEE Workshop on Automatic Speech Recognition Understanding, 2007 (ASRU) (2007), pp. 699–704. doi:10.1109/ASRU.2007.4430197 K. Markov, S. Nakamura, Never-ending learning system for on-line speaker diarization, in IEEE Workshop on Automatic Speech Recognition Understanding, 2007 (ASRU) (2007), pp. 699–704. doi:10.​1109/​ASRU.​2007.​4430197
88.
Zurück zum Zitat K. Markov, S. Nakamura, Improved novelty detection for online GMM based speaker diarization, in Interspeech (2008), pp. 363–366 K. Markov, S. Nakamura, Improved novelty detection for online GMM based speaker diarization, in Interspeech (2008), pp. 363–366
89.
Zurück zum Zitat S. Meignier, J. Bonastre, S. Igounet, E-HMM approach for learning and adapting sound models for speaker indexing, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (ISCA, Pittsburgh, 2001) S. Meignier, J. Bonastre, S. Igounet, E-HMM approach for learning and adapting sound models for speaker indexing, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (ISCA, Pittsburgh, 2001)
91.
Zurück zum Zitat X.A. Miró, Robust speaker diarization for meetings, Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona (2006) X.A. Miró, Robust speaker diarization for meetings, Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona (2006)
92.
Zurück zum Zitat D. Moraru, S. Meignier, L. Besacier, J.F. Bonastre, I. Magrin-Chagnolleau, The ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recognition evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 2 (2003), p. II-89 D. Moraru, S. Meignier, L. Besacier, J.F. Bonastre, I. Magrin-Chagnolleau, The ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recognition evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 2 (2003), p. II-89
93.
Zurück zum Zitat D. Moraru, S. Meignier, C. Fredouille, L. Besacier, J.F. Bonastre, The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), p. I-373 D. Moraru, S. Meignier, C. Fredouille, L. Besacier, J.F. Bonastre, The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), p. I-373
94.
Zurück zum Zitat K. Mori, S. Nakagawa, Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001 (ICASSP’01), vol. 1 (2001) K. Mori, S. Nakagawa, Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001 (ICASSP’01), vol. 1 (2001)
95.
Zurück zum Zitat R.M. Neal, G.E. Hinton, A view of the em algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models (Springer, Berlin, 1998), pp. 355–368 R.M. Neal, G.E. Hinton, A view of the em algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models (Springer, Berlin, 1998), pp. 355–368
96.
Zurück zum Zitat A.Y. Ng, M.I. Jordan, Y. Weiss et al., On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002) A.Y. Ng, M.I. Jordan, Y. Weiss et al., On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002)
97.
Zurück zum Zitat P. Nguyen, L. Rigazio, Y. Moh, J. Junqua, Rich transcription 2002 site report, Panasonic Speech Technology Laboratory (PSTL), in Proceedings of the 2002 Rich Transcription Workshop (2002) P. Nguyen, L. Rigazio, Y. Moh, J. Junqua, Rich transcription 2002 site report, Panasonic Speech Technology Laboratory (PSTL), in Proceedings of the 2002 Rich Transcription Workshop (2002)
98.
Zurück zum Zitat M. Nishida, T. Kawahara, Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 1 (2003), pp. 172–175 M. Nishida, T. Kawahara, Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 1 (2003), pp. 172–175
99.
Zurück zum Zitat J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multi-microphone meetings using only between-channel differences, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 257–264 J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multi-microphone meetings using only between-channel differences, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 257–264
100.
Zurück zum Zitat J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences, in Interspeech (2006) J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences, in Interspeech (2006)
101.
Zurück zum Zitat J.M. Pardo, R. Barra-Chicote, R. San-Segundo, R. de Córdoba, B. Martínez-González, Speaker diarization features: the UPM contribution to the RT09 evaluation. IEEE Trans. Audio Speech Lang. Process. 20(2), 426–435 (2012) J.M. Pardo, R. Barra-Chicote, R. San-Segundo, R. de Córdoba, B. Martínez-González, Speaker diarization features: the UPM contribution to the RT09 evaluation. IEEE Trans. Audio Speech Lang. Process. 20(2), 426–435 (2012)
102.
Zurück zum Zitat J. Pelecanos, S. Sridharan, Feature warping for robust speaker verification, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (2001) J. Pelecanos, S. Sridharan, Feature warping for robust speaker verification, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (2001)
103.
Zurück zum Zitat L. Perez-Freire, C. Garcia-Mateo, A multimedia approach for audio segmentation in TV broadcast news, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004) L. Perez-Freire, C. Garcia-Mateo, A multimedia approach for audio segmentation in TV broadcast news, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004)
104.
Zurück zum Zitat T. Pfau, D. Ellis, A. Stolcke, Multispeaker speech activity detection for the ICSI meeting recorder, in Proceedings of ASRU, vol. 1 (2001) T. Pfau, D. Ellis, A. Stolcke, Multispeaker speech activity detection for the ICSI meeting recorder, in Proceedings of ASRU, vol. 1 (2001)
105.
Zurück zum Zitat L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRef L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRef
106.
Zurück zum Zitat W.M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRef W.M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRef
107.
Zurück zum Zitat D. Reynolds, E. Singer, B. Carlson, G. O’Leary, J. McLaughlin, M. Zissman, Blind clustering of speech utterances based on speaker and language characteristics, in Fifth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 1998) D. Reynolds, E. Singer, B. Carlson, G. O’Leary, J. McLaughlin, M. Zissman, Blind clustering of speech utterances based on speaker and language characteristics, in Fifth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 1998)
108.
Zurück zum Zitat D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1), 19–41 (2000)CrossRef D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1), 19–41 (2000)CrossRef
109.
Zurück zum Zitat D.A. Reynolds, R.C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)CrossRef D.A. Reynolds, R.C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)CrossRef
110.
Zurück zum Zitat D.A. Reynolds, P. Torres-Carrasquillo, The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. Technical Report, DTIC Document (2004) D.A. Reynolds, P. Torres-Carrasquillo, The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. Technical Report, DTIC Document (2004)
111.
Zurück zum Zitat M. Roch, Y. Cheng, Speaker segmentation using the MAP-adapted Bayesian information criterion, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004) M. Roch, Y. Cheng, Speaker segmentation using the MAP-adapted Bayesian information criterion, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)
112.
Zurück zum Zitat P.R. Roth, Effective measurements using digital signal analysis. IEEE Spectr. 8(4), 62–70 (1971)CrossRef P.R. Roth, Effective measurements using digital signal analysis. IEEE Spectr. 8(4), 62–70 (1971)CrossRef
113.
Zurück zum Zitat J. Rougui, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, F. Rabat, Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2006 (ICASSP 2006), vol. 5 (2006) J. Rougui, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, F. Rabat, Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2006 (ICASSP 2006), vol. 5 (2006)
114.
Zurück zum Zitat M. Rouvier, S. Meignier, A global optimization framework for speaker diarization, in Odyssey 2012-The Speaker and Language Recognition Workshop (2012) M. Rouvier, S. Meignier, A global optimization framework for speaker diarization, in Odyssey 2012-The Speaker and Language Recognition Workshop (2012)
115.
Zurück zum Zitat M.A. Sato, S. Ishii, On-line EM algorithm for the normalized Gaussian network. Neural Comput. 12(2), 407–432 (2000)CrossRef M.A. Sato, S. Ishii, On-line EM algorithm for the normalized Gaussian network. Neural Comput. 12(2), 407–432 (2000)CrossRef
116.
117.
Zurück zum Zitat E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, A. Stolcke, Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3), 455–472 (2005)CrossRef E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, A. Stolcke, Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3), 455–472 (2005)CrossRef
118.
Zurück zum Zitat S. Shum, N. Dehak, E. Chuangsuwanich, D.A. Reynolds, J.R. Glass, Exploiting intra-conversation variability for speaker diarization, in Interspeech (2011), pp. 945–948 S. Shum, N. Dehak, E. Chuangsuwanich, D.A. Reynolds, J.R. Glass, Exploiting intra-conversation variability for speaker diarization, in Interspeech (2011), pp. 945–948
119.
Zurück zum Zitat S. Shum, N. Dehak, R. Dehak, J. Glass, Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. Audio Speech Lang. Process. 21(10), 2015–2028 (2013). doi:10.1109/TASL.2013.2264673 CrossRef S. Shum, N. Dehak, R. Dehak, J. Glass, Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. Audio Speech Lang. Process. 21(10), 2015–2028 (2013). doi:10.​1109/​TASL.​2013.​2264673 CrossRef
120.
Zurück zum Zitat S. Shum, N. Dehak, J. Glass, On the use of spectral and iterative methods for speaker diarization. System 1(w2), 2 (2012) S. Shum, N. Dehak, J. Glass, On the use of spectral and iterative methods for speaker diarization. System 1(w2), 2 (2012)
121.
Zurück zum Zitat M.A. Siegler, U. Jain, B. Raj, R.M. Stern, Automatic segmentation, classification and clustering of broadcast news audio, in Proceedings of DARPA Broadcast News Workshop (1997), p. 11 M.A. Siegler, U. Jain, B. Raj, R.M. Stern, Automatic segmentation, classification and clustering of broadcast news audio, in Proceedings of DARPA Broadcast News Workshop (1997), p. 11
122.
Zurück zum Zitat J. Silovsky, J. Prazak, Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4193–4196 J. Silovsky, J. Prazak, Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4193–4196
123.
Zurück zum Zitat R. Sinha, S.E. Tranter, M.J. Gales, P.C. Woodland, The Cambridge university March 2005 speaker diarisation system, in Interspeech (2005), pp. 2437–2440 R. Sinha, S.E. Tranter, M.J. Gales, P.C. Woodland, The Cambridge university March 2005 speaker diarisation system, in Interspeech (2005), pp. 2437–2440
124.
Zurück zum Zitat P. Sivakumaran, J. Fortuna, A.M. Ariyaeeinia, On the use of the Bayesian information criterion in multiple speaker detection, in Interspeech (2001), pp. 795–798 P. Sivakumaran, J. Fortuna, A.M. Ariyaeeinia, On the use of the Bayesian information criterion in multiple speaker detection, in Interspeech (2001), pp. 795–798
125.
Zurück zum Zitat A. Solomonoff, A. Mielke, M. Schmidt, H. Gish, Clustering speakers by their voices, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2 (1998), pp. 757–760 A. Solomonoff, A. Mielke, M. Schmidt, H. Gish, Clustering speakers by their voices, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2 (1998), pp. 757–760
126.
Zurück zum Zitat S. Stevens, J. Volkmann, The relation of pitch to frequency: a revised scale. Am. J. Psychol. 53(3), 329–353 (1940)CrossRef S. Stevens, J. Volkmann, The relation of pitch to frequency: a revised scale. Am. J. Psychol. 53(3), 329–353 (1940)CrossRef
127.
Zurück zum Zitat H. Sun, B. Ma, S. Kalayar Khine, H. Li, Speaker diarization system for RT07 and RT09 meeting room audio, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4982–4985 H. Sun, B. Ma, S. Kalayar Khine, H. Li, Speaker diarization system for RT07 and RT09 meeting room audio, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4982–4985
129.
130.
Zurück zum Zitat S. Tranter, Two-way cluster voting to improve speaker diarisation performance, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005) S. Tranter, Two-way cluster voting to improve speaker diarisation performance, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)
131.
Zurück zum Zitat A. Tritschler, R. Gopinath, Improved speaker segmentation and segments clustering using the Bayesian information criterion, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999), pp. 679–682 A. Tritschler, R. Gopinath, Improved speaker segmentation and segments clustering using the Bayesian information criterion, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999), pp. 679–682
132.
Zurück zum Zitat W. Tsai, H. Wang, On maximizing the within-cluster homogeneity of speaker voice characteristics for speech utterance clustering, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, 2006 W. Tsai, H. Wang, On maximizing the within-cluster homogeneity of speaker voice characteristics for speech utterance clustering, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, 2006
133.
Zurück zum Zitat W.H. Tsai, S.S. Cheng, Y.H. Chao, H.M. Wang, Clustering speech utterances by speaker using eigenvoice-motivated vector space models, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005), pp. 725–728 W.H. Tsai, S.S. Cheng, Y.H. Chao, H.M. Wang, Clustering speech utterances by speaker using eigenvoice-motivated vector space models, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005), pp. 725–728
134.
Zurück zum Zitat W.H. Tsai, S.S. Cheng, H.M. Wang, Speaker clustering of speech utterances using a voice characteristic reference space, in Eighth International Conference on Spoken Language Processing (2004) W.H. Tsai, S.S. Cheng, H.M. Wang, Speaker clustering of speech utterances using a voice characteristic reference space, in Eighth International Conference on Spoken Language Processing (2004)
135.
Zurück zum Zitat F. Valente, Infinite models for speaker clustering, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006) F. Valente, Infinite models for speaker clustering, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)
136.
Zurück zum Zitat F. Valente, C. Wellekens, Variational Bayesian speaker clustering, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004) F. Valente, C. Wellekens, Variational Bayesian speaker clustering, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)
137.
Zurück zum Zitat F. Valente, C. Wellekens, Variational Bayesian adaptation for speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005) F. Valente, C. Wellekens, Variational Bayesian adaptation for speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)
138.
Zurück zum Zitat D. Van Leeuwen, T. Factors, The TNO speaker diarization system for NIST RT05s meeting data. Lecture Notes in Computer Science, Machine Learning for Multimodal Interaction (Springer Berlin Heidelberg 2006) vol. 3869, pp. 440 D. Van Leeuwen, T. Factors, The TNO speaker diarization system for NIST RT05s meeting data. Lecture Notes in Computer Science, Machine Learning for Multimodal Interaction (Springer Berlin Heidelberg 2006) vol. 3869, pp. 440
139.
Zurück zum Zitat A. Vandecatseye, J. Martens, A fast, accurate and stream-based speaker segmentation and clustering algorithm, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003) A. Vandecatseye, J. Martens, A fast, accurate and stream-based speaker segmentation and clustering algorithm, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)
140.
Zurück zum Zitat D. Vijayasenan, F. Valente, Speaker diarization of meetings based on large TDOA feature vectors, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4173–4176. doi:10.1109/ICASSP.2012.6288838 D. Vijayasenan, F. Valente, Speaker diarization of meetings based on large TDOA feature vectors, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4173–4176. doi:10.​1109/​ICASSP.​2012.​6288838
141.
Zurück zum Zitat D. Vijayasenan, F. Valente, H. Bourlard, Agglomerative information bottleneck for speaker diarization of meetings data, in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (2007), pp. 250–449 D. Vijayasenan, F. Valente, H. Bourlard, Agglomerative information bottleneck for speaker diarization of meetings data, in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (2007), pp. 250–449
142.
Zurück zum Zitat D. Vijayasenan, F. Valente, H. Bourlard, Combination of agglomerative and sequential clustering for speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008) (2008), pp. 4361–4364. doi:10.1109/ICASSP.2008.4518621 D. Vijayasenan, F. Valente, H. Bourlard, Combination of agglomerative and sequential clustering for speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008) (2008), pp. 4361–4364. doi:10.​1109/​ICASSP.​2008.​4518621
143.
Zurück zum Zitat D. Vijayasenan, F. Valente, H. Bourlard, Integration of TDOA features in information bottleneck framework for fast speaker diarization, in Interspeech (2008), pp. 40–43 D. Vijayasenan, F. Valente, H. Bourlard, Integration of TDOA features in information bottleneck framework for fast speaker diarization, in Interspeech (2008), pp. 40–43
144.
Zurück zum Zitat D. Vijayasenan, F. Valente, H. Bourlard, Mutual information based channel selection for speaker diarization of meetings data, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4065–4068. doi:10.1109/ICASSP.2009.4960521 D. Vijayasenan, F. Valente, H. Bourlard, Mutual information based channel selection for speaker diarization of meetings data, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4065–4068. doi:10.​1109/​ICASSP.​2009.​4960521
145.
Zurück zum Zitat D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic combination of MFCC and TDOA features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 19(2), 431–438 (2011). doi:10.1109/TASL.2010.2048603 CrossRef D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic combination of MFCC and TDOA features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 19(2), 431–438 (2011). doi:10.​1109/​TASL.​2010.​2048603 CrossRef
146.
Zurück zum Zitat D. Vijayasenan, F. Valente, H. Bourlard, Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Commun. 54(1), 55–67 (2012)CrossRef D. Vijayasenan, F. Valente, H. Bourlard, Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Commun. 54(1), 55–67 (2012)CrossRef
147.
Zurück zum Zitat O. Vinyals, G. Friedland, Modulation spectrogram features for improved speaker diarization, in Interspeech (2008), pp. 630–633 O. Vinyals, G. Friedland, Modulation spectrogram features for improved speaker diarization, in Interspeech (2008), pp. 630–633
148.
Zurück zum Zitat A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)CrossRefMATH A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)CrossRefMATH
149.
Zurück zum Zitat H. Wang, S. Cheng, METRIC-SEQDAC: a hybrid approach for audio segmentation, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004) H. Wang, S. Cheng, METRIC-SEQDAC: a hybrid approach for audio segmentation, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)
150.
Zurück zum Zitat N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications, vol. 8 (MIT Press, Cambridge, 1964) N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications, vol. 8 (MIT Press, Cambridge, 1964)
151.
Zurück zum Zitat A. Willsky, H. Jones, A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans. Automat. Contr. 21(1), 108–112 (1976)MathSciNetCrossRefMATH A. Willsky, H. Jones, A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans. Automat. Contr. 21(1), 108–112 (1976)MathSciNetCrossRefMATH
152.
Zurück zum Zitat C. Wooters, J. Fung, B. Peskin, X. Anguera, Towards robust speaker segmentation: the ICSI-SRI fall 2004 diarization system, in RT-04F Workshop, vol. 23 (2004) C. Wooters, J. Fung, B. Peskin, X. Anguera, Towards robust speaker segmentation: the ICSI-SRI fall 2004 diarization system, in RT-04F Workshop, vol. 23 (2004)
153.
Zurück zum Zitat C. Wooters, M. Huijbregts, The ICSI RT07s speaker diarization system, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 509–519 C. Wooters, M. Huijbregts, The ICSI RT07s speaker diarization system, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 509–519
154.
Zurück zum Zitat S. Wrigley, G. Brown, V. Wan, S. Renals, Feature selection for the classification of crosstalk in multi-channel audio, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003) S. Wrigley, G. Brown, V. Wan, S. Renals, Feature selection for the classification of crosstalk in multi-channel audio, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)
155.
Zurück zum Zitat S. Wrigley, G. Brown, V. Wan, S. Renals, Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech Audio Process. 13(1), 84–91 (2005)CrossRef S. Wrigley, G. Brown, V. Wan, S. Renals, Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech Audio Process. 13(1), 84–91 (2005)CrossRef
156.
Zurück zum Zitat T. Wu, L. Lu, K. Chen, H. Zhang, UBM-based real-time speaker segmentation for broadcasting news, in ICME 2003, vol. 2 (2003), pp. 721–724 T. Wu, L. Lu, K. Chen, H. Zhang, UBM-based real-time speaker segmentation for broadcasting news, in ICME 2003, vol. 2 (2003), pp. 721–724
157.
Zurück zum Zitat K. Yamanishi, J.I. Takeuchi, G. Williams, P. Milne, On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2000), pp. 320–324 K. Yamanishi, J.I. Takeuchi, G. Williams, P. Milne, On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2000), pp. 320–324
158.
Zurück zum Zitat M. Zamalloa, L.J. Rodríguez-Fuentes, G. Bordel, M. Penagarikano, J.P. Uribe, Low-latency online speaker tracking on the AMI corpus of meeting conversations, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4962–4965 M. Zamalloa, L.J. Rodríguez-Fuentes, G. Bordel, M. Penagarikano, J.P. Uribe, Low-latency online speaker tracking on the AMI corpus of meeting conversations, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4962–4965
159.
Zurück zum Zitat B. Zhou, J. Hansen, Efficient audio stream segmentation via the combined T2 statistic and Bayesian information criterion. IEEE Trans. Speech Audio Process. 13(4), 467–474 (2005)CrossRef B. Zhou, J. Hansen, Efficient audio stream segmentation via the combined T2 statistic and Bayesian information criterion. IEEE Trans. Speech Audio Process. 13(4), 467–474 (2005)CrossRef
160.
Zurück zum Zitat B. Zhou, J.H. Hansen, Unsupervised audio stream segmentation and clustering via the Bayesian information criterion, in Interspeech (2000), pp. 714–717 B. Zhou, J.H. Hansen, Unsupervised audio stream segmentation and clustering via the Bayesian information criterion, in Interspeech (2000), pp. 714–717
161.
Zurück zum Zitat X. Zhu, C. Barras, L. Lamel, J.L. Gauvain, Speaker diarization: from broadcast news to lectures, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 396–406 X. Zhu, C. Barras, L. Lamel, J.L. Gauvain, Speaker diarization: from broadcast news to lectures, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 396–406
162.
Zurück zum Zitat X. Zhu, C. Barras, S. Meignier, J.L. Gauvain, Combining speaker identification and BIC for speaker diarization, in Interspeech, vol. 5 (2005), pp. 2441–2444 X. Zhu, C. Barras, S. Meignier, J.L. Gauvain, Combining speaker identification and BIC for speaker diarization, in Interspeech, vol. 5 (2005), pp. 2441–2444
163.
Zurück zum Zitat P. Zochova, V. Radova, Modified DISTBIC algorithm for speaker change detection, in Ninth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2005) P. Zochova, V. Radova, Modified DISTBIC algorithm for speaker change detection, in Ninth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2005)
164.
Zurück zum Zitat E. Zwicker, E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. 68, 1523 (1980)CrossRef E. Zwicker, E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. 68, 1523 (1980)CrossRef
Metadaten
Titel
Speaker Diarization: An Emerging Research
verfasst von
Trung Hieu Nguyen
Eng Siong Chng
Haizhou Li
Copyright-Jahr
2015
Verlag
Springer New York
DOI
https://doi.org/10.1007/978-1-4939-1456-2_8

Neuer Inhalt