nach oben

Erschienen in:

2015 | OriginalPaper | Buchkapitel

8. Speaker Diarization: An Emerging Research

verfasst von : Trung Hieu Nguyen, Eng Siong Chng, Haizhou Li

Erschienen in: Speech and Audio Processing for Coding, Enhancement and Recognition

Verlag: Springer New York

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Speaker diarization is the task of determining “Who spoke when?”, where the objective is to annotate a continuous audio recording with appropriate speaker labels corresponding to the time regions where they spoke. The labels are not necessarily the actual speaker identities, i.e. speaker identification, as long as the same labels are assigned to the regions uttered by the same speakers. These regions may overlap as multiple speakers could talk simultaneously. Speaker diarization is thus essentially the combination of two different processes: segmentation, in which the speaker turns are detected, and unsupervised clustering, in which segments of the same speakers are grouped. The clustering process is considered as unsupervised problem since there is no prior information about the number of speakers, their identities or acoustic conditions (Meignier et al., Comput Speech Lang 20(2–3):303–330, 2006; Zhou and Hansen, IEEE Trans Speech Audio Process 13(4):467–474, 2005). This chapter presents the fundamentals of speaker diarization and the most significant works over the recent years on this topic.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Speech Based Emotion Recognition

Nächstes Kapitel Maximum A Posteriori Spectral Estimation with Source Log-Spectral Priors for Multichannel Speech Enhancement

The ISL Meeting Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S05. Accessed 25 Aug 2014

The ICSI Meeting Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S02. Accessed 24 Aug 2014

NIST Meeting Room Pilot Corpus (2004), https://catalog.ldc.upenn.edu/LDC2004S09. Accessed 24 Aug 2014

The AMI corpus (2007), http://groups.inf.ed.ac.uk/ami/download/. Accessed 25 Aug 2014

A.G. Adam, S.S. Kajarekar, H. Hermansky, A new speaker change detection method for two-speaker segmentation, in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2002), vol. 4 (2002), pp. 3908–3911

A.G. Adami, L. Burget, S. Dupont, H. Garudadri, F. Grezl, H. Hermansky, P. Jain, S.S. Kajarekar, N. Morgan, S. Sivadas, Qualcomm-ICSI-OGI features for ASR, in Interspeech (2002)

J. Ajmera, H. Bourlard, I. Lapidot, I. McCowan, Unknown-multiple speaker clustering using HMM, in Interspeech (2002)

J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection. IEEE Signal Process. Lett. 11(8), 649–651 (2004)CrossRef

J. Ajmera, C. Wooters, A robust speaker clustering algorithm, in 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, 2003 (ASRU’03) (2003), pp. 411–416

10.

J. Allen, How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994)CrossRef

11.

X. Anguera, BeamformIt acoustic beamformer (2009), http://www.xavieranguera.com/beamformit/. Accessed 24 Aug 2014

12.

X. Anguera, M. Aguilo, C. Wooters, C. Nadeu, J. Hernando, Hybrid speech/non-speech detector applied to speaker diarization of meetings, in IEEE Odyssey 2006: The Speaker and Language Recognition Workshop (2006), pp. 1–6

13.

X. Anguera, J. Hernando, Evolutive speaker segmentation using a repository system, in Proceedings of International Conference on Speech and Language Processing, Jeju Island, 2004

14.

X. Anguera, J. Hernando, Xbic: real-time cross probabilities measure for speaker segmentation. University of California Berkeley, ICSIBerkeley Technical Report (2005)

15.

X. Anguera, C. Wooters, J. Hernando, Automatic cluster complexity and quantity selection: towards robust speaker diarization, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 248–256

16.

X. Anguera, C. Wooters, J. Pardo, Robust speaker diarization for meetings: ICSI RT06s evaluation system, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)

17.

X. Anguera, C. Wooters, J. Pardo, J. Hernando, Automatic weighting for the combination of TDOA and acoustic features in speaker diarization for meetings, in Proceedings of ICASSP (2007), pp. 241–244

18.

X. Anguera, C. Wooters, B. Peskin, M. Aguiló, Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 402–414

19.

C. Barras, X. Zhu, S. Meignier, J.L. Gauvain, Improving speaker diarization, in RT-04F Workshop (2004)

20.

M. Ben, M. Betser, F. Bimbot, G. Gravier, Speaker diarization using bottom-up clustering based on a parameter-derived distance between adapted GMMs, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)

21.

F. Bimbot, L. Mathan, Text-free speaker recognition using an arithmetic-harmonic sphericity measure, in Third European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1993)

22.

J.F. Bonastre, P. Delacourt, C. Fredouille, T. Merlin, C. Wellekens, A speaker tracking system based on speaker turn detection for NIST evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 2 (2000), pp. 1177–1180

23.

S. Bozonnet, N. Evans, C. Fredouille, The lia-eurecom RT’09 speaker diarization system: enhancements in speaker modelling and cluster purification, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4958–4961. doi:10.1109/ICASSP.2010.5495088

24.

J. Campbell et al., Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997)CrossRef

25.

W. Campbell, D. Sturim, D. Reynolds, Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006). doi:10.1109/LSP.2006.870086 CrossRef

26.

G.C. Carter, A.H. Nuttall, P.G. Cable, The smoothed coherence transform. Proc. IEEE 61(10), 1497–1498 (1973)CrossRef

27.

S. Cassidy, The Macquarie speaker diarization system for RT04s, in NIST 2004 Spring Rich Transcription Evaluation Workshop, Montreal, 2004

28.

M. Cettolo, M. Vescovi, Efficient audio segmentation algorithms based on the BIC, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’03), vol. 6 (2003)

29.

S. Chen, P. Gopalakrishnan, Speaker, environment and channel change detection and clustering via the Bayesian information criterion, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop (1998), pp. 127–132

30.

T. Cover, J. Thomas, Elements of Information Theory (Wiley-Interscience, London, 2006)MATH

31.

S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980) [see also IEEE Transactions on Signal Processing]

32.

N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). doi:10.1109/TASL.2010.2064307 CrossRef

33.

P. Delacourt, D. Kryze, C. Wellekens, Detection of speaker changes in an audio document, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)

34.

P. Delacourt, C. Wellekens, DISTBIC: a speaker-based segmentation for audio data indexing. Speech Commun. 32(1–2), 111–126 (2000)CrossRef

35.

A. Dempster, N. Laird, D. Rubin et al., Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)MathSciNetMATH

36.

R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification (Wiley, London, 2012)

37.

C. Eckart, Optimal rectifier systems for the detection of steady signals, Scripps Institution of Oceanography, (UC San Diego 1952). Retrieved from: http://escholarship.org/uc/item/3676p6rt

38.

E. El-Khoury, C. Senac, R. Andre-Obrecht, Speaker diarization: towards a more robust and portable system, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2007 (ICASSP 2007), vol. 4 (2007), pp. 489–492. doi:10.1109/ICASSP.2007.366956

39.

D.P. Ellis, J.C. Liu, Speaker turn segmentation based on between-channel differences, in NIST ICASSP 2004 Meeting Recognition Workshop, Montreal, 2004, pp. 112–117

40.

T. Ferguson, A Bayesian analysis of some nonparametric problems. Ann. Stat. 1(2) 209–230 (1973)MathSciNetCrossRefMATH

41.

J.G. Fiscus, J. Ajot, J.S. Garofolo, The rich transcription 2007 meeting recognition evaluation, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 373–389

42.

J.G. Fiscus, J. Ajot, M. Michel, J.S. Garofolo, The Rich Transcription 2006 Spring Meeting Recognition Evaluation (Springer, Berlin, 2006)

43.

J.G. Fiscus, N. Radde, J.S. Garofolo, A. Le, J. Ajot, C. Laprun, The rich transcription 2005 spring meeting recognition evaluation, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 369–389

44.

E. Fox, E. Sudderth, M. Jordan, A. Willsky, An HDP-HMM for systems with state persistence, in Proceedings of the 25th International Conference on Machine Learning (ACM, New York, 2008), pp. 312–319

45.

E.B. Fox, E.B. Sudderth, M.I. Jordan, A.S. Willsky, A sticky HDP-HMM with application to speaker diarization. Ann. Appl. Stat. 5(2A), 1020–1056 (2011)MathSciNetCrossRefMATH

46.

A. Friedland, B. Vinyals, C. Huang, D. Muller, Fusing short term and long term features for improved speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4077–4080. doi:10.1109/ICASSP.2009.4960524

47.

G. Friedland, A. Janin, D. Imseng, X. Anguera Miro, L. Gottlieb, M. Huijbregts, M. Knox, O. Vinyals, The ICSI RT-09 speaker diarization system. IEEE Trans. Audio Speech Lang. Process. 20(2), 371–381 (2012). doi:10.1109/TASL.2011.2158419 CrossRef

48.

G. Friedland, O. Vinyals, Y. Huang, C. Muller, Prosodic and other long-term features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 17(5), 985–993 (2009). doi:10.1109/TASL.2009.2015089 CrossRef

49.

R. Gangadharaiah, B. Narayanaswamy, N. Balakrishnan, A novel method for two-speaker segmentation, in Interspeech (2004)

50.

J. Gauvain, C. Lee, Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2(2), 291–298 (1994)CrossRef

51.

J.L. Gauvain, L. Lamel, G. Adda, Partitioning and transcription of broadcast news data, in ICSLP, vol. 98 (1998), pp. 1335–1338

52.

J.T. Geiger, F. Wallhoff, G. Rigoll, GMM-UBM based open-set online speaker diarization, in Interspeech (2010), pp. 2330–2333

53.

H. Gish, M.H. Siu, R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, in International Conference on Acoustics, Speech, and Signal Processing, 1991 (ICASSP-91) (1991), pp. 873–876

54.

T. Hain, S. Johnson, A. Tuerk, P. Woodland, S. Young, Segment generation and clustering in the HTK broadcast news transcription system, in Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 1998 (1998)

55.

J. Hansen, B. Zhou, M. Akbacak, R. Sarikaya, B. Pellom, Audio stream phrase recognition for a national gallery of the spoken word:“ One Small Step”, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)

56.

H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990)CrossRef

57.

H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique, in IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992 (ICASSP-92), vol. 1 (1992), pp. 121–124

58.

M. Huijbregts, R. Ordelman, F. de Jong, Annotation of heterogeneous multimedia content using automatic speech recognition. Lecture Notes in Computer Science Semantic Multimedia, vol. 4816, (Springer Berlin Heldeberg 2007), pp. 78–90

59.

D. Imseng, G. Friedland, An adaptive initialization method for speaker diarization based on prosodic features, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4946–4949

60.

D. Istrate, C. Fredouille, S. Meignier, L. Besacier, J.F. Bonastre, NIST RT’05S evaluation: pre-processing techniques and speaker diarization on multiple microphone meetings, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 428–439

61.

H. Jin, F. Kubala, R. Schwartz, Automatic speaker clustering, in Proceedings of the DARPA Speech Recognition Workshop (1997), pp. 108–111

62.

Q. Jin, T. Schultz, Speaker segmentation and clustering in meetings, in Interspeech, vol. 4 (2004), pp. 597–600

63.

S. Johnson, Who spoke when?-automatic segmentation and clustering for determining speaker turns, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999)

64.

S.E. Johnson, J. Woodland, Speaker clustering using direct maximisation of the MLLR-adapted likelihood, in Proceedings of ICSLP 98 (1998), pp. 1775–1779

65.

T. Kemp, M. Schmidt, M. Westphal, A. Waibel, Strategies for automatic segmentation of audio data, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2000 (ICASSP’00), vol. 3 (2000), pp. 1423–1426

66.

P. Kenny, G. Boulianne, P. Dumouchel, Eigenvoice modeling with sparse training data. IEEE Trans. Speech Audio Process. 13(3), 345–354 (2005). doi:10.1109/TSA.2004.840940 CrossRef

67.

H. Kim, D. Ertelt, T. Sikora, Hybrid speaker-based segmentation system using model-level clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1 (2005), pp. 745–748

68.

B.E. Kingsbury, N. Morgan, S. Greenberg, Robust speech recognition using the modulation spectrogram. Speech Commun. 25(1), 117–132 (1998)CrossRef

69.

C. Knapp, G. Carter, The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24(4), 320–327 (1976)CrossRef

70.

T. Koshinaka, K. Nagatomo, K. Shinoda, Online speaker clustering using incremental learning of an ergodic hidden Markov model, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4093–4096. doi:10.1109/ICASSP.2009.4960528

71.

R. Kuhn, J.C. Junqua, P. Nguyen, N. Niedzielski, Rapid speaker adaptation in eigenvoice space. IEEE Trans. Speech Audio Process. 8(6), 695–707 (2000)CrossRef

72.

I. Lapidot, SOM as likelihood estimator for speaker clustering, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)

73.

K. Laskowski, C. Fugen, T. Schultz, Simultaneous multispeaker segmentation for automatic meeting recognition, in Proceedings of EUSIPCO, Poznan, 2007, pp. 1294–1298

74.

K. Laskowski, Q. Jin, T. Schultz, Crosscorrelation-based multispeaker speech activity detection, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)

75.

K. Laskowski, G. Karlsruhe, T. Schultz, A geometric interpretation of non-target-normalized maximum cross-channel correlation for vocal activity detection in meetings, in Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers, pp. 89–92. Association for Computational Linguistics (2007)

76.

K. Laskowski, T. Schultz, Unsupervised learning of overlapped speech model parameters for multichannel speech activity detection in meetings, in Proceedings of ICASSP (2006), pp. 993–996

77.

V.B. Le, O. Mella, D. Fohr, et al., Speaker diarization using normalized cross likelihood ratio, in Interspeech, vol. 7 (2007), pp. 1869–1872

78.

D.A. van Leeuwen, The TNO speaker diarization system for NIST RT05s meeting data, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 440–449

79.

D.A. van Leeuwen, M. Konečný, Progress in the AMIDA speaker diarization system for meeting data, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 475–483

80.

D. Lilt, F. Kubala, Online speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), pp. 333–336

81.

D. Liu, F. Kubala, Fast speaker change detection for broadcast news transcription and indexing, in Sixth European Conference on Speech Communication and Technology (1999)

82.

J. López, D. Ellis, Using acoustic condition clustering to improve acoustic change detection on broadcast news, in Sixth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2000)

83.

L. Lu, H. Zhang, Real-time unsupervised speaker change detection, in International Conference on Pattern Recognition, vol. 16 (2002), pp. 358–361

84.

J. Luque, C. Segura, J. Hernando, Clustering initialization based on spatial information for speaker diarization of meetings, in Interspeech (2008), pp. 383–386

85.

J. Makhoul, Linear prediction: a tutorial review. Proc. IEEE 63(4), 561–580 (1975)CrossRef

86.

A. Malegaonkar, A. Ariyaeeinia, P. Sivakumaran, J. Fortuna, Unsupervised speaker change detection using probabilistic pattern matching. IEEE Signal Process. Lett. 13(8), 509–512 (2006)CrossRef

87.

K. Markov, S. Nakamura, Never-ending learning system for on-line speaker diarization, in IEEE Workshop on Automatic Speech Recognition Understanding, 2007 (ASRU) (2007), pp. 699–704. doi:10.1109/ASRU.2007.4430197

88.

K. Markov, S. Nakamura, Improved novelty detection for online GMM based speaker diarization, in Interspeech (2008), pp. 363–366

89.

S. Meignier, J. Bonastre, S. Igounet, E-HMM approach for learning and adapting sound models for speaker indexing, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (ISCA, Pittsburgh, 2001)

90.

S. Meignier, D. Moraru, C. Fredouille, J.F. Bonastre, L. Besacier, Step-by-step and integrated approaches in broadcast news speaker diarization. Comput. Speech Lang. 20(2–3), 303–330 (2006). doi:http://dx.doi.org/10.1016/j.csl.2005.08.002. http://www.sciencedirect.com/science/article/pii/S0885230805000471

91.

X.A. Miró, Robust speaker diarization for meetings, Ph.D. thesis, Universitat Politècnica de Catalunya, Barcelona (2006)

92.

D. Moraru, S. Meignier, L. Besacier, J.F. Bonastre, I. Magrin-Chagnolleau, The ELISA consortium approaches in speaker segmentation during the NIST 2002 speaker recognition evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 2 (2003), p. II-89

93.

D. Moraru, S. Meignier, C. Fredouille, L. Besacier, J.F. Bonastre, The ELISA consortium approaches in broadcast news speaker segmentation during the NIST 2003 rich transcription evaluation, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004), p. I-373

94.

K. Mori, S. Nakagawa, Speaker change detection and speaker clustering using VQ distortion for broadcast news speech recognition, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001 (ICASSP’01), vol. 1 (2001)

95.

R.M. Neal, G.E. Hinton, A view of the em algorithm that justifies incremental, sparse, and other variants, in Learning in Graphical Models (Springer, Berlin, 1998), pp. 355–368

96.

A.Y. Ng, M.I. Jordan, Y. Weiss et al., On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 2, 849–856 (2002)

97.

P. Nguyen, L. Rigazio, Y. Moh, J. Junqua, Rich transcription 2002 site report, Panasonic Speech Technology Laboratory (PSTL), in Proceedings of the 2002 Rich Transcription Workshop (2002)

98.

M. Nishida, T. Kawahara, Unsupervised speaker indexing using speaker model selection based on Bayesian information criterion, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003 (ICASSP’03), vol. 1 (2003), pp. 172–175

99.

J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multi-microphone meetings using only between-channel differences, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 257–264

100.

J.M. Pardo, X. Anguera, C. Wooters, Speaker diarization for multiple distant microphone meetings: mixing acoustic features and inter-channel time differences, in Interspeech (2006)

101.

J.M. Pardo, R. Barra-Chicote, R. San-Segundo, R. de Córdoba, B. Martínez-González, Speaker diarization features: the UPM contribution to the RT09 evaluation. IEEE Trans. Audio Speech Lang. Process. 20(2), 426–435 (2012)

102.

J. Pelecanos, S. Sridharan, Feature warping for robust speaker verification, in 2001: A Speaker Odyssey-The Speaker Recognition Workshop (2001)

103.

L. Perez-Freire, C. Garcia-Mateo, A multimedia approach for audio segmentation in TV broadcast news, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2004 (ICASSP’04), vol. 1 (2004)

104.

T. Pfau, D. Ellis, A. Stolcke, Multispeaker speech activity detection for the ICSI meeting recorder, in Proceedings of ASRU, vol. 1 (2001)

105.

L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989)CrossRef

106.

W.M. Rand, Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66(336), 846–850 (1971)CrossRef

107.

D. Reynolds, E. Singer, B. Carlson, G. O’Leary, J. McLaughlin, M. Zissman, Blind clustering of speech utterances based on speaker and language characteristics, in Fifth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 1998)

108.

D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1), 19–41 (2000)CrossRef

109.

D.A. Reynolds, R.C. Rose, Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)CrossRef

110.

D.A. Reynolds, P. Torres-Carrasquillo, The MIT Lincoln laboratory RT-04F diarization systems: applications to broadcast audio and telephone conversations. Technical Report, DTIC Document (2004)

111.

M. Roch, Y. Cheng, Speaker segmentation using the MAP-adapted Bayesian information criterion, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)

112.

P.R. Roth, Effective measurements using digital signal analysis. IEEE Spectr. 8(4), 62–70 (1971)CrossRef

113.

J. Rougui, M. Rziza, D. Aboutajdine, M. Gelgon, J. Martinez, F. Rabat, Fast incremental clustering of gaussian mixture speaker models for scaling up retrieval in on-line broadcast, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2006 (ICASSP 2006), vol. 5 (2006)

114.

M. Rouvier, S. Meignier, A global optimization framework for speaker diarization, in Odyssey 2012-The Speaker and Language Recognition Workshop (2012)

115.

M.A. Sato, S. Ishii, On-line EM algorithm for the normalized Gaussian network. Neural Comput. 12(2), 407–432 (2000)CrossRef

116.

G. Schwarz, Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)CrossRefMATH

117.

E. Shriberg, L. Ferrer, S. Kajarekar, A. Venkataraman, A. Stolcke, Modeling prosodic feature sequences for speaker recognition. Speech Commun. 46(3), 455–472 (2005)CrossRef

118.

S. Shum, N. Dehak, E. Chuangsuwanich, D.A. Reynolds, J.R. Glass, Exploiting intra-conversation variability for speaker diarization, in Interspeech (2011), pp. 945–948

119.

S. Shum, N. Dehak, R. Dehak, J. Glass, Unsupervised methods for speaker diarization: an integrated and iterative approach. IEEE Trans. Audio Speech Lang. Process. 21(10), 2015–2028 (2013). doi:10.1109/TASL.2013.2264673 CrossRef

120.

S. Shum, N. Dehak, J. Glass, On the use of spectral and iterative methods for speaker diarization. System 1(w2), 2 (2012)

121.

M.A. Siegler, U. Jain, B. Raj, R.M. Stern, Automatic segmentation, classification and clustering of broadcast news audio, in Proceedings of DARPA Broadcast News Workshop (1997), p. 11

122.

J. Silovsky, J. Prazak, Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4193–4196

123.

R. Sinha, S.E. Tranter, M.J. Gales, P.C. Woodland, The Cambridge university March 2005 speaker diarisation system, in Interspeech (2005), pp. 2437–2440

124.

P. Sivakumaran, J. Fortuna, A.M. Ariyaeeinia, On the use of the Bayesian information criterion in multiple speaker detection, in Interspeech (2001), pp. 795–798

125.

A. Solomonoff, A. Mielke, M. Schmidt, H. Gish, Clustering speakers by their voices, in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 2 (1998), pp. 757–760

126.

S. Stevens, J. Volkmann, The relation of pitch to frequency: a revised scale. Am. J. Psychol. 53(3), 329–353 (1940)CrossRef

127.

H. Sun, B. Ma, S. Kalayar Khine, H. Li, Speaker diarization system for RT07 and RT09 meeting room audio, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4982–4985

128.

H. Tang, S. Chu, M. Hasegawa-Johnson, T. Huang, Partially supervised speaker clustering. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 959–971 (2012). doi:10.1109/TPAMI.2011.174 CrossRef

129.

Y. Teh, M. Jordan, M. Beal, D. Blei, Hierarchical Dirichlet processes. J. Am. Stat. Assoc. 101(476), 1566–1581 (2006)MathSciNetCrossRefMATH

130.

S. Tranter, Two-way cluster voting to improve speaker diarisation performance, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)

131.

A. Tritschler, R. Gopinath, Improved speaker segmentation and segments clustering using the Bayesian information criterion, in Sixth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 1999), pp. 679–682

132.

W. Tsai, H. Wang, On maximizing the within-cluster homogeneity of speaker voice characteristics for speech utterance clustering, in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, Toulouse, 2006

133.

W.H. Tsai, S.S. Cheng, Y.H. Chao, H.M. Wang, Clustering speech utterances by speaker using eigenvoice-motivated vector space models, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005), pp. 725–728

134.

W.H. Tsai, S.S. Cheng, H.M. Wang, Speaker clustering of speech utterances using a voice characteristic reference space, in Eighth International Conference on Spoken Language Processing (2004)

135.

F. Valente, Infinite models for speaker clustering, in Ninth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2006)

136.

F. Valente, C. Wellekens, Variational Bayesian speaker clustering, in ODYSSEY04-The Speaker and Language Recognition Workshop (ISCA, Pittsburgh, 2004)

137.

F. Valente, C. Wellekens, Variational Bayesian adaptation for speaker clustering, in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005 (ICASSP’05), vol. 1 (2005)

138.

D. Van Leeuwen, T. Factors, The TNO speaker diarization system for NIST RT05s meeting data. Lecture Notes in Computer Science, Machine Learning for Multimodal Interaction (Springer Berlin Heidelberg 2006) vol. 3869, pp. 440

139.

A. Vandecatseye, J. Martens, A fast, accurate and stream-based speaker segmentation and clustering algorithm, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)

140.

D. Vijayasenan, F. Valente, Speaker diarization of meetings based on large TDOA feature vectors, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 4173–4176. doi:10.1109/ICASSP.2012.6288838

141.

D. Vijayasenan, F. Valente, H. Bourlard, Agglomerative information bottleneck for speaker diarization of meetings data, in IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU) (2007), pp. 250–449

142.

D. Vijayasenan, F. Valente, H. Bourlard, Combination of agglomerative and sequential clustering for speaker diarization, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2008 (ICASSP 2008) (2008), pp. 4361–4364. doi:10.1109/ICASSP.2008.4518621

143.

D. Vijayasenan, F. Valente, H. Bourlard, Integration of TDOA features in information bottleneck framework for fast speaker diarization, in Interspeech (2008), pp. 40–43

144.

D. Vijayasenan, F. Valente, H. Bourlard, Mutual information based channel selection for speaker diarization of meetings data, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2009 (ICASSP 2009) (2009), pp. 4065–4068. doi:10.1109/ICASSP.2009.4960521

145.

D. Vijayasenan, F. Valente, H. Bourlard, An information theoretic combination of MFCC and TDOA features for speaker diarization. IEEE Trans. Audio Speech Lang. Process. 19(2), 431–438 (2011). doi:10.1109/TASL.2010.2048603 CrossRef

146.

D. Vijayasenan, F. Valente, H. Bourlard, Multistream speaker diarization of meetings recordings beyond MFCC and TDOA features. Speech Commun. 54(1), 55–67 (2012)CrossRef

147.

O. Vinyals, G. Friedland, Modulation spectrogram features for improved speaker diarization, in Interspeech (2008), pp. 630–633

148.

A. Viterbi, Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Trans. Inf. Theory 13(2), 260–269 (1967)CrossRefMATH

149.

H. Wang, S. Cheng, METRIC-SEQDAC: a hybrid approach for audio segmentation, in Eighth International Conference on Spoken Language Processing (ISCA, Pittsburgh, 2004)

150.

N. Wiener, Extrapolation, Interpolation, and Smoothing of Stationary Time Series: With Engineering Applications, vol. 8 (MIT Press, Cambridge, 1964)

151.

A. Willsky, H. Jones, A generalized likelihood ratio approach to the detection and estimation of jumps in linear systems. IEEE Trans. Automat. Contr. 21(1), 108–112 (1976)MathSciNetCrossRefMATH

152.

C. Wooters, J. Fung, B. Peskin, X. Anguera, Towards robust speaker segmentation: the ICSI-SRI fall 2004 diarization system, in RT-04F Workshop, vol. 23 (2004)

153.

C. Wooters, M. Huijbregts, The ICSI RT07s speaker diarization system, in Multimodal Technologies for Perception of Humans (Springer, Berlin, 2008), pp. 509–519

154.

S. Wrigley, G. Brown, V. Wan, S. Renals, Feature selection for the classification of crosstalk in multi-channel audio, in Eighth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2003)

155.

S. Wrigley, G. Brown, V. Wan, S. Renals, Speech and crosstalk detection in multichannel audio. IEEE Trans. Speech Audio Process. 13(1), 84–91 (2005)CrossRef

156.

T. Wu, L. Lu, K. Chen, H. Zhang, UBM-based real-time speaker segmentation for broadcasting news, in ICME 2003, vol. 2 (2003), pp. 721–724

157.

K. Yamanishi, J.I. Takeuchi, G. Williams, P. Milne, On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms, in Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2000), pp. 320–324

158.

M. Zamalloa, L.J. Rodríguez-Fuentes, G. Bordel, M. Penagarikano, J.P. Uribe, Low-latency online speaker tracking on the AMI corpus of meeting conversations, in 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP) (2010), pp. 4962–4965

159.

B. Zhou, J. Hansen, Efficient audio stream segmentation via the combined T² statistic and Bayesian information criterion. IEEE Trans. Speech Audio Process. 13(4), 467–474 (2005)CrossRef

160.

B. Zhou, J.H. Hansen, Unsupervised audio stream segmentation and clustering via the Bayesian information criterion, in Interspeech (2000), pp. 714–717

161.

X. Zhu, C. Barras, L. Lamel, J.L. Gauvain, Speaker diarization: from broadcast news to lectures, in Machine Learning for Multimodal Interaction (Springer, Berlin, 2006), pp. 396–406

162.

X. Zhu, C. Barras, S. Meignier, J.L. Gauvain, Combining speaker identification and BIC for speaker diarization, in Interspeech, vol. 5 (2005), pp. 2441–2444

163.

P. Zochova, V. Radova, Modified DISTBIC algorithm for speaker change detection, in Ninth European Conference on Speech Communication and Technology (ISCA, Pittsburgh, 2005)

164.

E. Zwicker, E. Terhardt, Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. J. Acoust. Soc. Am. 68, 1523 (1980)CrossRef

Titel: Speaker Diarization: An Emerging Research
verfasst von: Trung Hieu Nguyen
Eng Siong Chng
Haizhou Li
Verlag: Springer New York
Buch: Speech and Audio Processing for Coding, Enhancement and Recognition
Print ISBN: 978-1-4939-1455-5

Electronic ISBN: 978-1-4939-1456-2

Copyright-Jahr: 2015
DOI: https://doi.org/10.1007/978-1-4939-1456-2_8

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Die Gewinner und Laudatoren des Sustainability Award in Automotive 2024/© Uli Regenscheit | ATZlive, Search Icon, Banner Hanser, Kundenpotenzial/© Andrii Yalanskyi / Getty Images / iStock, Toyota-Logo/© ollo / Getty Images / iStock, Sebastian Glenschek/© Hermes International, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH, adäsion-Webinar-Matinee/© krystiannawrocki_ Getty Images

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.